NumPy API on a GPU?

Is future of Python numerical computation?

Late last year, NVIDIA made a significant announcement regarding the future of Python-based numerical computing. I wouldn’t be surprised if you missed it. After all, every other announcement from every AI company, then and now, seems mega-important.

That announcement introduced the cuNumeric library, a drop-in replacement for the ubiquitous NumPy library built on top of the Legate framework.

Who are Nvidia?

Most people will probably know Nvidia from their ultra-fast chips that power computers and data centres all over the world. You may also be familiar with Nvidia’s charismatic, leather jacket-loving CEO, Jensen Huang, who seems to pop up on the stage of every AI conference these days.

What many people don’t know is that Nvidia also designs and creates innovative device architectures and associated software. One of its most prized products is the Compute Unified Device Architecture (CUDA). CUDA is NVIDIA’s proprietary parallel-computing platform and programming model. Since its launch in 2007, it has evolved into a comprehensive ecosystem comprising drivers, runtime, compilers, math libraries, debugging and profiling tools, and container images. The result is a neatly tuned hardware and software loop that keeps NVIDIA GPUs at the centre of modern high-performance and AI workloads.

What is Legate?

Legate is an NVIDIA-led open-source runtime layer that lets you run familiar Python data-science libraries (NumPy, cuNumeric, Pandas-style APIs, sparse linear-algebra kernels, …) on multi-core CPUs, single or multi-GPU nodes, and even multi-node clusters without changing your Python code. It translates high-level array operations into a graph of fine-grained tasks and hands that graph to the C++ Legion runtime, which schedules the tasks, partitions the data, and moves tiles between CPUs, GPUs and network links for you.

In a nutshell, Legate lets familiar single-node Python libraries scale transparently to multi-GPU, multi-node machines.

What is cuNumeric?

cuNumeric is a drop-in replacement for NumPy whose array operations are executed by Legate’s task engine and accelerated on one or many NVIDIA GPUs (or, if no GPU is present, on all CPU cores). In practice, you install it and need only change one import line to start using it in place of your regular NumPy code. For example …

# old
import numpy as np
...
...

# new
import cupynumeric as np     # everything else stays the same
...
...

… and run your script on the terminal with the legate command.

Behind the scenes, cuNumeric converts each NumPy call you make, for example, np.sin, np.linalg.svd, fancy indexing, broadcasting, reductions, etc, into Legate tasks. Those tasks will,

Partition your arrays into tiles sized to fit GPU memory.
Schedule each tile on the best available device (GPU or CPU).
Overlap compute with communication when the workload spans multiple GPUs or nodes.
Spill tiles to NVMe/SSD automatically when your dataset outruns GPU RAM.

Because the API of cuNumeric mirrors NumPy’s nearly 1-for-1, existing scientific or data-science code can scale from a laptop to a multi-GPU cluster without a rewrite.

Performance benefits

So, this all seems great, right? But it only makes sense if it results in tangible performance improvements over using NumPy, and Nvidia is making some strong claims that this is the case. As data scientists, machine learning engineers and data engineers typically use NumPy a lot, we can appreciate that this can be a crucial aspect of the systems we write and maintain.

Now, I don’t have a cluster of GPUs or a supercomputer to test this on, but my desktop PC does have an Nvidia GeForce RTX 4070 GPU, and we’re going to use that to test out some of Nvidia’s claims.

(base) tom@tpr-desktop:~$ nvidia-smi
Sun Jun 15 15:26:36 2025
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.75                 Driver Version: 566.24         CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA GeForce RTX 4070 Ti     On  |   00000000:01:00.0  On |                  N/A |
| 32%   29C    P8              9W /  285W |    1345MiB /  12282MiB |      2%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

I’ll install cuNumeric and NumPy on my PC to conduct comparative tests. This will help us assess whether Nvidia’s claims are accurate and understand the performance differences between the two libraries.

Setting up a development environment.

As always, I like to set up a separate development environment to run my tests. That way, nothing I do in that environment will affect any of my other projects. At the time of writing, cuNumeric is not available to install on Windows, so I’ll be using WSL2 Ubuntu for Windows instead.

I’ll be using Miniconda to set up my environment, but feel free to use whichever tool you’re comfortable with.

$ conda create cunumeric-env python=3.10 -c conda-forge
$ conda activate cunumeric-env
$ conda install -c conda-forge -c legate cupynumeric
$ conda install -c conda-forge ucx cuda-cudart cuda-version=12

Code example 1 — A simple matrix multiplication

Matrix multiplication is the bread and butter of mathematical operations that underpin so many AI systems, so it makes sense to try that operation out first.

Note that in all my examples, I will run the NumPy and cuNumeric code snippets five times in a row and average the time taken for each. I also perform a “warm-up step on the GPU before the timing run to take into account overheads such as just-in-time (JIT) compilation.

import time
import gc
import argparse
import sys

def benchmark_numpy(n, runs):
    """Runs the matrix multiplication benchmark using standard NumPy on the CPU."""
    import numpy as np
    
    print(f"--- NumPy (CPU) Benchmark ---")
    print(f"Multiplying two {n}×{n} matrices ({runs} runs)\n")

    # 1. Generate data ONCE before the timing loop.
    print(f"Generating two {n}x{n} random matrices on CPU...")
    A = np.random.rand(n, n).astype(np.float32)
    B = np.random.rand(n, n).astype(np.float32)

    # 2. Perform one untimed warm-up run.
    print("Performing warm-up run...")
    _ = np.matmul(A, B)
    print("Warm-up complete.\n")

    # 3. Perform the timed runs.
    times = []
    for i in range(runs):
        start = time.time()
        # The operation being timed. The @ operator is a convenient
        # shorthand for np.matmul.
        C = A @ B
        end = time.time()

        duration = end - start
        times.append(duration)
        print(f"Run {i+1}: time = {duration:.4f}s")
        del C # Clean up the result matrix
        gc.collect()

    avg = sum(times) / len(times)
    print(f"\nNumPy average: {avg:.4f}s\n")
    return avg

def benchmark_cunumeric(n, runs):
    """Runs the matrix multiplication benchmark using cuNumeric on the GPU."""
    import cupynumeric as cn
    import numpy as np # Import numpy for the canonical sync
    
    print(f"--- cuNumeric (GPU) Benchmark ---")
    print(f"Multiplying two {n}×{n} matrices ({runs} runs)\n")

    # 1. Generate data ONCE on the GPU before the timing loop.
    print(f"Generating two {n}x{n} random matrices on GPU...")
    A = cn.random.rand(n, n).astype(np.float32)
    B = cn.random.rand(n, n).astype(np.float32)

    # 2. Perform a crucial untimed warm-up run for JIT compilation.
    print("Performing warm-up run...")
    C_warmup = cn.matmul(A, B)
    # The best practice for synchronization: force a copy back to the CPU.
    _ = np.array(C_warmup)
    print("Warm-up complete.\n")

    # 3. Perform the timed runs.
    times = []
    for i in range(runs):
        start = time.time()
        
        # Launch the operation on the GPU
        C = A @ B
        
        # Synchronize by converting the result to a host-side NumPy array.
        np.array(C)

        end = time.time()

        duration = end - start
        times.append(duration)
        print(f"Run {i+1}: time = {duration:.4f}s")
        del C
        gc.collect()

    avg = sum(times) / len(times)
    print(f"\ncuNumeric average: {avg:.4f}s\n")
    return avg

if __name__ == "__main__":
    parser = argparse.ArgumentParser(
        description="Benchmark matrix multiplication on NumPy (CPU) vs. cuNumeric (GPU)."
    )
    parser.add_argument(
        "-n", "--n", type=int, default=3000, help="Matrix size (n x n)"
    )
    parser.add_argument(
        "-r", "--runs", type=int, default=5, help="Number of timing runs"
    )
    parser.add_argument(
        "--cunumeric", action="store_true", help="Run the cuNumeric (GPU) version"
    )
    
    args, unknown = parser.parse_known_args()

    # The dispatcher logic
    if args.cunumeric or "--cunumeric" in unknown:
        benchmark_cunumeric(args.n, args.runs)
    else:
        benchmark_numpy(args.n, args.runs)

Running the NumPy side of things uses the regular python example1.py command line syntax. For running using Legate, the syntax is more complex. What it does is disable Legate’s automatic configuration and then launch the example1.py script under Legate with one CPU, one GPU, and zero OpenMP threads using the cuNumeric backend.

Here is the output.

(cunumeric-env) tom@tpr-desktop:~$ python example1.py
--- NumPy (CPU) Benchmark ---
Multiplying two 3000×3000 matrices (5 runs)

Generating two 3000x3000 random matrices on CPU...
Performing warm-up run...
Warm-up complete.

Run 1: time = 0.0976s
Run 2: time = 0.0987s
Run 3: time = 0.0957s
Run 4: time = 0.1063s
Run 5: time = 0.0989s

NumPy average: 0.0994s

(cunumeric-env) tom@tpr-desktop:~$ LEGATE_AUTO_CONFIG=0 legate --cpus 1 --gpus 1 --omps 0 example1.py --cunu
meric
[0 - 7f2e8fcc8480]    0.000000 {5}{module_config}: Module numa can not detect resources.
[0 - 7f2e8fcc8480]    0.000000 {4}{topology}: can't open /sys/devices/system/node/
[0 - 7f2e8fcc8480]    0.000049 {4}{threads}: reservation ('GPU ctxsync 0x55cd5fd34530') cannot be satisfied
--- cuNumeric (GPU) Benchmark ---
Multiplying two 3000×3000 matrices (5 runs)

Generating two 3000x3000 random matrices on GPU...
Performing warm-up run...
Warm-up complete.

Run 1: time = 0.0113s
Run 2: time = 0.0089s
Run 3: time = 0.0086s
Run 4: time = 0.0090s
Run 5: time = 0.0087s

cuNumeric average: 0.0093s

Well, that’s an impressive start. cuNumeric is registering a 10x speedup over NumPy.

The warnings that Legate is outputting can be ignored. These are informational, indicating Legate couldn’t find details about the machine’s CPU/memory layout (NUMA) or enough CPU cores to manage the GPU.

Code example 2 — Logistic regression

Logistic regression is a foundational tool in data science because it provides a simple, interpretable way to model and predict binary outcomes (yes/no, pass/fail, click/no-click). In this example, we’ll measure how long it takes to train a simple binary classifier on synthetic data. For each of the five runs, it first generates N samples with D features (X), and a corresponding random 0/1 label vector (Y). It initialises the weight vector w to zeros, then performs 500 iterations of batch gradient descent: computing the linear predictions z = X.dot(w), applying the sigmoid p = 1/(1+exp(–z)), computing the gradient grad = X.T.dot(p – y) / N, and updating the weights with w -= 0.1 * grad. The script records the elapsed time for each run, cleans up memory, and finally prints the average training time.

import time
import gc
import argparse
import sys

# --- Reusable Training Function ---
# By putting the training loop in its own function, we avoid code duplication.
# The `np` argument allows us to pass in either the numpy or cupynumeric module.
def train_logistic_regression(np, X, y, iters, alpha):
    """Performs a set number of gradient descent iterations."""
    # Ensure w starts on the correct device (CPU or GPU)
    w = np.zeros(X.shape[1])
    
    for _ in range(iters):
        z = X.dot(w)
        p = 1.0 / (1.0 + np.exp(-z))
        grad = X.T.dot(p - y) / X.shape[0]
        w -= alpha * grad
    
    return w

def benchmark_numpy(n_samples, n_features, iters, alpha):
    """Runs the logistic regression benchmark using standard NumPy on the CPU."""
    import numpy as np
    
    print(f"--- NumPy (CPU) Benchmark ---")
    print(f"Training on {n_samples} samples, {n_features} features for {iters} iterations\n")

    # 1. Generate data ONCE before the timing loop.
    print("Generating random dataset on CPU...")
    X = np.random.rand(n_samples, n_features)
    y = (np.random.rand(n_samples) > 0.5).astype(np.float64)

    # 2. Perform one untimed warm-up run.
    print("Performing warm-up run...")
    _ = train_logistic_regression(np, X, y, iters, alpha)
    print("Warm-up complete.\n")

    # 3. Perform the timed runs.
    times = []
    for i in range(args.runs):
        start = time.time()
        # The operation being timed
        _ = train_logistic_regression(np, X, y, iters, alpha)
        end = time.time()

        duration = end - start
        times.append(duration)
        print(f"Run {i+1}: time = {duration:.3f}s")
        gc.collect()

    avg = sum(times) / len(times)
    print(f"\nNumPy average: {avg:.3f}s\n")
    return avg

def benchmark_cunumeric(n_samples, n_features, iters, alpha):
    """Runs the logistic regression benchmark using cuNumeric on the GPU."""
    import cupynumeric as cn
    import numpy as np # Also import numpy for the canonical synchronization
    
    print(f"--- cuNumeric (GPU) Benchmark ---")
    print(f"Training on {n_samples} samples, {n_features} features for {iters} iterations\n")

    # 1. Generate data ONCE on the GPU before the timing loop.
    print("Generating random dataset on GPU...")
    X = cn.random.rand(n_samples, n_features)
    y = (cn.random.rand(n_samples) > 0.5).astype(np.float64)

    # 2. Perform a crucial untimed warm-up run for JIT compilation.
    print("Performing warm-up run...")
    w_warmup = train_logistic_regression(cn, X, y, iters, alpha)
    # The best practice for synchronization: force a copy back to the CPU.
    _ = np.array(w_warmup)
    print("Warm-up complete.\n")

    # 3. Perform the timed runs.
    times = []
    for i in range(args.runs):
        start = time.time()
        
        # Launch the operation on the GPU
        w = train_logistic_regression(cn, X, y, iters, alpha)
        
        # Synchronize by converting the final result back to a NumPy array.
        np.array(w)

        end = time.time()

        duration = end - start
        times.append(duration)
        print(f"Run {i+1}: time = {duration:.3f}s")
        del w
        gc.collect()

    avg = sum(times) / len(times)
    print(f"\ncuNumeric average: {avg:.3f}s\n")
    return avg

if __name__ == "__main__":
    # A more robust argument parsing setup
    parser = argparse.ArgumentParser(
        description="Benchmark logistic regression on NumPy (CPU) vs. cuNumeric (GPU)."
    )
    # Hyperparameters for the model
    parser.add_argument(
        "-n", "--n_samples", type=int, default=2_000_000, help="Number of data samples"
    )
    parser.add_argument(
        "-d", "--n_features", type=int, default=10, help="Number of features"
    )
    parser.add_argument(
        "-i", "--iters", type=int, default=500, help="Number of gradient descent iterations"
    )
    parser.add_argument(
        "-a", "--alpha", type=float, default=0.1, help="Learning rate"
    )
    # Benchmark control
    parser.add_argument(
        "-r", "--runs", type=int, default=5, help="Number of timing runs"
    )
    parser.add_argument(
        "--cunumeric", action="store_true", help="Run the cuNumeric (GPU) version"
    )
    
    args, unknown = parser.parse_known_args()

    # Dispatcher logic
    if args.cunumeric or "--cunumeric" in unknown:
        benchmark_cunumeric(args.n_samples, args.n_features, args.iters, args.alpha)
    else:
        benchmark_numpy(args.n_samples, args.n_features, args.iters, args.alpha)

And the outputs.

(cunumeric-env) tom@tpr-desktop:~$ python example2.py
--- NumPy (CPU) Benchmark ---
Training on 2000000 samples, 10 features for 500 iterations

Generating random dataset on CPU...
Performing warm-up run...
Warm-up complete.

Run 1: time = 12.292s
Run 2: time = 11.830s
Run 3: time = 11.903s
Run 4: time = 12.843s
Run 5: time = 11.964s

NumPy average: 12.166s

(cunumeric-env) tom@tpr-desktop:~$ LEGATE_AUTO_CONFIG=0 legate --cpus 1 --gpus 1 --omps 0 example2.py --cunu
meric
[0 - 7f04b535c480]    0.000000 {5}{module_config}: Module numa can not detect resources.
[0 - 7f04b535c480]    0.000000 {4}{topology}: can't open /sys/devices/system/node/
[0 - 7f04b535c480]    0.001149 {4}{threads}: reservation ('GPU ctxsync 0x55fb037cf140') cannot be satisfied
--- cuNumeric (GPU) Benchmark ---
Training on 2000000 samples, 10 features for 500 iterations

Generating random dataset on GPU...
Performing warm-up run...
Warm-up complete.

Run 1: time = 1.964s
Run 2: time = 1.957s
Run 3: time = 1.968s
Run 4: time = 1.955s
Run 5: time = 1.960s

cuNumeric average: 1.961s

Not quite as impressive as our first example, but a 5x to 6x speedup on an already fast NumPy program is not to be sniffed at.

Code example 3 — solving linear equations

This script benchmarks how long it takes to solve a dense 3000×3000 linear algebra equation system. This is a fundamental operation in linear algebra used to solve the equation of type Ax = b, where A is a giant grid of numbers (a 3000×3000 matrix in this case), and b is a list of numbers (a vector).

The goal is to find the unknown list of numbers x that makes the equation true. This is a computationally intensive task that is at the heart of many scientific simulations, engineering problems, financial models, and even some AI algorithms.

import time
import gc
import argparse
import sys # Import sys to check arguments

# Note: The library imports (numpy and cupynumeric) are now done *inside*
# their respective functions to keep them separate and avoid import errors.

def benchmark_numpy(n, runs):
    """Runs the linear solve benchmark using standard NumPy on the CPU."""
    import numpy as np

    print(f"--- NumPy (CPU) Benchmark ---")
    print(f"Solving {n}×{n} A x = b ({runs} runs)\n")

    # 1. Generate data ONCE before the timing loop.
    print("Generating random system on CPU...")
    A = np.random.randn(n, n).astype(np.float32)
    b = np.random.randn(n).astype(np.float32)

    # 2. Perform one untimed warm-up run. This is good practice even for
    # the CPU to ensure caches are warm and any one-time setup is done.
    print("Performing warm-up run...")
    _ = np.linalg.solve(A, b)
    print("Warm-up complete.\n")

    # 3. Perform the timed runs.
    times = []
    for i in range(runs):
        start = time.time()
        # The operation being timed
        x = np.linalg.solve(A, b)
        end = time.time()

        duration = end - start
        times.append(duration)
        print(f"Run {i+1}: time = {duration:.6f}s")
        # Clean up the result to be safe with memory
        del x
        gc.collect()

    avg = sum(times) / len(times)
    print(f"\nNumPy average: {avg:.6f}s\n")
    return avg

def benchmark_cunumeric(n, runs):
    """Runs the linear solve benchmark using cuNumeric on the GPU."""
    import cupynumeric as cn
    import numpy as np # Also import numpy for the canonical synchronization

    print(f"--- cuNumeric (GPU) Benchmark ---")
    print(f"Solving {n}×{n} A x = b ({runs} runs)\n")

    # 1. Generate data ONCE on the GPU before the timing loop.
    # This ensures we are not timing the data transfer in our main loop.
    print("Generating random system on GPU...")
    A = cn.random.randn(n, n).astype(np.float32)
    b = cn.random.randn(n).astype(np.float32)

    # 2. Perform a crucial untimed warm-up run. This handles JIT
    # compilation and other one-time GPU setup costs.
    print("Performing warm-up run...")
    x_warmup = cn.linalg.solve(A, b)
    # The best practice for synchronization: force a copy back to the CPU.
    _ = np.array(x_warmup)
    print("Warm-up complete.\n")

    # 3. Perform the timed runs.
    times = []
    for i in range(runs):
        start = time.time()

        # Launch the operation on the GPU
        x = cn.linalg.solve(A, b)

        # Synchronize by converting the result to a host-side NumPy array.
        # This is guaranteed to block until the GPU has finished.
        np.array(x)

        end = time.time()

        duration = end - start
        times.append(duration)
        print(f"Run {i+1}: time = {duration:.6f}s")
        # Clean up the GPU array result
        del x
        gc.collect()

    avg = sum(times) / len(times)
    print(f"\ncuNumeric average: {avg:.6f}s\n")
    return avg

if __name__ == "__main__":
    # A more robust argument parsing setup
    parser = argparse.ArgumentParser(
        description="Benchmark linear solve on NumPy (CPU) vs. cuNumeric (GPU)."
    )
    parser.add_argument(
        "-n", "--n", type=int, default=3000, help="Matrix size (n x n)"
    )
    parser.add_argument(
        "-r", "--runs", type=int, default=5, help="Number of timing runs"
    )

    # Use parse_known_args() to handle potential extra arguments from Legate
    args, unknown = parser.parse_known_args()

    # The dispatcher logic: check if "--cunumeric" is in the command line
    # This is a simple and effective way to switch between modes.
    if "--cunumeric" in sys.argv or "--cunumeric" in unknown:
        benchmark_cunumeric(args.n, args.runs)
    else:
        benchmark_numpy(args.n, args.runs)

The outputs.

(cunumeric-env) tom@tpr-desktop:~$ python example4.py
--- NumPy (CPU) Benchmark ---
Solving 3000×3000 A x = b (5 runs)

Generating random system on CPU...
Performing warm-up run...
Warm-up complete.

Run 1: time = 0.133075s
Run 2: time = 0.126129s
Run 3: time = 0.135849s
Run 4: time = 0.137383s
Run 5: time = 0.138805s

NumPy average: 0.134248s

(cunumeric-env) tom@tpr-desktop:~$ LEGATE_AUTO_CONFIG=0 legate --cpus 1 --gpus 1 --omps 0 example4.py --cunumeric
[0 - 7f29f42ce480]    0.000000 {5}{module_config}: Module numa can not detect resources.
[0 - 7f29f42ce480]    0.000000 {4}{topology}: can't open /sys/devices/system/node/
[0 - 7f29f42ce480]    0.000053 {4}{threads}: reservation ('GPU ctxsync 0x562e88c28700') cannot be satisfied
--- cuNumeric (GPU) Benchmark ---
Solving 3000×3000 A x = b (5 runs)

Generating random system on GPU...
Performing warm-up run...
Warm-up complete.

Run 1: time = 0.009685s
Run 2: time = 0.010043s
Run 3: time = 0.009966s
Run 4: time = 0.009739s
Run 5: time = 0.009383s

cuNumeric average: 0.009763s

That is a tremendous result. The Nvidia cuNumeric run is 100x faster than the NumPy run.

Code example 4 — Sorting

Sorting is such a fundamental part of everything that happens in computing, and modern computers are so fast that most developers don’t even think about it. But let’s see how much of a difference using cuNumeric can make to this ubiquitous operation. We’ll sort a large (30,000,000) 1D array of numbers

# benchmark_sort.py
import time
import sys
import gc

# Array size
n = 30_000_000 # 30 million elements

def benchmark_numpy():
    import numpy as np
    print(f"Sorting an array of {n} elements with NumPy (5 runs)\n")

    times = []
    for i in range(5):
        data = np.random.randn(n).astype(np.float32)
        start = time.time()
        _ = np.sort(data)
        end = time.time()

        duration = end - start
        times.append(duration)
        print(f"Run {i+1}: time = {duration:.6f}s")
        del data
        gc.collect()

    avg = sum(times) / len(times)
    print(f"\nNumPy average: {avg:.6f}s\n")

def benchmark_cunumeric():
    import cupynumeric as np
    print(f"Sorting an array of {n} elements with cuNumeric (5 runs)\n")

    times = []
    for i in range(5):
        data = np.random.randn(n).astype(np.float32)
        start = time.time()
        _ = np.sort(data)
        # Force GPU sync
        _ = np.linalg.norm(np.zeros(()))
        end = time.time()

        duration = end - start
        times.append(duration)
        print(f"Run {i+1}: time = {duration:.6f}s")
        del data
        gc.collect()
        _ = np.linalg.norm(np.zeros(()))

    avg = sum(times) / len(times)
    print(f"\ncuNumeric average: {avg:.6f}s\n")

if __name__ == "__main__":
    if "--cunumeric" in sys.argv:
        benchmark_cunumeric()
    else:
        benchmark_numpy()

The outputs.

(cunumeric-env) tom@tpr-desktop:~$ python example5.py
--- NumPy (CPU) Benchmark ---
Sorting an array of 30000000 elements (5 runs)

Creating random array on CPU...
Performing warm-up run...
Warm-up complete.

Run 1: time = 0.588777s
Run 2: time = 0.586813s
Run 3: time = 0.586745s
Run 4: time = 0.586525s
Run 5: time = 0.583783s

NumPy average: 0.586529s
-----------------------------

(cunumeric-env) tom@tpr-desktop:~$ LEGATE_AUTO_CONFIG=0 legate --cpus 1 --gpus 1 --omps 0 example5.py --cunumeric
[0 - 7fd9e4615480]    0.000000 {5}{module_config}: Module numa can not detect resources.
[0 - 7fd9e4615480]    0.000000 {4}{topology}: can't open /sys/devices/system/node/
[0 - 7fd9e4615480]    0.000082 {4}{threads}: reservation ('GPU ctxsync 0x564489232fd0') cannot be satisfied
--- cuNumeric (GPU) Benchmark ---
Sorting an array of 30000000 elements (5 runs)

Creating random array on GPU...
Performing warm-up run...
Warm-up complete.

Run 1: time = 0.010857s
Run 2: time = 0.007927s
Run 3: time = 0.007921s
Run 4: time = 0.008240s
Run 5: time = 0.007810s

cuNumeric average: 0.008551s
-------------------------------

Yet another hugely impressive performance from cuNumeric and Legate.

Summary

This article introduced cuNumeric, an NVIDIA library designed as a high-performance, drop-in replacement for NumPy. The key takeaway is that data scientists can accelerate their existing Python code on NVIDIA GPUs with minimal effort, often by simply changing a single import line and running the script with the ‘legate’ command.

Two main components power the technology:

Legate: An open-source runtime layer from NVIDIA that automatically translates high-level Python operations into tasks. It intelligently manages distributing these tasks across single or multiple GPUs, handling data partitioning, memory management (even spilling to disk if needed), and optimising communication.
cuNumeric: The user-facing library that mirrors the NumPy API. When you make a call like np.matmul(), cuNumeric converts it into a task for the Legate engine to execute on the GPU.

I was able to validate Nvidia’s performance claims by running four benchmark tests on my desktop PC (with an NVIDIA RTX 4070 Ti GPU), comparing standard NumPy on the CPU against cuNumeric on the GPU.

The results demonstrate significant performance gains for cuNumeric:

Matrix Multiplication: ~10x faster than NumPy.
Logistic Regression Training: ~6x faster.
Solving Linear Equations: A massive 100x+ speedup.
Sorting a Large Array: Another huge improvement, running approximately 70x faster.

In conclusion, I showed that cuNumeric successfully delivers on its promise, making the immense computational power of GPUs accessible to the broader Python data science community without requiring a steep learning curve or a complete code rewrite.

For more information and links to related resources, check out the original Nvidia announcement on cuNumeric here.