Skip to content

GPU support in NuMojo #273

@shivasankarka

Description

@shivasankarka

GPU Support in NuMojo

Motivation

NuMojo aims to serve as a drop-in replacement for NumPy, while leveraging Mojo’s performance characteristics and native compilation. With GPU backends (e.g., Metal, CUDA, ROCm), we can bring C++/CUDA performance and better. The challenge lies in designing a unified and ergonomic device model that allows users to transparently scale their code from CPU to GPU without major API changes or performance regressions.

Proposed approaches

I outline three main architectural approaches:

Option 1: Unified NDArray with Device-Aware Storage

This approach extends the existing NDArray to include device specialization at compile time.
It provides a PyTorch-like .to[device]() API for explicit device transfers, while keeping a single unified interface across backends like torch.tensor. Enables users to create NDArray in either CPU or GPU by just providing Device parameter.

Key Properties:

  • Unified API for CPU, GPU, and future devices
  • Compile-time device specialization
  • Minimal breaking changes to current codebase
  • Simple integration path for future devices

Cons:

  • A lot of ugly compile time if conditions to differentiate cpu and gpu methods.

Example Usage:

fn main() raises:
    alias SIZE: Int = 1024
    alias cpu: Device = Device.CPU
    alias mps: Device = Device.MPS

    # Create CPU arrays
    var arr_cpu_1 = arange[f32](1.0, 101.0, 1).reshape(Shape(10, 10))
    var arr_cpu_2 = arange[f32](1.0, 101.0, 1).reshape(Shape(10, 10))
    var matmul_cpu = arr_cpu_1 @ arr_cpu_2

    # Create GPU arrays (Metal backend)
    var arr_gpu_1 = arange[f32, device=mps](1.0, 101.0, 1).reshape(Shape(10, 10))
    var arr_gpu_2 = arange[f32, device=mps](1.0, 101.0, 1).reshape(Shape(10, 10))
    var matmul_gpu = arr_gpu_1 @ arr_gpu_2

    # Matrix API variant
    var mat_cpu_1 = Matrix[f32, cpu]((SIZE, SIZE), fill_value=1.0)
    var mat_cpu_2 = Matrix[f32, cpu]((SIZE, SIZE), fill_value=2.0)
    var matmul_cpu = mat_cpu_1 @ mat_cpu_2

    var mat_gpu_1 = Matrix[f32, mps]((SIZE, SIZE), fill_value=1.0)
    var mat_gpu_2 = Matrix[f32, mps]((SIZE, SIZE), fill_value=2.0)
    var matmul_gpu = mat_gpu_1 @ mat_gpu_2

    # Transfer between devices
    var gpu_from_cpu_1 = mat_cpu_1.to[mps]()
    var gpu_from_cpu_2 = mat_cpu_2.to[mps]()
    var matmul_gpu_from_cpu = gpu_from_cpu_1 @ gpu_from_cpu_2

Option 2: Separate Device-Specific Classes

This design introduces explicit device-specific classes, e.g.:NDArrayCPU, NDArrayGPU. Each type directly manages its own memory layout and compute kernels.

Pros:

  • Zero device abstraction overhead
  • Enables backend-specific optimizations

Cons:

  • Significant code duplication for function overloading
  • Poor ergonomics for users switching between CPU/GPU

Example:

alias mps = Device.MPS
var x_cpu_1 = NDArrayCPU[f32](Shape(1024, 1024))
var x_cpu_2 = NDArrayCPU[f32](Shape(1024, 1024))
var result = x_cpu_1 @ x_cpu_2

var x_gpu_1 = NDArrayGPU[f32](Shape(1024, 1024))
var x_gpu_2 = NDArrayGPU[f32](Shape(1024, 1024))
var result = x_gpu_1 @ x_gpu_2

var x_cpu_to_gpu = x_cpu_1.to[mps]()

This model may be more suitable for low-level or embedded contexts, but less ideal for NuMojo’s NumPy-compatibility goals.

Option 3: Static Shape GPU Arrays

This approach introduces a StaticNDArray type with compile-time known shapes and dtypes, enabling aggressive optimizations such as loop unrolling and vectorization.

Pros

  • Maximum performance and compile-time safety
  • Enables highly optimized kernels for fixed-size data

Cons

  • Limited flexibility for dynamic workloads
  • Increased API and implementation complexity
  • Requires separate type definitions (NDArray vs StaticNDArray)

This model could coexist with the dynamic NDArray, targeting scientific computing and ML inference workloads where shapes are known ahead of time.

Note:

  1. Many of these limitations may be mitigated in the future as Mojo evolves (e.g., with trait parameters and advanced compile-time metaprogramming features).
  2. While NuMojo aims to be largely NumPy-compatible, we shouldn’t hesitate to improve the API design where it makes sense, even if it introduces intentional deviations from NumPy’s behavior.

Preliminary Results:

Using Approach 1 and approach 2,

  • Observed near-zero abstraction overhead with this unified approach
  • Achieved ~15× speedup on Apple Silicon GPU (MPS backend) for matmul with SIZE = 2048 using basic GPU kernels

Check the attached figure for cpu vs gpu comparison of matmul using approach 1 and approach 2 with basic tiled gpu kernels.
matmul_benchmark_plots_matrix

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions