-
Notifications
You must be signed in to change notification settings - Fork 22
Description
GPU Support in NuMojo
Motivation
NuMojo aims to serve as a drop-in replacement for NumPy, while leveraging Mojo’s performance characteristics and native compilation. With GPU backends (e.g., Metal, CUDA, ROCm), we can bring C++/CUDA performance and better. The challenge lies in designing a unified and ergonomic device model that allows users to transparently scale their code from CPU to GPU without major API changes or performance regressions.
Proposed approaches
I outline three main architectural approaches:
Option 1: Unified NDArray with Device-Aware Storage
This approach extends the existing NDArray to include device specialization at compile time.
It provides a PyTorch-like .to[device]() API for explicit device transfers, while keeping a single unified interface across backends like torch.tensor. Enables users to create NDArray in either CPU or GPU by just providing Device parameter.
Key Properties:
- Unified API for CPU, GPU, and future devices
- Compile-time device specialization
- Minimal breaking changes to current codebase
- Simple integration path for future devices
Cons:
- A lot of ugly compile time if conditions to differentiate cpu and gpu methods.
Example Usage:
fn main() raises:
alias SIZE: Int = 1024
alias cpu: Device = Device.CPU
alias mps: Device = Device.MPS
# Create CPU arrays
var arr_cpu_1 = arange[f32](1.0, 101.0, 1).reshape(Shape(10, 10))
var arr_cpu_2 = arange[f32](1.0, 101.0, 1).reshape(Shape(10, 10))
var matmul_cpu = arr_cpu_1 @ arr_cpu_2
# Create GPU arrays (Metal backend)
var arr_gpu_1 = arange[f32, device=mps](1.0, 101.0, 1).reshape(Shape(10, 10))
var arr_gpu_2 = arange[f32, device=mps](1.0, 101.0, 1).reshape(Shape(10, 10))
var matmul_gpu = arr_gpu_1 @ arr_gpu_2
# Matrix API variant
var mat_cpu_1 = Matrix[f32, cpu]((SIZE, SIZE), fill_value=1.0)
var mat_cpu_2 = Matrix[f32, cpu]((SIZE, SIZE), fill_value=2.0)
var matmul_cpu = mat_cpu_1 @ mat_cpu_2
var mat_gpu_1 = Matrix[f32, mps]((SIZE, SIZE), fill_value=1.0)
var mat_gpu_2 = Matrix[f32, mps]((SIZE, SIZE), fill_value=2.0)
var matmul_gpu = mat_gpu_1 @ mat_gpu_2
# Transfer between devices
var gpu_from_cpu_1 = mat_cpu_1.to[mps]()
var gpu_from_cpu_2 = mat_cpu_2.to[mps]()
var matmul_gpu_from_cpu = gpu_from_cpu_1 @ gpu_from_cpu_2Option 2: Separate Device-Specific Classes
This design introduces explicit device-specific classes, e.g.:NDArrayCPU, NDArrayGPU. Each type directly manages its own memory layout and compute kernels.
Pros:
- Zero device abstraction overhead
- Enables backend-specific optimizations
Cons:
- Significant code duplication for function overloading
- Poor ergonomics for users switching between CPU/GPU
Example:
alias mps = Device.MPS
var x_cpu_1 = NDArrayCPU[f32](Shape(1024, 1024))
var x_cpu_2 = NDArrayCPU[f32](Shape(1024, 1024))
var result = x_cpu_1 @ x_cpu_2
var x_gpu_1 = NDArrayGPU[f32](Shape(1024, 1024))
var x_gpu_2 = NDArrayGPU[f32](Shape(1024, 1024))
var result = x_gpu_1 @ x_gpu_2
var x_cpu_to_gpu = x_cpu_1.to[mps]()This model may be more suitable for low-level or embedded contexts, but less ideal for NuMojo’s NumPy-compatibility goals.
Option 3: Static Shape GPU Arrays
This approach introduces a StaticNDArray type with compile-time known shapes and dtypes, enabling aggressive optimizations such as loop unrolling and vectorization.
Pros
- Maximum performance and compile-time safety
- Enables highly optimized kernels for fixed-size data
Cons
- Limited flexibility for dynamic workloads
- Increased API and implementation complexity
- Requires separate type definitions (NDArray vs StaticNDArray)
This model could coexist with the dynamic NDArray, targeting scientific computing and ML inference workloads where shapes are known ahead of time.
Note:
- Many of these limitations may be mitigated in the future as Mojo evolves (e.g., with trait parameters and advanced compile-time metaprogramming features).
- While NuMojo aims to be largely NumPy-compatible, we shouldn’t hesitate to improve the API design where it makes sense, even if it introduces intentional deviations from NumPy’s behavior.
Preliminary Results:
Using Approach 1 and approach 2,
- Observed near-zero abstraction overhead with this unified approach
- Achieved ~15× speedup on Apple Silicon GPU (MPS backend) for matmul with SIZE = 2048 using basic GPU kernels
Check the attached figure for cpu vs gpu comparison of matmul using approach 1 and approach 2 with basic tiled gpu kernels.
