GPU support in NuMojo

# GPU Support in NuMojo 

# Motivation
NuMojo aims to serve as a drop-in replacement for NumPy, while leveraging Mojo’s performance characteristics and native compilation. With GPU backends (e.g., Metal, CUDA, ROCm), we can bring C++/CUDA performance and better. The challenge lies in designing a unified and ergonomic device model that allows users to transparently scale their code from CPU to GPU without major API changes or performance regressions.

# Proposed approaches
I outline three main architectural approaches:

## Option 1: Unified NDArray with Device-Aware Storage
This approach extends the existing NDArray to include device specialization at compile time.
It provides a PyTorch-like `.to[device]()` API for explicit device transfers, while keeping a single unified interface across backends like `torch.tensor`. Enables users to create NDArray in either CPU or GPU by just providing `Device` parameter. 

Key Properties:
- Unified API for CPU, GPU, and future devices
- Compile-time device specialization
- Minimal breaking changes to current codebase
- Simple integration path for future devices

Cons:
- A lot of ugly compile time if conditions to differentiate cpu and gpu methods. 

Example Usage:
```mojo
fn main() raises:
    alias SIZE: Int = 1024
    alias cpu: Device = Device.CPU
    alias mps: Device = Device.MPS

    # Create CPU arrays
    var arr_cpu_1 = arange[f32](1.0, 101.0, 1).reshape(Shape(10, 10))
    var arr_cpu_2 = arange[f32](1.0, 101.0, 1).reshape(Shape(10, 10))
    var matmul_cpu = arr_cpu_1 @ arr_cpu_2

    # Create GPU arrays (Metal backend)
    var arr_gpu_1 = arange[f32, device=mps](1.0, 101.0, 1).reshape(Shape(10, 10))
    var arr_gpu_2 = arange[f32, device=mps](1.0, 101.0, 1).reshape(Shape(10, 10))
    var matmul_gpu = arr_gpu_1 @ arr_gpu_2

    # Matrix API variant
    var mat_cpu_1 = Matrix[f32, cpu]((SIZE, SIZE), fill_value=1.0)
    var mat_cpu_2 = Matrix[f32, cpu]((SIZE, SIZE), fill_value=2.0)
    var matmul_cpu = mat_cpu_1 @ mat_cpu_2

    var mat_gpu_1 = Matrix[f32, mps]((SIZE, SIZE), fill_value=1.0)
    var mat_gpu_2 = Matrix[f32, mps]((SIZE, SIZE), fill_value=2.0)
    var matmul_gpu = mat_gpu_1 @ mat_gpu_2

    # Transfer between devices
    var gpu_from_cpu_1 = mat_cpu_1.to[mps]()
    var gpu_from_cpu_2 = mat_cpu_2.to[mps]()
    var matmul_gpu_from_cpu = gpu_from_cpu_1 @ gpu_from_cpu_2
```

## Option 2: Separate Device-Specific Classes
This design introduces explicit device-specific classes, e.g.:NDArrayCPU, NDArrayGPU. Each type directly manages its own memory layout and compute kernels.

### Pros:
- Zero device abstraction overhead
- Enables backend-specific optimizations

### Cons:
- Significant code duplication for function overloading
- Poor ergonomics for users switching between CPU/GPU

### Example:
```mojo
alias mps = Device.MPS
var x_cpu_1 = NDArrayCPU[f32](Shape(1024, 1024))
var x_cpu_2 = NDArrayCPU[f32](Shape(1024, 1024))
var result = x_cpu_1 @ x_cpu_2

var x_gpu_1 = NDArrayGPU[f32](Shape(1024, 1024))
var x_gpu_2 = NDArrayGPU[f32](Shape(1024, 1024))
var result = x_gpu_1 @ x_gpu_2

var x_cpu_to_gpu = x_cpu_1.to[mps]()
```

This model may be more suitable for low-level or embedded contexts, but less ideal for NuMojo’s NumPy-compatibility goals. 

## Option 3: Static Shape GPU Arrays
This approach introduces a StaticNDArray type with compile-time known shapes and dtypes, enabling aggressive optimizations such as loop unrolling and vectorization.

### Pros
- Maximum performance and compile-time safety
- Enables highly optimized kernels for fixed-size data

### Cons
- Limited flexibility for dynamic workloads
- Increased API and implementation complexity
- Requires separate type definitions (NDArray vs StaticNDArray)

This model could coexist with the dynamic NDArray, targeting scientific computing and ML inference workloads where shapes are known ahead of time.

## Note: 
1) Many of these limitations may be mitigated in the future as Mojo evolves (e.g., with trait parameters and advanced compile-time metaprogramming features).
2) While NuMojo aims to be largely NumPy-compatible, we shouldn’t hesitate to improve the API design where it makes sense, even if it introduces intentional deviations from NumPy’s behavior.

## Preliminary Results:
Using Approach 1 and approach 2,
- Observed near-zero abstraction overhead with this unified approach
- Achieved ~15× speedup on Apple Silicon GPU (MPS backend) for matmul with SIZE = 2048 using basic GPU kernels

Check the attached figure for cpu vs gpu comparison of matmul using approach 1 and approach 2 with basic tiled gpu kernels. 
<img width="3600" height="2400" alt="matmul_benchmark_plots_matrix" src="https://github.com/user-attachments/assets/83d587ad-bf94-43fb-af2f-78ace57dbd9b" />

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GPU support in NuMojo #273

GPU Support in NuMojo

Motivation

Proposed approaches

Option 1: Unified NDArray with Device-Aware Storage

Option 2: Separate Device-Specific Classes

Pros:

Cons:

Example:

Option 3: Static Shape GPU Arrays

Pros

Cons

Note:

Preliminary Results:

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

GPU support in NuMojo #273

Description

GPU Support in NuMojo

Motivation

Proposed approaches

Option 1: Unified NDArray with Device-Aware Storage

Option 2: Separate Device-Specific Classes

Pros:

Cons:

Example:

Option 3: Static Shape GPU Arrays

Pros

Cons

Note:

Preliminary Results:

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions