Whitematter

Train and deploy custom neural networks from your browser — powered by a C++ framework that compiles and runs models natively.

Quick Start

docker compose up

Open http://localhost:5173 — upload a dataset, describe your model, train, and deploy.

Set ANTHROPIC_API_KEY in .env to enable AI-assisted architecture design.

Features

AI Architecture Designer — Describe your model in natural language; Claude suggests layer configurations, hyperparameters, and training recipes
Live Training Dashboard — Real-time loss curves, accuracy charts, stat cards, and training logs streamed to the browser
One-Click Deploy — Push trained models to AWS EC2 as inference APIs with a single button
Visual Model Builder — Drag-and-drop architecture graph with interactive node editing
Dataset Management — Upload ZIPs, import from Hugging Face or URL, automatic preprocessing for images/text/tabular data
Model Zoo — Browse pretrained architectures, inspect model cards, run predictions in the playground
Native C++ Performance — Models compile to optimized C++ with SIMD, OpenMP, Metal, and CUDA backends

Architecture

graph TB
    subgraph Frontend["Frontend (Next.js)"]
        UI[React UI]
        Graph[Architecture Graph]
        Charts[Training Charts]
    end

    subgraph Backend["Backend (FastAPI)"]
        API[REST API]
        LLM[LLM Service<br/>Claude API]
        CodeGen[Code Generator]
        DataMgr[Dataset Manager]
    end

    subgraph Core["C++ Core"]
        Tensor[Tensor + Autograd]
        Layers[Layers / Optimizers / Loss]
        GPU[Metal / CUDA Backends]
    end

    UI -->|HTTP| API
    Graph -->|HTTP| API
    Charts -->|SSE| API
    API --> LLM
    API --> CodeGen
    API --> DataMgr
    CodeGen -->|generates| Core
    Core -->|compiles & trains| Tensor
    Tensor --> Layers
    Layers --> GPU

API Documentation

Full API documentation is available at /docs (Swagger UI) and /redoc when the server is running. These include request/response schemas, authentication details, and interactive "try it out" support.

Quick-reference for the main endpoint groups:

Group	Endpoints	Description
Auth	`POST /auth/register`, `POST /auth/login`	User registration, login, and JWT authentication
Datasets	`POST /datasets/upload`, `GET /datasets/{id}`	Upload, list, and manage training datasets
Design	`POST /design/suggest`, `POST /design/refine`	AI-assisted model architecture suggestions
Training	`POST /train`, `GET /train/{job_id}`	Start training jobs, stream progress, retrieve results
Models	`GET /models`, `GET /models/{id}`	Browse trained models and metadata
Predict	`POST /predict/{model_id}`	Run inference on trained models
Deploy	`POST /deploy`	One-click deploy to AWS EC2
Credentials	`/credentials`	Manage cloud provider and API credentials
Health	`/health`	Server health and readiness checks

Community

Public Roadmap: GitHub Projects board — see what's planned and vote on priorities
Contributing: See CONTRIBUTING.md for setup instructions and PR guidelines
Changelog: See CHANGELOG.md for release history

"Built with Whitematter" Badge

Add this badge to your project's README:

[![Built with Whitematter](https://raw.githubusercontent.com/hwang2409/whitematter/main/docs/badge.svg)](https://github.com/hwang2409/whitematter)

Or in HTML:

<a href="https://github.com/hwang2409/whitematter">
  <img src="https://raw.githubusercontent.com/hwang2409/whitematter/main/docs/badge.svg" alt="Built with Whitematter" />
</a>

C++ Framework

Whitematter's core is a lightweight PyTorch-like neural network framework written in C++ with automatic differentiation, SIMD optimizations, and GPU backends (Metal / CUDA).

Distribution

Python (pip)

pip install .
# or for development: pip install -e .
import whitematter as wm

C++ (CMake / FetchContent)

include(FetchContent)
FetchContent_Declare(whitematter
  GIT_REPOSITORY https://github.com/hwang2409/whitematter
  GIT_TAG main
)
FetchContent_MakeAvailable(whitematter)
target_link_libraries(your_app PRIVATE whitematter)

Build options: -DWHITEMATTER_METAL=ON (macOS), -DWHITEMATTER_CUDA=ON (GPU).

C++ Quick Start

make                # Build all examples
make test           # Run unit tests
./build/ml          # Train an MLP on MNIST

Framework Structure

core/               C++ core (tensor, layers, loss, optimizer, autograd)
datasets/           MNIST and CIFAR-10 loaders
examples/           Training examples (CNN, GAN, RNN, transformer, autoencoder)
bindings/           Python bindings (pybind11)
platform/           FastAPI backend server
frontend/           Next.js React UI

Usage Guide

Tensors

The Tensor class is the core data structure with automatic gradient computation:

#include "tensor.h"

// Create tensors
auto a = Tensor::randn({3, 4}, true);      // 3x4 random tensor, requires_grad=true
auto b = Tensor::zeros({4, 2}, true);      // 4x2 zeros
auto c = Tensor::ones({3, 2}, false);      // 3x2 ones, no gradients
auto w = Tensor::xavier(784, 256, true);   // Xavier initialization

// Operations (automatically tracked for backprop)
auto d = a->matmul(b);          // Matrix multiplication
auto e = d->add(c);             // Addition (supports broadcasting)
auto f = e->relu();             // ReLU activation
auto g = f->sum();              // Sum to scalar

// Backpropagation
g->backward();                  // Compute gradients
a->grad;                        // Access gradients
a->zero_grad();                 // Reset gradients

Available tensor operations:

Arithmetic: add, sub, mul, div, neg (with broadcasting)
Matrix: matmul, bmm, transpose, reshape, slice, concat, stack
Shape: squeeze, unsqueeze, flatten, permute
Activations: relu, sigmoid, tanh_, softmax, log_softmax
Reductions: sum, mean, max, min, argmax, argmin
Element-wise: log_, exp_, pow, sqrt, abs, clamp
Augmentation: flip_horizontal, random_flip_horizontal, pad2d, crop, random_crop

Broadcasting: Arithmetic operations (add, sub, mul, div) support NumPy-style broadcasting:

auto a = Tensor::randn({2, 3}, true);
auto b = Tensor::randn({3}, true);      // Broadcasts to [2, 3]
auto c = a->add(b);                      // Shape: [2, 3]

auto x = Tensor::randn({3, 1}, true);
auto y = Tensor::randn({1, 4}, true);
auto z = x->mul(y);                      // Outer product, Shape: [3, 4]

auto bias = Tensor::randn({1}, true);    // Scalar broadcast
auto out = a->add(bias);                 // Shape: [2, 3]

Math operations:

auto a = Tensor::create({4, 9, 16}, {3}, true);
auto b = a->sqrt();           // [2, 3, 4]
auto c = a->pow(2.0f);        // [16, 81, 256]
auto d = a->pow(0.5f);        // Same as sqrt

// L2 norm: sqrt(sum(x^2))
auto x = Tensor::randn({10}, true);
auto norm = x->pow(2.0f)->sum()->sqrt();

// Element-wise power with broadcasting
auto bases = Tensor::randn({2, 3}, true);
auto exps = Tensor::create({1, 2, 3}, {3}, true);
auto result = bases->pow(exps);  // Each column raised to different power

// Absolute value and clamping
auto y = Tensor::create({-3, -1, 2, 5}, {4}, true);
auto abs_y = y->abs();              // [3, 1, 2, 5]
auto clamped = y->clamp(-2, 3);     // [-2, -1, 2, 3]

Max/Min operations:

auto a = Tensor::create({1, 5, 3, 2, 8, 4}, {2, 3}, true);
// [[1, 5, 3], [2, 8, 4]]

// Reduction along dimension
auto max0 = a->max(0);           // max along dim 0: [2, 8, 4]
auto max1 = a->max(1);           // max along dim 1: [5, 8]
auto min1 = a->min(1);           // min along dim 1: [1, 2]
auto max_keep = a->max(1, true); // keepdim: [[5], [8]] shape [2, 1]

// Element-wise max/min with broadcasting
auto threshold = Tensor::create({3}, {1}, true);
auto clamped_low = a->max(threshold);   // ReLU-like: max(a, 3)
auto clamped_high = a->min(threshold);  // cap at 3: min(a, 3)

// Gradient flows only to the "winning" element
auto loss = a->max(1)->sum();
loss->backward();  // grad is 1 at max positions, 0 elsewhere

// Get indices of max/min values (no gradients, returns integer indices)
auto argmax_idx = a->argmax(1);    // [1, 1] - column indices of max values
auto argmin_idx = a->argmin(1);    // [0, 0] - column indices of min values
auto argmax_keep = a->argmax(1, true);  // [[1], [1]] - keepdim preserves shape

Batch operations:

// Batch matrix multiplication: [batch, m, k] @ [batch, k, n] -> [batch, m, n]
auto a = Tensor::randn({8, 16, 32}, true);   // 8 batches of 16x32 matrices
auto b = Tensor::randn({8, 32, 64}, true);   // 8 batches of 32x64 matrices
auto c = a->bmm(b);                          // Shape: [8, 16, 64]

// Attention scores computation: Q @ K^T
auto Q = Tensor::randn({4, 8, 16}, true);    // [batch, seq_len, head_dim]
auto K = Tensor::randn({4, 8, 16}, true);
auto scores = Q->bmm(K->permute({0, 2, 1})); // Shape: [4, 8, 8]

Combining tensors:

// Concatenate along existing dimension
auto a = Tensor::randn({2, 3}, true);
auto b = Tensor::randn({2, 3}, true);
auto c = Tensor::concat({a, b}, 0);  // Shape: [4, 3]
auto d = Tensor::concat({a, b}, 1);  // Shape: [2, 6]

// Stack along new dimension
auto e = Tensor::stack({a, b}, 0);   // Shape: [2, 2, 3]
auto f = Tensor::stack({a, b}, -1);  // Shape: [2, 3, 2]

Reshaping tensors:

auto a = Tensor::randn({2, 3}, true);
auto b = a->unsqueeze(0);    // Shape: [1, 2, 3]
auto c = a->unsqueeze(-1);   // Shape: [2, 3, 1]
auto d = b->squeeze(0);      // Shape: [2, 3]
auto e = Tensor::randn({1, 2, 1, 3, 1}, true);
auto f = e->squeeze();       // Remove all size-1 dims -> [2, 3]

// Permute dimensions (reorder axes)
auto g = Tensor::randn({2, 3, 4}, true);
auto h = g->permute({2, 0, 1});   // Shape: [4, 2, 3]
auto i = g->permute({0, 2, 1});   // Shape: [2, 4, 3]
// NCHW -> NHWC conversion
auto nchw = Tensor::randn({8, 3, 32, 32}, true);
auto nhwc = nchw->permute({0, 2, 3, 1});  // Shape: [8, 32, 32, 3]

Layers

Build networks using the Module interface:

#include "layer.h"

// Individual layers
auto linear = new Linear(784, 256);   // Fully connected
auto relu = new ReLU();               // Activation
auto sigmoid = new Sigmoid();
auto softmax = new Softmax(-1);       // Along last dimension
auto dropout = new Dropout(0.5);      // 50% dropout

// Sequential model
Sequential model({
    new Linear(784, 256),
    new ReLU(),
    new Dropout(0.2),
    new Linear(256, 128),
    new ReLU(),
    new Linear(128, 10)
});

// Forward pass
auto output = model.forward(input);

// Access parameters
auto params = model.parameters();  // Returns vector<TensorPtr>

// Training/eval mode (affects Dropout)
model.train();
model.eval();

Available layers:

Linear(in_features, out_features) - Fully connected layer
ReLU() - ReLU activation
Sigmoid() - Sigmoid activation
Tanh() - Tanh activation
Softmax(dim) - Softmax activation
LogSoftmax(dim) - Log-softmax activation
Dropout(p) - Dropout regularization
Conv2d(in_channels, out_channels, kernel_size, stride, padding) - 2D convolution
ConvTranspose2d(in_channels, out_channels, kernel_size, stride, padding, output_padding) - 2D transposed convolution (upsampling)
MaxPool2d(kernel_size, stride) - 2D max pooling
AvgPool2d(kernel_size, stride) - 2D average pooling
BatchNorm2d(num_features, eps, momentum) - 2D batch normalization
LayerNorm(normalized_shape, eps) - Layer normalization (for transformers)
Flatten() - Flatten spatial dimensions
Embedding(num_embeddings, embedding_dim) - Embedding lookup table for NLP
LSTM(input_size, hidden_size, batch_first) - LSTM recurrent layer
GRU(input_size, hidden_size, batch_first) - GRU recurrent layer
MultiHeadAttention(embed_dim, num_heads) - Multi-head attention for transformers
Sequential({...}) - Container for chaining layers

Transposed Convolution (Upsampling):

// ConvTranspose2d upsamples spatial dimensions (used in autoencoders, GANs, segmentation)
// Output size: (H - 1) * stride - 2 * padding + kernel_size + output_padding

// Double spatial dimensions with stride=2
auto upsample = new ConvTranspose2d(64, 32, 4, 2, 1);  // [N, 64, 8, 8] -> [N, 32, 16, 16]

// Decoder for autoencoder
Sequential decoder({
    new ConvTranspose2d(256, 128, 4, 2, 1),  // 4x4 -> 8x8
    new ReLU(),
    new ConvTranspose2d(128, 64, 4, 2, 1),   // 8x8 -> 16x16
    new ReLU(),
    new ConvTranspose2d(64, 3, 4, 2, 1),     // 16x16 -> 32x32
    new Sigmoid()                             // Output in [0, 1]
});

Loss Functions

#include "loss.h"

CrossEntropyLoss criterion;           // For multi-class classification
MSELoss mse_criterion;                // For regression (L2 loss)
L1Loss l1_criterion;                  // For regression (mean absolute error)
SmoothL1Loss smooth_l1;               // Huber loss (robust to outliers)
SmoothL1Loss smooth_l1_custom(0.5f);  // Huber loss with custom beta threshold
NLLLoss nll_criterion;                // Negative log likelihood
BCELoss bce_criterion;                // Binary cross entropy (expects probabilities)
BCEWithLogitsLoss bce_logits;         // BCE with built-in sigmoid (numerically stable)
KLDivLoss kl_div;                     // KL divergence (for knowledge distillation)
FocalLoss focal;                      // Focal loss for imbalanced multi-class
FocalLoss focal_custom(2.0f, 0.25f);  // gamma=2, alpha=0.25
BinaryFocalLoss binary_focal;         // Binary focal loss for imbalanced binary

// Compute loss
auto loss = criterion(predictions, targets);
loss->backward();

Optimizers

#include "optimizer.h"

// Stochastic Gradient Descent with momentum
SGD optimizer(model.parameters(), /*lr=*/0.01, /*momentum=*/0.9);

// Adam optimizer
Adam optimizer(model.parameters(), /*lr=*/0.001, /*beta1=*/0.9, /*beta2=*/0.999);

// AdamW optimizer (Adam with decoupled weight decay)
AdamW optimizer(model.parameters(), /*lr=*/0.001, /*beta1=*/0.9, /*beta2=*/0.999,
                /*eps=*/1e-8, /*weight_decay=*/0.01);

// RMSprop optimizer
RMSprop optimizer(model.parameters(), /*lr=*/0.01, /*alpha=*/0.99, /*eps=*/1e-8,
                  /*momentum=*/0.0, /*weight_decay=*/0.0);

// Training step
optimizer.zero_grad();      // Clear gradients
auto loss = criterion(model.forward(x), y);
loss->backward();           // Compute gradients
optimizer.step();           // Update parameters

Learning Rate Schedulers

#include "optimizer.h"

SGD optimizer(model.parameters(), 0.1f, 0.9f);

// Step decay: multiply LR by gamma every step_size epochs
StepLR scheduler(&optimizer, 30, 0.1f);  // decay by 0.1 every 30 epochs

// Exponential decay: multiply LR by gamma every epoch
ExponentialLR scheduler(&optimizer, 0.95f);

// Cosine annealing: smoothly decrease LR to eta_min over T_max epochs
CosineAnnealingLR scheduler(&optimizer, 100, 0.0001f);

// Cosine with warm restarts: reset LR periodically
CosineAnnealingWarmRestarts scheduler(&optimizer, 10, 2, 0.0001f);  // T_0=10, T_mult=2

// Reduce on plateau: reduce LR when metric stops improving
ReduceLROnPlateau scheduler(&optimizer, 0.1f, 10, 0.0001f, true);  // factor, patience, min_lr, mode_min

// Call at end of each epoch
for (int epoch = 0; epoch < num_epochs; epoch++) {
    train_one_epoch();
    scheduler.step();                    // For most schedulers
    // scheduler.step(validation_loss);  // For ReduceLROnPlateau
}

Gradient Clipping

#include "optimizer.h"

auto params = model.parameters();

// Clip by norm: scale gradients if total norm > max_norm
loss->backward();
clip_grad_norm_(params, 1.0f);  // max_norm = 1.0
optimizer.step();

// Clip by value: clamp each gradient to [-clip_value, clip_value]
loss->backward();
clip_grad_value_(params, 0.5f);  // clip to [-0.5, 0.5]
optimizer.step();

// Get gradient norm (useful for monitoring)
float grad_norm = get_grad_norm(params);

Mixed Precision Training

Mixed precision training uses fp16 (half-precision) for faster computation while maintaining fp32 accuracy through loss scaling:

#include "amp.h"

Sequential model({
    new Linear(784, 256),
    new ReLU(),
    new Linear(256, 10)
});

SGD optimizer(model.parameters(), 0.001f);
GradScaler scaler;  // Default: init_scale=65536, growth_interval=2000
CrossEntropyLoss criterion;

for (int epoch = 0; epoch < epochs; epoch++) {
    for (auto [x, y] : dataloader) {
        optimizer.zero_grad();

        // Forward pass (in real fp16, this would use half-precision)
        auto output = model.forward(x);
        auto loss = criterion(output, y);

        // Scale loss for numerical stability
        auto scaled_loss = scaler.scale(loss);
        scaled_loss->backward();

        // Unscale gradients and check for overflow
        scaler.unscale(&optimizer);

        // Gradient clipping (optional, works with scaled gradients)
        clip_grad_norm_(model.parameters(), 1.0f);

        // Step optimizer only if gradients are finite
        scaler.step(&optimizer, true);  // true = already unscaled

        // Update scale factor based on overflow history
        scaler.update();
    }
}

GradScaler API:

// Constructor with custom settings
GradScaler scaler(
    256.0f,    // init_scale: starting scale factor
    2.0f,      // growth_factor: multiply scale by this after growth_interval good steps
    0.5f,      // backoff_factor: multiply scale by this on overflow
    2000,      // growth_interval: steps between scale increases
    true       // enabled: set to false to disable scaling
);

// Core methods
float scale = scaler.get_scale();           // Get current scale factor
auto scaled = scaler.scale(loss);           // Scale a tensor
bool finite = scaler.unscale(&optimizer);   // Unscale gradients, returns true if all finite
scaler.step(&optimizer, already_unscaled);  // Step if gradients are finite
scaler.update();                            // Adjust scale based on overflow history

HalfTensor for memory optimization:

// Store weights in fp16 to save memory (50% reduction)
auto weights = Tensor::randn({1000, 1000}, false);
HalfTensor half_weights(weights);

printf("Original: %zu bytes\n", weights->data.size() * 4);    // 4MB
printf("Half: %zu bytes\n", half_weights.data.size() * 2);    // 2MB

// Convert back to fp32 for computation
auto restored = half_weights.to_float();

FP16 conversion utilities:

// Convert single values
uint16_t h = float_to_half(3.14159f);
float f = half_to_float(h);

// Convert vectors
std::vector<uint16_t> half_data = to_half(float_data);
std::vector<float> float_data = from_half(half_data);

Gradient Accumulation

Gradient accumulation enables training with effectively larger batch sizes when memory is limited. Instead of updating weights after every batch, accumulate gradients over multiple mini-batches:

#include "optimizer.h"

Sequential model({
    new Linear(784, 256),
    new ReLU(),
    new Linear(256, 10)
});

SGD optimizer(model.parameters(), 0.01f);
CrossEntropyLoss criterion;
GradientAccumulator accumulator(4);  // Effective batch = mini_batch * 4

for (auto [x, y] : dataloader) {
    auto loss = criterion(model.forward(x), y);

    // Scales loss by 1/4 and calls backward
    accumulator.backward(loss);

    // Step only when we've accumulated 4 batches
    if (accumulator.should_step()) {
        optimizer.step();
        optimizer.zero_grad();
        accumulator.reset();
    }
}

GradientAccumulator API:

// Create accumulator (effective_batch = mini_batch * accumulation_steps)
GradientAccumulator accumulator(4);

// Option 1: Combined scale + backward
accumulator.backward(loss);

// Option 2: Manual control
auto scaled_loss = accumulator.scale(loss);  // loss / accumulation_steps
scaled_loss->backward();
accumulator.increment();

// Check state
accumulator.should_step();              // True when ready for optimizer step
accumulator.is_last_step();             // True on final accumulation step
accumulator.current_step();             // Current step (0 to accumulation_steps-1)
accumulator.get_accumulation_steps();   // Total steps
accumulator.get_scale_factor();         // 1.0 / accumulation_steps

// Reset after optimizer step
accumulator.reset();

Combined with mixed precision:

GradientAccumulator accumulator(4);
GradScaler scaler(256.0f);

for (auto [x, y] : dataloader) {
    auto loss = criterion(model.forward(x), y);

    // Scale for accumulation, then for mixed precision
    auto scaled = scaler.scale(accumulator.scale(loss));
    scaled->backward();
    accumulator.increment();

    if (accumulator.should_step()) {
        scaler.unscale(&optimizer);
        scaler.step(&optimizer, true);
        scaler.update();
        optimizer.zero_grad();
        accumulator.reset();
    }
}

Early Stopping

Early stopping prevents overfitting by halting training when validation metrics stop improving:

#include "optimizer.h"

Sequential model({...});
SGD optimizer(model.parameters(), 0.01f);
CrossEntropyLoss criterion;

EarlyStopping early_stopping(10);  // patience = 10 epochs

for (int epoch = 0; epoch < max_epochs; epoch++) {
    // Training
    train_one_epoch(model, train_loader);

    // Validation
    float val_loss = evaluate(model, val_loader);

    // Check if we should stop
    if (early_stopping.step(val_loss)) {
        printf("Early stopping at epoch %d\n", epoch);
        printf("Best val loss: %.4f at epoch %d\n",
               early_stopping.best_metric(), early_stopping.best_epoch());
        break;
    }
}

EarlyStopping API:

// Create early stopping monitor
// patience: epochs to wait after last improvement
// min_delta: minimum change to qualify as improvement
// mode_min: true = lower is better (loss), false = higher is better (accuracy)
EarlyStopping early_stopping(10, 0.001f, true);   // For loss
EarlyStopping early_stopping(10, 0.0f, false);    // For accuracy

// Step and check
bool should_stop = early_stopping.step(metric);

// Query state
early_stopping.best_metric();                // Best value seen
early_stopping.best_epoch();                 // Epoch of best value
early_stopping.epochs_without_improvement(); // Current patience counter
early_stopping.should_stop();                // Whether training should stop

// Reset for new training run
early_stopping.reset();

ModelCheckpoint - Save best model automatically:

#include "optimizer.h"
#include "serialize.h"

EarlyStopping early_stopping(10);
ModelCheckpoint checkpoint("best_model.bin", true);  // mode_min=true for loss

for (int epoch = 0; epoch < max_epochs; epoch++) {
    train_one_epoch();
    float val_loss = evaluate();

    // Save model if validation improved
    if (checkpoint.step(val_loss, &model)) {
        printf("Saved best model (val_loss=%.4f)\n", val_loss);
    }

    // Check early stopping
    if (early_stopping.step(val_loss)) {
        printf("Early stopping at epoch %d\n", epoch);
        break;
    }
}

// Restore best model for inference/testing
checkpoint.restore(&model);

Model Summary

Inspect model architecture, parameter counts, and memory usage:

#include "layer.h"

Sequential model({
    new Conv2d(1, 16, 3, 1, 1),
    new BatchNorm2d(16),
    new ReLU(),
    new MaxPool2d(2, 2),
    new Flatten(),
    new Linear(3136, 128),
    new ReLU(),
    new Linear(128, 10)
});

// PyTorch-style layer-by-layer summary with output shapes
model.summary({1, 1, 28, 28});  // Pass input shape for shape tracking

Output:

==============================================================================
Layer (type)                    Output Shape              Param #
==============================================================================
Conv2d(1, 16, kernel_size=3)    [1, 16, 28, 28]           160
BatchNorm2d(16)                  [1, 16, 28, 28]           32
ReLU                             [1, 16, 28, 28]           0
MaxPool2d(kernel_size=2)         [1, 16, 14, 14]           0
Flatten                          [1, 3136]                 0
Linear(3136, 128)                [1, 128]                  401,536
ReLU                             [1, 128]                  0
Linear(128, 10)                  [1, 10]                   1,290
==============================================================================
Total params: 403,018
Trainable params: 403,018
Non-trainable params: 0
==============================================================================

Utility functions:

// Get detailed model info
ModelSummary info = get_model_summary(&model);
printf("Total params: %zu\n", info.total_params);
printf("Trainable: %zu\n", info.trainable_params);
printf("Param memory (fp32): %zu bytes\n", info.param_memory_bytes);
printf("Param memory (fp16): %zu bytes\n", info.param_memory_fp16_bytes);
printf("Gradient memory: %zu bytes\n", info.grad_memory_bytes);
printf("Total training memory: %zu bytes\n", info.total_memory_bytes);

// Convenience functions
size_t total = count_parameters(&model);
size_t trainable = count_trainable_parameters(&model);

// Human-readable formatting
std::string params = format_number(1234567);     // "1,234,567"
std::string memory = format_memory(1024*1024);   // "1.00 MB"

// Simple summary printout
print_model_info(&model, "My CNN");

Training Logger

Log metrics during training with TensorBoard-style tracking and export:

#include "logging.h"

TrainingLogger logger("logs", "my_experiment");
logger.set_total_epochs(10);
logger.set_total_steps(100);  // Steps per epoch (for progress bar)

for (int epoch = 0; epoch < 10; epoch++) {
    logger.new_epoch();

    for (int batch = 0; batch < 100; batch++) {
        // Train...
        float loss = train_batch();
        float acc = compute_accuracy();

        // Log batch metrics (for epoch averaging)
        logger.log_batch("loss", loss);
        logger.log_batch("accuracy", acc);

        // Show progress bar
        logger.print_progress();
    }

    // Log epoch-level metrics
    logger.log("train_loss", logger.epoch_mean("loss"));
    logger.log("train_acc", logger.epoch_mean("accuracy"));
    logger.log("lr", optimizer.lr);
    logger.step();

    logger.print_epoch_summary();
}

// Save logs and print summary
logger.save_csv();   // logs/my_experiment_metrics.csv
logger.save_json();  // logs/my_experiment_metrics.json
logger.print_summary();

Console output:

Epoch 3/10 [============>       ] 60% loss: 0.4523 accuracy: 0.8721 (1m 23s)
Epoch 3/10 - loss: 0.4512 accuracy: 0.8734 - 2m 18s

==============================================================================
Training Summary: my_experiment
==============================================================================
Total epochs:  10
Total steps:   1000
Elapsed time:  23m 45s
------------------------------------------------------------------------------
train_loss:     min:   0.1234  max:   1.2345  mean:   0.4567  std:   0.2345
train_acc:      min:   0.7500  max:   0.9500  mean:   0.8750  std:   0.0500
==============================================================================

MetricTracker for statistics:

// Track running statistics for any metric
MetricTracker tracker;
for (float value : batch_losses) {
    tracker.update(value);
}
printf("Mean: %.4f, Std: %.4f, Min: %.4f, Max: %.4f\n",
       tracker.mean(), tracker.std(), tracker.min(), tracker.max());

ProgressBar for loops:

ProgressBar bar(1000, 40, "Training: ");
for (int i = 0; i < 1000; i++) {
    // Do work...
    bar.update();
}
bar.finish();
// Output: Training: [=================>              ] 50% 500/1000 [1.2s < 1.2s]

CSV export format:

step,epoch,timestamp,train_loss,train_acc,lr
0,1,12.34,0.9876,0.7500,0.001
1,2,25.67,0.5432,0.8500,0.001
...

JSON export format:

{
  "experiment": "my_experiment",
  "total_steps": 10,
  "total_epochs": 10,
  "elapsed_seconds": 1234.56,
  "summary": {
    "train_loss": {"min": 0.12, "max": 1.23, "mean": 0.45, "std": 0.23, "last": 0.15}
  },
  "history": [
    {"step": 0, "epoch": 1, "timestamp": 12.34, "train_loss": 0.9876}
  ]
}

Disabling Gradient Tracking

Use NoGradGuard for inference to improve performance:

{
    NoGradGuard no_grad;  // Disables gradient tracking in this scope
    auto output = model.forward(input);  // No computation graph built
    // ~3x faster inference
}
// Gradients automatically re-enabled when guard goes out of scope

Model Zoo

The model zoo provides pre-defined architectures and pretrained weights:

#include "model_zoo.h"

// List available models
auto models = list_models();  // {"mnist_mlp", "mnist_cnn", "cifar10_simple", ...}

// Get model info
auto info = ModelZoo::instance().get_info("mnist_cnn");
printf("Params: %zu, Expected accuracy: %.1f%%\n", info.num_params, info.accuracy);

// Create model architecture (random weights)
Sequential* model = create_model("mnist_cnn");

// Load pretrained model (architecture + weights)
Sequential* pretrained = load_pretrained("mnist_cnn");

// Save a trained model to the zoo
ModelZoo::instance().save_to_zoo("mnist_cnn", model);

// Change weights directory (default: "pretrained/")
ModelZoo::instance().set_weights_dir("my_models/");

Available models:

Model	Dataset	Input Shape	Params	Expected Accuracy
`mnist_mlp`	MNIST	[1, 784]	203K	~97.5%
`mnist_cnn`	MNIST	[1, 1, 28, 28]	207K	~98.5%
`cifar10_simple`	CIFAR-10	[1, 3, 32, 32]	310K	~75%
`cifar10_vgg`	CIFAR-10	[1, 3, 32, 32]	3.2M	~85%
`tiny_mlp`	synthetic	[1, 4]	42	100%

Training and saving to the zoo:

// Train a model
Sequential* model = create_model("mnist_cnn");
// ... training loop ...

// Save to pretrained/mnist_cnn.bin
ModelZoo::instance().save_to_zoo("mnist_cnn", model);

// Later, load the pretrained model
Sequential* loaded = load_pretrained("mnist_cnn");

ONNX Export

Export models to ONNX format for use with ONNX Runtime, TensorRT, or other frameworks:

#include "onnx_export.h"

// Simple export with input shape
Sequential* model = load_pretrained("mnist_cnn");
export_onnx(model, "model.onnx", {1, 1, 28, 28});

// Export with options
ONNXExportOptions options;
options.model_name = "my_model";
options.input_shape = {1, 1, 28, 28};
options.verbose = true;
export_onnx(model, "model.onnx", options);

// Get export info (for debugging)
std::string info = get_onnx_export_info(model, {1, 1, 28, 28});

Supported layers for ONNX export:

Linear → Gemm
Conv2d → Conv
ReLU, Sigmoid, Tanh → Relu, Sigmoid, Tanh
Softmax → Softmax
Flatten → Flatten
MaxPool2d, AvgPool2d → MaxPool, AveragePool
BatchNorm2d → BatchNormalization
Dropout → Identity (inference mode)

Using the exported model:

import onnxruntime as ort
import numpy as np

# Load and run inference
sess = ort.InferenceSession("mnist_cnn.onnx")
x = np.random.randn(1, 1, 28, 28).astype(np.float32)
y = sess.run(None, {"input": x})[0]
print(f"Predicted class: {np.argmax(y)}")

Data Loading

Basic DataLoader (MNIST):

#include "mnist.h"

// Load MNIST dataset
auto train_data = load_mnist_train("data/");
auto test_data = load_mnist_test("data/");

// Create data loader with batching and shuffling
DataLoader train_loader(train_data, /*batch_size=*/64, /*shuffle=*/true);
DataLoader test_loader(test_data, /*batch_size=*/64, /*shuffle=*/false);

// Iterate over batches
while (train_loader.has_next()) {
    auto [images, labels] = train_loader.next_batch();
    // images: [batch_size, 784], labels: [batch_size]
}

train_loader.reset();  // Reset for next epoch

ThreadedDataLoader (Multi-threaded with Prefetching):

#include "dataloader.h"

// Convert dataset to generic Dataset struct
Dataset train_dataset{train_data.images, train_data.labels, train_data.num_samples};

// Create threaded data loader
// Args: dataset, batch_size, shuffle, num_workers, prefetch_factor
ThreadedDataLoader train_loader(train_dataset, 64, true, 2, 2);  // 2 workers, prefetch 2 batches each
ThreadedDataLoader test_loader(test_dataset, 64, false, 0);       // 0 workers = synchronous

// Training loop
for (int epoch = 0; epoch < num_epochs; epoch++) {
    train_loader.reset();  // Shuffles data, starts worker threads
    while (train_loader.has_next()) {
        auto [images, labels] = train_loader.next_batch_pair();
        // Workers prefetch next batches while you train
    }
}

The ThreadedDataLoader uses background threads to prefetch batches while the main thread processes the current batch, improving training throughput on CPU-bound workloads.

CIFAR-10:

#include "cifar10.h"

// Download CIFAR-10 binary files to data/ directory first
// Files needed: data_batch_1.bin ... data_batch_5.bin, test_batch.bin

// Load CIFAR-10 dataset (automatically normalized with ImageNet stats)
auto train_data = load_cifar10_train("data/");  // 50,000 images
auto test_data = load_cifar10_test("data/");    // 10,000 images

// Create data loader with augmentation (random crop + horizontal flip)
CIFAR10DataLoader train_loader(train_data, 64, /*shuffle=*/true, /*augment=*/true);
CIFAR10DataLoader test_loader(test_data, 64, /*shuffle=*/false, /*augment=*/false);

while (train_loader.has_next()) {
    auto [images, labels] = train_loader.next_batch();
    // images: [batch_size, 3, 32, 32], labels: [batch_size]
}

// Get class name
const char* name = cifar10_class_name(3);  // "cat"

Data Augmentation

// For images with shape [N, C, H, W] or [C, H, W]

// Horizontal flip
auto flipped = img->flip_horizontal();
auto maybe_flipped = img->random_flip_horizontal(0.5f);  // 50% chance

// Padding and cropping
auto padded = img->pad2d(4);  // Zero-pad by 4 pixels on each side
auto cropped = padded->crop(2, 2, 32, 32);  // Crop from (top=2, left=2)
auto random_cropped = padded->random_crop(32, 32);  // Random position

// Standard CIFAR-10 augmentation pipeline
auto augmented = img->pad2d(4)->random_crop(32, 32)->random_flip_horizontal(0.5f);

Complete Training Example

#include "tensor.h"
#include "layer.h"
#include "loss.h"
#include "optimizer.h"
#include "mnist.h"

int main() {
    // Load data
    auto train_data = load_mnist_train("data/");
    DataLoader train_loader(train_data, 64, true);

    // Define model
    Sequential model({
        new Linear(784, 256),
        new ReLU(),
        new Linear(256, 10)
    });

    // Loss and optimizer
    CrossEntropyLoss criterion;
    SGD optimizer(model.parameters(), 0.01, 0.9);

    // Training loop
    for (int epoch = 0; epoch < 10; epoch++) {
        train_loader.reset();
        float total_loss = 0;
        int batches = 0;

        while (train_loader.has_next()) {
            auto [images, labels] = train_loader.next_batch();

            optimizer.zero_grad();
            auto output = model.forward(images);
            auto loss = criterion(output, labels);
            loss->backward();
            optimizer.step();

            total_loss += loss->item();
            batches++;
        }

        printf("Epoch %d, Loss: %.4f\n", epoch + 1, total_loss / batches);
    }

    return 0;
}

CNN Example (MNIST)

A convolutional neural network example is included in cnn_mnist.cpp:

make cnn_mnist
./cnn_mnist

Architecture:

Conv2d(1, 16, 3) -> BatchNorm2d -> ReLU -> MaxPool2d(2)
Conv2d(16, 32, 3) -> BatchNorm2d -> ReLU -> MaxPool2d(2)
Flatten -> Linear(1568, 128) -> ReLU -> Linear(128, 10)

Results: ~98.5% test accuracy after 1 epoch (~207k parameters)

CNN Example (CIFAR-10)

A VGG-style CNN for CIFAR-10 color image classification in cnn_cifar10.cpp:

# Download CIFAR-10 data
mkdir -p data && cd data
curl -LO https://www.cs.toronto.edu/~kriz/cifar-10-binary.tar.gz
tar -xzf cifar-10-binary.tar.gz
mv cifar-10-batches-bin/*.bin .
cd ..

# Build and run
make cnn_cifar10
./cnn_cifar10

Architecture:

Block 1: Conv(3,32,3) -> BN -> ReLU -> Conv(32,32,3) -> BN -> ReLU -> MaxPool(2)
Block 2: Conv(32,64,3) -> BN -> ReLU -> Conv(64,64,3) -> BN -> ReLU -> MaxPool(2)
Block 3: Conv(64,128,3) -> BN -> ReLU -> Conv(128,128,3) -> BN -> ReLU -> MaxPool(2)
Classifier: Flatten -> Linear(2048,256) -> ReLU -> Dropout(0.5) -> Linear(256,10)

Features:

~815k parameters (~3.1 MB)
Adam optimizer with cosine annealing learning rate schedule
Data augmentation: random crop (pad=4) + horizontal flip
im2col + GEMM + OpenMP optimized convolution (~1.5s/batch on Apple M1)
Expected: ~75-85% test accuracy after 20 epochs (~19 min/epoch)

Transformer Example

A simple transformer language model example is included in transformer_example.cpp:

make transformer_example
./transformer_example

Architecture:

Embedding -> Positional Encoding -> 2x Transformer Blocks -> Linear
Each block: MultiHeadAttention -> LayerNorm -> FFN -> LayerNorm

The model learns to predict the next character in a simple repeating pattern ("abcdef"):

17k parameters with embed_dim=32, num_heads=4, 2 layers
Trains to near-zero loss in ~100 epochs
Generates perfect sequences from any starting prompt

Sample output:

Prompt 'abc' -> abcdefabcdefabcdefabcdef...
Prompt 'f'   -> fabcdefabcdefabcdefabcdef...

Autoencoder Example

A convolutional autoencoder using ConvTranspose2d for image reconstruction on MNIST:

make autoencoder
./build/autoencoder

Architecture:

Encoder: Conv2d(1,16,3,s=2) -> Conv2d(16,32,3,s=2) -> Conv2d(32,64,3,s=2) -> Linear(1024,32)
         28x28 -> 14x14 -> 7x7 -> 4x4 -> 32-dim latent vector

Decoder: Linear(32,1024) -> ConvTranspose2d(64,32) -> ConvTranspose2d(32,16) -> ConvTranspose2d(16,1)
         32-dim latent -> 4x4 -> 7x7 -> 14x14 -> 28x28

Features:

~85k parameters with 32-dimensional latent space
Uses ConvTranspose2d for learned upsampling (decoder)
MSE loss for pixel-wise reconstruction
Demonstrates latent space interpolation between digits
ASCII art visualization of reconstructions

Sample output:

Epoch 10/10  Train Loss: 0.012345  Test Loss: 0.012567

Reconstruction Examples:
Original:                    Reconstructed:
    .::---==+++**##%@@           .::---==+++**##%@@
    :::---==+++**##%%@           .::---==+++**##%%@
    ...                          ...

Latent Space Interpolation (digit 3 -> digit 7):
[image 1]  [image 2]  [image 3]  [image 4]  [image 5]

GAN Example

A Deep Convolutional GAN (DCGAN) for generating handwritten digits:

make gan
./build/gan

Architecture:

Generator (noise -> image):
  Linear(100, 4096) -> Reshape(256,4,4) -> ConvTranspose2d -> 7x7 -> ConvTranspose2d -> 14x14 -> ConvTranspose2d -> 28x28

Discriminator (image -> real/fake):
  Conv2d(1,64) -> 14x14 -> Conv2d(64,128) -> 7x7 -> Conv2d(128,256) -> 4x4 -> Linear -> sigmoid

Features:

~1.5M parameters (Generator: ~1M, Discriminator: ~500K)
100-dimensional latent space
Adam optimizer with beta1=0.5 (standard for GANs)
Label smoothing (real labels = 0.9) for training stability
Dropout in discriminator to prevent overfitting
Demonstrates latent space interpolation and diversity checking

Training dynamics:

D(x): Discriminator output on real images (should stay ~0.5-0.8)
D(G(z)): Discriminator output on fake images (should rise from ~0 to ~0.5)
Balanced training when both losses are similar

Sample output:

Epoch 20/20  D_loss: 0.8234  G_loss: 1.2345  D(x): 0.72  D(G(z)): 0.45

Generated samples (epoch 20):
    .::---==+++**##%@@  .::---==+++**##%@@  .::---==+++**##%@@
    :::---==+++**##%%@  :::---==+++**##%%@  :::---==+++**##%%@

Latent Space Interpolation:
[z1] -> [interp1] -> [interp2] -> [interp3] -> [z2]

RNN Text Generation Example

A character-level RNN language model for generating Shakespeare-style text:

make rnn
./build/rnn_text_gen

Architecture:

Embedding(vocab_size, 128) -> LSTM(128, 256) -> LSTM(256, 256) -> Dropout(0.3) -> Linear(256, vocab_size)

Features:

~500K parameters with 2-layer LSTM and 256 hidden units
Character-level language modeling (predicts next character)
Embedded Shakespeare corpus (~2KB) for training
Temperature-based sampling for text generation:
- Low temperature (0.5): More conservative, repetitive text
- Medium temperature (0.8): Balanced creativity and coherence
- High temperature (1.2): More random, creative text
Gradient clipping (max norm = 5.0) for stable RNN training
Custom sequence cross-entropy loss with proper gradient computation

Training output:

Epoch 50/50  Loss: 1.2345  Perplexity: 3.44

Generated text (temperature=0.8):
----------------------------------------
ROMEO:
What light through yonder window breaks?
It is the east, and Juliet is the sun.
Arise, fair sun, and kill the envious moon...
----------------------------------------

Sample generation at different temperatures:

Temperature 0.5 (conservative):
  "the the the the and the..."

Temperature 0.8 (balanced):
  "What dreams may come when we have shuffled off..."

Temperature 1.2 (creative):
  "Twas brillig sloathy toves did gyre..."

Performance

The framework includes several optimizations:

SIMD Vectorization: ARM NEON (Apple Silicon) and x86 SSE/AVX support
Blocked Matrix Multiplication: Cache-friendly 32x32 block tiling
im2col + GEMM Convolution: Converts conv2d to optimized matrix multiplication
OpenMP Parallelization: Multi-threaded convolution and GEMM operations
Threaded Data Loading: Background workers prefetch batches during training
Mixed Precision (fp16): GradScaler for loss scaling, HalfTensor for memory optimization
Gradient Accumulation: Train with larger effective batch sizes on limited memory
Early Stopping: Prevent overfitting with automatic training termination
NoGradGuard: Skip computation graph building during inference
O3 Optimization: Aggressive compiler optimizations enabled

macOS OpenMP Setup

brew install libomp

Benchmarks (Apple M1, 10 threads)

Model	Batch Time	Epoch Time
Simple CNN (2 conv)	~76 ms	~1 min
VGG-style (6 conv)	~1.5 s	~19 min

Typical MNIST training: ~18 seconds/epoch on Apple M1.

Build Options

make            # Build optimized release (CPU-only)
make debug      # Build with debug symbols
make clean      # Remove build artifacts
make run        # Build and run
make test       # Run unit tests

GPU backends (optional):

Metal (macOS / Apple Silicon): make METAL=1 — uses Metal for matmul when tensors are on .to(Device::metal()).
CUDA (Linux / cloud): make CUDA=1 — uses cuBLAS for matmul and batched matmul when tensors are on .to(Device::cuda()). Requires nvcc and CUDA toolkit; set CUDA_PATH if needed. Default build is CPU-only and does not require CUDA.

Device auto-selection: Use Device::auto_() or Device::default_device() to pick the best available backend (Metal on macOS when built with METAL=1, else CUDA when built with CUDA=1, else CPU) so you don’t need to hard-code backend or remember make flags.

ONNX: Export with export_onnx(model, path, input_shape); import with load_onnx(path) for the same op set (Gemm, Conv, Relu, Sigmoid, Tanh, Softmax, Flatten, MaxPool, AveragePool, BatchNormalization, Identity). Set ONNXExportOptions::export_fp16 = true to export initializers in Float16 for smaller, edge-friendly models. For memory-efficient inference in C++, use HalfTensor and fp16 helpers in core/amp.h.

Unit Tests

Comprehensive unit tests are provided in the tests/ directory:

# Run all tests
make test

# Run specific test suites
make test-tensor      # Tensor operations
make test-autograd    # Automatic differentiation
make test-layers      # Neural network layers
make test-loss        # Loss functions
make test-optimizer   # Optimizers and schedulers

Test coverage:

Tensor Operations (39 tests): Creation, arithmetic, matrix ops, reductions, shape manipulation
Autograd (23 tests): Gradient computation for all differentiable operations
Layers (42 tests): Forward pass, parameters, gradients for all layer types
Loss Functions (22 tests): Correctness and gradients for all losses
Optimizers (26 tests): SGD, Adam, AdamW, RMSprop, schedulers, early stopping

Sample output:

################################################################################
#                         WHITEMATTER UNIT TESTS                               #
################################################################################

================================================================================
Test Suite: Tensor Operations
================================================================================
  [PASS] zeros (0.01ms)
  [PASS] ones (0.00ms)
  [PASS] matmul_2d (0.52ms)
  ...
--------------------------------------------------------------------------------
Results: 39 passed, 0 failed, 39 total (0.95ms)
================================================================================

################################################################################
TOTAL: 152 passed, 0 failed (0.01s)
################################################################################

Roadmap

Performance (High Impact)

Correctness & Safety

Thread-safe RNG — Global static std::mt19937 in tensor.cpp is not thread-safe; use thread_local
Numerical gradient checking — Add finite-difference gradient verification to test suite to catch backward pass bugs
Fix grad_fn circular references — Lambda captures of shared_ptrs can create cycles and leak memory (tensor.cpp:413)
Conv signed/unsigned mismatch — Padding computed as int, used as size_t (conv_ops.cpp:23-24)
Bounded memory pool — Free lists grow without limit; add max bucket size to prevent unbounded memory growth

Features

Mixed precision (fp16/bf16) — Halves memory bandwidth (the real bottleneck) and enables tensor cores on NVIDIA GPUs
Metal GPU backend — Stubs exist but aren't implemented; M-series Macs have powerful GPUs sitting idle
CUDA backend — Move beyond stubs to functional GPU compute
Grouped/depthwise convolutions — Required for MobileNet, EfficientNet, and modern architectures
Dilated convolutions — Common in semantic segmentation (DeepLab, WaveNet)
INT8/INT4 quantization — GGML-style quantized inference for practical deployment
Operator graph compilation — Record operations, optimize the graph, then execute (like TorchScript/XLA)

Requirements

C++17 compatible compiler (g++, clang++)
No external dependencies

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 227 Commits
.claude/commands		.claude/commands
.devcontainer		.devcontainer
.github		.github
bindings		bindings
core		core
data		data
datasets		datasets
deploy		deploy
docs		docs
examples		examples
frontend		frontend
models		models
platform		platform
pretrained		pretrained
scripts		scripts
tests		tests
.dockerignore		.dockerignore
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
AGENTS.md		AGENTS.md
CHANGELOG.md		CHANGELOG.md
CMakeLists.txt		CMakeLists.txt
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.prod.yml		docker-compose.prod.yml
docker-compose.staging.yml		docker-compose.staging.yml
docker-compose.yml		docker-compose.yml
install.sh		install.sh
pyproject.toml		pyproject.toml
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

Whitematter

Quick Start

Features

Architecture

API Documentation

Community

"Built with Whitematter" Badge

C++ Framework

Distribution

C++ Quick Start

Framework Structure

Usage Guide

Tensors

Layers

Loss Functions

Optimizers

Learning Rate Schedulers

Gradient Clipping

Mixed Precision Training

Gradient Accumulation

Early Stopping

Model Summary

Training Logger

Disabling Gradient Tracking

Model Zoo

ONNX Export

Data Loading

Data Augmentation

Complete Training Example

CNN Example (MNIST)

CNN Example (CIFAR-10)

Transformer Example

Autoencoder Example

GAN Example

RNN Text Generation Example

Performance

macOS OpenMP Setup

Benchmarks (Apple M1, 10 threads)

Build Options

Unit Tests

Roadmap

Performance (High Impact)

Correctness & Safety

Features

Requirements

License

About

Resources

License

Code of conduct

Contributing

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages