Train and deploy custom neural networks from your browser — powered by a C++ framework that compiles and runs models natively.
docker compose upOpen http://localhost:5173 — upload a dataset, describe your model, train, and deploy.
Set ANTHROPIC_API_KEY in .env to enable AI-assisted architecture design.
- AI Architecture Designer — Describe your model in natural language; Claude suggests layer configurations, hyperparameters, and training recipes
- Live Training Dashboard — Real-time loss curves, accuracy charts, stat cards, and training logs streamed to the browser
- One-Click Deploy — Push trained models to AWS EC2 as inference APIs with a single button
- Visual Model Builder — Drag-and-drop architecture graph with interactive node editing
- Dataset Management — Upload ZIPs, import from Hugging Face or URL, automatic preprocessing for images/text/tabular data
- Model Zoo — Browse pretrained architectures, inspect model cards, run predictions in the playground
- Native C++ Performance — Models compile to optimized C++ with SIMD, OpenMP, Metal, and CUDA backends
graph TB
subgraph Frontend["Frontend (Next.js)"]
UI[React UI]
Graph[Architecture Graph]
Charts[Training Charts]
end
subgraph Backend["Backend (FastAPI)"]
API[REST API]
LLM[LLM Service<br/>Claude API]
CodeGen[Code Generator]
DataMgr[Dataset Manager]
end
subgraph Core["C++ Core"]
Tensor[Tensor + Autograd]
Layers[Layers / Optimizers / Loss]
GPU[Metal / CUDA Backends]
end
UI -->|HTTP| API
Graph -->|HTTP| API
Charts -->|SSE| API
API --> LLM
API --> CodeGen
API --> DataMgr
CodeGen -->|generates| Core
Core -->|compiles & trains| Tensor
Tensor --> Layers
Layers --> GPU
Full API documentation is available at /docs (Swagger UI) and /redoc when the server is running. These include request/response schemas, authentication details, and interactive "try it out" support.
Quick-reference for the main endpoint groups:
| Group | Endpoints | Description |
|---|---|---|
| Auth | POST /auth/register, POST /auth/login |
User registration, login, and JWT authentication |
| Datasets | POST /datasets/upload, GET /datasets/{id} |
Upload, list, and manage training datasets |
| Design | POST /design/suggest, POST /design/refine |
AI-assisted model architecture suggestions |
| Training | POST /train, GET /train/{job_id} |
Start training jobs, stream progress, retrieve results |
| Models | GET /models, GET /models/{id} |
Browse trained models and metadata |
| Predict | POST /predict/{model_id} |
Run inference on trained models |
| Deploy | POST /deploy |
One-click deploy to AWS EC2 |
| Credentials | /credentials |
Manage cloud provider and API credentials |
| Health | /health |
Server health and readiness checks |
- Public Roadmap: GitHub Projects board — see what's planned and vote on priorities
- Contributing: See CONTRIBUTING.md for setup instructions and PR guidelines
- Changelog: See CHANGELOG.md for release history
Add this badge to your project's README:
[](https://github.com/hwang2409/whitematter)Or in HTML:
<a href="https://github.com/hwang2409/whitematter">
<img src="https://raw.githubusercontent.com/hwang2409/whitematter/main/docs/badge.svg" alt="Built with Whitematter" />
</a>Whitematter's core is a lightweight PyTorch-like neural network framework written in C++ with automatic differentiation, SIMD optimizations, and GPU backends (Metal / CUDA).
Python (pip)
pip install .
# or for development: pip install -e .
import whitematter as wmC++ (CMake / FetchContent)
include(FetchContent)
FetchContent_Declare(whitematter
GIT_REPOSITORY https://github.com/hwang2409/whitematter
GIT_TAG main
)
FetchContent_MakeAvailable(whitematter)
target_link_libraries(your_app PRIVATE whitematter)Build options: -DWHITEMATTER_METAL=ON (macOS), -DWHITEMATTER_CUDA=ON (GPU).
make # Build all examples
make test # Run unit tests
./build/ml # Train an MLP on MNISTcore/ C++ core (tensor, layers, loss, optimizer, autograd)
datasets/ MNIST and CIFAR-10 loaders
examples/ Training examples (CNN, GAN, RNN, transformer, autoencoder)
bindings/ Python bindings (pybind11)
platform/ FastAPI backend server
frontend/ Next.js React UI
The Tensor class is the core data structure with automatic gradient computation:
#include "tensor.h"
// Create tensors
auto a = Tensor::randn({3, 4}, true); // 3x4 random tensor, requires_grad=true
auto b = Tensor::zeros({4, 2}, true); // 4x2 zeros
auto c = Tensor::ones({3, 2}, false); // 3x2 ones, no gradients
auto w = Tensor::xavier(784, 256, true); // Xavier initialization
// Operations (automatically tracked for backprop)
auto d = a->matmul(b); // Matrix multiplication
auto e = d->add(c); // Addition (supports broadcasting)
auto f = e->relu(); // ReLU activation
auto g = f->sum(); // Sum to scalar
// Backpropagation
g->backward(); // Compute gradients
a->grad; // Access gradients
a->zero_grad(); // Reset gradientsAvailable tensor operations:
- Arithmetic:
add,sub,mul,div,neg(with broadcasting) - Matrix:
matmul,bmm,transpose,reshape,slice,concat,stack - Shape:
squeeze,unsqueeze,flatten,permute - Activations:
relu,sigmoid,tanh_,softmax,log_softmax - Reductions:
sum,mean,max,min,argmax,argmin - Element-wise:
log_,exp_,pow,sqrt,abs,clamp - Augmentation:
flip_horizontal,random_flip_horizontal,pad2d,crop,random_crop
Broadcasting:
Arithmetic operations (add, sub, mul, div) support NumPy-style broadcasting:
auto a = Tensor::randn({2, 3}, true);
auto b = Tensor::randn({3}, true); // Broadcasts to [2, 3]
auto c = a->add(b); // Shape: [2, 3]
auto x = Tensor::randn({3, 1}, true);
auto y = Tensor::randn({1, 4}, true);
auto z = x->mul(y); // Outer product, Shape: [3, 4]
auto bias = Tensor::randn({1}, true); // Scalar broadcast
auto out = a->add(bias); // Shape: [2, 3]Math operations:
auto a = Tensor::create({4, 9, 16}, {3}, true);
auto b = a->sqrt(); // [2, 3, 4]
auto c = a->pow(2.0f); // [16, 81, 256]
auto d = a->pow(0.5f); // Same as sqrt
// L2 norm: sqrt(sum(x^2))
auto x = Tensor::randn({10}, true);
auto norm = x->pow(2.0f)->sum()->sqrt();
// Element-wise power with broadcasting
auto bases = Tensor::randn({2, 3}, true);
auto exps = Tensor::create({1, 2, 3}, {3}, true);
auto result = bases->pow(exps); // Each column raised to different power
// Absolute value and clamping
auto y = Tensor::create({-3, -1, 2, 5}, {4}, true);
auto abs_y = y->abs(); // [3, 1, 2, 5]
auto clamped = y->clamp(-2, 3); // [-2, -1, 2, 3]Max/Min operations:
auto a = Tensor::create({1, 5, 3, 2, 8, 4}, {2, 3}, true);
// [[1, 5, 3], [2, 8, 4]]
// Reduction along dimension
auto max0 = a->max(0); // max along dim 0: [2, 8, 4]
auto max1 = a->max(1); // max along dim 1: [5, 8]
auto min1 = a->min(1); // min along dim 1: [1, 2]
auto max_keep = a->max(1, true); // keepdim: [[5], [8]] shape [2, 1]
// Element-wise max/min with broadcasting
auto threshold = Tensor::create({3}, {1}, true);
auto clamped_low = a->max(threshold); // ReLU-like: max(a, 3)
auto clamped_high = a->min(threshold); // cap at 3: min(a, 3)
// Gradient flows only to the "winning" element
auto loss = a->max(1)->sum();
loss->backward(); // grad is 1 at max positions, 0 elsewhere
// Get indices of max/min values (no gradients, returns integer indices)
auto argmax_idx = a->argmax(1); // [1, 1] - column indices of max values
auto argmin_idx = a->argmin(1); // [0, 0] - column indices of min values
auto argmax_keep = a->argmax(1, true); // [[1], [1]] - keepdim preserves shapeBatch operations:
// Batch matrix multiplication: [batch, m, k] @ [batch, k, n] -> [batch, m, n]
auto a = Tensor::randn({8, 16, 32}, true); // 8 batches of 16x32 matrices
auto b = Tensor::randn({8, 32, 64}, true); // 8 batches of 32x64 matrices
auto c = a->bmm(b); // Shape: [8, 16, 64]
// Attention scores computation: Q @ K^T
auto Q = Tensor::randn({4, 8, 16}, true); // [batch, seq_len, head_dim]
auto K = Tensor::randn({4, 8, 16}, true);
auto scores = Q->bmm(K->permute({0, 2, 1})); // Shape: [4, 8, 8]Combining tensors:
// Concatenate along existing dimension
auto a = Tensor::randn({2, 3}, true);
auto b = Tensor::randn({2, 3}, true);
auto c = Tensor::concat({a, b}, 0); // Shape: [4, 3]
auto d = Tensor::concat({a, b}, 1); // Shape: [2, 6]
// Stack along new dimension
auto e = Tensor::stack({a, b}, 0); // Shape: [2, 2, 3]
auto f = Tensor::stack({a, b}, -1); // Shape: [2, 3, 2]Reshaping tensors:
auto a = Tensor::randn({2, 3}, true);
auto b = a->unsqueeze(0); // Shape: [1, 2, 3]
auto c = a->unsqueeze(-1); // Shape: [2, 3, 1]
auto d = b->squeeze(0); // Shape: [2, 3]
auto e = Tensor::randn({1, 2, 1, 3, 1}, true);
auto f = e->squeeze(); // Remove all size-1 dims -> [2, 3]
// Permute dimensions (reorder axes)
auto g = Tensor::randn({2, 3, 4}, true);
auto h = g->permute({2, 0, 1}); // Shape: [4, 2, 3]
auto i = g->permute({0, 2, 1}); // Shape: [2, 4, 3]
// NCHW -> NHWC conversion
auto nchw = Tensor::randn({8, 3, 32, 32}, true);
auto nhwc = nchw->permute({0, 2, 3, 1}); // Shape: [8, 32, 32, 3]Build networks using the Module interface:
#include "layer.h"
// Individual layers
auto linear = new Linear(784, 256); // Fully connected
auto relu = new ReLU(); // Activation
auto sigmoid = new Sigmoid();
auto softmax = new Softmax(-1); // Along last dimension
auto dropout = new Dropout(0.5); // 50% dropout
// Sequential model
Sequential model({
new Linear(784, 256),
new ReLU(),
new Dropout(0.2),
new Linear(256, 128),
new ReLU(),
new Linear(128, 10)
});
// Forward pass
auto output = model.forward(input);
// Access parameters
auto params = model.parameters(); // Returns vector<TensorPtr>
// Training/eval mode (affects Dropout)
model.train();
model.eval();Available layers:
Linear(in_features, out_features)- Fully connected layerReLU()- ReLU activationSigmoid()- Sigmoid activationTanh()- Tanh activationSoftmax(dim)- Softmax activationLogSoftmax(dim)- Log-softmax activationDropout(p)- Dropout regularizationConv2d(in_channels, out_channels, kernel_size, stride, padding)- 2D convolutionConvTranspose2d(in_channels, out_channels, kernel_size, stride, padding, output_padding)- 2D transposed convolution (upsampling)MaxPool2d(kernel_size, stride)- 2D max poolingAvgPool2d(kernel_size, stride)- 2D average poolingBatchNorm2d(num_features, eps, momentum)- 2D batch normalizationLayerNorm(normalized_shape, eps)- Layer normalization (for transformers)Flatten()- Flatten spatial dimensionsEmbedding(num_embeddings, embedding_dim)- Embedding lookup table for NLPLSTM(input_size, hidden_size, batch_first)- LSTM recurrent layerGRU(input_size, hidden_size, batch_first)- GRU recurrent layerMultiHeadAttention(embed_dim, num_heads)- Multi-head attention for transformersSequential({...})- Container for chaining layers
Transposed Convolution (Upsampling):
// ConvTranspose2d upsamples spatial dimensions (used in autoencoders, GANs, segmentation)
// Output size: (H - 1) * stride - 2 * padding + kernel_size + output_padding
// Double spatial dimensions with stride=2
auto upsample = new ConvTranspose2d(64, 32, 4, 2, 1); // [N, 64, 8, 8] -> [N, 32, 16, 16]
// Decoder for autoencoder
Sequential decoder({
new ConvTranspose2d(256, 128, 4, 2, 1), // 4x4 -> 8x8
new ReLU(),
new ConvTranspose2d(128, 64, 4, 2, 1), // 8x8 -> 16x16
new ReLU(),
new ConvTranspose2d(64, 3, 4, 2, 1), // 16x16 -> 32x32
new Sigmoid() // Output in [0, 1]
});#include "loss.h"
CrossEntropyLoss criterion; // For multi-class classification
MSELoss mse_criterion; // For regression (L2 loss)
L1Loss l1_criterion; // For regression (mean absolute error)
SmoothL1Loss smooth_l1; // Huber loss (robust to outliers)
SmoothL1Loss smooth_l1_custom(0.5f); // Huber loss with custom beta threshold
NLLLoss nll_criterion; // Negative log likelihood
BCELoss bce_criterion; // Binary cross entropy (expects probabilities)
BCEWithLogitsLoss bce_logits; // BCE with built-in sigmoid (numerically stable)
KLDivLoss kl_div; // KL divergence (for knowledge distillation)
FocalLoss focal; // Focal loss for imbalanced multi-class
FocalLoss focal_custom(2.0f, 0.25f); // gamma=2, alpha=0.25
BinaryFocalLoss binary_focal; // Binary focal loss for imbalanced binary
// Compute loss
auto loss = criterion(predictions, targets);
loss->backward();#include "optimizer.h"
// Stochastic Gradient Descent with momentum
SGD optimizer(model.parameters(), /*lr=*/0.01, /*momentum=*/0.9);
// Adam optimizer
Adam optimizer(model.parameters(), /*lr=*/0.001, /*beta1=*/0.9, /*beta2=*/0.999);
// AdamW optimizer (Adam with decoupled weight decay)
AdamW optimizer(model.parameters(), /*lr=*/0.001, /*beta1=*/0.9, /*beta2=*/0.999,
/*eps=*/1e-8, /*weight_decay=*/0.01);
// RMSprop optimizer
RMSprop optimizer(model.parameters(), /*lr=*/0.01, /*alpha=*/0.99, /*eps=*/1e-8,
/*momentum=*/0.0, /*weight_decay=*/0.0);
// Training step
optimizer.zero_grad(); // Clear gradients
auto loss = criterion(model.forward(x), y);
loss->backward(); // Compute gradients
optimizer.step(); // Update parameters#include "optimizer.h"
SGD optimizer(model.parameters(), 0.1f, 0.9f);
// Step decay: multiply LR by gamma every step_size epochs
StepLR scheduler(&optimizer, 30, 0.1f); // decay by 0.1 every 30 epochs
// Exponential decay: multiply LR by gamma every epoch
ExponentialLR scheduler(&optimizer, 0.95f);
// Cosine annealing: smoothly decrease LR to eta_min over T_max epochs
CosineAnnealingLR scheduler(&optimizer, 100, 0.0001f);
// Cosine with warm restarts: reset LR periodically
CosineAnnealingWarmRestarts scheduler(&optimizer, 10, 2, 0.0001f); // T_0=10, T_mult=2
// Reduce on plateau: reduce LR when metric stops improving
ReduceLROnPlateau scheduler(&optimizer, 0.1f, 10, 0.0001f, true); // factor, patience, min_lr, mode_min
// Call at end of each epoch
for (int epoch = 0; epoch < num_epochs; epoch++) {
train_one_epoch();
scheduler.step(); // For most schedulers
// scheduler.step(validation_loss); // For ReduceLROnPlateau
}#include "optimizer.h"
auto params = model.parameters();
// Clip by norm: scale gradients if total norm > max_norm
loss->backward();
clip_grad_norm_(params, 1.0f); // max_norm = 1.0
optimizer.step();
// Clip by value: clamp each gradient to [-clip_value, clip_value]
loss->backward();
clip_grad_value_(params, 0.5f); // clip to [-0.5, 0.5]
optimizer.step();
// Get gradient norm (useful for monitoring)
float grad_norm = get_grad_norm(params);Mixed precision training uses fp16 (half-precision) for faster computation while maintaining fp32 accuracy through loss scaling:
#include "amp.h"
Sequential model({
new Linear(784, 256),
new ReLU(),
new Linear(256, 10)
});
SGD optimizer(model.parameters(), 0.001f);
GradScaler scaler; // Default: init_scale=65536, growth_interval=2000
CrossEntropyLoss criterion;
for (int epoch = 0; epoch < epochs; epoch++) {
for (auto [x, y] : dataloader) {
optimizer.zero_grad();
// Forward pass (in real fp16, this would use half-precision)
auto output = model.forward(x);
auto loss = criterion(output, y);
// Scale loss for numerical stability
auto scaled_loss = scaler.scale(loss);
scaled_loss->backward();
// Unscale gradients and check for overflow
scaler.unscale(&optimizer);
// Gradient clipping (optional, works with scaled gradients)
clip_grad_norm_(model.parameters(), 1.0f);
// Step optimizer only if gradients are finite
scaler.step(&optimizer, true); // true = already unscaled
// Update scale factor based on overflow history
scaler.update();
}
}GradScaler API:
// Constructor with custom settings
GradScaler scaler(
256.0f, // init_scale: starting scale factor
2.0f, // growth_factor: multiply scale by this after growth_interval good steps
0.5f, // backoff_factor: multiply scale by this on overflow
2000, // growth_interval: steps between scale increases
true // enabled: set to false to disable scaling
);
// Core methods
float scale = scaler.get_scale(); // Get current scale factor
auto scaled = scaler.scale(loss); // Scale a tensor
bool finite = scaler.unscale(&optimizer); // Unscale gradients, returns true if all finite
scaler.step(&optimizer, already_unscaled); // Step if gradients are finite
scaler.update(); // Adjust scale based on overflow historyHalfTensor for memory optimization:
// Store weights in fp16 to save memory (50% reduction)
auto weights = Tensor::randn({1000, 1000}, false);
HalfTensor half_weights(weights);
printf("Original: %zu bytes\n", weights->data.size() * 4); // 4MB
printf("Half: %zu bytes\n", half_weights.data.size() * 2); // 2MB
// Convert back to fp32 for computation
auto restored = half_weights.to_float();FP16 conversion utilities:
// Convert single values
uint16_t h = float_to_half(3.14159f);
float f = half_to_float(h);
// Convert vectors
std::vector<uint16_t> half_data = to_half(float_data);
std::vector<float> float_data = from_half(half_data);Gradient accumulation enables training with effectively larger batch sizes when memory is limited. Instead of updating weights after every batch, accumulate gradients over multiple mini-batches:
#include "optimizer.h"
Sequential model({
new Linear(784, 256),
new ReLU(),
new Linear(256, 10)
});
SGD optimizer(model.parameters(), 0.01f);
CrossEntropyLoss criterion;
GradientAccumulator accumulator(4); // Effective batch = mini_batch * 4
for (auto [x, y] : dataloader) {
auto loss = criterion(model.forward(x), y);
// Scales loss by 1/4 and calls backward
accumulator.backward(loss);
// Step only when we've accumulated 4 batches
if (accumulator.should_step()) {
optimizer.step();
optimizer.zero_grad();
accumulator.reset();
}
}GradientAccumulator API:
// Create accumulator (effective_batch = mini_batch * accumulation_steps)
GradientAccumulator accumulator(4);
// Option 1: Combined scale + backward
accumulator.backward(loss);
// Option 2: Manual control
auto scaled_loss = accumulator.scale(loss); // loss / accumulation_steps
scaled_loss->backward();
accumulator.increment();
// Check state
accumulator.should_step(); // True when ready for optimizer step
accumulator.is_last_step(); // True on final accumulation step
accumulator.current_step(); // Current step (0 to accumulation_steps-1)
accumulator.get_accumulation_steps(); // Total steps
accumulator.get_scale_factor(); // 1.0 / accumulation_steps
// Reset after optimizer step
accumulator.reset();Combined with mixed precision:
GradientAccumulator accumulator(4);
GradScaler scaler(256.0f);
for (auto [x, y] : dataloader) {
auto loss = criterion(model.forward(x), y);
// Scale for accumulation, then for mixed precision
auto scaled = scaler.scale(accumulator.scale(loss));
scaled->backward();
accumulator.increment();
if (accumulator.should_step()) {
scaler.unscale(&optimizer);
scaler.step(&optimizer, true);
scaler.update();
optimizer.zero_grad();
accumulator.reset();
}
}Early stopping prevents overfitting by halting training when validation metrics stop improving:
#include "optimizer.h"
Sequential model({...});
SGD optimizer(model.parameters(), 0.01f);
CrossEntropyLoss criterion;
EarlyStopping early_stopping(10); // patience = 10 epochs
for (int epoch = 0; epoch < max_epochs; epoch++) {
// Training
train_one_epoch(model, train_loader);
// Validation
float val_loss = evaluate(model, val_loader);
// Check if we should stop
if (early_stopping.step(val_loss)) {
printf("Early stopping at epoch %d\n", epoch);
printf("Best val loss: %.4f at epoch %d\n",
early_stopping.best_metric(), early_stopping.best_epoch());
break;
}
}EarlyStopping API:
// Create early stopping monitor
// patience: epochs to wait after last improvement
// min_delta: minimum change to qualify as improvement
// mode_min: true = lower is better (loss), false = higher is better (accuracy)
EarlyStopping early_stopping(10, 0.001f, true); // For loss
EarlyStopping early_stopping(10, 0.0f, false); // For accuracy
// Step and check
bool should_stop = early_stopping.step(metric);
// Query state
early_stopping.best_metric(); // Best value seen
early_stopping.best_epoch(); // Epoch of best value
early_stopping.epochs_without_improvement(); // Current patience counter
early_stopping.should_stop(); // Whether training should stop
// Reset for new training run
early_stopping.reset();ModelCheckpoint - Save best model automatically:
#include "optimizer.h"
#include "serialize.h"
EarlyStopping early_stopping(10);
ModelCheckpoint checkpoint("best_model.bin", true); // mode_min=true for loss
for (int epoch = 0; epoch < max_epochs; epoch++) {
train_one_epoch();
float val_loss = evaluate();
// Save model if validation improved
if (checkpoint.step(val_loss, &model)) {
printf("Saved best model (val_loss=%.4f)\n", val_loss);
}
// Check early stopping
if (early_stopping.step(val_loss)) {
printf("Early stopping at epoch %d\n", epoch);
break;
}
}
// Restore best model for inference/testing
checkpoint.restore(&model);Inspect model architecture, parameter counts, and memory usage:
#include "layer.h"
Sequential model({
new Conv2d(1, 16, 3, 1, 1),
new BatchNorm2d(16),
new ReLU(),
new MaxPool2d(2, 2),
new Flatten(),
new Linear(3136, 128),
new ReLU(),
new Linear(128, 10)
});
// PyTorch-style layer-by-layer summary with output shapes
model.summary({1, 1, 28, 28}); // Pass input shape for shape trackingOutput:
==============================================================================
Layer (type) Output Shape Param #
==============================================================================
Conv2d(1, 16, kernel_size=3) [1, 16, 28, 28] 160
BatchNorm2d(16) [1, 16, 28, 28] 32
ReLU [1, 16, 28, 28] 0
MaxPool2d(kernel_size=2) [1, 16, 14, 14] 0
Flatten [1, 3136] 0
Linear(3136, 128) [1, 128] 401,536
ReLU [1, 128] 0
Linear(128, 10) [1, 10] 1,290
==============================================================================
Total params: 403,018
Trainable params: 403,018
Non-trainable params: 0
==============================================================================
Utility functions:
// Get detailed model info
ModelSummary info = get_model_summary(&model);
printf("Total params: %zu\n", info.total_params);
printf("Trainable: %zu\n", info.trainable_params);
printf("Param memory (fp32): %zu bytes\n", info.param_memory_bytes);
printf("Param memory (fp16): %zu bytes\n", info.param_memory_fp16_bytes);
printf("Gradient memory: %zu bytes\n", info.grad_memory_bytes);
printf("Total training memory: %zu bytes\n", info.total_memory_bytes);
// Convenience functions
size_t total = count_parameters(&model);
size_t trainable = count_trainable_parameters(&model);
// Human-readable formatting
std::string params = format_number(1234567); // "1,234,567"
std::string memory = format_memory(1024*1024); // "1.00 MB"
// Simple summary printout
print_model_info(&model, "My CNN");Log metrics during training with TensorBoard-style tracking and export:
#include "logging.h"
TrainingLogger logger("logs", "my_experiment");
logger.set_total_epochs(10);
logger.set_total_steps(100); // Steps per epoch (for progress bar)
for (int epoch = 0; epoch < 10; epoch++) {
logger.new_epoch();
for (int batch = 0; batch < 100; batch++) {
// Train...
float loss = train_batch();
float acc = compute_accuracy();
// Log batch metrics (for epoch averaging)
logger.log_batch("loss", loss);
logger.log_batch("accuracy", acc);
// Show progress bar
logger.print_progress();
}
// Log epoch-level metrics
logger.log("train_loss", logger.epoch_mean("loss"));
logger.log("train_acc", logger.epoch_mean("accuracy"));
logger.log("lr", optimizer.lr);
logger.step();
logger.print_epoch_summary();
}
// Save logs and print summary
logger.save_csv(); // logs/my_experiment_metrics.csv
logger.save_json(); // logs/my_experiment_metrics.json
logger.print_summary();Console output:
Epoch 3/10 [============> ] 60% loss: 0.4523 accuracy: 0.8721 (1m 23s)
Epoch 3/10 - loss: 0.4512 accuracy: 0.8734 - 2m 18s
==============================================================================
Training Summary: my_experiment
==============================================================================
Total epochs: 10
Total steps: 1000
Elapsed time: 23m 45s
------------------------------------------------------------------------------
train_loss: min: 0.1234 max: 1.2345 mean: 0.4567 std: 0.2345
train_acc: min: 0.7500 max: 0.9500 mean: 0.8750 std: 0.0500
==============================================================================
MetricTracker for statistics:
// Track running statistics for any metric
MetricTracker tracker;
for (float value : batch_losses) {
tracker.update(value);
}
printf("Mean: %.4f, Std: %.4f, Min: %.4f, Max: %.4f\n",
tracker.mean(), tracker.std(), tracker.min(), tracker.max());ProgressBar for loops:
ProgressBar bar(1000, 40, "Training: ");
for (int i = 0; i < 1000; i++) {
// Do work...
bar.update();
}
bar.finish();
// Output: Training: [=================> ] 50% 500/1000 [1.2s < 1.2s]CSV export format:
step,epoch,timestamp,train_loss,train_acc,lr
0,1,12.34,0.9876,0.7500,0.001
1,2,25.67,0.5432,0.8500,0.001
...JSON export format:
{
"experiment": "my_experiment",
"total_steps": 10,
"total_epochs": 10,
"elapsed_seconds": 1234.56,
"summary": {
"train_loss": {"min": 0.12, "max": 1.23, "mean": 0.45, "std": 0.23, "last": 0.15}
},
"history": [
{"step": 0, "epoch": 1, "timestamp": 12.34, "train_loss": 0.9876}
]
}Use NoGradGuard for inference to improve performance:
{
NoGradGuard no_grad; // Disables gradient tracking in this scope
auto output = model.forward(input); // No computation graph built
// ~3x faster inference
}
// Gradients automatically re-enabled when guard goes out of scopeThe model zoo provides pre-defined architectures and pretrained weights:
#include "model_zoo.h"
// List available models
auto models = list_models(); // {"mnist_mlp", "mnist_cnn", "cifar10_simple", ...}
// Get model info
auto info = ModelZoo::instance().get_info("mnist_cnn");
printf("Params: %zu, Expected accuracy: %.1f%%\n", info.num_params, info.accuracy);
// Create model architecture (random weights)
Sequential* model = create_model("mnist_cnn");
// Load pretrained model (architecture + weights)
Sequential* pretrained = load_pretrained("mnist_cnn");
// Save a trained model to the zoo
ModelZoo::instance().save_to_zoo("mnist_cnn", model);
// Change weights directory (default: "pretrained/")
ModelZoo::instance().set_weights_dir("my_models/");Available models:
| Model | Dataset | Input Shape | Params | Expected Accuracy |
|---|---|---|---|---|
mnist_mlp |
MNIST | [1, 784] | 203K | ~97.5% |
mnist_cnn |
MNIST | [1, 1, 28, 28] | 207K | ~98.5% |
cifar10_simple |
CIFAR-10 | [1, 3, 32, 32] | 310K | ~75% |
cifar10_vgg |
CIFAR-10 | [1, 3, 32, 32] | 3.2M | ~85% |
tiny_mlp |
synthetic | [1, 4] | 42 | 100% |
Training and saving to the zoo:
// Train a model
Sequential* model = create_model("mnist_cnn");
// ... training loop ...
// Save to pretrained/mnist_cnn.bin
ModelZoo::instance().save_to_zoo("mnist_cnn", model);
// Later, load the pretrained model
Sequential* loaded = load_pretrained("mnist_cnn");Export models to ONNX format for use with ONNX Runtime, TensorRT, or other frameworks:
#include "onnx_export.h"
// Simple export with input shape
Sequential* model = load_pretrained("mnist_cnn");
export_onnx(model, "model.onnx", {1, 1, 28, 28});
// Export with options
ONNXExportOptions options;
options.model_name = "my_model";
options.input_shape = {1, 1, 28, 28};
options.verbose = true;
export_onnx(model, "model.onnx", options);
// Get export info (for debugging)
std::string info = get_onnx_export_info(model, {1, 1, 28, 28});Supported layers for ONNX export:
Linear→ GemmConv2d→ ConvReLU,Sigmoid,Tanh→ Relu, Sigmoid, TanhSoftmax→ SoftmaxFlatten→ FlattenMaxPool2d,AvgPool2d→ MaxPool, AveragePoolBatchNorm2d→ BatchNormalizationDropout→ Identity (inference mode)
Using the exported model:
import onnxruntime as ort
import numpy as np
# Load and run inference
sess = ort.InferenceSession("mnist_cnn.onnx")
x = np.random.randn(1, 1, 28, 28).astype(np.float32)
y = sess.run(None, {"input": x})[0]
print(f"Predicted class: {np.argmax(y)}")Basic DataLoader (MNIST):
#include "mnist.h"
// Load MNIST dataset
auto train_data = load_mnist_train("data/");
auto test_data = load_mnist_test("data/");
// Create data loader with batching and shuffling
DataLoader train_loader(train_data, /*batch_size=*/64, /*shuffle=*/true);
DataLoader test_loader(test_data, /*batch_size=*/64, /*shuffle=*/false);
// Iterate over batches
while (train_loader.has_next()) {
auto [images, labels] = train_loader.next_batch();
// images: [batch_size, 784], labels: [batch_size]
}
train_loader.reset(); // Reset for next epochThreadedDataLoader (Multi-threaded with Prefetching):
#include "dataloader.h"
// Convert dataset to generic Dataset struct
Dataset train_dataset{train_data.images, train_data.labels, train_data.num_samples};
// Create threaded data loader
// Args: dataset, batch_size, shuffle, num_workers, prefetch_factor
ThreadedDataLoader train_loader(train_dataset, 64, true, 2, 2); // 2 workers, prefetch 2 batches each
ThreadedDataLoader test_loader(test_dataset, 64, false, 0); // 0 workers = synchronous
// Training loop
for (int epoch = 0; epoch < num_epochs; epoch++) {
train_loader.reset(); // Shuffles data, starts worker threads
while (train_loader.has_next()) {
auto [images, labels] = train_loader.next_batch_pair();
// Workers prefetch next batches while you train
}
}The ThreadedDataLoader uses background threads to prefetch batches while the main thread processes the current batch, improving training throughput on CPU-bound workloads.
CIFAR-10:
#include "cifar10.h"
// Download CIFAR-10 binary files to data/ directory first
// Files needed: data_batch_1.bin ... data_batch_5.bin, test_batch.bin
// Load CIFAR-10 dataset (automatically normalized with ImageNet stats)
auto train_data = load_cifar10_train("data/"); // 50,000 images
auto test_data = load_cifar10_test("data/"); // 10,000 images
// Create data loader with augmentation (random crop + horizontal flip)
CIFAR10DataLoader train_loader(train_data, 64, /*shuffle=*/true, /*augment=*/true);
CIFAR10DataLoader test_loader(test_data, 64, /*shuffle=*/false, /*augment=*/false);
while (train_loader.has_next()) {
auto [images, labels] = train_loader.next_batch();
// images: [batch_size, 3, 32, 32], labels: [batch_size]
}
// Get class name
const char* name = cifar10_class_name(3); // "cat"// For images with shape [N, C, H, W] or [C, H, W]
// Horizontal flip
auto flipped = img->flip_horizontal();
auto maybe_flipped = img->random_flip_horizontal(0.5f); // 50% chance
// Padding and cropping
auto padded = img->pad2d(4); // Zero-pad by 4 pixels on each side
auto cropped = padded->crop(2, 2, 32, 32); // Crop from (top=2, left=2)
auto random_cropped = padded->random_crop(32, 32); // Random position
// Standard CIFAR-10 augmentation pipeline
auto augmented = img->pad2d(4)->random_crop(32, 32)->random_flip_horizontal(0.5f);#include "tensor.h"
#include "layer.h"
#include "loss.h"
#include "optimizer.h"
#include "mnist.h"
int main() {
// Load data
auto train_data = load_mnist_train("data/");
DataLoader train_loader(train_data, 64, true);
// Define model
Sequential model({
new Linear(784, 256),
new ReLU(),
new Linear(256, 10)
});
// Loss and optimizer
CrossEntropyLoss criterion;
SGD optimizer(model.parameters(), 0.01, 0.9);
// Training loop
for (int epoch = 0; epoch < 10; epoch++) {
train_loader.reset();
float total_loss = 0;
int batches = 0;
while (train_loader.has_next()) {
auto [images, labels] = train_loader.next_batch();
optimizer.zero_grad();
auto output = model.forward(images);
auto loss = criterion(output, labels);
loss->backward();
optimizer.step();
total_loss += loss->item();
batches++;
}
printf("Epoch %d, Loss: %.4f\n", epoch + 1, total_loss / batches);
}
return 0;
}A convolutional neural network example is included in cnn_mnist.cpp:
make cnn_mnist
./cnn_mnistArchitecture:
Conv2d(1, 16, 3) -> BatchNorm2d -> ReLU -> MaxPool2d(2)
Conv2d(16, 32, 3) -> BatchNorm2d -> ReLU -> MaxPool2d(2)
Flatten -> Linear(1568, 128) -> ReLU -> Linear(128, 10)
Results: ~98.5% test accuracy after 1 epoch (~207k parameters)
A VGG-style CNN for CIFAR-10 color image classification in cnn_cifar10.cpp:
# Download CIFAR-10 data
mkdir -p data && cd data
curl -LO https://www.cs.toronto.edu/~kriz/cifar-10-binary.tar.gz
tar -xzf cifar-10-binary.tar.gz
mv cifar-10-batches-bin/*.bin .
cd ..
# Build and run
make cnn_cifar10
./cnn_cifar10Architecture:
Block 1: Conv(3,32,3) -> BN -> ReLU -> Conv(32,32,3) -> BN -> ReLU -> MaxPool(2)
Block 2: Conv(32,64,3) -> BN -> ReLU -> Conv(64,64,3) -> BN -> ReLU -> MaxPool(2)
Block 3: Conv(64,128,3) -> BN -> ReLU -> Conv(128,128,3) -> BN -> ReLU -> MaxPool(2)
Classifier: Flatten -> Linear(2048,256) -> ReLU -> Dropout(0.5) -> Linear(256,10)
Features:
- ~815k parameters (~3.1 MB)
- Adam optimizer with cosine annealing learning rate schedule
- Data augmentation: random crop (pad=4) + horizontal flip
- im2col + GEMM + OpenMP optimized convolution (~1.5s/batch on Apple M1)
- Expected: ~75-85% test accuracy after 20 epochs (~19 min/epoch)
A simple transformer language model example is included in transformer_example.cpp:
make transformer_example
./transformer_exampleArchitecture:
Embedding -> Positional Encoding -> 2x Transformer Blocks -> Linear
Each block: MultiHeadAttention -> LayerNorm -> FFN -> LayerNorm
The model learns to predict the next character in a simple repeating pattern ("abcdef"):
- 17k parameters with embed_dim=32, num_heads=4, 2 layers
- Trains to near-zero loss in ~100 epochs
- Generates perfect sequences from any starting prompt
Sample output:
Prompt 'abc' -> abcdefabcdefabcdefabcdef...
Prompt 'f' -> fabcdefabcdefabcdefabcdef...
A convolutional autoencoder using ConvTranspose2d for image reconstruction on MNIST:
make autoencoder
./build/autoencoderArchitecture:
Encoder: Conv2d(1,16,3,s=2) -> Conv2d(16,32,3,s=2) -> Conv2d(32,64,3,s=2) -> Linear(1024,32)
28x28 -> 14x14 -> 7x7 -> 4x4 -> 32-dim latent vector
Decoder: Linear(32,1024) -> ConvTranspose2d(64,32) -> ConvTranspose2d(32,16) -> ConvTranspose2d(16,1)
32-dim latent -> 4x4 -> 7x7 -> 14x14 -> 28x28
Features:
- ~85k parameters with 32-dimensional latent space
- Uses ConvTranspose2d for learned upsampling (decoder)
- MSE loss for pixel-wise reconstruction
- Demonstrates latent space interpolation between digits
- ASCII art visualization of reconstructions
Sample output:
Epoch 10/10 Train Loss: 0.012345 Test Loss: 0.012567
Reconstruction Examples:
Original: Reconstructed:
.::---==+++**##%@@ .::---==+++**##%@@
:::---==+++**##%%@ .::---==+++**##%%@
... ...
Latent Space Interpolation (digit 3 -> digit 7):
[image 1] [image 2] [image 3] [image 4] [image 5]
A Deep Convolutional GAN (DCGAN) for generating handwritten digits:
make gan
./build/ganArchitecture:
Generator (noise -> image):
Linear(100, 4096) -> Reshape(256,4,4) -> ConvTranspose2d -> 7x7 -> ConvTranspose2d -> 14x14 -> ConvTranspose2d -> 28x28
Discriminator (image -> real/fake):
Conv2d(1,64) -> 14x14 -> Conv2d(64,128) -> 7x7 -> Conv2d(128,256) -> 4x4 -> Linear -> sigmoid
Features:
- ~1.5M parameters (Generator: ~1M, Discriminator: ~500K)
- 100-dimensional latent space
- Adam optimizer with beta1=0.5 (standard for GANs)
- Label smoothing (real labels = 0.9) for training stability
- Dropout in discriminator to prevent overfitting
- Demonstrates latent space interpolation and diversity checking
Training dynamics:
- D(x): Discriminator output on real images (should stay ~0.5-0.8)
- D(G(z)): Discriminator output on fake images (should rise from ~0 to ~0.5)
- Balanced training when both losses are similar
Sample output:
Epoch 20/20 D_loss: 0.8234 G_loss: 1.2345 D(x): 0.72 D(G(z)): 0.45
Generated samples (epoch 20):
.::---==+++**##%@@ .::---==+++**##%@@ .::---==+++**##%@@
:::---==+++**##%%@ :::---==+++**##%%@ :::---==+++**##%%@
Latent Space Interpolation:
[z1] -> [interp1] -> [interp2] -> [interp3] -> [z2]
A character-level RNN language model for generating Shakespeare-style text:
make rnn
./build/rnn_text_genArchitecture:
Embedding(vocab_size, 128) -> LSTM(128, 256) -> LSTM(256, 256) -> Dropout(0.3) -> Linear(256, vocab_size)
Features:
- ~500K parameters with 2-layer LSTM and 256 hidden units
- Character-level language modeling (predicts next character)
- Embedded Shakespeare corpus (~2KB) for training
- Temperature-based sampling for text generation:
- Low temperature (0.5): More conservative, repetitive text
- Medium temperature (0.8): Balanced creativity and coherence
- High temperature (1.2): More random, creative text
- Gradient clipping (max norm = 5.0) for stable RNN training
- Custom sequence cross-entropy loss with proper gradient computation
Training output:
Epoch 50/50 Loss: 1.2345 Perplexity: 3.44
Generated text (temperature=0.8):
----------------------------------------
ROMEO:
What light through yonder window breaks?
It is the east, and Juliet is the sun.
Arise, fair sun, and kill the envious moon...
----------------------------------------
Sample generation at different temperatures:
Temperature 0.5 (conservative):
"the the the the and the..."
Temperature 0.8 (balanced):
"What dreams may come when we have shuffled off..."
Temperature 1.2 (creative):
"Twas brillig sloathy toves did gyre..."
The framework includes several optimizations:
- SIMD Vectorization: ARM NEON (Apple Silicon) and x86 SSE/AVX support
- Blocked Matrix Multiplication: Cache-friendly 32x32 block tiling
- im2col + GEMM Convolution: Converts conv2d to optimized matrix multiplication
- OpenMP Parallelization: Multi-threaded convolution and GEMM operations
- Threaded Data Loading: Background workers prefetch batches during training
- Mixed Precision (fp16): GradScaler for loss scaling, HalfTensor for memory optimization
- Gradient Accumulation: Train with larger effective batch sizes on limited memory
- Early Stopping: Prevent overfitting with automatic training termination
- NoGradGuard: Skip computation graph building during inference
- O3 Optimization: Aggressive compiler optimizations enabled
brew install libomp| Model | Batch Time | Epoch Time |
|---|---|---|
| Simple CNN (2 conv) | ~76 ms | ~1 min |
| VGG-style (6 conv) | ~1.5 s | ~19 min |
Typical MNIST training: ~18 seconds/epoch on Apple M1.
make # Build optimized release (CPU-only)
make debug # Build with debug symbols
make clean # Remove build artifacts
make run # Build and run
make test # Run unit testsGPU backends (optional):
- Metal (macOS / Apple Silicon):
make METAL=1— uses Metal for matmul when tensors are on.to(Device::metal()). - CUDA (Linux / cloud):
make CUDA=1— uses cuBLAS for matmul and batched matmul when tensors are on.to(Device::cuda()). Requires nvcc and CUDA toolkit; setCUDA_PATHif needed. Default build is CPU-only and does not require CUDA.
Device auto-selection: Use Device::auto_() or Device::default_device() to pick the best available backend (Metal on macOS when built with METAL=1, else CUDA when built with CUDA=1, else CPU) so you don’t need to hard-code backend or remember make flags.
ONNX: Export with export_onnx(model, path, input_shape); import with load_onnx(path) for the same op set (Gemm, Conv, Relu, Sigmoid, Tanh, Softmax, Flatten, MaxPool, AveragePool, BatchNormalization, Identity). Set ONNXExportOptions::export_fp16 = true to export initializers in Float16 for smaller, edge-friendly models. For memory-efficient inference in C++, use HalfTensor and fp16 helpers in core/amp.h.
Comprehensive unit tests are provided in the tests/ directory:
# Run all tests
make test
# Run specific test suites
make test-tensor # Tensor operations
make test-autograd # Automatic differentiation
make test-layers # Neural network layers
make test-loss # Loss functions
make test-optimizer # Optimizers and schedulersTest coverage:
- Tensor Operations (39 tests): Creation, arithmetic, matrix ops, reductions, shape manipulation
- Autograd (23 tests): Gradient computation for all differentiable operations
- Layers (42 tests): Forward pass, parameters, gradients for all layer types
- Loss Functions (22 tests): Correctness and gradients for all losses
- Optimizers (26 tests): SGD, Adam, AdamW, RMSprop, schedulers, early stopping
Sample output:
################################################################################
# WHITEMATTER UNIT TESTS #
################################################################################
================================================================================
Test Suite: Tensor Operations
================================================================================
[PASS] zeros (0.01ms)
[PASS] ones (0.00ms)
[PASS] matmul_2d (0.52ms)
...
--------------------------------------------------------------------------------
Results: 39 passed, 0 failed, 39 total (0.95ms)
================================================================================
################################################################################
TOTAL: 152 passed, 0 failed (0.01s)
################################################################################
- Link against Accelerate/BLAS — Replace hand-rolled matmul with
cblas_sgemmfor 5-10x speedup - Fix matmul blocking order — Current (i,k,j) causes cache misses on B columns; transpose B or switch to (i,j,k) blocking for ~2x improvement (
core/ops/matmul_cpu.cpp) - Rewrite attention with batched matmul — Q*K^T uses 6 nested scalar loops instead of bmm; catastrophically slow for seq_len > 256 (
core/layers/attention.cpp:105-170) - Flash attention — Current implementation stores full O(N^2) attention matrix; flash attention reduces memory to O(N) and speeds up 10-100x
- Cache im2col buffer in Conv2d — Allocates a
std::vectorevery forward pass; caching eliminates thousands of heap allocations per epoch (core/ops/conv_ops.cpp:41) - Winograd convolution for 3x3 kernels — ~2.5x speedup for the most common conv kernel size
- Fix BatchNorm iteration order — Iterates (c,b,h,w) but tensor layout is (b,c,h,w); swap loops for better cache locality (
core/layers/normalization.cpp:31-56) - Add FMA SIMD instructions —
_mm256_fmadd_psfor AVX,vfmaq_f32for NEON; currently unused - Add
-march=native -fltoto Makefile for free 5-15% speedup from LTO and native ISA - Conv+BN+ReLU fusion — Fuse into a single kernel to eliminate intermediate memory traffic
- Thread-safe RNG — Global
static std::mt19937in tensor.cpp is not thread-safe; usethread_local - Numerical gradient checking — Add finite-difference gradient verification to test suite to catch backward pass bugs
- Fix grad_fn circular references — Lambda captures of shared_ptrs can create cycles and leak memory (
tensor.cpp:413) - Conv signed/unsigned mismatch — Padding computed as
int, used assize_t(conv_ops.cpp:23-24) - Bounded memory pool — Free lists grow without limit; add max bucket size to prevent unbounded memory growth
- Mixed precision (fp16/bf16) — Halves memory bandwidth (the real bottleneck) and enables tensor cores on NVIDIA GPUs
- Metal GPU backend — Stubs exist but aren't implemented; M-series Macs have powerful GPUs sitting idle
- CUDA backend — Move beyond stubs to functional GPU compute
- Grouped/depthwise convolutions — Required for MobileNet, EfficientNet, and modern architectures
- Dilated convolutions — Common in semantic segmentation (DeepLab, WaveNet)
- INT8/INT4 quantization — GGML-style quantized inference for practical deployment
- Operator graph compilation — Record operations, optimize the graph, then execute (like TorchScript/XLA)
- C++17 compatible compiler (g++, clang++)
- No external dependencies
