Skip to content

Commit 776a620

Browse files
committed
Record: Seed-Regenerated Random Model + Incremental N-gram Cache — val_bpb 0.0905
1 parent 50390d6 commit 776a620

File tree

4 files changed

+2444
-0
lines changed

4 files changed

+2444
-0
lines changed
Lines changed: 122 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,122 @@
1+
# Seed-Regenerated Random Model + Incremental N-gram Cache — val_bpb 0.0905
2+
3+
**val_bpb = 0.0905** (1 seed, additional seeds pending H100 access) | **15.09 MB** | 8×MI250X (H100 validation pending)
4+
5+
## Results
6+
7+
| Seed | step_avg | steps | neural_bpb | blended_bpb | Artifact |
8+
|------|----------|-------|------------|-------------|----------|
9+
| 1337 | 272ms (MI250X) | 9,912 | 1.503 | **0.0905** | 15,093,968 |
10+
| 42 ||||| pending |
11+
| 2025 ||||| pending |
12+
13+
> **Note**: This submission was developed and validated on 8×MI250X (LUMI supercomputer). H100 validation with 3 seeds is pending RunPod access. Expected H100 step_avg: ~68ms, ~8,800 steps in 600s.
14+
15+
## Key Innovation: Seed-Regenerated Weights
16+
17+
All weight matrices in the transformer blocks (Q, K, V, O-proj, MLP-up, MLP-down) use **frozen orthogonal random projections** regenerated from deterministic seeds at load time. The artifact stores only:
18+
19+
- **LoRA adapters** (rank-64 A and B matrices): ~3.9 MB at INT8
20+
- **Embedding + control tensors**: ~1.0 MB at FP16
21+
- **N-gram cache** (INT16 counts, LZMA compressed): ~10.7 MB
22+
- **Code**: ~0.1 MB
23+
24+
The random base weights cost **0 bytes** in the artifact — they are regenerated from 8-byte seeds per matrix via QR-decomposed orthogonal initialization.
25+
26+
### Why Orthogonal (not Gaussian)
27+
28+
Prior work (PR #874) used Gaussian random bases but could not train models deeper than 5 layers — gradients vanish through deep stacks of random projections. Our **orthogonal initialization via QR decomposition** preserves singular values at exactly 1.0, enabling stable training of 11-layer random models (though we use 5L here for throughput).
29+
30+
```python
31+
@staticmethod
32+
def _generate_orthogonal_base(seed, rows, cols):
33+
g = torch.Generator(device='cpu')
34+
g.manual_seed(seed)
35+
size = max(rows, cols)
36+
raw = torch.randn(size, size, generator=g)
37+
Q, _ = torch.linalg.qr(raw)
38+
return Q[:rows, :cols] / math.sqrt(cols)
39+
```
40+
41+
### Adapter Quantization: Nearly Lossless
42+
43+
The LoRA adapters are quantized with simple per-row INT8 (no GPTQ needed). The quantization gap is only **+0.003 BPB** — dramatically better than INT6 GPTQ on full weight matrices (+0.006 for the baseline).
44+
45+
## N-gram Cache: Incremental Build During Training
46+
47+
The n-gram cache is built **incrementally during training** with zero overhead:
48+
49+
```python
50+
# After each training microstep (cost: <1ms per call):
51+
ngram_counter.update_batch_fast(full_seq.cpu().numpy().astype(np.int32))
52+
```
53+
54+
- **Orders**: 2-7 (hash-bucketed count tables)
55+
- **Counts**: INT16 (uint16), clipped to 65535
56+
- **Total counts**: 31.1 billion (from 9,912 steps × 524K tokens × 8 GPUs)
57+
- **Multi-GPU sync**: `dist.all_reduce(SUM)` across 8 GPUs before serialization
58+
- **Compression**: LZMA preset 9 → 10.7 MB
59+
60+
At eval time, the cache is **frozen** — no TTT, no eval-time updates. Entropy-adaptive alpha blending:
61+
```
62+
alpha = min(alpha_max, log1p(count) / 10)
63+
P_blend = alpha * P_ngram + (1 - alpha) * P_neural
64+
```
65+
66+
### Why Incremental > Pre-fill
67+
68+
We tested pre-filling the cache from training shards at startup. This was **10× worse** (0.996 BPB vs 0.0905) because:
69+
1. Pre-fill consumed 24-33% of the training budget (650-880s for 10 shards)
70+
2. Numpy hash computation on 50M-token shards was catastrophically slow
71+
3. Only covered 10/80 shards vs incremental seeing ALL training tokens
72+
73+
## Architecture
74+
75+
| Parameter | Value |
76+
|-----------|-------|
77+
| Layers | 5 |
78+
| Model dim | 512 |
79+
| Heads / KV heads | 8 / 4 |
80+
| MLP multiplier | 3.0 (hidden=1536) |
81+
| Activation | LeakyReLU(0.5)² |
82+
| Adapter rank | 64 |
83+
| Random init | Orthogonal (QR decomposition) |
84+
| Vocab | 1024 BPE |
85+
| Sequence length | 2048 |
86+
87+
## Training
88+
89+
| Parameter | Value |
90+
|-----------|-------|
91+
| Optimizer (adapters) | Muon (NS5, momentum 0.99) |
92+
| Optimizer (embed/scalar) | AdamW |
93+
| Matrix LR | 0.04 |
94+
| Grad clip norm | 0.1 |
95+
| Weight decay | 0.04 |
96+
| Batch tokens | 524,288 |
97+
| EMA decay | 0.997 |
98+
99+
## Ablation
100+
101+
| Config | BPB | Notes |
102+
|--------|-----|-------|
103+
| Neural only (post-quant) | 1.503 | Adapter INT8, no cache |
104+
| Neural sliding window | 1.474 | stride=64 |
105+
| **Neural + n-gram blend** | **0.0905** | Entropy-adaptive alpha, frozen cache |
106+
| Improvement from cache | -1.413 | |
107+
108+
## Artifact Budget
109+
110+
```
111+
Neural model (INT8 adapters + FP16 embed): 4,401,588 bytes
112+
N-gram cache (INT16 counts, LZMA): 10,692,380 bytes
113+
Total: 15,093,968 bytes
114+
Remaining: 906,032 bytes
115+
```
116+
117+
## Credits
118+
119+
- PR #874 (@fielding) — Random linear maps concept
120+
- PR #931 (@AnirudhRahul) — Packed n-gram artifact approach
121+
- arXiv:2407.00957 — Expressivity with random weights and learned biases
122+
- PR #549 (@abaybektursun) — LeakyReLU² activation, score-first TTT protocol
Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,9 @@
1+
{
2+
"name": "Seed-Regenerated Random Model + Incremental N-gram Cache",
3+
"val_bpb": 0.0905,
4+
"bytes_total": 15093968,
5+
"blurb": "5L 512d model with ALL weight matrices as seeded orthogonal random projections (0 bytes in artifact) + rank-64 LoRA adapters (3.9 MB). The remaining 11 MB holds an incrementally-built INT16 n-gram cache (orders 2-7, 31B counts from 8-GPU all-reduce). The neural model achieves 1.50 BPB; entropy-adaptive blending with the n-gram cache yields 0.0905 BPB. No eval-time adaptation — the cache is frozen after training.",
6+
"author": "Vilhelm Toivonen",
7+
"github_id": "vimeto",
8+
"date": "2026-03-29"
9+
}

0 commit comments

Comments
 (0)