|
| 1 | +# Seed-Regenerated Random Model + Incremental N-gram Cache — val_bpb 0.0905 |
| 2 | + |
| 3 | +**val_bpb = 0.0905** (1 seed, additional seeds pending H100 access) | **15.09 MB** | 8×MI250X (H100 validation pending) |
| 4 | + |
| 5 | +## Results |
| 6 | + |
| 7 | +| Seed | step_avg | steps | neural_bpb | blended_bpb | Artifact | |
| 8 | +|------|----------|-------|------------|-------------|----------| |
| 9 | +| 1337 | 272ms (MI250X) | 9,912 | 1.503 | **0.0905** | 15,093,968 | |
| 10 | +| 42 | — | — | — | — | pending | |
| 11 | +| 2025 | — | — | — | — | pending | |
| 12 | + |
| 13 | +> **Note**: This submission was developed and validated on 8×MI250X (LUMI supercomputer). H100 validation with 3 seeds is pending RunPod access. Expected H100 step_avg: ~68ms, ~8,800 steps in 600s. |
| 14 | +
|
| 15 | +## Key Innovation: Seed-Regenerated Weights |
| 16 | + |
| 17 | +All weight matrices in the transformer blocks (Q, K, V, O-proj, MLP-up, MLP-down) use **frozen orthogonal random projections** regenerated from deterministic seeds at load time. The artifact stores only: |
| 18 | + |
| 19 | +- **LoRA adapters** (rank-64 A and B matrices): ~3.9 MB at INT8 |
| 20 | +- **Embedding + control tensors**: ~1.0 MB at FP16 |
| 21 | +- **N-gram cache** (INT16 counts, LZMA compressed): ~10.7 MB |
| 22 | +- **Code**: ~0.1 MB |
| 23 | + |
| 24 | +The random base weights cost **0 bytes** in the artifact — they are regenerated from 8-byte seeds per matrix via QR-decomposed orthogonal initialization. |
| 25 | + |
| 26 | +### Why Orthogonal (not Gaussian) |
| 27 | + |
| 28 | +Prior work (PR #874) used Gaussian random bases but could not train models deeper than 5 layers — gradients vanish through deep stacks of random projections. Our **orthogonal initialization via QR decomposition** preserves singular values at exactly 1.0, enabling stable training of 11-layer random models (though we use 5L here for throughput). |
| 29 | + |
| 30 | +```python |
| 31 | +@staticmethod |
| 32 | +def _generate_orthogonal_base(seed, rows, cols): |
| 33 | + g = torch.Generator(device='cpu') |
| 34 | + g.manual_seed(seed) |
| 35 | + size = max(rows, cols) |
| 36 | + raw = torch.randn(size, size, generator=g) |
| 37 | + Q, _ = torch.linalg.qr(raw) |
| 38 | + return Q[:rows, :cols] / math.sqrt(cols) |
| 39 | +``` |
| 40 | + |
| 41 | +### Adapter Quantization: Nearly Lossless |
| 42 | + |
| 43 | +The LoRA adapters are quantized with simple per-row INT8 (no GPTQ needed). The quantization gap is only **+0.003 BPB** — dramatically better than INT6 GPTQ on full weight matrices (+0.006 for the baseline). |
| 44 | + |
| 45 | +## N-gram Cache: Incremental Build During Training |
| 46 | + |
| 47 | +The n-gram cache is built **incrementally during training** with zero overhead: |
| 48 | + |
| 49 | +```python |
| 50 | +# After each training microstep (cost: <1ms per call): |
| 51 | +ngram_counter.update_batch_fast(full_seq.cpu().numpy().astype(np.int32)) |
| 52 | +``` |
| 53 | + |
| 54 | +- **Orders**: 2-7 (hash-bucketed count tables) |
| 55 | +- **Counts**: INT16 (uint16), clipped to 65535 |
| 56 | +- **Total counts**: 31.1 billion (from 9,912 steps × 524K tokens × 8 GPUs) |
| 57 | +- **Multi-GPU sync**: `dist.all_reduce(SUM)` across 8 GPUs before serialization |
| 58 | +- **Compression**: LZMA preset 9 → 10.7 MB |
| 59 | + |
| 60 | +At eval time, the cache is **frozen** — no TTT, no eval-time updates. Entropy-adaptive alpha blending: |
| 61 | +``` |
| 62 | +alpha = min(alpha_max, log1p(count) / 10) |
| 63 | +P_blend = alpha * P_ngram + (1 - alpha) * P_neural |
| 64 | +``` |
| 65 | + |
| 66 | +### Why Incremental > Pre-fill |
| 67 | + |
| 68 | +We tested pre-filling the cache from training shards at startup. This was **10× worse** (0.996 BPB vs 0.0905) because: |
| 69 | +1. Pre-fill consumed 24-33% of the training budget (650-880s for 10 shards) |
| 70 | +2. Numpy hash computation on 50M-token shards was catastrophically slow |
| 71 | +3. Only covered 10/80 shards vs incremental seeing ALL training tokens |
| 72 | + |
| 73 | +## Architecture |
| 74 | + |
| 75 | +| Parameter | Value | |
| 76 | +|-----------|-------| |
| 77 | +| Layers | 5 | |
| 78 | +| Model dim | 512 | |
| 79 | +| Heads / KV heads | 8 / 4 | |
| 80 | +| MLP multiplier | 3.0 (hidden=1536) | |
| 81 | +| Activation | LeakyReLU(0.5)² | |
| 82 | +| Adapter rank | 64 | |
| 83 | +| Random init | Orthogonal (QR decomposition) | |
| 84 | +| Vocab | 1024 BPE | |
| 85 | +| Sequence length | 2048 | |
| 86 | + |
| 87 | +## Training |
| 88 | + |
| 89 | +| Parameter | Value | |
| 90 | +|-----------|-------| |
| 91 | +| Optimizer (adapters) | Muon (NS5, momentum 0.99) | |
| 92 | +| Optimizer (embed/scalar) | AdamW | |
| 93 | +| Matrix LR | 0.04 | |
| 94 | +| Grad clip norm | 0.1 | |
| 95 | +| Weight decay | 0.04 | |
| 96 | +| Batch tokens | 524,288 | |
| 97 | +| EMA decay | 0.997 | |
| 98 | + |
| 99 | +## Ablation |
| 100 | + |
| 101 | +| Config | BPB | Notes | |
| 102 | +|--------|-----|-------| |
| 103 | +| Neural only (post-quant) | 1.503 | Adapter INT8, no cache | |
| 104 | +| Neural sliding window | 1.474 | stride=64 | |
| 105 | +| **Neural + n-gram blend** | **0.0905** | Entropy-adaptive alpha, frozen cache | |
| 106 | +| Improvement from cache | -1.413 | | |
| 107 | + |
| 108 | +## Artifact Budget |
| 109 | + |
| 110 | +``` |
| 111 | +Neural model (INT8 adapters + FP16 embed): 4,401,588 bytes |
| 112 | +N-gram cache (INT16 counts, LZMA): 10,692,380 bytes |
| 113 | +Total: 15,093,968 bytes |
| 114 | +Remaining: 906,032 bytes |
| 115 | +``` |
| 116 | + |
| 117 | +## Credits |
| 118 | + |
| 119 | +- PR #874 (@fielding) — Random linear maps concept |
| 120 | +- PR #931 (@AnirudhRahul) — Packed n-gram artifact approach |
| 121 | +- arXiv:2407.00957 — Expressivity with random weights and learned biases |
| 122 | +- PR #549 (@abaybektursun) — LeakyReLU² activation, score-first TTT protocol |
0 commit comments