Discussion #43 produced a well-validated set of improvements (126 experiments, ~10.5h on H100, best val_bpb 0.969686 from baseline 0.997900). These are not yet reflected in master, which means every new agent session starts from the pre-#43 baseline and risks re-discovering the same optimizations from scratch.
As a concrete example: my own independent autoresearch run reproduced the WARMDOWN_RATIO 0.75 finding without knowing it had already been confirmed in #43. That's wasted GPU time.
I tested the full all-in config on H100 SXM on the same pod (master baseline vs branch, identical conditions):
| run |
val_bpb |
peak_vram_mb |
steps |
| master |
0.997591 |
45060.2 |
917 |
| discussion-43 all-in |
0.977085 |
50100.2 |
1748 |
| delta |
−0.020506 |
+5040 |
+831 |
Lower val_bpb is better. This is a large improvement. VRAM increase is modest (~11%) and well within H100 headroom.
Changes vs current master:
Architecture:
- Depth 8 → 9, aspect ratio tuned to keep dim=512
- Window pattern
SSSL → SSSSL (4:1 short:long ratio)
- Short window
seq_len/2 → seq_len/8
- RoPE base 10K → 200K
Optimization:
- Batch 524K → 262K tokens (more steps per wall-clock budget)
- Embedding LR 0.6 → 0.9 (effective with WD)
- Unembedding LR 0.004 → 0.005
- Warmdown ratio 0.5 → 0.75
- Final LR fraction 0.0 → 0.05
- Muon momentum warmup 300 → 200 steps
Init & regularization:
- Transformer init scale ×0.68 (narrow optimum: 0.66 and 0.70 both worse)
- x0 skip scalar init 0.1 → 0.05
- Weight decay added to lm_head (0.01), embeddings (0.001), value embeddings (0.003)
The precedent for this kind of merge is the nanochat update from March 9 — lock in what works so the next session builds on it rather than rediscovers it.
Discussion #43 produced a well-validated set of improvements (126 experiments, ~10.5h on H100, best val_bpb 0.969686 from baseline 0.997900). These are not yet reflected in master, which means every new agent session starts from the pre-#43 baseline and risks re-discovering the same optimizations from scratch.
As a concrete example: my own independent autoresearch run reproduced the
WARMDOWN_RATIO 0.75finding without knowing it had already been confirmed in #43. That's wasted GPU time.I tested the full all-in config on H100 SXM on the same pod (master baseline vs branch, identical conditions):
Lower val_bpb is better. This is a large improvement. VRAM increase is modest (~11%) and well within H100 headroom.
Changes vs current master:
Architecture:
SSSL→SSSSL(4:1 short:long ratio)seq_len/2→seq_len/8Optimization:
Init & regularization:
The precedent for this kind of merge is the nanochat update from March 9 — lock in what works so the next session builds on it rather than rediscovers it.