Skip to content

Merge discussion-43 validated config to master to avoid re-discovery #243

@dan-y

Description

@dan-y

Discussion #43 produced a well-validated set of improvements (126 experiments, ~10.5h on H100, best val_bpb 0.969686 from baseline 0.997900). These are not yet reflected in master, which means every new agent session starts from the pre-#43 baseline and risks re-discovering the same optimizations from scratch.

As a concrete example: my own independent autoresearch run reproduced the WARMDOWN_RATIO 0.75 finding without knowing it had already been confirmed in #43. That's wasted GPU time.

I tested the full all-in config on H100 SXM on the same pod (master baseline vs branch, identical conditions):

run val_bpb peak_vram_mb steps
master 0.997591 45060.2 917
discussion-43 all-in 0.977085 50100.2 1748
delta −0.020506 +5040 +831

Lower val_bpb is better. This is a large improvement. VRAM increase is modest (~11%) and well within H100 headroom.

Changes vs current master:

Architecture:

  • Depth 8 → 9, aspect ratio tuned to keep dim=512
  • Window pattern SSSLSSSSL (4:1 short:long ratio)
  • Short window seq_len/2seq_len/8
  • RoPE base 10K → 200K

Optimization:

  • Batch 524K → 262K tokens (more steps per wall-clock budget)
  • Embedding LR 0.6 → 0.9 (effective with WD)
  • Unembedding LR 0.004 → 0.005
  • Warmdown ratio 0.5 → 0.75
  • Final LR fraction 0.0 → 0.05
  • Muon momentum warmup 300 → 200 steps

Init & regularization:

  • Transformer init scale ×0.68 (narrow optimum: 0.66 and 0.70 both worse)
  • x0 skip scalar init 0.1 → 0.05
  • Weight decay added to lm_head (0.01), embeddings (0.001), value embeddings (0.003)

The precedent for this kind of merge is the nanochat update from March 9 — lock in what works so the next session builds on it rather than rediscovers it.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions