Merge discussion-43 validated config to master to avoid re-discovery

[Discussion #43](https://github.com/karpathy/autoresearch/discussions/43) produced a well-validated set of improvements (126 experiments, ~10.5h on H100, best val_bpb 0.969686 from baseline 0.997900). These are not yet reflected in master, which means every new agent session starts from the pre-#43 baseline and risks re-discovering the same optimizations from scratch.

As a concrete example: my own independent autoresearch run reproduced the `WARMDOWN_RATIO 0.75` finding without knowing it had already been confirmed in #43. That's wasted GPU time.

I tested the full all-in config on H100 SXM on the same pod (master baseline vs branch, identical conditions):

| run | val_bpb | peak_vram_mb | steps |
|---|---|---|---|
| master | 0.997591 | 45060.2 | 917 |
| discussion-43 all-in | 0.977085 | 50100.2 | 1748 |
| delta | **−0.020506** | +5040 | +831 |

Lower val_bpb is better. This is a large improvement. VRAM increase is modest (~11%) and well within H100 headroom.

Changes vs current master:

**Architecture:**
- Depth 8 → 9, aspect ratio tuned to keep dim=512
- Window pattern `SSSL` → `SSSSL` (4:1 short:long ratio)
- Short window `seq_len/2` → `seq_len/8`
- RoPE base 10K → 200K

**Optimization:**
- Batch 524K → 262K tokens (more steps per wall-clock budget)
- Embedding LR 0.6 → 0.9 (effective with WD)
- Unembedding LR 0.004 → 0.005
- Warmdown ratio 0.5 → 0.75
- Final LR fraction 0.0 → 0.05
- Muon momentum warmup 300 → 200 steps

**Init & regularization:**
- Transformer init scale ×0.68 (narrow optimum: 0.66 and 0.70 both worse)
- x0 skip scalar init 0.1 → 0.05
- Weight decay added to lm_head (0.01), embeddings (0.001), value embeddings (0.003)

The precedent for this kind of merge is the [nanochat update from March 9](https://github.com/karpathy/nanochat/commit/6ed7d1d82cee16c2e26f45d559ad3338447a6c1b) — lock in what works so the next session builds on it rather than rediscovers it.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Merge discussion-43 validated config to master to avoid re-discovery #243

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

run	val_bpb	peak_vram_mb	steps
master	0.997591	45060.2	917
discussion-43 all-in	0.977085	50100.2	1748
delta	−0.020506	+5040	+831

Merge discussion-43 validated config to master to avoid re-discovery #243

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions