Auto-generated by tri experiment export. Do not edit manually.
| Rank | Directory | Best PPL | Best Loss | Best Step | Max Step | Checkpoints |
|---|---|---|---|---|---|---|
| 1 | data/checkpoints |
2.96 | 1.086 | 100000 | 100000 | 21 |
| 2 | data/checkpoints/real |
TBD | TBD | TBD | TBD | 10 |
| 3 | data/checkpoints_v3 |
TBD | TBD | TBD | TBD | 10 |
| 4 | data/checkpoints_v13_lamb128 |
TBD | TBD | TBD | TBD | 4 |
Run
tri experiment exportto regenerate with actual values from checkpoint headers.
All share: HSLM_OPTIMIZER=lamb, HSLM_LR_SCHEDULE=cosine, HSLM_BATCH=66, HSLM_WARMUP=2000
| ID | LR | Context | Steps | Seed | Special |
|---|---|---|---|---|---|
| W6-1 | 5e-4 | 27 | 100K | 61 | LR sweep |
| W6-2 | 7e-4 | 27 | 100K | 62 | LR sweep |
| W6-3 | 1.5e-3 | 27 | 100K | 63 | LR sweep |
| W6-4 | 2e-3 | 27 | 100K | 64 | LR sweep |
| W6-5 | 1e-3 | 9 | 100K | 65 | Short context |
| W6-6 | 1e-3 | 18 | 100K | 66 | Medium context |
| W6-7 | 1e-3 | 54 | 100K | 67 | Long context |
| W6-8 | 1e-3 | 81 | 100K | 68 | Overfitting test |
| W6-9 | 1e-3 | 27 | 100K | 69 | GRAD_ACCUM=1 |
| W6-10 | 1e-3 | 27 | 100K | 70 | GRAD_ACCUM=4 |
| W6-11 | 1e-3 | 27 | 100K | 71 | GRAD_ACCUM=8 |
| W6-12 | 1e-3 | 27 | 100K | 72 | PHI schedule |
| W6-13 | 1e-3 | 27 | 100K | 73 | Restart period=33K |
| W6-14 | 1e-3 | 27 | 100K | 74 | Warmup=5000 |
| W6-15 | 1e-3 | 27 | 100K | 75 | PHI_SCALE=1 |
| W6-16 | 1e-3 | 27 | 100K | 76 | Adaptive sparsity |
| W6-17 | 1e-3 | 27 | 100K | 77 | PHI+PHI_SCALE |
| W6-18 | 1e-3 | 27 | 100K | 78 | Dropout=0.15 |
| W6-19 | 1e-3 | 27 | 200K | 79 | Extended run |
| W6-20 | 1e-3 | 27 | 200K | 80 | PHI extended |
- R5 KING: PPL=2.96 (LAMB 1e-3, cosine, ctx=27) — best result to date
- Cosine LR schedule essential — flat schedule dies by step 20K
- LAMB optimizer outperforms AdamW for ternary weights
- Context length 27 appears optimal for current architecture
- Batch size 66 with gradient accumulation 2 is default baseline
| Wave | Date | Focus | Best Result |
|---|---|---|---|
| Wave 2 | 2026-03-12 | Night experiments, optimizer sweep | v4R PPL=125 |
| Wave 3 | 2026-03-12 | STE wiring fix, batch size tuning | v7 avg loss=5.73 |
| Wave 4 | 2026-03-13 | 15 experiments, 3 accounts, all cosine | R5 PPL=2.96 KING |
| Wave 5 | 2026-03-13 | Extended runs, context variations | In progress |
| Wave 6 | 2026-03-13 | 20 R5 KING variations | Planned |