Non-record: 28 Experiments in 5 Days — What Works, What Fails, and Why Small-Scale Tests Lie by himanshudongre · Pull Request #1227 · openai/parameter-golf

himanshudongre · 2026-04-01T19:25:43Z

Summary

5 days, 28 controlled experiments, $12 in GPU spend, Mac Mini M4 + single H100. Systematic exploration of architecture, training, quantization, and eval-time techniques for the 16MB language model competition.

The most important finding: Small-scale local experiments can be 180 degrees wrong. Our SSM hybrid showed -18% CE improvement at dim=192 but was +2.7% BPB worse at dim=512 on H100. This isn't noise — it's a systematic bias.

What Works

RoPE-16 partial (-23.6% BPC)
QAT as regularizer (quantized beats float32 by 0.66%)
Knowledge distillation (-0.75%)
Dirichlet CTW-6 Bayesian n-gram (-5.76%, properly normalized)
Score-first TTT 10ep cosine (-0.12% on H100)
Entropy-adaptive eval-time mixing

14 Dead Techniques (with numbers)

SSM hybrid (+2.7%), JEPA-LM (-0.24%), MoS (+1.7%), Monarch Matrices, DenseFormer (+0%), Complementary Training (+2.6%), Residual Lambdas (+0.28%), QK-Norm (-4.5%), Softcapping (+2.1%), LN Scaling (+11.4%), Fixed-Share Mixer (+0%), PAQ Logistic Mixing (BPC=19, fundamentally broken for multi-class), Product Quantization (+292%), Local Linear Prediction (+9.84%)

Key Contributions

Scale deception finding — local tests can be 180° wrong
QAT as regularizer — quantized model beats float32 (novel finding)
PAQ logistic mixing failure analysis — important negative result for anyone porting compression techniques
Bayesian n-gram — properly normalized, unlike 93% of n-gram submissions
Distillation within single run — unexplored in competition

See full README for deep dives, reproduction code, and practical takeaways.

Hardware

Mac Mini M4, 16GB (local experiments)
Single H100 80GB, RunPod (GPU validation)
Total spend: ~$12

Note

We applied for the development grant on March 27 and are still awaiting approval. Several promising techniques need 8xH100 validation. @valerio-oai @0hq — would appreciate if the grant application could be reviewed when possible.

…y Small-Scale Tests Lie Systematic exploration of architecture, training, quantization, and eval-time techniques. 14 dead techniques documented. Key finding: small-scale tests can be 180 degrees wrong (SSM: -18% local → +2.7% at scale).

notapplica mentioned this pull request Apr 1, 2026

⛳ Parameter Golf Live AI Commentary ⛳ + Analysis / Ideas | every 10 minutes #140

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: 28 Experiments in 5 Days — What Works, What Fails, and Why Small-Scale Tests Lie#1227

Non-record: 28 Experiments in 5 Days — What Works, What Fails, and Why Small-Scale Tests Lie#1227
himanshudongre wants to merge 1 commit intoopenai:mainfrom
himanshudongre:nonrecord/28-experiments-scale-deception

himanshudongre commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

himanshudongre commented Apr 1, 2026

Summary

What Works

14 Dead Techniques (with numbers)

Key Contributions

Hardware

Note

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant