Skip to content

Non-record: 28 Experiments in 5 Days — What Works, What Fails, and Why Small-Scale Tests Lie#1227

Open
himanshudongre wants to merge 1 commit intoopenai:mainfrom
himanshudongre:nonrecord/28-experiments-scale-deception
Open

Non-record: 28 Experiments in 5 Days — What Works, What Fails, and Why Small-Scale Tests Lie#1227
himanshudongre wants to merge 1 commit intoopenai:mainfrom
himanshudongre:nonrecord/28-experiments-scale-deception

Conversation

@himanshudongre
Copy link
Copy Markdown

Summary

5 days, 28 controlled experiments, $12 in GPU spend, Mac Mini M4 + single H100. Systematic exploration of architecture, training, quantization, and eval-time techniques for the 16MB language model competition.

The most important finding: Small-scale local experiments can be 180 degrees wrong. Our SSM hybrid showed -18% CE improvement at dim=192 but was +2.7% BPB worse at dim=512 on H100. This isn't noise — it's a systematic bias.

What Works

  • RoPE-16 partial (-23.6% BPC)
  • QAT as regularizer (quantized beats float32 by 0.66%)
  • Knowledge distillation (-0.75%)
  • Dirichlet CTW-6 Bayesian n-gram (-5.76%, properly normalized)
  • Score-first TTT 10ep cosine (-0.12% on H100)
  • Entropy-adaptive eval-time mixing

14 Dead Techniques (with numbers)

SSM hybrid (+2.7%), JEPA-LM (-0.24%), MoS (+1.7%), Monarch Matrices, DenseFormer (+0%), Complementary Training (+2.6%), Residual Lambdas (+0.28%), QK-Norm (-4.5%), Softcapping (+2.1%), LN Scaling (+11.4%), Fixed-Share Mixer (+0%), PAQ Logistic Mixing (BPC=19, fundamentally broken for multi-class), Product Quantization (+292%), Local Linear Prediction (+9.84%)

Key Contributions

  1. Scale deception finding — local tests can be 180° wrong
  2. QAT as regularizer — quantized model beats float32 (novel finding)
  3. PAQ logistic mixing failure analysis — important negative result for anyone porting compression techniques
  4. Bayesian n-gram — properly normalized, unlike 93% of n-gram submissions
  5. Distillation within single run — unexplored in competition

See full README for deep dives, reproduction code, and practical takeaways.

Hardware

  • Mac Mini M4, 16GB (local experiments)
  • Single H100 80GB, RunPod (GPU validation)
  • Total spend: ~$12

Note

We applied for the development grant on March 27 and are still awaiting approval. Several promising techniques need 8xH100 validation. @valerio-oai @0hq — would appreciate if the grant application could be reviewed when possible.

…y Small-Scale Tests Lie

Systematic exploration of architecture, training, quantization, and eval-time techniques. 14 dead techniques documented. Key finding: small-scale tests can be 180 degrees wrong (SSM: -18% local → +2.7% at scale).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant