Record: 11-gram Eval Cache + Hedge Mixer (val_bpb: 0.8609)#909
Record: 11-gram Eval Cache + Hedge Mixer (val_bpb: 0.8609)#909sunnypatneedi wants to merge 26 commits intoopenai:mainfrom
Conversation
Two-phase TTT pipeline (novel combination): - Phase 1: In-Place TTT — updates MLP output projections per-document (ICLR 2026) - Phase 2: Per-doc LoRA TTT — adapts Q/V/LM head with surprise gating (top-K tokens) Architecture: PR openai#486 base (11L, TrigramHash, ValueResidual, GradQuant) + LeakyReLU(0.5)^2 + eval-only XSA on all layers + Partial RoPE + LN Scale Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- gptq_calibrate(): collect Hessian H=X^TX via forward hooks on training data - gptq_quantize_weight(): column-wise int6 with Cholesky error compensation - _find_best_row_scales(): percentile search for optimal per-row scales - Integrated into mixed_quantize_int6() — falls back to naive when no Hessian - Expected: -0.0026 bpb from better quantization alone (PR openai#535 ablation) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Bug 1: Function adapted MLP weights but never scored documents. All compute was wasted — no loss/bpb accumulation. Fix: Rewrote as inplace_ttt_eval() with apply-then-update loop: score chunk first (accumulate bpb), then gradient-update MLP proj. Bug 2: Model left in last document's adapted state after function. This corrupted subsequent LoRA TTT evaluation. Fix: Reset MLP weights to original after all documents. Also: Made In-Place TTT and LoRA TTT alternatives (config switch) rather than sequential phases, since both produce val_bpb scores. Use INPLACE_TTT_ENABLED=1 for In-Place, =0 for LoRA TTT. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Run 1 results: - Artifact 16.35MB (352KB over 16MB limit) — caused by GradQuant int7 - LoRA TTT took 1572s (2.6x over 600s budget) — 20 epochs too many - Pre-quant val_bpb: 1.1757 (46 shards, not full 80) - Post-quant sliding window: 1.3569 Fixes: - GradQuant: top-10% sensitivity stays int6 (not int7) - TTT epochs: 20 → 5 (should complete in ~400s) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Run 1 showed: - Pre-quant val_bpb: 1.1757 - Post-quant sliding window: 1.3569 - Quantization penalty: 0.18 bpb (expected ~0.003) Root cause: Our GPTQ implementation (ported from PR openai#535) produced WORSE quantization than standard per-row int6. PR openai#486 base doesn't use GPTQ at all. Possible issues: bad Hessian calibration, numerical instability in Cholesky decomposition, or name mismatch between hooks and state dict keys. Fix: Disable GPTQ, revert to standard quantization path. GPTQ code preserved for future debugging. Also confirmed: TTT bpb formula is algebraically correct. The 0.6185 bpb was real (20 epochs = heavy per-doc overfitting). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Run 0: PR openai#548 UNMODIFIED (1.0865 proven). Reproduce baseline. Run 1: PR openai#548 + LeakyReLU(0.5)^2 (1 line change). Measure delta. Following retro lesson: baseline first, one change at a time. No GPTQ, no In-Place TTT, no XSA, no surprise gating. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
… PR openai#548 Run 0: PR openai#414 UNMODIFIED (merged SOTA 1.1228, verified 3-seed) Run 1: PR openai#414 + LeakyReLU(0.5)^2 (1 line change) Baseline against verified numbers, not claimed scores from open PRs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Builds on Run 1 (PR openai#414 + LeakyReLU). Adds: - temperature param to eval_val_sliding (default 1.0, no change) - After main eval, sweeps T={0.95,0.96,0.97,0.98,0.99} - PR openai#576 reported T=0.98 gives -0.003 bpb for free 10 lines added over Run 1. Zero training cost. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Builds on Run 2. Changes from PR openai#414 base: - MLP expansion: 3.0x → 3.5x (1536 → 1792 hidden, more params) - Quantization: int6 → int5 (clip_range 31→15, fits more params) - QAT: enabled with threshold 0.5 (early start, matching PR openai#576) - QAT uses quantile(0.9995) clip instead of row max - BigramHash: 2048 → 8192 buckets From PR openai#576's "Train Larger, Quantize Harder" approach (1.1164 bpb). 8 lines changed from Run 2. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Template includes: - README.md with placeholder results table - submission.json with schema matching existing PRs - submit.sh helper to collect logs and extract metrics Fill in after successful runs, rename folder, PR to upstream. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
In-Place TTT: loss INCREASES (2.63+), 955s+ eval time. Harmful. GradQuant int5/int6 mix: 34KB over 16MB even without int7. PR openai#486 baseline reproduced at 1.1249 (within seed variance of 1.1233). Added lessons 13-16 to CLAUDE.md. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PR openai#414 hardcodes `from flash_attn_interface import ...` (FA3/Hopper only). This pod has FA2 but not FA3. Added try/except + SDPA fallback in attention. Applied to all 4 runs (0-3). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Pod has flash_attn 2.8.3 (from flash_attn import flash_attn_func) but NOT flash_attn_interface (FA3/Hopper). Added cascading import. Also keeping SDPA fallback for environments with no flash_attn at all. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Run 0: PR openai#549 UNMODIFIED (merged SOTA 1.1194, verified 3-seed) Run 1: PR openai#549 + TTT_ENABLED=1 + TTT_LR=0.0005 (2 lines changed) Both have FA3→FA2→SDPA fallback for non-Hopper GPUs. Following retro: one change per run, baseline first. Expected: Run 1 should achieve ~1.094-1.104 (beats 1.1144 target). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Upgrades TTT from PR openai#549's weak 3ep SGD (-0.0025 bpb) to PR openai#481's proven AdamW 30ep cosine + per-layer LR recipe (expected -0.01 to -0.025). Changes: - train_gpt.py: Added _ttt_run_phase() + ttt_adapt() + TTT hyperparams - run_3seeds.sh: Added TTT env vars for 3-seed validation - finalize_submission.py: Extracts pre/post TTT metrics from logs - README.md + submission.json: Updated for TTT-enabled submission Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Prevents "tensor does not have a device" error when torch.compile tries to recompile after TTT modified model weights. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
PR openai#549 SOTA base + PR openai#481 AdamW TTT recipe. Replaces weak 3ep SGD TTT with 30ep cosine decay + per-layer LR (mlp.proj 3x, mlp.fc 0.5x). 3-seed mean: 1.0705 (std 0.0009). All artifacts under 16MB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…_bytes PR openai#771 was listed as "0 seeds" in the competition tracker because submission.json was missing the required `seeds` and `track` fields, and used `bytes_total` instead of the expected `artifact_bytes` field. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…hanced n-gram - train_gpt_v10_safe.py: v9a + Hedge Mixer (multiplicative weights) + add-delta n-gram smoothing, dim=512 - train_gpt_v10_moonshot.py: model_dim=640 (42M params) + adaptive quant (ternary MLP / int4 attn / int6 embed) + Hedge Mixer - auto_experiment.py: local CPU random search over 20 configs, logs to experiments.jsonl - submit.sh: packaging and staging script for H100 runs - PLAN.md: strategy doc with size estimates and run order Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- validate_configs.py: CPU-only artifact size estimator for moonshot configs (no GPU/data needed) - experiments.jsonl: 20 initial random search results from auto_experiment.py Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Claude/peaceful mclean
v10 moonshot: ternary MLP quant + scaled model + hedge mixer + enhanced n-gram
3-seed mean 0.8609 bpb (42→0.8600, 1337→0.8611, 2025→0.8616). All artifacts under 16MB. 11-gram n-gram cache with entropy-adaptive alpha and Hedge Mixer on PR openai#549 base architecture. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
3-seed mean 0.8609 bpb (42→0.8600, 1337→0.8611, 2025→0.8616). All artifacts under 16MB. 11-gram n-gram cache with entropy-adaptive alpha and Hedge Mixer on PR openai#549 base architecture. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Add comprehensive experiment tracking and moonshot submissions
Community Review — 11-gram Eval Cache + Hedge MixerBPB: 0.8609 (3-seed, std 0.0008) | Seeds: 3 (42, 1337, 2025) | Artifact: ~15.3–15.9 MB | Compliance: FLAG (same n-gram family bug pattern) What this does: The base roundtrip neural model scores val_bpb 1.1452. The eval-time pipeline adds a 10-order hashed n-gram cache (orders 2–11, 4M buckets/order, uint32 counts) that blends n-gram probabilities into the neural distribution under an entropy-adaptive alpha, then wraps the neural vs. n-gram-enhanced experts in an online multiplicative-weights Hedge Mixer (beta=2.0). The ablation in the PR body reports the full -0.2843 nats improvement coming from "sliding window + 11-gram + Hedge"; the training model by itself sits at 1.1452. What I found in the code (
Questions/flags:
Verdict: COMPLIANCE FLAG — same n-gram family pattern as PR #779 / #770 / #798 / #808 / #825. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: Reviewed by @MatoTeziTanka — The Agora. CPU gauntlet PASS at head |
…cluster + CT2038 gauntlet provisioned Reviewed all 20 highest-priority Tier 1 PRs from openai/parameter-golf. Two cluster-level findings: - N-gram family bug (10 PRs CLOSED + 1 already ruled): full_key = ((ctx_hash ^ (target * primes[k])) & mask) — target token hashed into the eval-cache lookup key, ruled illegal by valerio-oai on PR openai#779. Same verbatim pattern in openai#770/openai#798/openai#808/openai#825/openai#786/openai#797/openai#909/openai#940/openai#761 + openai#764 follow-up. Upstream parent: lukacf (openai#659/openai#702/openai#727 — task #5 audit queued). - Standard SLOT cluster (4 HOLD pending openai#1336, 2 CLOSE): per-window delta+logit_bias optimized N steps against (per_token_nll * mask) where mask = scored positions [s:wlen]. PRs openai#1321/openai#1324/openai#1278/openai#1263 → HOLD; openai#1319/openai#1376 → CLOSE. Clean MERGE-eligible: openai#1420 (token_hint-only post-fix) and openai#1450 (TMA megakernel triple loop). Eval-budget gate (openai#915/openai#889 anthony-maio pair): clean ngram code, ~14.9 min ngram stage on 8xH100 SXM. One @0hq ruling on Issue openai#17 unblocks both PRs plus ~30 ngram-cache PRs. Infrastructure: provisioned CT2038 (proteus-engine, 128 GB RAM, 32 cores) as the dedicated parameter-golf gauntlet host. Installed Triton 3.6.0, deployed cpu_test.py + flash_attn_stub.py. Re-ran the 4 PRs originally skipped due to FA3/Triton blockers — all PASS. Edited 4 GitHub comments via gh api PATCH to add the rerun results. Coverage went from 9/20 to 14/20 fully gauntleted. Side session handed off via SOW_HF_DATASET_REPUBLISH.md (Scylla 998→1254 fix + SP4096/SP8192/SP12288/SP16384 publish + Cloudflare R2 mirror). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
11-gram Eval Cache + Hedge Mixer on PR #549 Base
val_bpb: 0.8609 (3-seed mean, std 0.0008, sliding window stride=64) | ~15.9 MB | 8×H100 SXM
Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)
Key Innovation: 11-gram Eval Cache with Entropy-Adaptive Mixing
The n-gram eval cache provides -0.284 bpb — the single largest improvement over the base model. It replaces TTT entirely, freeing the full eval time budget.
order_centers = 3.0 - 0.25 * (matched_order - min_order)N-gram Protocol
Run Config
cd /workspace/parameter-golf SEED=42 BIGRAM_VOCAB_SIZE=0 VE_DIM=64 GRADQUANT_ENABLED=0 \ torchrun --standalone --nproc_per_node=8 \ records/track_10min_16mb/2026-03-26_sunnypatneedi_moonshot/train_gpt.pyAll hyperparameters are baked into the script as defaults. Key environment variables:
Timing Budget
Training Architecture (from PR #549 SOTA)
Ablation
Credits