feat: Non-record 11L PR940 Stack (no n-gram in use) + 20k Steps + Legal TTT (1.0929 BPB)#1232
Open
Christopher-Lee-McClendon wants to merge 1 commit intoopenai:mainfrom
Conversation
- Scaling study: PR940 architecture at 20k steps achieves 1.0929 BPB (legal TTT) - Improves on prior GEPA 20k (1.0983 BPB) by -0.0054 BPB - FlowRefiner variant at 1.0928 BPB confirms auxiliary flow head is neutral - Trained on 1xA100-40GB, ~10.7h per run - Artifact: 14,473,337 bytes (base), 14,635,871 bytes (flow) - Includes scaling trajectory 7k-20k steps and comparison table
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
val_bpb = 1.0929 (base) / 1.0928 (flow) | Pre-TTT: 1.1005 / 1.1000 | Artifact: 14.47 MB / 14.64 MB
Headline Result
Extending the PR #940 architecture stack to 20,000 steps (8,000 peak-LR + 12,000 warmdown) achieves 1.0929 BPB with legal score-first TTT — improving on our prior GEPA 20k submission (1.0983 BPB) by −0.0054 BPB. This improvement comes entirely from architectural upgrades (gated attention, value residual, all-layer XSA, LeakyReLU²) introduced in the PR #549→PR #940 evolution, applied at the same 20k training scale.
Two configurations were trained:
FlowRefiner adds 98,625 parameters and provides negligible benefit at 20k steps (−0.0005 BPB no-TTT, −0.0001 BPB with TTT) — the auxiliary flow head is essentially neutral at this training budget.
Comparison with Prior 20k Submission
The prior GEPA 20k submission achieved a larger TTT gain (−0.017 vs −0.008) because its weaker float base left more room for test-time adaptation. The PR940 stack's stronger float base (1.1062 vs 1.1153) means TTT has less to correct — but the net result is still 0.005 BPB better.
Note: The new submission produces a smaller artifact despite using weaker compression (zstd-16 vs zstd-22). This is due to the PR940 architecture producing better-conditioned weight matrices that compress more efficiently.
Scaling Study: 7k → 20k Steps
Training trajectory showing the warmdown phase (steps 8,000–20,000) is the primary driver of improvement:
Key observations:
Quantized Evaluation Summary
Architecture Summary
1/√(layer+1)FlowRefiner (supplementary config only)
Training Details
Quantization Details
TTT (Test-Time Training) Details
SLURM Job Provenance
slurm_pr940_base_20k_ttt.shslurm_pr940_flow_20k_ttt.sheval_base20k_nottteval_base20k_legal_ttteval_flow20k_nottteval_flow20k_legal_tttTraining script:
train_gpt_pr940.py(2601 lines), environment variables control all configuration.Credits
Base architecture and gated attention/value residual (PR #940/#549, @abaybektursun), Muon optimizer (baseline), BigramHash/SmearGate (PR #65, @aquariouserworkman), XSA (PR #187/#265, @Idan3011/@unnir), mixed quant (PR #76), sliding window eval (PR #50, @mattqlf), legal score-first TTT (PR #77, @samacqua), VE/PartialRoPE/LN Scale (PR #315/#374, @jfprincz/@unnir), EMA (PR #65, @aquariouserworkman), LeakyReLU² (PR #549, @abaybektursun), GEPA 20k prior work (@mcclec07), FlowRefiner (PR #1170, @mcclec07), scaling study and this submission (@mcclec07).