Skip to content

feat: Non-record 11L PR940 Stack (no n-gram in use) + 20k Steps + Legal TTT (1.0929 BPB)#1232

Open
Christopher-Lee-McClendon wants to merge 1 commit intoopenai:mainfrom
Christopher-Lee-McClendon:submission/11L-pr940-20k-scaling-legal-ttt
Open

feat: Non-record 11L PR940 Stack (no n-gram in use) + 20k Steps + Legal TTT (1.0929 BPB)#1232
Christopher-Lee-McClendon wants to merge 1 commit intoopenai:mainfrom
Christopher-Lee-McClendon:submission/11L-pr940-20k-scaling-legal-ttt

Conversation

@Christopher-Lee-McClendon
Copy link
Copy Markdown

@Christopher-Lee-McClendon Christopher-Lee-McClendon commented Apr 1, 2026

  • Scaling study: PR940 architecture at 20k steps achieves 1.0929 BPB (legal TTT)
  • Improves on prior GEPA 20k (1.0983 BPB) by -0.0054 BPB
  • FlowRefiner variant at 1.0928 BPB confirms auxiliary flow head is neutral
  • Trained on 1xA100-40GB, ~10.7h per run
  • Artifact: 14,473,337 bytes (base), 14,635,871 bytes (flow)
  • Includes scaling trajectory 7k-20k steps and comparison table

val_bpb = 1.0929 (base) / 1.0928 (flow) | Pre-TTT: 1.1005 / 1.1000 | Artifact: 14.47 MB / 14.64 MB

Non-record unlimited-compute submission (trained on 1×A100-40GB PCIe, ~10.7h per run).


Headline Result

Extending the PR #940 architecture stack to 20,000 steps (8,000 peak-LR + 12,000 warmdown) achieves 1.0929 BPB with legal score-first TTT — improving on our prior GEPA 20k submission (1.0983 BPB) by −0.0054 BPB. This improvement comes entirely from architectural upgrades (gated attention, value residual, all-layer XSA, LeakyReLU²) introduced in the PR #549→PR #940 evolution, applied at the same 20k training scale.

Two configurations were trained:

  1. Base (no auxiliary heads): 27,137,223 params → 1.0929 BPB with legal TTT
  2. FlowRefiner (lightweight flow module): 27,235,848 params → 1.0928 BPB with legal TTT

FlowRefiner adds 98,625 parameters and provides negligible benefit at 20k steps (−0.0005 BPB no-TTT, −0.0001 BPB with TTT) — the auxiliary flow head is essentially neutral at this training budget.


Comparison with Prior 20k Submission

GEPA 20k (prior work) PR940 Base 20k (this work) Δ
Legal TTT BPB 1.0983 1.0929 −0.0054
No-TTT BPB 1.1005
TTT gain −0.0170 −0.0076
Float base (step 20k) 1.1153 1.1062 −0.0091
Parameters 27,030,107 27,137,223 +107,116
Total submission size 14,985,742 B 14,473,337 B −512,405 B
Compression zstd-22 zstd-16
Hardware 4×A100-40GB 1×A100-40GB −3 GPUs
Training time ~10.6h ~10.7h comparable
XSA layers Last 4 All 11 +7 layers
Activation ReLU² LeakyReLU(0.5)²
BigramHash 2048×128 4096×128 2× buckets
Gated attention No Yes new
Value residual No Yes new

The prior GEPA 20k submission achieved a larger TTT gain (−0.017 vs −0.008) because its weaker float base left more room for test-time adaptation. The PR940 stack's stronger float base (1.1062 vs 1.1153) means TTT has less to correct — but the net result is still 0.005 BPB better.

Note: The new submission produces a smaller artifact despite using weaker compression (zstd-16 vs zstd-22). This is due to the PR940 architecture producing better-conditioned weight matrices that compress more efficiently.


Scaling Study: 7k → 20k Steps

Training trajectory showing the warmdown phase (steps 8,000–20,000) is the primary driver of improvement:

Step Base val_bpb Flow val_bpb Phase
7,000 1.2064 1.2065 Peak LR
8,000 (warmdown start) 1.2016 1.2022 ← warmdown begins
10,000 1.1898 1.1907 Warmdown
12,000 1.1801 1.1805 Warmdown
14,000 1.1658 1.1666 Warmdown
16,000 1.1511 1.1516 Warmdown
18,000 1.1307 1.1309 Warmdown
20,000 1.1062 1.1062 End

Key observations:

  • The peak-LR plateau (steps 1–8k) saturates around 1.20 BPB
  • The warmdown phase (steps 8k–20k) drives the model from 1.20 → 1.11, a gain of −0.094 BPB
  • Base and Flow track within 0.001 BPB throughout training — the FlowRefiner does not diverge at longer schedules
  • Diminishing returns: ~7.8 mbpb/kstep from step 8k→14k, ~4.9 mbpb/kstep from step 14k→20k

Quantized Evaluation Summary

Configuration Params No TTT (BPB) Legal TTT (BPB) TTT Gain Artifact
Base 20k 27,137,223 1.10050 1.09292 −0.00758 14,473,337 B
Flow 20k 27,235,848 1.10002 1.09279 −0.00724 14,635,871 B
Δ (Flow − Base) +98,625 −0.00048 −0.00014 +162,534 B

Architecture Summary

Component Configuration
Layers 11
Embedding dim 512
Heads 8 query, 4 KV (GQA)
MLP 3× expansion (1536), LeakyReLU(0.5)²
Vocab 1024 (SentencePiece BPE)
Sequence length 2048
BigramHash 4096 buckets, 128-dim
RoPE Partial 16/64, base 10000
LN Scale Depth-scaled 1/√(layer+1)
XSA All 11 layers
Value residual Yes
Gated attention Yes (QK gain init 1.5)
Logit softcap 30.0
SmearGate Yes
Tied embeddings Yes
EMA decay 0.997

FlowRefiner (supplementary config only)

  • 98,625 additional parameters
  • Lightweight logit correction network trained jointly with AR objective
  • FLOW_ENABLED=1 environment variable

Training Details

Setting Value
Hardware 1×A100-40GB PCIe
Steps 20,000
Peak LR phase Steps 0–8,000
Warmdown Cosine steps 8,000–20,000 (12,000 steps, 60%)
Warmup 20 steps
Batch size 786,432 tokens
Matrix LR (Muon) 0.025
Scalar LR (Adam) 0.025
Embed LR 0.035
Weight decay 0.04
Grad clip 0.3
Muon momentum 0.99
EMA decay 0.997
Step avg time ~1.92s (base), ~1.96s (flow)
Total train time ~10.7h (base), ~10.9h (flow)

Quantization Details

Setting Value
Method Int6 per-row with GPTQ-lite clip search
Compression zstd-16
Embedding quant Int6
Mixed quant Auto int5 fallback if needed
Base artifact 14,473,337 bytes (14.47 MB)
Flow artifact 14,635,871 bytes (14.64 MB)
Budget headroom 1.53 MB / 1.36 MB

TTT (Test-Time Training) Details

Setting Value
Protocol Legal score-first (evaluate before training)
Optimizer SGD with momentum 0.9
Learning rate 0.002
Epochs 10 per chunk
Chunk size 32,768 tokens
Frozen blocks First 2
Grad clip 1.0
Stride 64
Eval time ~2.0h (base TTT), ~0.5h (no-TTT)

SLURM Job Provenance

Run Job ID Description
Base 20k train 55364163 slurm_pr940_base_20k_ttt.sh
Flow 20k train 55364164 slurm_pr940_flow_20k_ttt.sh
Base 20k eval (no TTT) 55372104 eval_base20k_nottt
Base 20k eval (legal TTT) 55372106 eval_base20k_legal_ttt
Flow 20k eval (no TTT) 55372105 eval_flow20k_nottt
Flow 20k eval (legal TTT) 55372109 eval_flow20k_legal_ttt

Training script: train_gpt_pr940.py (2601 lines), environment variables control all configuration.


Credits

Base architecture and gated attention/value residual (PR #940/#549, @abaybektursun), Muon optimizer (baseline), BigramHash/SmearGate (PR #65, @aquariouserworkman), XSA (PR #187/#265, @Idan3011/@unnir), mixed quant (PR #76), sliding window eval (PR #50, @mattqlf), legal score-first TTT (PR #77, @samacqua), VE/PartialRoPE/LN Scale (PR #315/#374, @jfprincz/@unnir), EMA (PR #65, @aquariouserworkman), LeakyReLU² (PR #549, @abaybektursun), GEPA 20k prior work (@mcclec07), FlowRefiner (PR #1170, @mcclec07), scaling study and this submission (@mcclec07).

- Scaling study: PR940 architecture at 20k steps achieves 1.0929 BPB (legal TTT)
- Improves on prior GEPA 20k (1.0983 BPB) by -0.0054 BPB
- FlowRefiner variant at 1.0928 BPB confirms auxiliary flow head is neutral
- Trained on 1xA100-40GB, ~10.7h per run
- Artifact: 14,473,337 bytes (base), 14,635,871 bytes (flow)
- Includes scaling trajectory 7k-20k steps and comparison table
@Christopher-Lee-McClendon Christopher-Lee-McClendon changed the title feat: Non-record 11L PR940 Stack + 20k Steps + Legal TTT (1.0929 BPB) feat: Non-record 11L PR940 Stack (no n-gram in use) + 20k Steps + Legal TTT (1.0929 BPB) Apr 1, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant