Non-record: Extended Compute Scaling Analysis (50K steps, 1.0858 BPB, 3 seeds (each run ~12 hours on 4xA100MIG)) by OnlyJundong · Pull Request #1424 · openai/parameter-golf

OnlyJundong · 2026-04-06T19:03:31Z

Summary

This submission is a non-record submission. It extends the 20K scaling run to 50K steps (~12 hours), using the same architecture and code from (PR #549 by @abaybektursun). Training runs on 4×A100 MIG instances (approximately 10× slower per step than 8×H100 SXM).

Results

50K steps, ~12 hours (4×A100 MIG, 3-seed comparison)

Seed	step_avg	steps	Pre-TTT bpb	Post-TTT bpb	TTT gain	Artifact
1337	829.0ms	50,000	1.0942	1.0853	-0.0089	14,330,478
42	830.9ms	50,000	1.0945	1.0857	-0.0088	14,206,210
2024	828.7ms	50,000	1.0953	1.0865	-0.0088	14,353,974
Mean	829.5ms	50,000	1.0947	1.0858 (std 0.0005)	-0.0088	14,296,887

Comparison: 20K vs 50K

Metric	20K steps (~6h)	50K steps (~12h)	Delta
Post-TTT BPB	1.0960	1.0858	−0.0102
Artifact size	~15.05 MB	~14.30 MB	−0.75 MB
TTT gain	−0.0058	−0.0088	−0.0030 (53% more)

Plots

BPB vs Steps (ASCII plot)

Power-law decay with two distinct phases: rapid early learning (0–10K), plateau (10K–30K), then warmdown-driven final drop.

BPB
4.10 |*
     |
     |
2.50 |
     |
1.26 | *
1.23 |   *
1.22 |     *
1.20 |       *  * * * * * * * *
1.18 |                           *
1.16 |                             *
1.14 |                               *
1.12 |                                 *
1.10 |                                   *
     +---------+--------+---------+-------> steps (K)
     0        10       20        30       50

     |<early>|<--- plateau --->|<warmdown>|
      (rapid)  (BPB ≈1.19–1.20)  (sharp drop)

Artifact Size vs Steps (ASCII plot)

Non-monotonic: grows to peak at ~10K–30K steps (~17.2MB), then shrinks rapidly during warmdown.

MB
17.2 |       * * * * * * * * * * * *
16.0 |--------------------------------------  16MB limit
15.5 |
14.9 |                                * *
14.4 |                                    *
13.9 |    *
13.1 | *
 4.6 |*
     +---------+--------+---------+-------> steps (K)
     0        10       20        30       50

     |<fits>|<-------over budget------->|<fits>|

…ps, ~6h, 4×A100 MIG)

…ps, ~12h, 4×A100 MIG)

MatoTeziTanka · 2026-04-11T20:04:08Z

Community Review — Non-record: Extended Compute Scaling Analysis (50K steps, 1.0858 BPB, 3 seeds (each run ~12 hours on 4xA100MIG))

BPB: 1.0858 | Compliance: LOOKS CLEAN — score-first-per-chunk TTT (legal #1416/#1423 pattern)

What I found in the code (head SHA e73d565a36ff, file records/track_non_record_16mb/2026-04-06_ExtendedCompute_20K_6hours_4xA100MIG_ScalingAnalysis/train_gpt.py):

The TTT path at line 1095 implements the score-first-per-chunk pattern: each chunk is scored under torch.no_grad() / inference_mode() before the base_model.train() + SGD adaptation runs on that same chunk, with an is_last_chunk guard so the final chunk gets no adaptation pass. This is the structural shape the legal frontier uses (PRs #1416 erichroepke, #1423 aryanbhosale).

Per Issue #402 and Issue #677, TTT is legal when each token is scored before the adapter updates on it, and that's what the code does here — chunk ci is scored under weights adapted only on chunks 0..ci-1. No prequant_ttt_adapt_adamw(val_tokens, ...) multi-epoch fine-tune, no scored-region SLOT, no target-in-key n-gram cache.

CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=11, vocab=1024, code=91502 B, SMOKE_TEST_PASS

Verdict: LOOKS CLEAN.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending standard checks (3-seed validation, 16MB artifact cap, 10-min wallclock on 8×H100 SXM). The compliance picture matches the legal reference frontier and no flags were raised by the classification pass.

Auto-classification caveat: this review was drafted by the AST-based classifier against a template derived from manually-reviewed cluster PRs (#1420, #1450, #1487, #1541, #1529, #1533, #1518). If I've misread a subtlety in your eval path — e.g., multi-epoch TTT that I mistook for single-pass, or a target-in-key lookup I missed in a helper function — please flag it and I'll re-run the audit manually.

Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): import OK in 0.04s, dim=512, layers=11, vocab=1024, code=91502 B, SMOKE_TEST_PASS. Classification via deterministic AST-based classify_prs.py (pattern bank derived from ~65 manually-reviewed PRs earlier in the 2026-04-11 sweep). This review was auto-drafted from a template and spot-checked before posting — if the template misread your code, please call it out so I can iterate the classifier.

OnlyJundong added 3 commits April 6, 2026 01:02

Add non-record submission: Extended Compute Scaling Analysis (20K ste…

80311ef

…ps, ~6h, 4×A100 MIG)

Add non-record submission: Extended Compute Scaling Analysis (50K ste…

0e44679

…ps, ~12h, 4×A100 MIG)

added 20k pr link

e73d565

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: Extended Compute Scaling Analysis (50K steps, 1.0858 BPB, 3 seeds (each run ~12 hours on 4xA100MIG))#1424

Non-record: Extended Compute Scaling Analysis (50K steps, 1.0858 BPB, 3 seeds (each run ~12 hours on 4xA100MIG))#1424
OnlyJundong wants to merge 3 commits intoopenai:mainfrom
OnlyJundong:nonrecord/extended-compute-scaling-50k-12hours

OnlyJundong commented Apr 6, 2026

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

OnlyJundong commented Apr 6, 2026

Summary

Results

50K steps, ~12 hours (4×A100 MIG, 3-seed comparison)

Comparison: 20K vs 50K

Plots

BPB vs Steps (ASCII plot)

Artifact Size vs Steps (ASCII plot)

Uh oh!

MatoTeziTanka commented Apr 11, 2026

Community Review — Non-record: Extended Compute Scaling Analysis (50K steps, 1.0858 BPB, 3 seeds (each run ~12 hours on 4xA100MIG))

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants