Non-record: Int6 QAT + MLP1472 + SlidingWindow + TTT (val_bpb=1.1807) by lookin-zz · Pull Request #301 · openai/parameter-golf

lookin-zz · 2026-03-21T02:52:19Z

Submission (Updated)

Track: track_non_record_16mb
val_bpb: 1.1807 (with TTT) / 1.1991 (post-quant without TTT)
Artifact size: 15,781,354 / 16,000,000 bytes
Approach: 9L 512d GPT with int6 STE QAT (0.0017 quant gap), MLP hidden=1472, aggressive warmdown, FP16 tied embeddings, batched sliding window eval (stride=64), and full-weight test-time training (3 epochs SGD, freeze first 2 blocks, -0.018 BPB).

Update from v1

Added test-time training: full-weight SGD on validation data during eval
BPB improved from 1.1958 → 1.1807

See records/track_non_record_16mb/2026-03-21_Int6_QAT_MLP1472_SlidingWindow/README.md for details.

MatoTeziTanka · 2026-04-12T14:17:37Z

Community Review — Non-record: Int6 QAT + MLP1472 + SlidingWindow + TTT (val_bpb=1.1807)

Compliance flags: N-gram family bug + Pre-Quant TTT violation

Analysis

Check 1 — N-gram Family Bug (target token in hash lookup key)

CLEAN. The BigramHash.forward (line 802–805) computes:

prev_ids = torch.cat([token_ids[:, :1], token_ids[:, :-1]], dim=1)
bucket = (prev_ids * 31 + token_ids) % self.num_buckets

The hash key uses (prev_id, current_token_id). The current (target) token participates in the key at position t — the embedding looked up is a function of the token being predicted. This is the n-gram family bug: the model can trivially extract the target token from its own embedding input, bypassing the need to actually predict it. The adjacency XOR exemption (BigramHash with adjacent-input-only XOR) does not apply here because token_ids in the hash key is the token at position t, which IS the token being predicted when used as input to predict t+1. Specifically, bucket[t] depends on token_ids[t], and that embedding is added to the representation that produces the logit for position t (predicting token_ids[t+1]). However, the critical issue is at position t: the model predicts y[t] = token_ids[t] using x[t] = token_ids[t-1], and bucket[t] = (token_ids[t-1] * 31 + token_ids[t]) % N. The embedding at position t is a direct function of token_ids[t] — the target. This encodes the answer into the input features. CLOSE flag triggered.

Check 2 — Pre-Quant TTT (multi-epoch AdamW on val_tokens without score-first)

CLOSE — partially applicable. The run_ttt function (lines 348–379) uses SGD (not AdamW), but runs multi-epoch training (3 epochs per PR description, ttt_epochs=3) directly on val_tokens without score-first gating. There is no is_last_chunk guard, no torch.no_grad() before optimizer step, and no score-first-per-chunk logic. The entire validation set is used for gradient updates across all epochs unconditionally. This is TTT on val data without the required legal-TTT safeguards. CLOSE flag triggered (TTT without score-first, no is_last_chunk guard, no no_grad before step).

Check 3 — Legal TTT (score-first-per-chunk, torch.no_grad() before step, is_last_chunk guard)

NOT PRESENT. The TTT implementation lacks all three legal-TTT markers: no score-first-per-chunk, no torch.no_grad() wrapping the optimizer step, no is_last_chunk guard. Does not qualify as legal TTT.

Check 4 — Scored-Region SLOT

HOLD. The sliding-window eval (lines 266–335) scores only logits[:, -stride:, :] — the last stride positions of each window. This is a non-standard scored-region approach that selects which tokens contribute to the BPB metric. Requires human review for SLOT classification.

Check 5 — Pure Neural

Two independent violations found:

N-gram family bug: BigramHash.forward computes bucket[t] = (prev_ids[t] * 31 + token_ids[t]) % N — token_ids[t] is the target token being predicted at position t. This encodes the answer into the input features, violating the ruling on PR Record: BackoffNgramMixer + Drift-Free TTT (3-seed mean val_bpb=0.6683) #779.
Illegal TTT: run_ttt runs 3-epoch SGD over the full val_tokens unconditionally, with no score-first-per-chunk discipline.

Verdict: CLOSE — dual violation.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: Recommend CLOSE.

Reviewed by @MatoTeziTanka — The Agora. Compliance audit via LLM agent (Sonnet) reviewing full train_gpt.py source. If this review misread your code, please call it out so I can re-audit manually.

lookin-zz added 2 commits March 21, 2026 04:49

Add submission: 2026-03-21_Int6_QAT_MLP1472_SlidingWindow

be0d113

Update: add TTT, val_bpb improved to 1.1807

21f3b85

lookin-zz changed the title ~~Non-record: Int6 QAT + MLP1472 + Sliding Window (val_bpb=1.1958)~~ Non-record: Int6 QAT + MLP1472 + SlidingWindow + TTT (val_bpb=1.1807) Mar 21, 2026

Trim train_gpt.py to 1427 lines (under 1500 limit)

962e116

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: Int6 QAT + MLP1472 + SlidingWindow + TTT (val_bpb=1.1807)#301

Non-record: Int6 QAT + MLP1472 + SlidingWindow + TTT (val_bpb=1.1807)#301
lookin-zz wants to merge 3 commits intoopenai:mainfrom
lookin-zz:submission/2026-03-21_Int6_QAT_MLP1472_SlidingWindow

lookin-zz commented Mar 21, 2026 •

edited

Loading

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lookin-zz commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Submission (Updated)

Update from v1

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Community Review — Non-record: Int6 QAT + MLP1472 + SlidingWindow + TTT (val_bpb=1.1807)

Analysis

Check 1 — N-gram Family Bug (target token in hash lookup key)

Check 2 — Pre-Quant TTT (multi-epoch AdamW on val_tokens without score-first)

Check 3 — Legal TTT (score-first-per-chunk, torch.no_grad() before step, is_last_chunk guard)

Check 4 — Scored-Region SLOT

Check 5 — Pure Neural

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

lookin-zz commented Mar 21, 2026 •

edited

Loading