Record: Discriminative TTT — val_bpb 1.0807 (3-seed mean) by resouer · Pull Request #6 · resouer/parameter-golf

resouer · 2026-04-04T15:08:56Z

Summary

3-seed mean val_bpb: 1.0807 (std 0.0005) | ~15.8 MB | 8xH100 SXM | ~185s TTT eval

Merged SOTA (PR openai#1019, 3-seed mean): 1.88218 nats. This run: 1.82463 nats. Delta: -0.058 nats. Clears the 0.005-nat threshold.

Track A (fixed predictor) — zero eval-time adaptation.

Results (3-seed)

Seed	Sliding BPP	val_loss (nats)	Artifact
1337	1.0803	1.8241	15,815,343
42	1.0805	1.8243	15,810,497
2025	1.0812	1.8255	15,804,659
Mean	1.0807	1.8246

Changes from Merged SOTA (PR openai#1019)

1. Discriminative TTT — per-block adaptive LR (Novel)

Pre-quant AdamW TTT with per-block learning rate scaling: early blocks get 0.3x base LR (preserve learned features), later blocks get 1.0x (full adaptation). Linear interpolation across 11 blocks. Combined with freeze=0 (all blocks trainable) and 10 epochs. Inspired by ULMFiT discriminative fine-tuning (Howard & Ruder 2018).

Nearest PR: openai#1306 (flat LR, freeze=2, 6 epochs). Different: graduated per-block LR replaces binary freeze, enabling all blocks to adapt at calibrated rates. Delta: -0.010 BPP vs flat-LR TTT.

2. Pre-quant AdamW TTT (10 epochs, freeze=0)

AdamW TTT on full-precision EMA weights before GPTQ. 10 epochs (up from 6), all blocks trainable. Delta: -0.022 BPP base + -0.010 BPP from discriminative LR.

3. Coprime-stride multi-shard data loader

Weighted random shard sampling with coprime stride. Delta: -0.003 BPP.

4. Config batch (QK_GAIN=5.0, WARMDOWN=4000, GPTQ damp=0.005)

Delta: ~-0.003 BPP combined.

Compliance (Track A — Fixed Predictor)

No SLOT — no eval-time delta optimization
No TTT during eval — all TTT before quantization, within training budget
No n-gram cache — no eval-time statistics
No eval-time adaptation of any kind — model frozen after training+TTT+GPTQ
Standard autoregressive sliding-window eval (stride=64)

Reproduction

pip install flash_attn_3 --no-deps --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/
torchrun --standalone --nproc_per_node=8 train_gpt.py

Credits

Base: PR openai#1019 (@abaybektursun). Pre-quant TTT: PR openai#1006. Coprime loader: PR openai#1184 (@icryo). Discriminative fine-tuning concept: ULMFiT (Howard & Ruder 2018). Freeze=0 insight: @MatoTeziTanka (Issue openai#140).

@valerio-oai

…11473 (3-seed mean) AR self-generated calibration (no val/train data during quantization). Recreated from PR openai#728 at @valerio-oai's request for clarity. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…ptq-xsa-bigramhash3072 Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean)

3-seed mean 1.0807 (std 0.0005). Beats merged SOTA (1.1147) by 0.034. Track A — zero eval-time adaptation. Novel: per-block adaptive LR during pre-quant TTT (0.3x early to 1.0x late). No existing PR modulates LR per block in TTT. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

resouer · 2026-04-04T15:21:32Z

Closing: same reason. Branch is clean and ready.

abaybektursun and others added 4 commits March 28, 2026 08:32

Merge pull request openai#1019 from abaybektursun/record/ar-selfgen-g…

2443851

…ptq-xsa-bigramhash3072 Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean)

Update README.md

9d070df

resouer force-pushed the submission/discriminative-ttt branch from b3b5bda to dbb4448 Compare April 4, 2026 15:19

resouer closed this Apr 4, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Discriminative TTT — val_bpb 1.0807 (3-seed mean)#6

Record: Discriminative TTT — val_bpb 1.0807 (3-seed mean)#6
resouer wants to merge 4 commits intomainfrom
submission/discriminative-ttt

resouer commented Apr 4, 2026 •

edited

Loading

Uh oh!

resouer commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

resouer commented Apr 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Results (3-seed)

Changes from Merged SOTA (PR openai#1019)

1. Discriminative TTT — per-block adaptive LR (Novel)

2. Pre-quant AdamW TTT (10 epochs, freeze=0)

3. Coprime-stride multi-shard data loader

4. Config batch (QK_GAIN=5.0, WARMDOWN=4000, GPTQ damp=0.005)

Compliance (Track A — Fixed Predictor)

Reproduction

Credits

Uh oh!

resouer commented Apr 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

resouer commented Apr 4, 2026 •

edited

Loading