Skip to content

Record: Discriminative TTT — val_bpb 1.0807 (3-seed mean)#6

Closed
resouer wants to merge 4 commits intomainfrom
submission/discriminative-ttt
Closed

Record: Discriminative TTT — val_bpb 1.0807 (3-seed mean)#6
resouer wants to merge 4 commits intomainfrom
submission/discriminative-ttt

Conversation

@resouer
Copy link
Copy Markdown
Owner

@resouer resouer commented Apr 4, 2026

Summary

3-seed mean val_bpb: 1.0807 (std 0.0005) | ~15.8 MB | 8xH100 SXM | ~185s TTT eval

Merged SOTA (PR openai#1019, 3-seed mean): 1.88218 nats. This run: 1.82463 nats. Delta: -0.058 nats. Clears the 0.005-nat threshold.

Track A (fixed predictor) — zero eval-time adaptation.

Results (3-seed)

Seed Sliding BPP val_loss (nats) Artifact
1337 1.0803 1.8241 15,815,343
42 1.0805 1.8243 15,810,497
2025 1.0812 1.8255 15,804,659
Mean 1.0807 1.8246

Changes from Merged SOTA (PR openai#1019)

1. Discriminative TTT — per-block adaptive LR (Novel)

Pre-quant AdamW TTT with per-block learning rate scaling: early blocks get 0.3x base LR (preserve learned features), later blocks get 1.0x (full adaptation). Linear interpolation across 11 blocks. Combined with freeze=0 (all blocks trainable) and 10 epochs. Inspired by ULMFiT discriminative fine-tuning (Howard & Ruder 2018).

Nearest PR: openai#1306 (flat LR, freeze=2, 6 epochs). Different: graduated per-block LR replaces binary freeze, enabling all blocks to adapt at calibrated rates. Delta: -0.010 BPP vs flat-LR TTT.

2. Pre-quant AdamW TTT (10 epochs, freeze=0)

AdamW TTT on full-precision EMA weights before GPTQ. 10 epochs (up from 6), all blocks trainable. Delta: -0.022 BPP base + -0.010 BPP from discriminative LR.

3. Coprime-stride multi-shard data loader

Weighted random shard sampling with coprime stride. Delta: -0.003 BPP.

4. Config batch (QK_GAIN=5.0, WARMDOWN=4000, GPTQ damp=0.005)

Delta: ~-0.003 BPP combined.

Compliance (Track A — Fixed Predictor)

  • No SLOT — no eval-time delta optimization
  • No TTT during eval — all TTT before quantization, within training budget
  • No n-gram cache — no eval-time statistics
  • No eval-time adaptation of any kind — model frozen after training+TTT+GPTQ
  • Standard autoregressive sliding-window eval (stride=64)

Reproduction

pip install flash_attn_3 --no-deps --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291/
torchrun --standalone --nproc_per_node=8 train_gpt.py

Credits

Base: PR openai#1019 (@abaybektursun). Pre-quant TTT: PR openai#1006. Coprime loader: PR openai#1184 (@icryo). Discriminative fine-tuning concept: ULMFiT (Howard & Ruder 2018). Freeze=0 insight: @MatoTeziTanka (Issue openai#140).

abaybektursun and others added 4 commits March 28, 2026 08:32
…11473 (3-seed mean)

AR self-generated calibration (no val/train data during quantization).
Recreated from PR openai#728 at @valerio-oai's request for clarity.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ptq-xsa-bigramhash3072

Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean)
3-seed mean 1.0807 (std 0.0005). Beats merged SOTA (1.1147) by 0.034.
Track A — zero eval-time adaptation.

Novel: per-block adaptive LR during pre-quant TTT (0.3x early to 1.0x late).
No existing PR modulates LR per block in TTT.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@resouer resouer force-pushed the submission/discriminative-ttt branch from b3b5bda to dbb4448 Compare April 4, 2026 15:19
@resouer
Copy link
Copy Markdown
Owner Author

resouer commented Apr 4, 2026

Closing: same reason. Branch is clean and ready.

@resouer resouer closed this Apr 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants