Record: Discriminative TTT — val_bpb 1.0807 (3-seed mean)#6
Closed
Record: Discriminative TTT — val_bpb 1.0807 (3-seed mean)#6
Conversation
…11473 (3-seed mean) AR self-generated calibration (no val/train data during quantization). Recreated from PR openai#728 at @valerio-oai's request for clarity. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…ptq-xsa-bigramhash3072 Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean)
3-seed mean 1.0807 (std 0.0005). Beats merged SOTA (1.1147) by 0.034. Track A — zero eval-time adaptation. Novel: per-block adaptive LR during pre-quant TTT (0.3x early to 1.0x late). No existing PR modulates LR per block in TTT. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
b3b5bda to
dbb4448
Compare
Owner
Author
|
Closing: same reason. Branch is clean and ready. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
3-seed mean val_bpb: 1.0807 (std 0.0005) | ~15.8 MB | 8xH100 SXM | ~185s TTT eval
Merged SOTA (PR openai#1019, 3-seed mean): 1.88218 nats. This run: 1.82463 nats. Delta: -0.058 nats. Clears the 0.005-nat threshold.
Track A (fixed predictor) — zero eval-time adaptation.
Results (3-seed)
Changes from Merged SOTA (PR openai#1019)
1. Discriminative TTT — per-block adaptive LR (Novel)
Pre-quant AdamW TTT with per-block learning rate scaling: early blocks get 0.3x base LR (preserve learned features), later blocks get 1.0x (full adaptation). Linear interpolation across 11 blocks. Combined with freeze=0 (all blocks trainable) and 10 epochs. Inspired by ULMFiT discriminative fine-tuning (Howard & Ruder 2018).
Nearest PR: openai#1306 (flat LR, freeze=2, 6 epochs). Different: graduated per-block LR replaces binary freeze, enabling all blocks to adapt at calibrated rates. Delta: -0.010 BPP vs flat-LR TTT.
2. Pre-quant AdamW TTT (10 epochs, freeze=0)
AdamW TTT on full-precision EMA weights before GPTQ. 10 epochs (up from 6), all blocks trainable. Delta: -0.022 BPP base + -0.010 BPP from discriminative LR.
3. Coprime-stride multi-shard data loader
Weighted random shard sampling with coprime stride. Delta: -0.003 BPP.
4. Config batch (QK_GAIN=5.0, WARMDOWN=4000, GPTQ damp=0.005)
Delta: ~-0.003 BPP combined.
Compliance (Track A — Fixed Predictor)
Reproduction
Credits
Base: PR openai#1019 (@abaybektursun). Pre-quant TTT: PR openai#1006. Coprime loader: PR openai#1184 (@icryo). Discriminative fine-tuning concept: ULMFiT (Howard & Ruder 2018). Freeze=0 insight: @MatoTeziTanka (Issue openai#140).