Skip to content

Non-record: AR Self-Generated GPTQ Calibration (val_bpb=1.1461)#1234

Open
ibarrajo wants to merge 1 commit intoopenai:mainfrom
ibarrajo:approach-g
Open

Non-record: AR Self-Generated GPTQ Calibration (val_bpb=1.1461)#1234
ibarrajo wants to merge 1 commit intoopenai:mainfrom
ibarrajo:approach-g

Conversation

@ibarrajo
Copy link
Copy Markdown

@ibarrajo ibarrajo commented Apr 1, 2026

Summary

Results

Config val_bpb Notes
Approach B baseline (B6) 1.1179 Training-data GPTQ calibration
Approach G (self-gen GPTQ) + TTT 1.1461 TTT s_0 score
Approach G (self-gen GPTQ) base 1.1559 Before TTT

Delta: +0.028 BPB vs baseline — self-gen GPTQ loses net.

Analysis: Why Self-Gen GPTQ Loses

The technique requires reserving ~210s for AR generation + Hessian collection, leaving only 390s for training (vs 590s baseline). This loses ~30% of training steps. While self-generated calibration data better matches the model's inference-time activation distribution, the quantization improvement (~0.002-0.003 BPB) is far smaller than the loss from fewer training steps (~0.03 BPB). The technique would become net positive if:

  1. AR generation were faster (batched generation, shorter sequences)
  2. Training were more sample-efficient (higher LR, better schedule)
  3. The training budget were longer (giving a smaller relative time cost)

Key Changes from Approach B

  1. generate_autoregressive_calib() — generates 64 sequences of 2048 tokens at temp=0.8
  2. collect_hessians_from_tokens() — collects H = X^T X from self-generated sequences
  3. Training budget reduced from 590s to 390s to accommodate generation time
  4. Int6 GPTQ with Cholesky error compensation and column reordering

Architecture

  • 11 layers, dim=512, 8 heads, 8 KV heads
  • BigramHash embedding (6144 x 128), Value embeddings
  • XSA on all layers, SmearGate, U-Net skip connections
  • ReLU^2 MLP (3.5x width)

Rule Compliance

  • Training <= 600s on 8xH100 (390s train + 210s AR gen + GPTQ)
  • Eval <= 600s
  • Artifact <= 16,000,000 bytes
  • No val tokens in artifact
  • No training data accessed during quantization (self-generated only)
  • GPTQ calibration within training budget
  • TTT is score-first only (s_0 reported)
  • Single-pass evaluation

Test Plan

  • Verified AR generation produces coherent text (not garbage)
  • Confirmed no training/val data accessed during GPTQ calibration
  • Full 8xH100 training run completed within time budget
  • Artifact fits under 16MB

🤖 Generated with Claude Code

Model generates its own GPTQ calibration data (64 seqs x 2048 tokens,
temp=0.8) after training, eliminating need for training data at eval time.
Built on Approach B base. The 390s training budget (vs 590s) to reserve
time for AR generation loses more from fewer training steps than it gains
from better-matched calibration distributions.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants