Non-record: AR Self-Generated GPTQ Calibration (val_bpb=1.1461) by ibarrajo · Pull Request #1234 · openai/parameter-golf

ibarrajo · 2026-04-01T22:28:04Z

Summary

AR self-generated GPTQ calibration: model generates its own calibration data (64 sequences x 2048 tokens, temp=0.8) after training, eliminating the need for training data at eval time
Built on Approach B base (Int6 GPTQ + 11L + XSA + TTT)
Technique confirmed legal by competition organizer on PR Record: AR Self-Gen GPTQ + XSA-all + BigramHash 3072×112 — val_bpb 1.11473 (3-seed mean) #1019

Results

Config	val_bpb	Notes
Approach B baseline (B6)	1.1179	Training-data GPTQ calibration
Approach G (self-gen GPTQ) + TTT	1.1461	TTT s_0 score
Approach G (self-gen GPTQ) base	1.1559	Before TTT

Delta: +0.028 BPB vs baseline — self-gen GPTQ loses net.

Analysis: Why Self-Gen GPTQ Loses

The technique requires reserving ~210s for AR generation + Hessian collection, leaving only 390s for training (vs 590s baseline). This loses ~30% of training steps. While self-generated calibration data better matches the model's inference-time activation distribution, the quantization improvement (~0.002-0.003 BPB) is far smaller than the loss from fewer training steps (~0.03 BPB). The technique would become net positive if:

AR generation were faster (batched generation, shorter sequences)
Training were more sample-efficient (higher LR, better schedule)
The training budget were longer (giving a smaller relative time cost)

Key Changes from Approach B

generate_autoregressive_calib() — generates 64 sequences of 2048 tokens at temp=0.8
collect_hessians_from_tokens() — collects H = X^T X from self-generated sequences
Training budget reduced from 590s to 390s to accommodate generation time
Int6 GPTQ with Cholesky error compensation and column reordering

Architecture

11 layers, dim=512, 8 heads, 8 KV heads
BigramHash embedding (6144 x 128), Value embeddings
XSA on all layers, SmearGate, U-Net skip connections
ReLU^2 MLP (3.5x width)

Rule Compliance

Training <= 600s on 8xH100 (390s train + 210s AR gen + GPTQ)
Eval <= 600s
Artifact <= 16,000,000 bytes
No val tokens in artifact
No training data accessed during quantization (self-generated only)
GPTQ calibration within training budget
TTT is score-first only (s_0 reported)
Single-pass evaluation

Test Plan

Verified AR generation produces coherent text (not garbage)
Confirmed no training/val data accessed during GPTQ calibration
Full 8xH100 training run completed within time budget
Artifact fits under 16MB

🤖 Generated with Claude Code

Model generates its own GPTQ calibration data (64 seqs x 2048 tokens, temp=0.8) after training, eliminating need for training data at eval time. Built on Approach B base. The 390s training budget (vs 590s) to reserve time for AR generation loses more from fewer training steps than it gains from better-matched calibration distributions. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: AR Self-Generated GPTQ Calibration (val_bpb=1.1461)#1234

Non-record: AR Self-Generated GPTQ Calibration (val_bpb=1.1461)#1234
ibarrajo wants to merge 1 commit intoopenai:mainfrom
ibarrajo:approach-g

ibarrajo commented Apr 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ibarrajo commented Apr 1, 2026

Summary

Results

Analysis: Why Self-Gen GPTQ Loses

Key Changes from Approach B

Architecture

Rule Compliance

Test Plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants