Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# Approach G: AR Self-Generated GPTQ Calibration

**Target: val_bpb < 1.12** | 8xH100 SXM, 600s | < 16MB artifact

## Technique

After training completes, the model autoregressively generates its own calibration data (64 sequences x 2048 tokens, temperature=0.8, fixed seed). These self-generated sequences are used to collect Hessians H = X^T X for full GPTQ quantization with Cholesky error compensation and column reordering.

**Why this is legal:** No training data and no validation data are accessed during quantization. The calibration data is entirely self-generated by the model. This approach was confirmed legal by the competition organizer on PR #1019.

**Why this is better than training-data calibration:** The self-generated text better represents the model's actual activation distribution at inference time, leading to lower quantization error. PR #1019 achieved 1.1147 BPB with this technique (current SOTA).

## Base

Built on Approach B (Int5 GPTQ + 33.6M params + TTT), replacing the training-data GPTQ calibration with AR self-generated calibration.

## Key Changes from Approach B

1. **AR Self-Gen Calibration**: `generate_autoregressive_calib()` generates 64 sequences of 2048 tokens at temp=0.8
2. **Hessian Collection from Generated Data**: `collect_hessians_from_tokens()` collects H = X^T X from self-generated sequences
3. **Training Time Budget**: `max_wallclock_seconds` reduced from 590s to 390s to reserve ~210s for AR generation + Hessian collection

## Architecture

- 11 layers, dim=512, 8 heads, 8 KV heads
- BigramHash embedding (6144 x 128)
- Value embeddings on layers 9,10
- XSA on last 11 layers
- SmearGate, U-Net skip connections
- ReLU^2 MLP (3.5x width)

## Quantization

- Int6 GPTQ for attention + MLP weights (Cholesky error compensation, column reordering)
- Int8 for remaining weights
- zstd level 22 compression
- 10% magnitude pruning pre-quantization

## Eval

- Sliding window eval (stride=64)
- Score-first TTT (3 epochs, lr=0.0001, AdamW)

## Run

```bash
NCCL_IB_DISABLE=1 RUN_ID=approach_g SEED=1337 \
torchrun --standalone --nproc_per_node=8 train_gpt.py 2>&1 | tee run.log
```
54 changes: 54 additions & 0 deletions records/track_10min_16mb/2026-04-01_ApproachG_SelfGenGPTQ/run.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
#!/bin/bash
# Approach G: AR Self-Gen GPTQ
# Run on 8xH100 SXM within 600s budget
set -euo pipefail

export NCCL_IB_DISABLE=1
export RUN_ID="${RUN_ID:-approach_g}"
export SEED="${SEED:-1337}"

# Model architecture
export NUM_LAYERS=11
export MODEL_DIM=512
export NUM_HEADS=8
export NUM_KV_HEADS=8
export VOCAB_SIZE=1024
export TRAIN_SEQ_LEN=2048
export EVAL_SEQ_LEN=2048
export BIGRAM_VOCAB_SIZE=6144
export BIGRAM_DIM=128
export XSA_LAST_N=11
export VE_ENABLED=1
export VE_DIM=128
export VE_LAYERS="9,10"

# Optimizer
export MUON_WD=0.04
export ADAM_WD=0.04
export MATRIX_LR=0.025
export SCALAR_LR=0.025
export TIED_EMBED_LR=0.035
export MUON_MOMENTUM=0.99
export MUON_MOMENTUM_WARMUP_START=0.92
export MUON_MOMENTUM_WARMUP_STEPS=1500
export WARMDOWN_ITERS=3500

# Training budget: 390s train + ~210s AR self-gen GPTQ = 600s total
export MAX_WALLCLOCK_SECONDS=390
export ITERATIONS=20000

# Eval
export EVAL_STRIDE=64
export SWA_ENABLED=1

# Pruning
export PRUNE_PCT=0.10

# TTT
export TTT_EPOCHS=3
export TTT_LR=0.0001
export TTT_FREEZE_BLOCKS=2
export TTT_CHUNK_TOKENS=131072
export TTT_OPTIMIZER=adamw

torchrun --standalone --nproc_per_node=8 train_gpt.py 2>&1 | tee "/workspace/run_${RUN_ID}.log"
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
{
"author": "elninja",
"github_id": "elninja",
"name": "AR Self-Gen GPTQ + Int6 + XSA + TTT",
"blurb": "11L model with AR self-generated GPTQ calibration (64 seqs x 2048 tokens, temp=0.8). Model generates its own calibration data after training -- no training or validation data accessed during quantization. Full Hessian GPTQ with Cholesky error compensation. BigramHash 6144x128, value embeddings, sliding window eval, score-first TTT.",
"date": "2026-04-01",
"track": "10min_16mb",
"val_loss": null,
"val_bpb": null,
"bytes_total": null,
"bytes_code": null,
"gpu": "8xH100 SXM",
"training_time_seconds": 600,
"num_layers": 11,
"model_dim": 512,
"num_heads": 8,
"num_kv_heads": 8,
"vocab_size": 1024,
"train_seq_len": 2048,
"eval_seq_len": 2048,
"bigram_vocab_size": 6144,
"bigram_dim": 128,
"quantization": "int6 GPTQ (AR self-gen calibration, 64 seqs x 2048, temp=0.8)",
"compression": "zstd level 22",
"calibration": "AR self-generated (no external data during quantization)",
"technique_summary": "AR self-gen GPTQ calibration replaces training-data calibration for legal compliance"
}
Loading