Skip to content

Record: 12L RecycledCore Int5 — val_bpb 1.1464 (seed 1337)#1573

Open
shivangbaveja wants to merge 1 commit intoopenai:mainfrom
shivangbaveja:sbaveja/submission/11L-4xMLP-RecycledCore-ParallelResid
Open

Record: 12L RecycledCore Int5 — val_bpb 1.1464 (seed 1337)#1573
shivangbaveja wants to merge 1 commit intoopenai:mainfrom
shivangbaveja:sbaveja/submission/11L-4xMLP-RecycledCore-ParallelResid

Conversation

@shivangbaveja
Copy link
Copy Markdown

Summary

Single-seed (1337) run on 8xH100 SXM, 600s wallclock. 12-layer model with recycled-core depth (layers 3,4 replayed once, 14 virtual layers), gated attention, value residual, and int5
quantization on both MLP and attention weights to fit under 16 MB.

Metric Value
Int5 sliding-window BPB 1.1464
Int5 roundtrip BPB 1.1699
Pre-quant BPB 1.1329
Post-EMA BPB 1.1333
Steps 5,449
ms/step 110.14
Wallclock 600s
Artifact size 15,925,822 bytes (15.93 MB)

Key Techniques

  • 12L recycled-core (default NUM_LAYERS=12): layers 3,4 replayed once for 14 virtual layers from 12 physical, with per-application learned modulation (scale+bias)
  • LeakyReLU(0.5)^2 MLP activation with 3x expansion (512->1536->512)
  • Gated attention + value residual: per-head sigmoid gate on attention output, first-layer values blended into all subsequent layers
  • Int5 quantization on both MLP and attention weights (GPTQ-lite per-row, pack 8 values -> 5 bytes) + LZMA preset=9
  • max-autotune compile with cudagraph_trees=False for tied embeddings
  • SWA (every 50 steps, 13 snapshots) + EMA (decay=0.997)
  • XSA last 4 layers, partial RoPE (16/64 dims), LN scale, U-Net skips, BigramHash (2048), SmearGate, value embeddings (layers 10,11)
  • Muon optimizer (lr=0.025, WD=0.04) + Adam (embeddings/scalars, WD=0.04)
  • Step-based warmdown over last 3500 steps

Results

Seed step_avg steps pre-quant bpb int5 roundtrip bpb int5 sliding bpb Artifact
1337 110.14ms 5,449 1.1329 1.1699 1.1464 15,925,822

Single-seed run. Seeds 42 and 2025 not yet completed.

Reproduction

# From repo root (all params are defaults at commit bf17df6):
SEED=1337 torchrun --standalone --nproc_per_node=8 \                                                                                                                                  
  records/track_10min_16mb/2026-04-12_11L_4xMLP_RecycledCore_ParallelResid_Int5/train_gpt.py

Credits

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant