Skip to content

Record: Vocab4096 + MLP4.0x + SLOT - val_bpb 1.0925 (3-seed mean)#1291

Open
dentity007 wants to merge 4 commits intoopenai:mainfrom
NathanMaine:record/vocab4096-mlp4x-slot-1.0925
Open

Record: Vocab4096 + MLP4.0x + SLOT - val_bpb 1.0925 (3-seed mean)#1291
dentity007 wants to merge 4 commits intoopenai:mainfrom
NathanMaine:record/vocab4096-mlp4x-slot-1.0925

Conversation

@dentity007
Copy link
Copy Markdown

Record: Vocab4096 + MLP4.0x + SLOT

val_bpb: 1.0925 (3-seed mean, std 0.0018) | ~15.95 MB | 8xH100 SXM | SLOT eval-time optimization

3-Seed Results

Seed Steps Sliding BPB + SLOT BPB Artifact
42 5,165 1.1014 1.0947 15,954,746
1337 5,890 1.0981 1.0913 15,932,192
2025 5,900 1.0986 1.0915 15,948,156
Mean 1.0994 1.0925 (std 0.0018)

Merged SOTA (PR #1019): 1.1147 BPB (1.8822 nats).
This submission: 1.0925 BPB (~1.8432 nats).
Delta: -0.0390 nats (-0.0222 BPB). Clears the 0.005-nat threshold by 7.8x.

Architecture

Built on PR #1218 (@clarkkev) with SLOT from PR #1176 (@bigbag).

  • 11L, d=512, 8H/4KV GQA, MLP 4.0x, Vocab 4096
  • XSA all layers, QK_GAIN=4.0, EMA 0.997
  • Full Hessian GPTQ (AR self-gen) + int6 + brotli-11
  • 34.4M params, dynamic warmdown 66.7%

SLOT: Per-Batch Delta Optimization

After sliding window eval, optimizes a small delta vector [1,1,512] at the last hidden layer:

  1. forward_hidden() under no_grad (frozen transformer)
  2. 8 AdamW steps (lr=0.005) through compute_logits() only
  3. Score with optimized delta, full softmax distribution

Delta re-initialized to zeros per batch. No cross-batch state. SLOT contribution: -0.007 BPB.

Legality

  • SLOT is score-first: hidden states frozen before optimization
  • Full normalized distributions throughout
  • No TTT, no n-gram cache, no QAT
  • GPTQ uses AR self-generated calibration only
  • Delta optimization on already-evaluated tokens only

Credits

PR #1218 (@clarkkev), PR #1176 (@bigbag), PR #1019 (@abaybektursun)

Reproduction

pip install sentencepiece zstandard brotli
pip install flash_attn_3 --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291
rm -f data/manifest.json
MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp4096 --train-shards 143
SEED=42 SLOT_ENABLED=1 SLOT_LR=0.005 SLOT_STEPS=8 torchrun --standalone --nproc_per_node=8 train_gpt.py

Test Plan

  • 3 seeds verified (std 0.0018, p < 0.01)
  • All artifacts under 16,000,000 bytes
  • Training under 600s, eval under 600s
  • SLOT score-first with full distributions
  • No TTT, no n-gram cache

@dentity007
Copy link
Copy Markdown
Author

DGX Spark GB10 Ablation Data - PROTEUS Feature Integration

Ran overnight ablation tests on NVIDIA DGX Spark (GB10, 128GB unified memory, single GPU) to evaluate PROTEUS features before committing to 8xH100 runs. All tests use sp1024 data, SEED=42, TORCH_COMPILE_DISABLE=1 (Triton/inductor not supported on GB10 ARM).

Phase 1: 3-Run Comparison (1000 iterations each)

Run Config train_bpb post-EMA INT6 round Sliding SLOT Size
1 Baseline (no features) 1.4601 1.5277 1.5521 - - 8.99 MB
2 Parallel + INT5 CRASHED - - - - -
3 Parallel + INT5 + SLOT 1.4479 1.5010 1.5376 1.5165 1.5077 8.21 MB

Delta (Run 3 vs Run 1): -0.0122 train_bpb, -0.0267 post-EMA, -0.0145 INT6 roundtrip

Phase 1 used TRAIN_BATCH_TOKENS=49152, VAL_BATCH_TOKENS=49152, full sliding window eval. Run 2 crashed during initialization (likely OOM from torch.compile fallback before TORCH_COMPILE_DISABLE was added).

Phase 2: 7-Run Overnight Ablation (500 iterations each)

All runs: VOCAB_SIZE=1024, ITERATIONS=500, WARMUP_STEPS=10, SLIDING_WINDOW_ENABLED=0

Run Config Parallel SLOT INT5 layers train_bpb post-EMA INT6 round Size
A Baseline 0 Off 2 1.5734 2.0469 2.1080 7.55 MB
B INT5 only 0 Off 10 1.5737 2.0462 2.1241 6.64 MB
C Parallel only 6 Off 2 1.5559 1.9314 1.9769 7.58 MB
D Parallel + INT5 6 Off 10 1.5556 1.9283 2.0082 6.67 MB
E SLOT only 0 On 2 1.5732 2.0442 2.1009 7.54 MB
F Parallel + SLOT 6 On 10 1.5557 1.9281 1.9911 6.67 MB
G Parallel + INT5(N=8) 6 Off 6 1.5553 1.9280 1.9982 7.14 MB

Isolated Feature Contributions (from ablation)

Feature train_bpb delta post-EMA delta Notes
Parallel residuals -0.0175 -0.1155 Biggest win by far
INT5 quant (alone) +0.0003 -0.0007 Neutral on BPB, saves ~0.9 MB
SLOT (alone) -0.0002 -0.0027 Marginal improvement
Parallel + SLOT combined -0.0177 -0.1188 SLOT adds almost nothing on top of parallel

Key Conclusions

  1. Parallel residuals (PARALLEL_START_LAYER=6) is the dominant feature. It delivers -0.0175 train_bpb and -0.115 post-EMA bpb. The dual-stream architecture with learnable lane_merge and 4-element route vector significantly outperforms sequential attention-then-MLP.

  2. Parallel residuals are also 2.3x faster on GB10. Throughput jumped from 11k tok/s (baseline) to 26k tok/s (parallel). This may be GB10-specific (unified memory benefits from dual-stream memory access patterns), but worth validating on H100.

  3. INT5 middle MLP saves ~0.9 MB with minimal quality impact. The coarser quantization for middle layers (3-7) is nearly BPB-neutral but frees artifact space for a larger model.

  4. SLOT adds negligible value on top of parallel residuals. The -0.0002 delta is within noise. SLOT's optimization surface may already be covered by the parallel architecture's additional parameters.

  5. Post-EMA BPB degrades significantly on GB10 due to only 500 training steps. The EMA weights do not converge as well with fewer steps. At 1000 steps (Phase 1), the EMA gap is much smaller.

Hardware Details

  • NVIDIA DGX Spark (GB10 Grace Blackwell, 128GB unified memory)
  • Single GPU, no distributed training (WORLD_SIZE=1, grad_accum_steps=8)
  • PyTorch 2.11.0+cu130, no flash_attn (SDPA fallback), no torch.compile
  • All tests ran on sp1024 FineWeb data (80 train shards, full validation)

Next Steps

Planning to run the parallel residuals configuration on 8xH100 with sp4096 data to validate BPB improvement at competition scale. The 2.3x throughput boost on GB10 could translate to more training steps within the 600s wallclock, amplifying the architecture advantage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant