Record: Vocab4096 + MLP4.0x + SLOT

dentity007 · 2026-04-03T07:54:53Z

val_bpb: 1.0925 (3-seed mean, std 0.0018) | ~15.95 MB | 8xH100 SXM | SLOT eval-time optimization

3-Seed Results

Seed	Steps	Sliding BPB	+ SLOT BPB	Artifact
42	5,165	1.1014	1.0947	15,954,746
1337	5,890	1.0981	1.0913	15,932,192
2025	5,900	1.0986	1.0915	15,948,156
Mean		1.0994	1.0925 (std 0.0018)

Merged SOTA (PR #1019): 1.1147 BPB (1.8822 nats).
This submission: 1.0925 BPB (~1.8432 nats).
Delta: -0.0390 nats (-0.0222 BPB). Clears the 0.005-nat threshold by 7.8x.

Architecture

Built on PR #1218 (@clarkkev) with SLOT from PR #1176 (@bigbag).

11L, d=512, 8H/4KV GQA, MLP 4.0x, Vocab 4096
XSA all layers, QK_GAIN=4.0, EMA 0.997
Full Hessian GPTQ (AR self-gen) + int6 + brotli-11
34.4M params, dynamic warmdown 66.7%

SLOT: Per-Batch Delta Optimization

After sliding window eval, optimizes a small delta vector [1,1,512] at the last hidden layer:

forward_hidden() under no_grad (frozen transformer)
8 AdamW steps (lr=0.005) through compute_logits() only
Score with optimized delta, full softmax distribution

Delta re-initialized to zeros per batch. No cross-batch state. SLOT contribution: -0.007 BPB.

Legality

SLOT is score-first: hidden states frozen before optimization
Full normalized distributions throughout
No TTT, no n-gram cache, no QAT
GPTQ uses AR self-generated calibration only
Delta optimization on already-evaluated tokens only

Credits

PR #1218 (@clarkkev), PR #1176 (@bigbag), PR #1019 (@abaybektursun)

Reproduction

pip install sentencepiece zstandard brotli
pip install flash_attn_3 --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291
rm -f data/manifest.json
MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf python3 data/cached_challenge_fineweb.py --variant sp4096 --train-shards 143
SEED=42 SLOT_ENABLED=1 SLOT_LR=0.005 SLOT_STEPS=8 torchrun --standalone --nproc_per_node=8 train_gpt.py

Test Plan

3 seeds verified (std 0.0018, p < 0.01)
All artifacts under 16,000,000 bytes
Training under 600s, eval under 600s
SLOT score-first with full distributions
No TTT, no n-gram cache

…er optimization, and SSM exploration

…3168) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

dentity007 · 2026-04-06T19:10:55Z

DGX Spark GB10 Ablation Data - PROTEUS Feature Integration

Ran overnight ablation tests on NVIDIA DGX Spark (GB10, 128GB unified memory, single GPU) to evaluate PROTEUS features before committing to 8xH100 runs. All tests use sp1024 data, SEED=42, TORCH_COMPILE_DISABLE=1 (Triton/inductor not supported on GB10 ARM).

Phase 1: 3-Run Comparison (1000 iterations each)

Run	Config	train_bpb	post-EMA	INT6 round	Sliding	SLOT	Size
1	Baseline (no features)	1.4601	1.5277	1.5521	-	-	8.99 MB
2	Parallel + INT5	CRASHED	-	-	-	-	-
3	Parallel + INT5 + SLOT	1.4479	1.5010	1.5376	1.5165	1.5077	8.21 MB

Delta (Run 3 vs Run 1): -0.0122 train_bpb, -0.0267 post-EMA, -0.0145 INT6 roundtrip

Phase 1 used TRAIN_BATCH_TOKENS=49152, VAL_BATCH_TOKENS=49152, full sliding window eval. Run 2 crashed during initialization (likely OOM from torch.compile fallback before TORCH_COMPILE_DISABLE was added).

Phase 2: 7-Run Overnight Ablation (500 iterations each)

All runs: VOCAB_SIZE=1024, ITERATIONS=500, WARMUP_STEPS=10, SLIDING_WINDOW_ENABLED=0

Run	Config	Parallel	SLOT	INT5 layers	train_bpb	post-EMA	INT6 round	Size
A	Baseline	0	Off	2	1.5734	2.0469	2.1080	7.55 MB
B	INT5 only	0	Off	10	1.5737	2.0462	2.1241	6.64 MB
C	Parallel only	6	Off	2	1.5559	1.9314	1.9769	7.58 MB
D	Parallel + INT5	6	Off	10	1.5556	1.9283	2.0082	6.67 MB
E	SLOT only	0	On	2	1.5732	2.0442	2.1009	7.54 MB
F	Parallel + SLOT	6	On	10	1.5557	1.9281	1.9911	6.67 MB
G	Parallel + INT5(N=8)	6	Off	6	1.5553	1.9280	1.9982	7.14 MB

Isolated Feature Contributions (from ablation)

Feature	train_bpb delta	post-EMA delta	Notes
Parallel residuals	-0.0175	-0.1155	Biggest win by far
INT5 quant (alone)	+0.0003	-0.0007	Neutral on BPB, saves ~0.9 MB
SLOT (alone)	-0.0002	-0.0027	Marginal improvement
Parallel + SLOT combined	-0.0177	-0.1188	SLOT adds almost nothing on top of parallel

Key Conclusions

Parallel residuals (PARALLEL_START_LAYER=6) is the dominant feature. It delivers -0.0175 train_bpb and -0.115 post-EMA bpb. The dual-stream architecture with learnable lane_merge and 4-element route vector significantly outperforms sequential attention-then-MLP.
Parallel residuals are also 2.3x faster on GB10. Throughput jumped from 11k tok/s (baseline) to 26k tok/s (parallel). This may be GB10-specific (unified memory benefits from dual-stream memory access patterns), but worth validating on H100.
INT5 middle MLP saves ~0.9 MB with minimal quality impact. The coarser quantization for middle layers (3-7) is nearly BPB-neutral but frees artifact space for a larger model.
SLOT adds negligible value on top of parallel residuals. The -0.0002 delta is within noise. SLOT's optimization surface may already be covered by the parallel architecture's additional parameters.
Post-EMA BPB degrades significantly on GB10 due to only 500 training steps. The EMA weights do not converge as well with fewer steps. At 1000 steps (Phase 1), the EMA gap is much smaller.

Hardware Details

NVIDIA DGX Spark (GB10 Grace Blackwell, 128GB unified memory)
Single GPU, no distributed training (WORLD_SIZE=1, grad_accum_steps=8)
PyTorch 2.11.0+cu130, no flash_attn (SDPA fallback), no torch.compile
All tests ran on sp1024 FineWeb data (80 train shards, full validation)

Next Steps

Planning to run the parallel residuals configuration on 8xH100 with sp4096 data to validate BPB improvement at competition scale. The 2.3x throughput boost on GB10 could translate to more training steps within the 600s wallclock, amplifying the architecture advantage.

dentity007 and others added 4 commits March 30, 2026 19:12

Add approach notes for parameter golf challenge

ad23b7f

Update approach with depth recurrence, factorized embeddings, tokeniz…

300eb5c

…er optimization, and SSM exploration

Non-record: Mamba-Inspired SSM Hybrid (3:1 SSM:Attention) (val_bpb 3.…

6e23d7d

…3168) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Record: Vocab4096 + MLP4.0x + SLOT - val_bpb 1.0925 (3-seed mean)

e48c8d2

aryanbhosale mentioned this pull request Apr 4, 2026

Record: SP4096 + Depth Recurrence + Parallel Residuals + Causal SLOT-16 — val_bpb 1.0766 (3-seed mean) #1333

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Record: Vocab4096 + MLP4.0x + SLOT - val_bpb 1.0925 (3-seed mean)#1291