Non-record: PROTEUS Feature Ablation - Parallel Residuals + Mixed INT5/INT6 + TTT on DGX Spark GB10 by dentity007 · Pull Request #1425 · openai/parameter-golf

dentity007 · 2026-04-06T19:17:05Z

Non-record: PROTEUS Feature Ablation on DGX Spark GB10

Best val_bpb: 1.4479 (1000 steps) | 1.5077 with SLOT | NVIDIA DGX Spark GB10, single GPU | sp1024

Systematic ablation of features from PROTEUS v1.6 (PR #1289) integrated into our PR #1218/#1287 stack. Tests parallel residuals, mixed INT5/INT6 quantization, and SLOT independently and in combination. All tests ran on a single NVIDIA GB10 (128GB unified memory) with no torch.compile (Triton unsupported on ARM).

Features Tested

Parallel Residuals (PARALLEL_START_LAYER=6): Dual-stream Block architecture where attention and MLP operate on separate residual streams with learnable 4-element route vector and sigmoid lane_merge. From PR Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR #1179, -0.0143 vs merged SOTA #1204 @msisovic, PR Record: PROTEUS v1.6 — Scylla + Parallel Residuals + Depth Recurrence + Legal TTT — val_bpb 1.0819 (3-seed mean) #1289 @MatoTeziTanka.
Mixed INT5/INT6 Quantization (N_INT6_MLP_LAYERS=6): Middle MLP layers (3-7) quantized to INT5 (clip_range=15), edge MLP + all attention stay INT6. Saves ~0.9 MB artifact space. From PR Record: PROTEUS v1.6 — Scylla + Parallel Residuals + Depth Recurrence + Legal TTT — val_bpb 1.0819 (3-seed mean) #1289 @MatoTeziTanka.
Score-First TTT (TTT_ENABLED=1): Chunk-based eval-time training with cosine LR decay, SGD momentum, frozen early blocks. Ported from PR Record: PROTEUS v1.6 — Scylla + Parallel Residuals + Depth Recurrence + Legal TTT — val_bpb 1.0819 (3-seed mean) #1289.
SLOT (SLOT_ENABLED=1): Per-batch delta optimization at last hidden layer. From PR Record: QK-Gain 4.0 + XSA-11 + Muon-TTT + SLOT — val_bpb 1.0914 (3-seed mean) #1176.

Phase 1: 3-Run Comparison (1000 iterations each)

Common: VOCAB_SIZE=1024, TRAIN_BATCH_TOKENS=49152, SEED=42

Run	Config	train_bpb	post-EMA	INT6 round	Sliding	SLOT	Artifact
1	Baseline	1.4601	1.5277	1.5521	-	-	8.99 MB
3	Parallel+INT5+SLOT	1.4479	1.5010	1.5376	1.5165	1.5077	8.21 MB

Delta: -0.0122 train_bpb, -0.0267 post-EMA, -0.0145 INT6 roundtrip

Phase 2: 7-Run Overnight Ablation (500 iterations each)

Common: VOCAB_SIZE=1024, ITERATIONS=500, WARMUP_STEPS=10, SLIDING_WINDOW_ENABLED=0, SEED=42

Run	Config	Parallel	SLOT	INT5 layers	train_bpb	post-EMA	INT6 round	Artifact
A	Baseline	0	Off	2	1.5734	2.0469	2.1080	7.55 MB
B	INT5 only	0	Off	10	1.5737	2.0462	2.1241	6.64 MB
C	Parallel only	6	Off	2	1.5559	1.9314	1.9769	7.58 MB
D	Parallel+INT5	6	Off	10	1.5556	1.9283	2.0082	6.67 MB
E	SLOT only	0	On	2	1.5732	2.0442	2.1009	7.54 MB
F	Parallel+SLOT	6	On	10	1.5557	1.9281	1.9911	6.67 MB
G	Parallel+INT5(N=8)	6	Off	6	1.5553	1.9280	1.9982	7.14 MB

Isolated Feature Impact

Feature	train_bpb delta	post-EMA delta	Artifact delta	Verdict
Parallel residuals	-0.0175	-0.1155	+0.03 MB	Strong win
INT5 middle MLP	+0.0003	-0.0007	-0.91 MB	Neutral BPB, saves space
SLOT	-0.0002	-0.0027	-0.01 MB	Marginal
Parallel+SLOT	-0.0177	-0.1188	-0.88 MB	SLOT adds nothing on top

Throughput Surprise

Config	tok/s	vs baseline
Baseline (sequential)	11,251	1.0x
Parallel residuals	26,204	2.3x

The dual-stream architecture is 2.3x faster on GB10. This appears to be from better memory access patterns in the unified memory architecture. The separate attention and MLP streams avoid the sequential dependency that forces a round-trip through memory between the two operations.

Conclusions

Parallel residuals is the dominant feature. It delivers the largest BPB improvement and the largest throughput improvement. On 8xH100, the throughput gain would translate to more training steps in the 600s wallclock.
INT5 quantization is a free lunch for artifact size. Nearly BPB-neutral, saves ~0.9 MB. Use this when you need headroom under the 16 MB cap.
SLOT provides diminishing returns when combined with parallel residuals. The parallel architecture's additional learnable parameters (resid_mix_mlp, route, lane_merge) may cover the same optimization surface that SLOT exploits.
TTT was not tested on GB10 due to impractical eval times on single GPU (estimated 8+ hours per run). The score-first TTT implementation is ready but needs multi-GPU validation.

Architecture Details

Parallel residuals split each transformer block (from layer 6 onward) into two independent streams:

Attention stream: takes input from x_attn, applies RMSNorm, self-attention, residual
MLP stream: takes input from x_mlp, applies RMSNorm, MLP, residual
Cross-blending: 4-element route vector [r0, r1, r2, r3] controls how attention and MLP deltas combine into each stream
Lane merge: Learnable sigmoid scalar blends the two streams back at the final layer

This adds 5,141 parameters (5 parallel blocks x ~1K each) for a 34.4M param model.

Hardware

NVIDIA DGX Spark (GB10 Grace Blackwell, SM 121, 128GB unified memory)
Single GPU, WORLD_SIZE=1, grad_accum_steps=8
PyTorch 2.11.0+cu130
No flash_attn_interface (SDPA fallback via scaled_dot_product_attention)
No torch.compile (Triton/inductor broken on aarch64, TORCH_COMPILE_DISABLE=1)
sp1024 FineWeb data (80 train shards, full validation set)

Reproduction

# On DGX Spark or any CUDA GPU:
pip install sentencepiece brotli
python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 80

# Best config (parallel + INT5):
VOCAB_SIZE=1024 PARALLEL_START_LAYER=6 N_INT6_MLP_LAYERS=6 \
  TORCH_COMPILE_DISABLE=1 ITERATIONS=500 \
  python3 train_gpt_1218_slot.py

# Full comparison script:
bash run_overnight.sh

Credits

Parallel residuals architecture: PR Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR #1179, -0.0143 vs merged SOTA #1204 @msisovic, PR Record: PROTEUS v1.6 — Scylla + Parallel Residuals + Depth Recurrence + Legal TTT — val_bpb 1.0819 (3-seed mean) #1289 @MatoTeziTanka
Mixed INT5/INT6 quantization: PR Record: PROTEUS v1.6 — Scylla + Parallel Residuals + Depth Recurrence + Legal TTT — val_bpb 1.0819 (3-seed mean) #1289 @MatoTeziTanka
Base architecture (sp4096, MLP 4x, WD 0.085): PR Record: 4096-Vocab + 4.0-MLP-mult + 0.085-WD + Simplifications — val_bpb 1.09785 (3-seed mean) #1218 @clarkkev, PR Record: Vocab4096 + MLP4.0x + WD0.085 - val_bpb 1.1048 (3-seed mean) #1287 @dentity007
SLOT: PR Record: QK-Gain 4.0 + XSA-11 + Muon-TTT + SLOT — val_bpb 1.0914 (3-seed mean) #1176

Request for Feedback

Has anyone validated the parallel residuals throughput improvement on 8xH100? The 2.3x on GB10 seems large and may be architecture-specific.
What PARALLEL_START_LAYER values have others found optimal? We tested 6 (decoder-only), PR Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + QK-Gain 5.0 — val_bpb 1.0897 (3-seed mean) #1334 uses 7.
Is the INT5 middle-MLP approach better than uniform INT6 for artifact compression on sp4096?

…er optimization, and SSM exploration

…3168) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…, TTT, CPU tests - Mixed quantization: INT5 (clip_range=15) for middle MLP layers, INT6 for attn+edge MLP - Parallel residuals: dual-stream Block with resid_mix_mlp, 4-element route, lane_merge - Score-first TTT: chunk-based eval-time training with cosine LR, frozen early blocks - CPU test suite (test_cpu.py): 22 tests covering model creation, forward pass, quant roundtrip - Flash attention import now conditional (CPU fallback via scaled_dot_product_attention) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

- Wrap all torch.compile calls in _maybe_compile() that checks TORCH_COMPILE_DISABLE=1 - Add run_spark_comparison.sh for 3-run baseline vs PROTEUS comparison on GB10 - Fix: Triton fails to compile on aarch64 (missing Python.h), this bypasses it Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

MatoTeziTanka · 2026-04-07T02:37:51Z

Really appreciate seeing PROTEUS v1.6 features independently ablated on different hardware — this is exactly the kind of community validation that makes the competition work. A few notes from the author side:

Parallel residuals fidelity: Your implementation is byte-equivalent to our canonical Scylla branch (resid_mix, resid_mix_mlp, 4-element route, lane_merge, same routing algebra, same parallel_start_layer encoder guard). One small divergence I noticed: you use torch.sigmoid(lane_merge) as the merge coefficient, while ours uses the raw parameter. Yours initializes at 0.622 vs our 0.5 — arguably better because it's bounded. We may adopt that.

Your questions, from our side:

2.3× throughput on 8×H100 — we have not seen this. Our own parallel-residual runs on H100 show negligible throughput change vs serial. Looking at the Block forward in your script (and ours), the attention and MLP lanes run sequentially in Python — there are no torch.cuda.Stream / wait_stream / fork-join primitives, so compute order is identical to serial despite the dual-stream architecture. An 11,251 → 26,204 tok/s gain from the same FLOPs is hard to explain architecturally. Worth profiling with something like torch.profiler to see if it's actually concurrent compute, or if it's torch.compile cache warming / TF32 toggles / batch-size diff between runs. I'd be genuinely surprised if this transfers off GB10.
PARALLEL_START_LAYER: We enforce parallel_start_layer >= num_encoder_layers (decoder-only, assert-guarded). We haven't swept it — your value of 6 on a 12-layer model matches our decoder-start pattern, so your result is in-family with ours.
INT5 middle-MLP vs uniform INT6 on sp4096: We use INT5-middle with N=6 on our Scylla runs; small but positive BPB win post-round-trip, and the ~0.9 MB artifact headroom is real. Your ablation is consistent with what we see.

Thanks for crediting the prior work cleanly — appreciated. Happy to discuss any of this further.

Disclosure: I use Claude Code CLI, Codex CLI, and Gemini Pro as tools in my workflow. Human first, AI-assisted.

Mined the top 20 open PRs at openai/parameter-golf and found that PARALLEL RESIDUALS (compute attn + mlp in parallel from the same pre-norm input) is in 3 of the top 6 recent records: PR openai#1437: SP8192 + Parallel Residuals + 3L Recurrence — val_bpb 1.07800 PR openai#1420: Triple Loop + Parallel Residuals + N-gram Tilt — val_bpb 1.08014 PR openai#1425: PROTEUS Parallel Residuals + INT5/INT6 We never tried it. Patch 13 adds USE_PARALLEL_RESIDUALS=1 which switches Block.forward from serial (x = x + attn(x); x = x + mlp(x)) to parallel (x = x + attn(LN(x)) + mlp(LN(x))). Idempotent, anchors on the first 3 lines of Block.forward which are invariant under Patch 11 (smear gate). Also discovered LESSONS.md §29 ("depth recurrence is DEAD under GPTQ") is contradicted by 5 of the top 10 recent records — they use depth recurrence + mixed-precision INT5/INT6 instead of pure int6 GPTQ. Worth re-investigating in a future research fire. experiments.json — 4 new PR_* configs: PR0: parallel residuals alone (no n-gram, isolated effect) PR1: parallel + leaky_relu + full n-gram (current best stack + new trick) PR2: parallel + smear + leaky + full n-gram (max stack) PR3: PR1 with seed=42 for noise check RESEARCH_LOG.md — full record of the research fire findings + the queue of techniques to investigate in future fires (n-gram tilt, depth recurrence, MuonEq-R, PartialRoPE+FA3, SwiGLU, codebooks). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

dentity007 and others added 6 commits March 30, 2026 19:12

Add approach notes for parameter golf challenge

ad23b7f

Update approach with depth recurrence, factorized embeddings, tokeniz…

300eb5c

…er optimization, and SSM exploration

Non-record: Mamba-Inspired SSM Hybrid (3:1 SSM:Attention) (val_bpb 3.…

6e23d7d

…3168) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Record: Vocab4096 + MLP4.0x + SLOT - val_bpb 1.0925 (3-seed mean)

e48c8d2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: PROTEUS Feature Ablation - Parallel Residuals + Mixed INT5/INT6 + TTT on DGX Spark GB10#1425

Non-record: PROTEUS Feature Ablation - Parallel Residuals + Mixed INT5/INT6 + TTT on DGX Spark GB10#1425
dentity007 wants to merge 6 commits intoopenai:mainfrom
NathanMaine:research/proteus-integration

dentity007 commented Apr 6, 2026

Uh oh!

MatoTeziTanka commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

dentity007 commented Apr 6, 2026

Non-record: PROTEUS Feature Ablation on DGX Spark GB10

Features Tested

Phase 1: 3-Run Comparison (1000 iterations each)

Phase 2: 7-Run Overnight Ablation (500 iterations each)

Isolated Feature Impact

Throughput Surprise

Conclusions

Architecture Details

Hardware

Reproduction

Credits

Request for Feedback

Uh oh!

MatoTeziTanka commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants