Skip to content

Non-record: PROTEUS Feature Ablation - Parallel Residuals + Mixed INT5/INT6 + TTT on DGX Spark GB10#1425

Open
dentity007 wants to merge 6 commits intoopenai:mainfrom
NathanMaine:research/proteus-integration
Open

Non-record: PROTEUS Feature Ablation - Parallel Residuals + Mixed INT5/INT6 + TTT on DGX Spark GB10#1425
dentity007 wants to merge 6 commits intoopenai:mainfrom
NathanMaine:research/proteus-integration

Conversation

@dentity007
Copy link
Copy Markdown

Non-record: PROTEUS Feature Ablation on DGX Spark GB10

Best val_bpb: 1.4479 (1000 steps) | 1.5077 with SLOT | NVIDIA DGX Spark GB10, single GPU | sp1024

Systematic ablation of features from PROTEUS v1.6 (PR #1289) integrated into our PR #1218/#1287 stack. Tests parallel residuals, mixed INT5/INT6 quantization, and SLOT independently and in combination. All tests ran on a single NVIDIA GB10 (128GB unified memory) with no torch.compile (Triton unsupported on ARM).

Features Tested

  1. Parallel Residuals (PARALLEL_START_LAYER=6): Dual-stream Block architecture where attention and MLP operate on separate residual streams with learnable 4-element route vector and sigmoid lane_merge. From PR Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR #1179, -0.0143 vs merged SOTA #1204 @msisovic, PR Record: PROTEUS v1.6 — Scylla + Parallel Residuals + Depth Recurrence + Legal TTT — val_bpb 1.0819 (3-seed mean) #1289 @MatoTeziTanka.

  2. Mixed INT5/INT6 Quantization (N_INT6_MLP_LAYERS=6): Middle MLP layers (3-7) quantized to INT5 (clip_range=15), edge MLP + all attention stay INT6. Saves ~0.9 MB artifact space. From PR Record: PROTEUS v1.6 — Scylla + Parallel Residuals + Depth Recurrence + Legal TTT — val_bpb 1.0819 (3-seed mean) #1289 @MatoTeziTanka.

  3. Score-First TTT (TTT_ENABLED=1): Chunk-based eval-time training with cosine LR decay, SGD momentum, frozen early blocks. Ported from PR Record: PROTEUS v1.6 — Scylla + Parallel Residuals + Depth Recurrence + Legal TTT — val_bpb 1.0819 (3-seed mean) #1289.

  4. SLOT (SLOT_ENABLED=1): Per-batch delta optimization at last hidden layer. From PR Record: QK-Gain 4.0 + XSA-11 + Muon-TTT + SLOT — val_bpb 1.0914 (3-seed mean) #1176.

Phase 1: 3-Run Comparison (1000 iterations each)

Common: VOCAB_SIZE=1024, TRAIN_BATCH_TOKENS=49152, SEED=42

Run Config train_bpb post-EMA INT6 round Sliding SLOT Artifact
1 Baseline 1.4601 1.5277 1.5521 - - 8.99 MB
3 Parallel+INT5+SLOT 1.4479 1.5010 1.5376 1.5165 1.5077 8.21 MB

Delta: -0.0122 train_bpb, -0.0267 post-EMA, -0.0145 INT6 roundtrip

Phase 2: 7-Run Overnight Ablation (500 iterations each)

Common: VOCAB_SIZE=1024, ITERATIONS=500, WARMUP_STEPS=10, SLIDING_WINDOW_ENABLED=0, SEED=42

Run Config Parallel SLOT INT5 layers train_bpb post-EMA INT6 round Artifact
A Baseline 0 Off 2 1.5734 2.0469 2.1080 7.55 MB
B INT5 only 0 Off 10 1.5737 2.0462 2.1241 6.64 MB
C Parallel only 6 Off 2 1.5559 1.9314 1.9769 7.58 MB
D Parallel+INT5 6 Off 10 1.5556 1.9283 2.0082 6.67 MB
E SLOT only 0 On 2 1.5732 2.0442 2.1009 7.54 MB
F Parallel+SLOT 6 On 10 1.5557 1.9281 1.9911 6.67 MB
G Parallel+INT5(N=8) 6 Off 6 1.5553 1.9280 1.9982 7.14 MB

Isolated Feature Impact

Feature train_bpb delta post-EMA delta Artifact delta Verdict
Parallel residuals -0.0175 -0.1155 +0.03 MB Strong win
INT5 middle MLP +0.0003 -0.0007 -0.91 MB Neutral BPB, saves space
SLOT -0.0002 -0.0027 -0.01 MB Marginal
Parallel+SLOT -0.0177 -0.1188 -0.88 MB SLOT adds nothing on top

Throughput Surprise

Config tok/s vs baseline
Baseline (sequential) 11,251 1.0x
Parallel residuals 26,204 2.3x

The dual-stream architecture is 2.3x faster on GB10. This appears to be from better memory access patterns in the unified memory architecture. The separate attention and MLP streams avoid the sequential dependency that forces a round-trip through memory between the two operations.

Conclusions

  1. Parallel residuals is the dominant feature. It delivers the largest BPB improvement and the largest throughput improvement. On 8xH100, the throughput gain would translate to more training steps in the 600s wallclock.

  2. INT5 quantization is a free lunch for artifact size. Nearly BPB-neutral, saves ~0.9 MB. Use this when you need headroom under the 16 MB cap.

  3. SLOT provides diminishing returns when combined with parallel residuals. The parallel architecture's additional learnable parameters (resid_mix_mlp, route, lane_merge) may cover the same optimization surface that SLOT exploits.

  4. TTT was not tested on GB10 due to impractical eval times on single GPU (estimated 8+ hours per run). The score-first TTT implementation is ready but needs multi-GPU validation.

Architecture Details

Parallel residuals split each transformer block (from layer 6 onward) into two independent streams:

  • Attention stream: takes input from x_attn, applies RMSNorm, self-attention, residual
  • MLP stream: takes input from x_mlp, applies RMSNorm, MLP, residual
  • Cross-blending: 4-element route vector [r0, r1, r2, r3] controls how attention and MLP deltas combine into each stream
  • Lane merge: Learnable sigmoid scalar blends the two streams back at the final layer

This adds 5,141 parameters (5 parallel blocks x ~1K each) for a 34.4M param model.

Hardware

  • NVIDIA DGX Spark (GB10 Grace Blackwell, SM 121, 128GB unified memory)
  • Single GPU, WORLD_SIZE=1, grad_accum_steps=8
  • PyTorch 2.11.0+cu130
  • No flash_attn_interface (SDPA fallback via scaled_dot_product_attention)
  • No torch.compile (Triton/inductor broken on aarch64, TORCH_COMPILE_DISABLE=1)
  • sp1024 FineWeb data (80 train shards, full validation set)

Reproduction

# On DGX Spark or any CUDA GPU:
pip install sentencepiece brotli
python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 80

# Best config (parallel + INT5):
VOCAB_SIZE=1024 PARALLEL_START_LAYER=6 N_INT6_MLP_LAYERS=6 \
  TORCH_COMPILE_DISABLE=1 ITERATIONS=500 \
  python3 train_gpt_1218_slot.py

# Full comparison script:
bash run_overnight.sh

Credits

Request for Feedback

  1. Has anyone validated the parallel residuals throughput improvement on 8xH100? The 2.3x on GB10 seems large and may be architecture-specific.
  2. What PARALLEL_START_LAYER values have others found optimal? We tested 6 (decoder-only), PR Record: SP4096 + Depth Recurrence + Parallel Residuals + MuonEq-R + QK-Gain 5.0 — val_bpb 1.0897 (3-seed mean) #1334 uses 7.
  3. Is the INT5 middle-MLP approach better than uniform INT6 for artifact compression on sp4096?

dentity007 and others added 6 commits March 30, 2026 19:12
…3168)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…, TTT, CPU tests

- Mixed quantization: INT5 (clip_range=15) for middle MLP layers, INT6 for attn+edge MLP
- Parallel residuals: dual-stream Block with resid_mix_mlp, 4-element route, lane_merge
- Score-first TTT: chunk-based eval-time training with cosine LR, frozen early blocks
- CPU test suite (test_cpu.py): 22 tests covering model creation, forward pass, quant roundtrip
- Flash attention import now conditional (CPU fallback via scaled_dot_product_attention)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Wrap all torch.compile calls in _maybe_compile() that checks TORCH_COMPILE_DISABLE=1
- Add run_spark_comparison.sh for 3-run baseline vs PROTEUS comparison on GB10
- Fix: Triton fails to compile on aarch64 (missing Python.h), this bypasses it

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MatoTeziTanka
Copy link
Copy Markdown

Really appreciate seeing PROTEUS v1.6 features independently ablated on different hardware — this is exactly the kind of community validation that makes the competition work. A few notes from the author side:

Parallel residuals fidelity: Your implementation is byte-equivalent to our canonical Scylla branch (resid_mix, resid_mix_mlp, 4-element route, lane_merge, same routing algebra, same parallel_start_layer encoder guard). One small divergence I noticed: you use torch.sigmoid(lane_merge) as the merge coefficient, while ours uses the raw parameter. Yours initializes at 0.622 vs our 0.5 — arguably better because it's bounded. We may adopt that.

Your questions, from our side:

  1. 2.3× throughput on 8×H100 — we have not seen this. Our own parallel-residual runs on H100 show negligible throughput change vs serial. Looking at the Block forward in your script (and ours), the attention and MLP lanes run sequentially in Python — there are no torch.cuda.Stream / wait_stream / fork-join primitives, so compute order is identical to serial despite the dual-stream architecture. An 11,251 → 26,204 tok/s gain from the same FLOPs is hard to explain architecturally. Worth profiling with something like torch.profiler to see if it's actually concurrent compute, or if it's torch.compile cache warming / TF32 toggles / batch-size diff between runs. I'd be genuinely surprised if this transfers off GB10.

  2. PARALLEL_START_LAYER: We enforce parallel_start_layer >= num_encoder_layers (decoder-only, assert-guarded). We haven't swept it — your value of 6 on a 12-layer model matches our decoder-start pattern, so your result is in-family with ours.

  3. INT5 middle-MLP vs uniform INT6 on sp4096: We use INT5-middle with N=6 on our Scylla runs; small but positive BPB win post-round-trip, and the ~0.9 MB artifact headroom is real. Your ablation is consistent with what we see.

Thanks for crediting the prior work cleanly — appreciated. Happy to discuss any of this further.

Disclosure: I use Claude Code CLI, Codex CLI, and Gemini Pro as tools in my workflow. Human first, AI-assisted.

taka6745 pushed a commit to taka6745/paramgolf that referenced this pull request Apr 7, 2026
Mined the top 20 open PRs at openai/parameter-golf and found that
PARALLEL RESIDUALS (compute attn + mlp in parallel from the same
pre-norm input) is in 3 of the top 6 recent records:
  PR openai#1437: SP8192 + Parallel Residuals + 3L Recurrence — val_bpb 1.07800
  PR openai#1420: Triple Loop + Parallel Residuals + N-gram Tilt — val_bpb 1.08014
  PR openai#1425: PROTEUS Parallel Residuals + INT5/INT6
We never tried it. Patch 13 adds USE_PARALLEL_RESIDUALS=1 which switches
Block.forward from serial (x = x + attn(x); x = x + mlp(x)) to parallel
(x = x + attn(LN(x)) + mlp(LN(x))). Idempotent, anchors on the first 3
lines of Block.forward which are invariant under Patch 11 (smear gate).

Also discovered LESSONS.md §29 ("depth recurrence is DEAD under GPTQ")
is contradicted by 5 of the top 10 recent records — they use depth
recurrence + mixed-precision INT5/INT6 instead of pure int6 GPTQ.
Worth re-investigating in a future research fire.

experiments.json — 4 new PR_* configs:
  PR0: parallel residuals alone (no n-gram, isolated effect)
  PR1: parallel + leaky_relu + full n-gram (current best stack + new trick)
  PR2: parallel + smear + leaky + full n-gram (max stack)
  PR3: PR1 with seed=42 for noise check

RESEARCH_LOG.md — full record of the research fire findings + the
queue of techniques to investigate in future fires (n-gram tilt, depth
recurrence, MuonEq-R, PartialRoPE+FA3, SwiGLU, codebooks).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants