Non-record: PROTEUS Feature Ablation - Parallel Residuals + Mixed INT5/INT6 + TTT on DGX Spark GB10#1425
Non-record: PROTEUS Feature Ablation - Parallel Residuals + Mixed INT5/INT6 + TTT on DGX Spark GB10#1425dentity007 wants to merge 6 commits intoopenai:mainfrom
Conversation
…er optimization, and SSM exploration
…3168) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…, TTT, CPU tests - Mixed quantization: INT5 (clip_range=15) for middle MLP layers, INT6 for attn+edge MLP - Parallel residuals: dual-stream Block with resid_mix_mlp, 4-element route, lane_merge - Score-first TTT: chunk-based eval-time training with cosine LR, frozen early blocks - CPU test suite (test_cpu.py): 22 tests covering model creation, forward pass, quant roundtrip - Flash attention import now conditional (CPU fallback via scaled_dot_product_attention) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Wrap all torch.compile calls in _maybe_compile() that checks TORCH_COMPILE_DISABLE=1 - Add run_spark_comparison.sh for 3-run baseline vs PROTEUS comparison on GB10 - Fix: Triton fails to compile on aarch64 (missing Python.h), this bypasses it Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
Really appreciate seeing PROTEUS v1.6 features independently ablated on different hardware — this is exactly the kind of community validation that makes the competition work. A few notes from the author side: Parallel residuals fidelity: Your implementation is byte-equivalent to our canonical Scylla branch ( Your questions, from our side:
Thanks for crediting the prior work cleanly — appreciated. Happy to discuss any of this further. Disclosure: I use Claude Code CLI, Codex CLI, and Gemini Pro as tools in my workflow. Human first, AI-assisted. |
Mined the top 20 open PRs at openai/parameter-golf and found that PARALLEL RESIDUALS (compute attn + mlp in parallel from the same pre-norm input) is in 3 of the top 6 recent records: PR openai#1437: SP8192 + Parallel Residuals + 3L Recurrence — val_bpb 1.07800 PR openai#1420: Triple Loop + Parallel Residuals + N-gram Tilt — val_bpb 1.08014 PR openai#1425: PROTEUS Parallel Residuals + INT5/INT6 We never tried it. Patch 13 adds USE_PARALLEL_RESIDUALS=1 which switches Block.forward from serial (x = x + attn(x); x = x + mlp(x)) to parallel (x = x + attn(LN(x)) + mlp(LN(x))). Idempotent, anchors on the first 3 lines of Block.forward which are invariant under Patch 11 (smear gate). Also discovered LESSONS.md §29 ("depth recurrence is DEAD under GPTQ") is contradicted by 5 of the top 10 recent records — they use depth recurrence + mixed-precision INT5/INT6 instead of pure int6 GPTQ. Worth re-investigating in a future research fire. experiments.json — 4 new PR_* configs: PR0: parallel residuals alone (no n-gram, isolated effect) PR1: parallel + leaky_relu + full n-gram (current best stack + new trick) PR2: parallel + smear + leaky + full n-gram (max stack) PR3: PR1 with seed=42 for noise check RESEARCH_LOG.md — full record of the research fire findings + the queue of techniques to investigate in future fires (n-gram tilt, depth recurrence, MuonEq-R, PartialRoPE+FA3, SwiGLU, codebooks). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Non-record: PROTEUS Feature Ablation on DGX Spark GB10
Best val_bpb: 1.4479 (1000 steps) | 1.5077 with SLOT | NVIDIA DGX Spark GB10, single GPU | sp1024
Systematic ablation of features from PROTEUS v1.6 (PR #1289) integrated into our PR #1218/#1287 stack. Tests parallel residuals, mixed INT5/INT6 quantization, and SLOT independently and in combination. All tests ran on a single NVIDIA GB10 (128GB unified memory) with no torch.compile (Triton unsupported on ARM).
Features Tested
Parallel Residuals (PARALLEL_START_LAYER=6): Dual-stream Block architecture where attention and MLP operate on separate residual streams with learnable 4-element route vector and sigmoid lane_merge. From PR Record: ParallelResiduals + MiniDepthRecurrence, 1.1063 BPB / 1.8679 nats, -0.0072 vs PR #1179, -0.0143 vs merged SOTA #1204 @msisovic, PR Record: PROTEUS v1.6 — Scylla + Parallel Residuals + Depth Recurrence + Legal TTT — val_bpb 1.0819 (3-seed mean) #1289 @MatoTeziTanka.
Mixed INT5/INT6 Quantization (N_INT6_MLP_LAYERS=6): Middle MLP layers (3-7) quantized to INT5 (clip_range=15), edge MLP + all attention stay INT6. Saves ~0.9 MB artifact space. From PR Record: PROTEUS v1.6 — Scylla + Parallel Residuals + Depth Recurrence + Legal TTT — val_bpb 1.0819 (3-seed mean) #1289 @MatoTeziTanka.
Score-First TTT (TTT_ENABLED=1): Chunk-based eval-time training with cosine LR decay, SGD momentum, frozen early blocks. Ported from PR Record: PROTEUS v1.6 — Scylla + Parallel Residuals + Depth Recurrence + Legal TTT — val_bpb 1.0819 (3-seed mean) #1289.
SLOT (SLOT_ENABLED=1): Per-batch delta optimization at last hidden layer. From PR Record: QK-Gain 4.0 + XSA-11 + Muon-TTT + SLOT — val_bpb 1.0914 (3-seed mean) #1176.
Phase 1: 3-Run Comparison (1000 iterations each)
Common: VOCAB_SIZE=1024, TRAIN_BATCH_TOKENS=49152, SEED=42
Delta: -0.0122 train_bpb, -0.0267 post-EMA, -0.0145 INT6 roundtrip
Phase 2: 7-Run Overnight Ablation (500 iterations each)
Common: VOCAB_SIZE=1024, ITERATIONS=500, WARMUP_STEPS=10, SLIDING_WINDOW_ENABLED=0, SEED=42
Isolated Feature Impact
Throughput Surprise
The dual-stream architecture is 2.3x faster on GB10. This appears to be from better memory access patterns in the unified memory architecture. The separate attention and MLP streams avoid the sequential dependency that forces a round-trip through memory between the two operations.
Conclusions
Parallel residuals is the dominant feature. It delivers the largest BPB improvement and the largest throughput improvement. On 8xH100, the throughput gain would translate to more training steps in the 600s wallclock.
INT5 quantization is a free lunch for artifact size. Nearly BPB-neutral, saves ~0.9 MB. Use this when you need headroom under the 16 MB cap.
SLOT provides diminishing returns when combined with parallel residuals. The parallel architecture's additional learnable parameters (resid_mix_mlp, route, lane_merge) may cover the same optimization surface that SLOT exploits.
TTT was not tested on GB10 due to impractical eval times on single GPU (estimated 8+ hours per run). The score-first TTT implementation is ready but needs multi-GPU validation.
Architecture Details
Parallel residuals split each transformer block (from layer 6 onward) into two independent streams:
This adds 5,141 parameters (5 parallel blocks x ~1K each) for a 34.4M param model.
Hardware
Reproduction
Credits
Request for Feedback