Skip to content

Record: Triple Loop + Fused Kernels + Parallel Residuals + N-gram Tilt; val_bpb 1.08309 (5-seed mean)#1420

Open
abaybektursun wants to merge 1 commit intoopenai:mainfrom
abaybektursun:submission/triple-loop-fused-ngram
Open

Record: Triple Loop + Fused Kernels + Parallel Residuals + N-gram Tilt; val_bpb 1.08309 (5-seed mean)#1420
abaybektursun wants to merge 1 commit intoopenai:mainfrom
abaybektursun:submission/triple-loop-fused-ngram

Conversation

@abaybektursun
Copy link
Copy Markdown
Contributor

@abaybektursun abaybektursun commented Apr 6, 2026

Mechanistic Interpretability: For a deep-dive analysis of this model — including per-matrix rate-distortion, recurrence error amplification, and skip gate analysis — see Mechanistic Interpretability of this submission.

Triple Loop + Fused Kernels + Parallel Residuals + N-gram Tilt

val_bpb: 1.08309 (5-seed mean, std=0.00044)

Seed Steps SW BPB Tilt BPB Artifact
1 4771 1.08271 1.08256 15,978,345
42 4769 1.08391 1.08376 15,975,585
1234 4692 1.08344 1.08330 15,973,639
1337 4756 1.08301 1.08287 15,974,187
2025 4755 1.08309 1.08295 15,970,317
Mean 1.08323 1.08309

Changes

  • One extra loop pass through layers 4-5. PR Record: SP8192 + GPTQ Embeddings + Depth Recurrence + MuonEq-R + SDClip — val_bpb 1.08563 (5 seed mean) #1394 passes through layers 4-5 three times total (NUM_LOOPS=2, giving 15 virtual layers from 11 physical). I add a fourth pass (NUM_LOOPS=3), giving 17 virtual layers. The encoder becomes [0,1,2,3,4,5,4,5] and the decoder [4,5,4,5,6,7,8,9,10]. It costs about 200 training steps, but the extra depth more than compensates. Quadruple looping (19 virtual) was worse because the step count drops too far.

  • Activate looping earlier (0.35 instead of 0.50). At 0.50, half the training budget runs without the looped layers doing anything. I swept {0.30, 0.35, 0.40, 0.50} on seed 1234. 0.35 won, though 0.40 was close. Below 0.35 the model doesn't get enough non-looped warmup and quality degrades.

  • Fused MLP kernels (Triton TMA forward + CUTLASS EVT backward). This took the most engineering effort and gave the most BPB back. The forward fuses leaky_relu(fc(x), 0.5).square() into a single Triton TMA kernel so the 403MB intermediate never hits HBM. The backward fuses (grad_out @ proj.weight) * act_grad into a CUTLASS 3.x Epilogue Visitor Tree, running the elementwise multiply while tiles are still in registers. Together: ~10% higher throughput, +127 training steps in the same 600s. I initially tried wrapping the entire MLP in a custom autograd.Function, but that killed torch.compile's cross-layer fusions and made everything 2.7x slower. The trick was to fuse surgically, just the forward activation and one backward GEMM, and let the compiler handle the rest. Details in Appendix A.1–A.3.

  • Parallel residuals for layers 7-10. GPT-J style (Wang & Komatsuzaki, 2021): attention and MLP both read from the same pre-residual input, outputs summed in parallel. I expected this to mostly help quantization (less interference between attention and MLP during GPTQ calibration), and it did tighten the gap slightly. The bigger surprise was +68 training steps from the faster forward pass. I also tried Hessian-Aware SDClip from PR #1412 alongside this, but it made things worse with triple looping. It probably needs its own λ tuning for the deeper architecture.

  • Eval-time n-gram tilt (causality-fixed). The original submission had a causality violation in the within-word and word-start hint channels: is_bnd/is_ws flags were derived from tokens_[p] (the target token being predicted), which made the hint-gating decision depend on the target. This was caught by @Gusanidas in review. The fix splits the flags into two sets: prefix-derived flags (tokens_[p-1]) for hint gating, and target-derived flags (tokens_[p]) for post-scoring state updates. However, the within-word and word-start channels cannot produce useful hints without target-dependent gating — they either fire too broadly or at the wrong positions. After testing all causal alternatives (prev_tok gating, state-based gating, disabling channels), the winning configuration uses token_hint only (orders 8-16), which was always fully causal. The remaining token_hint channel provides a consistent -0.00014 BPB across all seeds. The improvement is real but small — most of the original -0.0029 delta came from the (now-removed) target-dependent gating in within/word channels. Full details in Appendix A.4.

N-gram legality (#1017 conditions)

Update (post-review fix): The original submission had a Rule 1 violation in the within-word and word-start hint channels. The is_bnd/is_ws flags used to gate hint generation were derived from tokens_[p] (the target), making the decision of whether to produce a hint depend on the token being predicted. This was caught by @Gusanidas. The fix removes the within-word and word-start channels from hint output entirely — they cannot produce useful hints without target-dependent gating. Only the token_hint channel (orders 8–16) remains, which was always fully causal. The n-gram delta dropped from -0.0029 to -0.00014 BPB.

Audited against the four conditions proposed in #1017 for eval-time adaptation:

Condition 1, Causal dependence (p_t depends only on artifact + x_1...x_{t-1}): compute_hashes reads tokens[pos - k - 1] for k=0,1,..., all strictly before position pos. token_hint looks up hash tables containing only entries inserted by prior iterations. The target token tokens[pos] is read only for the post-scoring update phase.

Condition 2, Full normalized distribution: The tilted distribution is p_tilt(t) = p_model(t) · exp(β · 1[t==hint]) / Z where Z = 1 + p_model(hint) · (exp(β) - 1). Proper probability distribution over the full vocabulary.

Condition 3, Score-before-update: Hint and beta are written to output arrays before token_update inserts tokens[pos] into the tables.

Condition 4, Single left-to-right pass: get_hints_batch processes positions sequentially. The sliding window scores each token exactly once.

  • Double-buffered async data prefetch. Background thread + pinned memory + separate CUDA stream. I built this to work around the virtualized disk I/O on cloud H100 instances (see below), but it ended up helping in every setting I tested.

  • PyTorch 2.9.1 instead of 2.11. See below.

What the model looks like inside

I ran per-matrix rate-distortion, recurrence error amplification, and skip gate analysis on the trained model. Three things stood out:

Loop layers are 2.2x more sensitive to quantization than non-loop layers. Blocks 4 and 5 get reused across passes, so rounding error in those weights compounds. The single most sensitive matrix in the entire network (block 4's value projection) has 80x the BPB-per-byte cost of the least sensitive. This suggests mixed-precision quantization (more bits for loop layers) is the biggest remaining opportunity.

The third loop pass contributes 63% of what the second does. I measured a contraction ratio of 0.634 across passes: each loop iteration changes the representation by ~63% of the previous one. A hypothetical 4th pass would add only 0.63³ = 25% new information, which matches the empirical finding that quadruple looping hurts. The 3rd pass at 63% is clearly worth the step cost; the 4th at 25% is not.

All 8 skip connections are load-bearing. Gates are 0.61-0.70 (sigmoid), meaning roughly 35% encoder / 65% decoder blend. The first loop pass's skip connections (skips 2,3) have the highest weight norms (21.9, 19.5 vs 2.8-13.8 for others), so the first encoder pass through layers 4-5 is the most important information source for the decoder.

What the progress looks like: three models on the same prompt (temp=0.8)

Prompt (50 tokens): "Insurance Company Declares Living Man Dead George Johannesen is very much alive. Which is why it was so surpr"

Ground truth: ising when the Canadian man received a letter addressed "To the Estate of George Johannesen." Even more surprising is that it came from his insurance company, who should really be on top of such things...

PR #1019 (1.1147 BPB) PR #1105 (1.0962 BPB) This PR (1.08014 BPB)
ising to be insurance company. Living Man? Maybe. All Living Man? It was a miracle. However, right now, Living Man is still dead. Well, why is he not living? Living man Rachel Drobles is a person whose life is tested by the techniques of the car industry. Like all techniques, the life of Rachel is tested by the astounding things she has become. Therefore, whether it is a mistake, or in-residence, or a mistake happening, is perpetuated. Therefore, Lyon Man is dead. Can Living Man H ising to be insurance company. Living Man is the only insurance company that makes property insurance. It is a company that makes vacation insurance, but it still has the same degree of property insurance. So how does a living man die? So say I think there are ways to teach people how to be insured. The first step is to measure the health of the living man and the stress of his situation. To measure the health of the living man, it is important to measure his or her weight. What is the hazard to the living man? Living Man is the only insurance company that specializes in repairs ising when the Canadian man received a letter addressed "To the Estate of George Johannesen" George Johannesen was a retired professional who was a lucrative investor in Canada. His estate was worth about $1 billion. His death last month at the age of 73 was a direct shock to the entire estate and he was still alive. That is why he was so shocked. In 2005 he was a member of the Canadian As

#1019 drifts into incoherence ("Rachel Drobles... techniques of the car industry... Lyon Man is dead"). #1105 stays on topic but loops on "Living Man is the only insurance company." This model picks up the actual narrative thread ("the Canadian man received a letter"), invents plausible biographical details, and maintains coherence throughout. All three are wrong about what happens next, but the errors become progressively more plausible.

Debugging the platform

This was the hardest submission I've worked on. Most of the time went to infrastructure, not the model.

Virtualized disks tank throughput. The cloud H100 instances I rented use virtio block devices. The coprime-stride data loader from #726 does random reads across 143 shards, which is fine on bare metal but brutal on a virtual disk. That's what led me to build the async prefetch. It turned out to help everywhere, not just on virtualized storage.

PyTorch 2.9.1 vs 2.11: a full day lost. I could not reproduce results from other submissions. Training the same architecture with the same seed gave 0.0042 BPB worse results on torch 2.11. (I initially measured a 0.015 gap, which turned out to be a wrong model file on the server. The real gap, once I controlled for that, was 0.0042.) I swapped Triton versions, disabled autocast, forced cuBLAS backends, diffed Inductor-generated kernels. The root cause was two independent issues:

  1. Autocast backward changed in PR pytorch#165068 (landed Dec 2025, present in 2.11, absent from 2.9.1). Two lines in cached_cast() add an AutoGradMode enable_grad(true) guard on weight casts, inserting extra ToCopyBackward nodes into the autograd graph. This changes floating-point accumulation order by 1 ULP of bf16 (7.15e-7) in saved activations, which compounds over 5000 momentum steps into +60KB of weight entropy. The model goes from fitting at 16.00MB (no pruning) to 16.06MB (5.4% pruning needed). I verified eval is version-invariant to 0.00003 BPB; the entire gap is from training.

  2. Inductor over-fusion in backward codegen: Inductor 2.11's mix_order_reduction fuses _fused_rms_norm_backward into adjacent kernels, producing fewer but larger Triton kernels (65 functions / 11,855 lines vs 71 / 11,292 in 2.9.1). The fatter kernels hit register pressure and cost +5.93ms per backward pass (+8.8%). In a 600s budget, that's ~57 lost training steps. I submitted a fix that disables mix_order_reduction by default (aligning open-source with fbcode, where it was already off): pytorch/pytorch#179494.

Separately, our fused CUTLASS kernel crashed on torch 2.11 because Triton 3.6.0's TensorDescriptor.from_tensor() tries to access .data_ptr() on FakeTensors during torch.compile tracing. I traced that through Inductor's FallbackKernel codegen and submitted a second fix: pytorch/pytorch#179422. Two PyTorch PRs from a golf competition.

In time-budgeted competitions, the platform is the model. A 6ms/step Inductor regression can cost as much BPB as most algorithmic innovations.

How this submission came together

The first few days were mostly wasted. I tried improving the architecture directly: 12 layers, SwiGLU, mixed int5/int8 per layer. Nothing worked. The model was 930KB over the 16MB budget and MLP weights alone were 69% of the compressed artifact. Brotli-11 was already within 1-2% of Shannon entropy. There was nowhere to go.

Worse: a new optimizer schedule I'd been developing (Mixed NS5, a convergent Newton-Schulz coefficient ramp) changed the weight distribution enough that the model no longer fit in the 16MB budget. It was 930KB over, and aggressive pruning to fit destroyed the quality gains.

Then I lost a full day to PyTorch version divergence (described above). Besides the upstream fix, the useful thing that came out of it was a proof that compressed model size is a chaotic function of training hyperparameters. 1 ULP of bf16 rounding (7.15e-7) in a saved activation compounds over 5000 momentum steps into 60KB swings in Brotli output. I also proved that L2 weight decay is scale-invariant under max-abs quantization: Q(γW) = Q(W). All the per-bank WD tuning I'd been doing was chasing noise.

Once I stopped trying to control compression through training and focused on what was actually deterministic (GPTQ deadzone for size, n-gram tilt for eval), things moved fast. Clean reproduction of the baseline. Pivot to SP8192 + SDClip. Triple looping. Fused kernels. Parallel residuals. Each gain was small but they stacked: 45 experiments, five seeds, 1.08014 BPB.

What didn't work

Innovations that worked on earlier models but not here

Mixed NS5 coefficient schedule. On our SP4608 model this was worth -0.0066 BPB for free: use the standard Muon polynomial (3.4445, -4.775, 2.0315) to ramp singular values toward 1, then switch to the convergent polynomial (1.875, -1.25, 0.375) which has p(1)=1, p'(1)=0 to lock them in. The split adapts per bank based on aspect ratio as a proxy for condition number. On the SP8192 architecture the coefficient schedule produced weight distributions that were hostile to Brotli compression: the model was 500KB over budget and needed 46% pruning.

EC-GPTQ (entropy-constrained rounding). Inside the GPTQ inner loop, I added an element-wise deadzone: dz = λ · d / s², where d is the Hessian diagonal and s is the scale. Borderline ±1 values get rounded to 0 when the GPTQ error compensation cost is small. On the SP4096 architecture this achieved 10x better rate-distortion than uniform deadzoning (0.5×10⁻⁵ BPB/KB vs 6.8×10⁻⁵). On the SP8192 + SDClip architecture it was harmful: SDClip's c = k·σ already controls entropy per row, and adding EC-GPTQ on top just introduced extra quantization damage for no compression benefit.

Per-bank weight decay tuning. MLP is 69% of the compressed model. I tried giving MLP slightly lower WD (0.07 vs 0.09) to improve quality, offset by higher attention WD. Even ±0.005 from the baseline was catastrophic: lower MLP WD means larger MLP weights, which Brotli can't compress cheaply, so the artifact blows up.

L2 weight decay as a compression lever. I proved mathematically that L2 WD is scale-invariant under max-abs quantization: Q(γW) = round(W / (max|W|/31)) = Q(W). Multiplying all weights by a constant changes nothing about the quantized integers. This was useful to understand (it meant all the WD-based compression tuning I'd been doing was chasing noise), but it also closed a door.

Tried Effect Why it failed
EC-GPTQ λ=0.0005 on SDClip +0.00087 worse SDClip k=12.85 already near-optimal
Quadruple loop (NUM_LOOPS=4) +0.00164 worse Too few training steps
Loop layers 3-4 +0.00066 worse Suboptimal depth for recurrence
Loop layers 5-6 +0.00247 worse Suboptimal depth for recurrence
EMA decay 0.998 +0.00117 worse Over-smoothing
EMA decay 0.996 +0.00014 worse Marginal difference
Hessian SDClip λ=0.175 +0.00063 worse Not tuned for triple loop
enable_looping_at=0.30 +0.00013 worse Not enough non-loop warmup
ETLB (eval-time logit bias) -0.00020 better Takes 615s, doesn't fit in 600s eval budget

Code size

All code ships as part of the artifact: train_gpt.py, CUTLASS EVT source, and the n-gram C++ source. For a competition run, these would be bundled into a single LZMA-compressed blob.

Uncompressed LZMA-9
train_gpt.py 64,137
cutlass_evt_fusion/ (3 files) 9,095
ngram/fused_expert_blend.cpp 21,589
Total 73,674 19,668

train_gpt.py is minified with python-minifier (annotations, pass statements, and docstrings removed; variable names preserved). submission.py (143 bytes) is the entry point: it decompresses train_gpt.py.lzma and executes it. For a competition run, torchrun would invoke submission.py instead of train_gpt.py. Total code cost: 19,811 bytes. All 5 seeds fit under 16MB with 1.8-9.9KB headroom. The unminified train_gpt.py (64KB) is included in the PR for readability.

Requirements

  • PyTorch 2.9.1+cu128
  • Flash Attention 3 (Hopper): pip install flash_attn_3 --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291
  • CUTLASS EVT extension (compiled for sm_90a, source included)
  • SentencePiece, Brotli, NumPy
  • 8×H100 80GB SXM
SEED=1234 NUM_LOOPS=3 ENABLE_LOOPING_AT=0.35 PARALLEL_RESIDUAL_START=7 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Credits

Full component lineage: every piece traced to its origin PR
Component in this submission Origin Author
This PR
Triple depth recurrence (NUM_LOOPS=3) This work @abaybektursun
Earlier loop activation (enable_at=0.35) This work @abaybektursun
Triton TMA fused MLP forward #1105, ported to SP8192 here @abaybektursun
CUTLASS EVT fused MLP backward #1105, ported to SP8192 here @abaybektursun
Eval-time n-gram tilt (C++ open-addressing) #1105, re-tuned for SP8192 here @abaybektursun
Double-buffered async data prefetch This work @abaybektursun
PyTorch Inductor bug fixes (2 upstream PRs) pytorch#179422, pytorch#179494 @abaybektursun
Our prior submissions
AR Self-Gen GPTQ + XSA-all + BigramHash (merged SOTA) #1019 @abaybektursun
LeakyReLU² + Legal Score-First TTT + Parallel Muon #549 @abaybektursun
TTT negative results (why this submission does not use TTT) #756, #1103 @abaybektursun
Architecture
SP8192 vocabulary #1394 @clarkkev
SDClip quantization (c = k·σ) #1394 @clarkkev
GPTQ on embeddings (int8) #1394 @clarkkev
Tied embeddings (init_std=0.005) #1394 @clarkkev
SP4096→8192 vocab scaling #1218 @clarkkev
MLP 4.0× width, higher WD (0.085) #1218 @clarkkev
Depth recurrence (loop layers 4-5) #1204 @msisovic
Parallel residuals (GPT-J style) GPT-J (2021), adapted in #1204 @kingoflolz, @msisovic
MuonEq-R (row-normalized Muon) #1217 @bigbag
U-Net sigmoid-gated skip connections #289, refined in #1089 @integrate-your-mind, @mikeapedia
XSA on all layers #265 (partial), #478 (all layers) @unnir, @gowtham0992
Partial RoPE (16/64 dims) #315 @jfprincz
LN Scale (1/√(layer+1)) #315 @jfprincz
LeakyReLU(0.5)² activation #185 @dttdrv
Logit softcap (30.0) #315 @jfprincz
QK gain (4.0) #1125 @jainpranjal97
Optimizer
Muon (Newton-Schulz orthogonalization) #399 (parallel variant) @abaybektursun
EMA (decay=0.997) #315, #401 @jfprincz, @newjordan
Warmdown (0.667 frac, linear to 0) #364 @shikhar1729
Muon momentum warmup (0.92→0.99) #1394 @clarkkev
Quantization & Compression
Full Hessian GPTQ (actorder + Cholesky) #535, integrated in #1060 @raahilshah, @dexhunter
Brotli-11 + byte shuffle compression #1089 @mikeapedia
Evaluation
Sliding window (stride=64) #122 @mtybadger
Flash Attention 3 (Hopper) #122 @mtybadger
Data
ShuffledSequenceLoader (memmap + weighted sampling) #1394 @clarkkev

This competition is deeply collaborative. Nearly every component traces through multiple contributors. I've tried to credit the earliest PR that introduced each technique, but many were refined across several submissions.


Appendix

A.0 Ablation: fused 5-seed without parallel residuals

5-seed results: fused kernels + triple loop + n-gram, no parallel residuals
Seed Steps Sliding BPB N-gram BPB Artifact
1 4703 1.08336 1.08041 15,974,896
42 4704 1.08468 1.08175 15,974,993
1234 4680 1.08296 1.08007 15,971,965
1337 4697 1.08363 1.08077 15,970,370
2025 4702 1.08390 1.08101 15,970,844
Mean 1.08371 1.08080

5-seed mean: 1.08080 BPB (std=0.00064). Seed 1234 n-gram was run in terminal (1.08007), not logged to file.

Adding parallel residuals (layers 7+) improves seed 1234 from 1.08007 to 1.07971 (-0.00036), primarily from +68 extra training steps due to the faster parallel forward pass. Full parallel-residuals 5-seed results are in the main table above (mean 1.08014).

A.1 Fused MLP Kernels: Design & Implementation

These kernels were first developed for PR #1105 on the SP4608 architecture. This submission ports them to the SP8192 + triple-loop architecture and integrates the CUTLASS EVT backward with torch.compile's tracing.

Forward (Triton TMA): fuses F.linear + LeakyReLU(0.5) + square

Fuses F.linear(x, up_w) -> LeakyReLU(0.5) -> square into a single kernel. The 403MB intermediate never touches HBM.

Uses Triton's Tensor Memory Access (TMA) descriptors for H100-native global-to-shared memory loads. Block sizes 128x256x64 with 8 warps, 4 pipeline stages. The kernel performs the GEMM accumulation in FP32, then applies activation and squaring inline before writing back to BF16.

The interleaved write pattern splits the accumulator into two halves via tl.reshape + tl.permute + tl.split, writing activation gradient and post-activation to separate output buffers in a single pass.

Backward (CUTLASS EVT): fuses (go @ down_w.T) * act_grad

Fuses (go @ down_w.T) * act_grad into a single CUTLASS 3.x kernel via Epilogue Visitor Tree. The elementwise multiply runs in the GEMM epilogue while tiles are still in registers, eliminating one 403MB write + read per layer.

I store the activation gradient in the forward pass instead of the pre-activation. This removes all branching from the backward:

act_grad = (pre > 0) ? 2*pre : 0.5*pre    <-- one branch, forward only
post     = 0.5 * act_grad * pre            <-- branch-free recovery
dpre     = (go @ W_down.T) * act_grad      <-- branch-free backward

The identity post = 0.5 * act_grad * pre holds for both signs:

  • pre > 0: act_grad = 2·pre → 0.5 · 2pre · pre = pre² ✓
  • pre ≤ 0: act_grad = 0.5·pre → 0.5 · 0.5pre · pre = (0.5·pre)² ✓

This reduces the CUTLASS EVT epilogue to a trivial 3-node tree: Sm90EVT<multiplies, AccFetch, AuxLoad>.

Why surgical fusion, not full-MLP autograd.Function

torch.compile's cross-layer fusions (RMSNorm backward, residual adds, RoPE backward) account for ~21.6% of step time. Wrapping the full MLP backward in autograd.Function makes it opaque to Inductor, so everything runs in eager mode at 2.7x slower net (I hit this in my #670). So I fuse only the forward activation and one backward GEMM+pointwise, preserving the compiler's scope over everything else.

A.2 Kernel Benchmarks

Per-layer timing and end-to-end
Variant dpre time Delta per layer Delta per step (x11)
cuBLAS unfused 1.221 ms baseline baseline
Triton precomp 1.105 ms -0.116 ms -1.275 ms
CUTLASS Pingpong 1.073 ms -0.148 ms -1.623 ms

End-to-end (35 steps, seed=42, 2xH100):

Config Step avg Delta
Triton fwd + Triton bwd 313.90 ms baseline
Triton fwd + CUTLASS EVT bwd 313.47 ms -0.43 ms

On 8xH100: unfused 4553 steps → fused 4680 steps in 588s (+127 steps, +2.8%).

A.3 Step-Time Profile

Where all 313ms goes (2xH100, Nsight Systems)
Component Share
Flash Attention 3 (fwd+bwd) 20.1%
Fused MLP (Triton+CUTLASS) 13.5%
cuBLAS GEMMs (MLP bwd dW/dx, attn proj) 19.1%
torch.compile fusions (cross-layer) 21.6%
Unfused elementwise (LN, residuals) 21.0%
Communication + other 4.7%

A.4 N-Gram Tilt

The n-gram system was originally developed in PR #1105 for SP4608 models. This submission ports it to SP8192. Source code: ngram/fused_expert_blend.cpp (C++ open-addressing hash, nanobind FFI) and ngram/eval_ngram.py (tilt math + sliding window). Eval time on 8xH100: ~90s.

Post-review causality fix

The original submission had three hint channels: token_hint (orders 8–16), within_hint (within-word BPE completion), and word_hint (word-start prediction). @Gusanidas identified that within_hint and word_hint used is_bnd/is_ws flags derived from tokens_[p] (the target token) to gate whether a hint was produced — a Rule 1 violation.

What was invalid: The gating decision "should I produce a hint at this position?" depended on whether the target token was a word boundary or had a leading space. This meant the probability distribution P(x_t | x_1...x_{t-1}) changed depending on the value of x_t itself.

What was tried to salvage within/word channels:

  • Deriving is_bnd/is_ws from tokens_[p-1] (prefix): semantically inverted, delta = +0.00033 (harmful)
  • Gating on within_len_ state only: fires too broadly, delta = +0.00120 (harmful)
  • Disabling within/word entirely (token_hint only): delta = -0.00014 (helpful)

Conclusion: The within/word channels' -0.0025 BPB contribution came entirely from target-dependent gating. Without it, they add noise. Only token_hint (orders 8–16) produces a legitimate improvement. The fix removes within/word from hint output while keeping their state updates (dead code, no effect).

Parameter sweep (token_hint only, 4M token subset, 8 GPUs in parallel):

base_beta thresh_scale table_bits stride delta
1.5 0.75 26 1 -0.000083
1.5 0.50 26 1 -0.000081
2.0 0.75 26 1 -0.000079
2.0 0.50 26 1 -0.000074
1.0 1.00 26 1 -0.000073
0.5 0.50 26 1 -0.000046
3.0 0.50 26 1 -0.000020
5.0 0.50 26 1 +0.000214

Full-val delta with best params (beta=1.5): consistent -0.00014 BPB across all 5 seeds. The improvement is real but small.

Causality proof (token_hint channel)

The surviving token_hint channel is a textbook online n-gram with strict lookup-then-update discipline:

for (int i = 0; i < n; i++) {
    int64_t p = pos[i];
    compute_hashes(tokens_, p, ...);         // (1) hash from tokens[p-1], tokens[p-2], ...
    token_hint(hashes, ..., tok_hint, ...);  // (2) LOOKUP in tables built from pos < p
    hints[i] = tok_hint;                     // (3) emit hint
    token_update(hashes, ..., tok);           // (4) INSERT tokens[p] AFTER hint is emitted
}
Condition Requirement Status
Causal dependence p_t depends only on artifact + x_1...x_{t-1} PASS
Full normalized distribution Proper softmax over full vocab PASS
Score-before-update Score fixed before any x_t-dependent update PASS
Single left-to-right pass No rescoring PASS

A.5 Data Prefetch

Double-buffered async prefetch

Background thread prepares next batch in pinned memory while GPU trains. Separate CUDA stream for H2D overlap.

On the PR #1334 architecture: +39 steps, +0.7% throughput. The extra steps landed in a worse compression region (+40KB), so the net effect was actually harmful for that architecture. On PR #1394's ShuffledSequenceLoader with memmap, the data pipeline is already efficient enough that prefetch isn't the bottleneck.

A.6 ETLB (Eval-Time Logit Bias)

Algorithm and results

From PR #1399. Learns a vocab-sized bias vector via SGD on already-scored context tokens, carried across sliding windows:

  1. Forward pass (no grad) → logits
  2. 5 SGD steps (lr=0.05) on context tokens (first 1984 of 2048)
  3. Score stride tokens (last 64) with logits + bias
  4. Carry bias forward, clamped to [-3, 3]

Result (seed 1234, double-loop config on torch 2.11): n-gram only 1.08152 → ETLB + n-gram 1.08132 (-0.00020). Not re-tested on the final triple-loop fused config.

Rejected: takes 615s, doesn't fit in 600s eval budget.

A.7 Setup & Reproduction

Full build instructions
pip install torch==2.9.1 --index-url https://download.pytorch.org/whl/cu128
pip install flash_attn_3 --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291
pip install sentencepiece brotli numpy

export LD_LIBRARY_PATH=$(python3 -c "import torch; print(torch.__path__[0] + '/lib')"):${LD_LIBRARY_PATH:-}

cd /opt && git clone --depth 1 --branch v3.7.0 https://github.com/NVIDIA/cutlass
cd cutlass_evt_fusion && CUTLASS_PATH=/opt/cutlass python3 setup.py build_ext --inplace && cd ..

rm -f data/manifest.json
MATCHED_FINEWEB_REPO_ID=kevclark/parameter-golf \
python3 data/cached_challenge_fineweb.py --variant sp8192 --train-shards 128

SEED=1234 NUM_LOOPS=3 ENABLE_LOOPING_AT=0.35 PARALLEL_RESIDUAL_START=7 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

@abaybektursun abaybektursun changed the title Triple Loop + Fused Kernels + Parallel Residuals + N-gram Tilt — val_bpb 1.08014 Record: Triple Loop + Fused Kernels + Parallel Residuals + N-gram Tilt; val_bpb 1.08014 (5-seed mean) Apr 6, 2026
@abaybektursun abaybektursun force-pushed the submission/triple-loop-fused-ngram branch 3 times, most recently from 14879d0 to d581795 Compare April 6, 2026 17:13
sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Apr 6, 2026
…m tilt, SP8192 primary path

- PR openai#771 confirmed CLOSED/REJECTED (train-then-score AdamW TTT)
- PR openai#727 confirmed CLOSED (illegal n-gram hash cache)
- Merged SOTA unchanged at 1.1147
- New primary target: PR openai#1420 (abaybektursun, 1.08014):
  SP8192 + Triple Loop (3×, 17 virtual layers) + N-gram Tilt (legal,
  properly normalized, -0.0029 bpb) + Fused Kernels (+127 steps)
- PR openai#1413 (1.08279): confirms legal score-first TTT adds -0.003 bpb
- ETLB (-0.0019 bpb) noted as unruled — await @valerio-oai
- Strategy updated to v10.0: SP8192 + Triple Loop replaces SP4096 + 2×

https://claude.ai/code/session_01TbdBLJPXpbK5wGHpLAQ9x4
@abaybektursun abaybektursun force-pushed the submission/triple-loop-fused-ngram branch 8 times, most recently from 635dd75 to accb40b Compare April 6, 2026 19:38
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 6, 2026
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 6, 2026
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 6, 2026
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 6, 2026
- Add ngram_tilt_enabled and tilt hyperparameters to Hyperparameters
- Add build_ngram_extension(): cmake-based C++ build for fused_expert_ext
- Add precompute_ngram_hints(): rank-0 computes, broadcasts to all ranks
- Integrate Tilt into eval_val_sliding_ttt scoring loop:
  * Tilt applied AFTER TTT scoring (same sliding window)
  * TTT gradient uses ORIGINAL NLL (not tilted)
  * Tilted NLL accumulated for final score
- Track both base and tilted BPP for delta reporting
- Copy fused_expert_blend.cpp to repo root for C++ build
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 6, 2026
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 6, 2026
resouer added a commit to resouer/parameter-golf that referenced this pull request Apr 6, 2026
@abaybektursun
Copy link
Copy Markdown
Contributor Author

Post-Quantization Compression: Eight Negative Results

@clarkkev established in PR #1394 that compressed model size is governed by Shannon entropy, not hardware bitwidth:

$$H(q) \approx b - \log_2 k + \tfrac{1}{2}\log_2(\pi e / 2)$$

This note documents eight attempts to improve the compression pipeline beyond SDClip + GPTQ + Brotli. I raided the toolkits of crystallography (E8 lattice sphere packing), particle physics ($Z_2$ gauge symmetries), information-theoretic communications (water-filling bit allocation), and geological stratigraphy to test whether ideas from mathematical physics could breach the compression frontier. All failed end-to-end. The negative results are informative.

E8 Lattice Vector Quantization

The E8 lattice achieves optimal sphere packing in 8 dimensions (Viazovska, 2016), with normalized second moment 14% below the cubic lattice. I implemented D8 nearest-point rounding (the integer sublattice: all coordinates with even sum) and measured MSE on Gaussian-distributed weights.

D8 increased MSE by 8.37%. The constraint removes half the codewords from the integer grid without adding new ones. The VQ advantage requires dense index-based encoding, not per-coordinate int8 storage. Abandoned.

Entropy Equalization

Interpretability analysis revealed 80x variation in per-matrix quantization sensitivity. I derived the optimal bit allocation via Lagrange multipliers on the rate-distortion model, yielding $k_i \propto s_i^{-1/2}$. The improvement factor equals the AM-GM gap of the sensitivities.

Controlled A/B (same Hessians, 5 seeds): -0.004 BPB. End-to-end training: +0.002 BPB. The A/B test isolated the clip-allocation effect by holding GPTQ randomness constant. In practice, GPTQ stochasticity (~0.002 BPB from calibration sampling and floating-point non-determinism) exceeds the signal. The improvement is real but unmeasurable.

Sign-Flip Gauge

MLP hidden neurons admit a $Z_2$ gauge symmetry if the activation is odd. I attempted to flip net-negative neurons to skew the int8 histogram. BPB degraded to 6.4 because leaky_relu(x, 0.5).square() is positive-homogeneous of degree 2, not odd: $f(-x) = 0.25 f(x)$ for $x &gt; 0$. Invalid for this activation.

Scale Discretization

53K per-row float16 scales contribute ~100KB of mantissa entropy. I snapped them to a log-lattice before GPTQ, expecting the solver to absorb the <0.8% perturbation. Artifact grew by 31KB. The discretization destroyed the smooth mantissa gradient that Brotli was already exploiting: Shannon entropy decreased, but Kolmogorov complexity increased.

ZigZag Encoding

Two's Complement maps {0, -1, 1, -2} to bytes {0x00, 0xFF, 0x01, 0xFE}, creating bimodal byte distributions. ZigZag folds them to {0x00, 0x01, 0x02, 0x03}. Tested on existing artifact: +65 bytes. Brotli's context model handles Two's Complement natively.

Matrix Transposition

Column-major storage compressed 13KB smaller than row-major on the same quantized tensors because input-feature correlations dominate output-neuron correlations. Combined with stratigraphic dict ordering (grouping same-type matrices for inter-layer LZ77 matches): -16KB offline. End-to-end: +37KB. GPTQ output varies ~30-40KB across runs, overwhelming the signal.

Permutation Sort

MLP hidden-dimension permutation symmetry ($P^T P = I$) allows reordering neurons without changing the function. Sorting by L1 norm was intended to create spatial structure for LZ77. Not tested in isolation; included in a stacked run that failed.

The Noise Floor

Every experiment followed the same pattern: positive in controlled settings, neutral or negative end-to-end. The root cause is a GPTQ noise floor of ~0.002 BPB and ~30-40KB in artifact size, arising from Hessian estimation variance and floating-point non-determinism. Any compression-side optimization below this floor is unmeasurable in practice.

Brotli quality=11 is empirically at the byte-level compression frontier for this data. Six distinct byte-manipulation strategies (ZigZag, transposition, bit masking, scale discretization, dict reordering, permutation sort) all failed to improve on it

@abaybektursun
Copy link
Copy Markdown
Contributor Author

abaybektursun commented Apr 7, 2026

Experimental Attempts with Negative Results
The following outlines two architectural adjustments I tested just now.

1. Isospectral Conjugation (Failed: OOM Error)

  • The Concept: Prevent activations from collapsing during repeated loop passes by conjugating hidden states with random sign vectors. This theoretically simulates the depth of independent layers at zero parameter cost.
  • The Result: Unrunnable. The required element-wise multiplications broke torch.compile by drastically changing the Inductor graph, triggering a Triton shared memory Out-Of-Memory error (Required: 458KB, Limit: 232KB).

2. Skip-Gate Variance Normalization (Failed: Redundant)

  • The Concept: Manually scale U-Net skip_weights to 1.35 to counteract a compounding ~90% signal loss across the skip connections.
  • The Result: Unnecessary. While the structural signal attenuation is real, our manual fix yielded negligible improvements in pre-quantized and sliding BPB. The optimizer naturally compensates for the signal loss via learned scale parameters within the first 4,750 steps.

@mtybadger
Copy link
Copy Markdown

mtybadger commented Apr 7, 2026

I have also been playing with E8 lattice VQ over the past few days - despite not managing to break the frontier it is the best of the VQ methods I've tried, and certainly beat various learned/shared codebook strategies.

@abaybektursun
Copy link
Copy Markdown
Contributor Author

@Eppie Curious to know what you think about the ngram tilt

@abaybektursun
Copy link
Copy Markdown
Contributor Author

@mtybadger Cool! We need some more fun ideas that will help beat this sota, let me know if you get any more ideas/results

@Eppie
Copy link
Copy Markdown

Eppie commented Apr 7, 2026

@abaybektursun ngram tilt looks cool! From what I can see, it appears to be fully online, basically, it's a different approach to mixing the prediction from the trained model and the various order-N contexts, with some fixed confidence / count thresholds / priorities. Perhaps "mixing" is the wrong word, since it is focused on narrowing the model's probability distribution by assigning extra probability to the token predicted by the ngram model (as far as I can understand it). Also very cool to see the fused CUDA kernels. Great work!

dexhunter added a commit to dexhunter/parameter-golf that referenced this pull request Apr 7, 2026
…am Tilt — val_bpb 1.07800 (3-seed mean)

3-lever stack on top of PR openai#1394 sp8192 baseline:
- Parallel Residuals on layers 7-10 (PR openai#1412 by @Robby955)
- 3-layer depth recurrence (LOOP_START=3 LOOP_END=5, extends PR openai#1394's 2-layer recurrence)
- Eval-time causal n-gram tilt (PR openai#1420 by @abaybektursun, lineage PR openai#1145 by @AnirudhRahul)

Plus our existing PR openai#1413 stack: QK_GAIN_INIT=5, score-first legal TTT (LR=0.005, epochs=3).

Results (3-seed mean, 8xH100 SXM):
- val_bpb 1.07800 (std 0.00053)
- val_loss 2.78457 nats per token
- Beats PR openai#1394 (1.08563) by 0.01971 nats per token
- Beats PR openai#1420 (1.08014) by 0.00553 nats per token
- Beats own PR openai#1413 (1.08279) by 0.01237 nats per token

All four issue openai#1017 conditions verified for the n-gram tilt path: prefix-only
hash construction, full-vocab renormalized one-token tilt, score-before-update
ordering inside the C++ kernel, single left-to-right pass.

C++ n-gram kernel ported from PR openai#1420 with the nanobind dependency removed
(extern "C" shim + ctypes loader, single g++ -shared invocation at runtime).

5-seed re-verification via the shipped mini wrapper is in progress; this PR
will be updated with the final 5-seed mean once s1337 and s2025 land.
dexhunter added a commit to dexhunter/parameter-golf that referenced this pull request Apr 7, 2026
Adds s1337 (1.07801) and s2025 (1.07862) via the shipped mini wrapper.
The 5-seed mean is +0.00013 worse than the initial 3-seed mean (1.07800)
which is well within the std (~0.00046). Margins vs the legal open
chronology are unchanged in direction:

- vs PR openai#1394 (1.08563): -0.01938 nats per token (margin +0.01438 over 0.005 bar)
- vs PR openai#1420 (1.08014): -0.00520 nats per token (margin +0.00020 over 0.005 bar)
- vs own PR openai#1413 (1.08279): -0.01205 nats per token

3 of 5 seeds (s42, s1337, s2025) are now mini-wrapper-verified for fit;
s0 and s1234 mini-wrapper re-runs still in progress.
dexhunter added a commit to dexhunter/parameter-golf that referenced this pull request Apr 7, 2026
All 5 seeds (s0, s42, s1234, s1337, s2025) re-run via the shipped mini wrapper.
The mean improves slightly from the prior mixed-source 1.07813 to 1.07807
because s1234 produced a noticeably lower TTT under the mini wrapper
(1.07813 mini vs 1.07848 raw, -0.00035 — within float64 reordering noise but
the largest single-seed drift in the verification set).

All 5 artifact sizes are direct from the mini-wrapper runs (NOT projections):
- s0:    15,992,304 bytes (7,696 byte headroom)
- s42:   15,993,733 bytes (6,267 byte headroom)
- s1234: 15,990,539 bytes (9,461 byte headroom)
- s1337: 15,988,039 bytes (11,961 byte headroom)
- s2025: 15,992,215 bytes (7,785 byte headroom)

Margins vs the legal open chronology:
- vs PR openai#1394 (1.08563): -0.01952 nats per token (margin +0.01452 over 0.005 bar)
- vs PR openai#1420 (1.08014): -0.00534 nats per token (margin +0.00034 over 0.005 bar)
- vs own PR openai#1413 (1.08279): -0.01218 nats per token

All four issue openai#1017 conditions remain verified for the n-gram tilt path.
@dexhunter
Copy link
Copy Markdown

Thanks for the detailed legality note. For readers who have not followed #1017 closely, my understanding is that the main checklist is roughly:

  1. no dependence on x_t or future tokens,
  2. score under a proper normalized distribution,
  3. score before update,
  4. single left-to-right pass.

The score-before-update part looks fine to me. The part I am still unsure about is Condition 1 in the actual implementation.

In fused_expert_blend.cpp, inside get_hints_batch, the code reads tok = tokens_[p], derives is_bnd / is_ws from that token, and then uses those flags in within_hint(...) / word_hint(...) before the hint for position p is emitted. That seems to make the expert mix at position p depend on metadata of the current target token itself, not only on the strict prefix.

Could you clarify whether that is the intended reading? If the claim is that this is still prefix-only under #1017, a short explanation in the PR body (or an inline code comment) would help reviewers connect the implementation to the four conditions.

taka6745 pushed a commit to taka6745/paramgolf that referenced this pull request Apr 7, 2026
Mined the top 20 open PRs at openai/parameter-golf and found that
PARALLEL RESIDUALS (compute attn + mlp in parallel from the same
pre-norm input) is in 3 of the top 6 recent records:
  PR openai#1437: SP8192 + Parallel Residuals + 3L Recurrence — val_bpb 1.07800
  PR openai#1420: Triple Loop + Parallel Residuals + N-gram Tilt — val_bpb 1.08014
  PR openai#1425: PROTEUS Parallel Residuals + INT5/INT6
We never tried it. Patch 13 adds USE_PARALLEL_RESIDUALS=1 which switches
Block.forward from serial (x = x + attn(x); x = x + mlp(x)) to parallel
(x = x + attn(LN(x)) + mlp(LN(x))). Idempotent, anchors on the first 3
lines of Block.forward which are invariant under Patch 11 (smear gate).

Also discovered LESSONS.md §29 ("depth recurrence is DEAD under GPTQ")
is contradicted by 5 of the top 10 recent records — they use depth
recurrence + mixed-precision INT5/INT6 instead of pure int6 GPTQ.
Worth re-investigating in a future research fire.

experiments.json — 4 new PR_* configs:
  PR0: parallel residuals alone (no n-gram, isolated effect)
  PR1: parallel + leaky_relu + full n-gram (current best stack + new trick)
  PR2: parallel + smear + leaky + full n-gram (max stack)
  PR3: PR1 with seed=42 for noise check

RESEARCH_LOG.md — full record of the research fire findings + the
queue of techniques to investigate in future fires (n-gram tilt, depth
recurrence, MuonEq-R, PartialRoPE+FA3, SwiGLU, codebooks).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
taka6745 pushed a commit to taka6745/paramgolf that referenced this pull request Apr 7, 2026
Subagent A (BPE-8192 trainer): the exact tokenizer is already on disk
at data/tokenizers/fineweb_8192_bpe.model (370,908 bytes, the literal
file behind LESSONS.md §18c -0.129 BPB Mac win). Just needs scp to pod.

Subagent B (closed/merged PR audit): top 8 merged records analyzed.
Frequency table reveals 5+ convergent techniques we DON'T have:
- SmearGate in 6/8 (75%)
- zstd-22 in 5/8 (62%)
- EMA 0.997 in 4+/8
- Partial RoPE in 2+/8
- XSA in 1/8 (PR openai#1019 = literal openai#1 record at 1.11473)
- AR Self-Gen GPTQ in 1/8 (also PR openai#1019)

Subagent C (N-gram Tilt): FOUND the definition. It's a multiplicative
single-token exponential boost from a causal eval-time n-gram cache:
  p_tilt(t) = p_model(t) · exp(β · [t==hint]) / Z
  Z = 1 + p_model(hint) · (exp(β) - 1)
Used by PRs openai#1437, openai#1420, openai#1430. Bespoke to parameter-golf, not in
any published paper. Delta: -0.0029 to -0.0055 BPB.

Subagent D (TTT researcher): full ~80-line Score-First TTT sketch
provided. Pattern: score chunk in inference_mode, train on chunk SGD,
move on. PR openai#461 framework. Cost ~410s on 8xH100. ~-0.0025 BPB.

Subagent E (records miner): top 5 records analyzed, EMA + XSA +
Parallel Muon are convergent best practices. We have leaky_relu and
that's all from the comp's stack.

8-action priority list compiled. Highest EV next: scp BPE-8192,
implement EMA, XSA, Partial RoPE, LN Scale.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@abaybektursun
Copy link
Copy Markdown
Contributor Author

abaybektursun commented Apr 7, 2026

@dexhunter I can't secure 8xh100 right now, can you test this fix on your PR if you have access?

1. Hint gating (lines 400-409): is_bnd/is_ws now derived from tokens_[p-1] instead of tokens_[p]

Fixes the Rule 1 causal violation @Gusanidas identified
The probability distribution at position p no longer depends on x_p
2. Update functions (lines 448-456): New tok_is_bnd/tok_is_ws flags derived from the actual target tok

Ensures within_update() and word_update() still correctly track word boundaries using the real token
Without this, the state machine would segment words incorrectly, corrupting future hints

@abaybektursun abaybektursun force-pushed the submission/triple-loop-fused-ngram branch 2 times, most recently from 7a70e6d to 5e2eff8 Compare April 7, 2026 14:08
taka6745 pushed a commit to taka6745/paramgolf that referenced this pull request Apr 7, 2026
…A captured

Subagent extracted the canonical formula from PR openai#1420 (the source for
PR openai#1437 and the entire Legal N-gram Tilt family):

  p_tilt(x_t) = p_model(x_t) * exp(beta * 1[x_t == hint]) / Z
  Z = 1 + p_model(hint) * (exp(beta) - 1)

Verified legal under issue openai#1017 four conditions (causal, normalized,
score-before-update, single-pass). Genuinely different from EM-INF
(last fire's PASS) — multiplicative reweighting using external signal,
not entropy sharpening.

DEFERRED code patch despite high confidence because:
1. Eval-only metric — our loop measures train_loss with SKIP_FINAL_EVAL=1
2. Subagent's "50 LOC sketch" has O(L^2) forward-pass bug, real impl is 150+
3. Modifying eval pipeline risks breaking FINAL int8_zlib_roundtrip path

Marked HIGH PRIORITY for next H100 escalation cycle. Estimated +0.0015-0.0030
BPB at our SP-1024 vocab size — same order as largest single-technique gains.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@dexhunter
Copy link
Copy Markdown

@abaybektursun yep, I am running a fixed version right now and for 5 seeds it will take about 85 min, (my agents) will report once we have more results

@dexhunter
Copy link
Copy Markdown

dexhunter commented Apr 7, 2026

@abaybektursun Confirming I tested the exact fix you described in the comment above, on my fork of the kernel for PR #1437. Results below — your fix is structurally correct, and we converged on essentially the same patch independently.

Summary: the causal patch is correct, BUT the within/word experts contribute essentially nothing once the leak is removed. The cleanest legal mode is to also set within_beta=0 word_beta=0, which gives strictly better BPB than the patched-but-still-active version.

My fix vs your proposal: identical structure.

// Inside get_hints_batch, for each position p:
auto tok = uint16_t(tokens_[p]);          // for updates only (causal — runs after hint emit)
bool is_bnd = is_bnd_ && is_bnd_[tok];    // for updates only
bool is_ws  = has_ls_ && has_ls_[tok];    // for updates only

// CAUSAL FIX: hint gating must use prefix-only metadata (tokens_[p-1]).
bool prev_is_bnd = false, prev_is_ws = false;
if (p > 0) {
    uint16_t prev_tok = uint16_t(tokens_[p - 1]);
    prev_is_bnd = is_bnd_ && is_bnd_[prev_tok];
    prev_is_ws  = has_ls_ && has_ls_[prev_tok];
}

// ... compute hints with prefix-only gates ...
within_hint(prev_is_bnd, prev_is_ws, ...);
word_hint(prev_is_ws, ...);

// Updates still use current tok (causal because they run after hint is locked):
token_update(hashes, max_avail, tok);
within_update(tok, is_bnd, is_ws);
word_update(tok, is_bnd, is_ws);

This matches your description exactly: prefix-only is_bnd/is_ws for hint gating, current-token is_bnd/is_ws for updates so the state machine still segments words correctly.

Measurements (s42, sp8192 + par7+loop35+ngram + score-first TTT, 8×H100 SXM, 600 s):

Config TTT BPB Δ vs non-causal
Original kernel (non-causal, illegal) 1.07809
Your fix (== my fix): prev-tok hint gate, current-tok updates 1.08108 +0.00299
Token-only mode: NGRAM_WITHIN_BETA=0 NGRAM_WORD_BETA=0 (only token_hint contributes) 1.07951 +0.00142
Token-only + your kernel patch (recommended) 1.08097 +0.00288

Reading: the leak was worth ~+0.0027-0.0029 nats of TTT BPB, consistent across stacks I tested (with/without VR + Hessian SDClip). The structural fix recovers all of the causality, but the within/word experts then contribute negative BPB compared to leaving them at beta=0. The reason is that under prefix-only gating, the within hint fires for word-start positions too (and predicts a mid-word token, which is wrong) and the word hint fires for mid-word positions too (also wrong). The agree-bonus mechanism doesn't fully compensate. So:

Cleanest legal mode: ship the kernel patch AND set within_beta = word_beta = 0. Then token_hint (which has always been causal — compute_hashes only reads tokens[pos - k - 1]) is the only contributor and you get a cleanly auditable result.

For PR #1420: applying the same correction to your reported 1.08014 5-seed mean would put it at approximately ~1.08300 post-fix (assuming the leak constant is the same on your stack). Happy to test directly on your branch if you push the kernel patch.

For PR #1437: I'm updating my submission to ship the corrected (token-only) version with a transparency note. The corrected 5-seed mean estimate is ~1.08095 (vs the originally reported 1.07807). PR #1437 will no longer claim a record under the corrected number, but the public record will be honest about the bug and the correction. I'll ping back here once the seed sweep finishes (~30 min).

Thanks again for catching this and for the quick fix proposal — collaboration was painless.

@abaybektursun
Copy link
Copy Markdown
Contributor Author

@dexhunter Sweet thanks, yes can you please run my latest code as well? I pushed the fix

dexhunter added a commit to dexhunter/parameter-golf that referenced this pull request Apr 7, 2026
…-only experts

The original n-gram tilt kernel inherited from PR openai#1420 had a causality bug:
within_hint() and word_hint() in fused_expert_kernel.cpp::get_hints_batch
gated their emission on is_bnd[tokens_[p]] / is_ws[tokens_[p]] (target token
metadata at the position being scored), leaking 1-2 bits about the answer
per scored position. This is an Issue openai#1017 condition 2 violation.

PR openai#1420 has the identical bug. @abaybektursun has acknowledged it in PR
openai#1420's thread and proposed the same fix that's applied here:

  * fused_expert_kernel.cpp: derive is_bnd / is_ws from tokens_[p-1] (last
    prefix token) for hint gating. Updates use the actual current tok via
    new tok_is_bnd / tok_is_ws variables so within_update / word_update
    still segment words correctly. Variable naming and structure copied
    verbatim from PR openai#1420's fix.
  * Run command updated to set NGRAM_WITHIN_BETA=0 NGRAM_WORD_BETA=0.
    Empirically the within / word experts under prefix-only gating fire
    for the wrong positions (within fires for word-starts, word fires for
    mid-word) and contribute *negative* BPB. Disabling them gives 1.07951
    on s42 vs 1.08108 with the experts active — token_hint is the only
    legitimate contributor.

5-seed verification (all on the patched kernel):

    seed   pre-fix   corrected  delta
    0      1.07751   1.08035    +0.00284
    42     1.07809   1.08097    +0.00288
    1234   1.07813   1.08127    +0.00314
    1337   1.07801   1.08060    +0.00259
    2025   1.07862   1.08135    +0.00273
    mean   1.07807   1.08091    +0.00284

All 5 artifacts fit under 16 MB (15,988,802 - 15,995,572 bytes; 4.4-11.2 KB
headroom). Pre-fix per-seed values preserved in submission.json under
seed_results_pre_fix for the public record.

Bar comparisons (corrected mean 1.08091):

    PR openai#1394 (1.08563): beats by +0.00472, fails 0.005 nat record bar
    PR openai#1413 ours (1.08279): beats by +0.00188, fails record bar
    PR openai#1420 (1.08014): we lose by 0.00077 (PR openai#1420 also tainted by the
                        same bug; would correct to ~1.08300 post-fix)

This PR is left open as a transparency / diagnostic record, NOT as a record
claim. PR openai#1413 (no n-gram tilt at all) at 1.08279 remains our cleanest
legal anchor. The README has been retitled "Diagnostic (causal-corrected)"
and the legality fix is documented in a dedicated section.
sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Apr 7, 2026
…ikely illegal), merged SOTA unchanged

- PR openai#1430 (renqianluo, Apr 7): claims 0.39642 bpb via per-sample SLOT + n-gram order-22 hash + TTT. Flagged likely illegal: n-gram hash cache matches closed openai#727/openai#741 pattern; SLOT unruled (Issue openai#140). No organizer reviews yet.
- Merged SOTA unchanged at 1.1147 (PR openai#1019)
- Issue openai#140: no new rulings on SLOT, causal SLOT, or ETLB
- Legal path unchanged: PR openai#1420 stack (SP8192 + Triple Loop + N-gram Tilt + Legal TTT) targeting ~1.075–1.077
- No new breakthrough papers beyond existing tracking

https://claude.ai/code/session_01XLD5qpZfXpmJPnuT9kSnPC
sunnypatneedi pushed a commit to sunnypatneedi/parameter-golf that referenced this pull request Apr 7, 2026
…ctions

- N-gram Tilt bug: PR openai#1420 kernel is non-causal; PR openai#1437 (dexhunter) found/fixed it
  (pre-fix 1.07807 → post-fix 1.08091). Updated primary reference to PR openai#1437 kernel.
- PR openai#1423 flagged illegal (pre-quant TTT, same as openai#1351/openai#1408/openai#1416)
- Added full PR openai#1421–1444 scan results
- Updated best open legal PR: ~1.08091 (PR openai#1437) not 1.08014 (openai#1420)
- Session 8 lessons learned added to CLAUDE.md

https://claude.ai/code/session_01XLD5qpZfXpmJPnuT9kSnPC
taka6745 pushed a commit to taka6745/paramgolf that referenced this pull request Apr 7, 2026
…led, 2 new PRs validate deferred specs

Patches 15/16/21 still uncontested in 150+ open + 10 closed PRs (5 audits
in a row). Strong evidence of true novelty.

PR #1430 still OPEN, 0 comments, no comp owner activity since creation.
Increasingly likely to be reverted or outlawed.

NEW PRs validate two of our deferred H100 escalation specs:
  - PR #1445 (1.0889): "Depth Recurrence + EMA 0.9965" → validates Patch 17 EMA spec
  - PR #1446 (1.0960): "int6 GPTQ + lzma" → validates Patch 23 INT6 GPTQ-Lite spec

Combined with PR #1437/#1420 already validating Patch 23 N-gram Tilt, the
3-spec H100 escalation bundle (EMA + Tilt + INT6 GPTQ) is now triple-
confirmed by independent comp PRs.

Spend ~$3.00/$36 (8% utilization). Pod healthy at 6h uptime.

Reminder: depth recurrence is back on the table — 5+ records use it now.
LESSONS.md §29 needs another update from "stale" to "real direction".

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
abaybektursun pushed a commit to abaybektursun/parameter-golf that referenced this pull request Apr 7, 2026
Deep review of train_gpt.py reveals ttt_adapt_adamw() trains on val
data for 10 full epochs (TTT_EPOCHS=10, TTT_ENABLED=1 by default)
before quantization. This is the same pre-quantization TTT violation
as PRs openai#1423 and openai#1416 — the artifact encodes information from the
entire validation set, violating strict causal dependence.

The ~0.04-0.05 BPB improvement from dTTT is entirely attributable
to fitting the test set.

Best verified-valid score updated to 1.0801 BPB (PR openai#1420).

https://claude.ai/code/session_017F8GGeKA7MhUoQdqMGcTpg
dexhunter added a commit to dexhunter/parameter-golf that referenced this pull request Apr 7, 2026
…reshold

The previous "Diagnostic" framing was based on a unit error: I compared
val_bpb deltas as if they were nats-per-token deltas, missing the factor
of ~2.583 (mean bytes per token in the sp8192 val set, computable directly
from this submission's val_loss / val_bpb ratio).

With the correct units, the causal-corrected 5-seed mean (1.08091 BPB,
2.79210 nats/token) clears the 0.005-nat record bar against PR openai#1394:

  vs PR openai#1394 (1.08563): +0.01219 nats per token  ✅ 2.4× the bar
  vs PR openai#1019 (1.11473): +0.08736 nats per token  ✅ comfortably
  vs PR openai#1413 (ours):    +0.00486 nats per token  — essentially tied
  vs PR openai#1420 (1.08014): -0.00199 nats — but PR openai#1420 has the same kernel
                          bug; its corrected ~1.08298 yields +0.00535 nats ✅

Title reverted from "Diagnostic (causal-corrected)" to "Record". The
legality fix section is preserved (the kernel patch is still a real
correctness fix matching @abaybektursun's proposed patch in PR openai#1420).
The leak magnitude in the legality fix section now correctly states
"+0.00284 BPB ≈ +0.00734 nats per token" instead of just BPB.

Pre-fix per-seed values are still preserved in submission.json under
seed_results_pre_fix for the public record.
@abaybektursun abaybektursun force-pushed the submission/triple-loop-fused-ngram branch from 5e2eff8 to f265f65 Compare April 8, 2026 00:22
…bpb 1.08014

5-seed mean 1.08014 BPB (std=0.0004), best seed 1.07971.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@abaybektursun abaybektursun force-pushed the submission/triple-loop-fused-ngram branch from f265f65 to d2bda6f Compare April 8, 2026 00:40
@abaybektursun abaybektursun changed the title Record: Triple Loop + Fused Kernels + Parallel Residuals + N-gram Tilt; val_bpb 1.08014 (5-seed mean) Record: Triple Loop + Fused Kernels + Parallel Residuals + N-gram Tilt; val_bpb 1.08309 (5-seed mean) Apr 8, 2026
PhamPhuHoa-23 added a commit to angela231005/parameter-golf that referenced this pull request Apr 8, 2026
abaybektursun pushed a commit to abaybektursun/parameter-golf that referenced this pull request Apr 8, 2026
Documents the Rule 1 causal violation in PR openai#1420's n-gram tilt code,
including why it was hard to spot, a concrete detection checklist, and
the fix pattern of separating prefix-only flags from target flags.

https://claude.ai/code/session_017F8GGeKA7MhUoQdqMGcTpg
taka6745 pushed a commit to taka6745/paramgolf that referenced this pull request Apr 9, 2026
Thesis: the speed path is the most underutilized section of openai/parameter-golf.
The quality path has 170+ PRs; the speed path has maybe 30 and 2-3 genuine novelties.
Our 13x per-GPU gap vs comp records is almost entirely soft — most of it collapses
under free wins + comp ports.

Findings:

TIER 0 FREE WINS (before any kernel work) — ~3x speedup, 2-3 days total:
- Shot 0a: drop grad_accum_steps 8→1. The single biggest easy win hiding in
  plain sight. We're paying 8x kernel-launch overhead because grad_accum was
  inherited from an 8xGPU distributed config. 5 LOC, 30-50% speedup.
- Shot 0b: eval batched + streaming KV cache. Current sliding-window eval is
  625K sequential forwards at B=1 stride=64. 97% of each window's context is
  shared with the previous. Streaming KV (StreamingLLM arXiv 2309.17453) gives
  5-15x eval speedup, saves 3-5 min of the 600s budget.
- Shot 0c: SkyLadder progressive seq_len 256→2048 (NeurIPS 2025 arXiv
  2503.15450). 22% throughput + 1-3.7% quality. Already in Mac SETUP §35
  backlog, never shipped.
- Shot 0d: train-data GPTQ calibration (PR openai#1219, comp-organizer-approved).
  Replaces 220s AR self-gen with 14s. +2000 extra training steps.
- Free: TORCHINDUCTOR_MIX_ORDER_REDUCTION=0 + torch 2.9.1 pin. +8.8% step time.

TIER 2 COMP-PORT WINS we missed in the original Phase 2 plan:
- Shot 9: FA3 varlen + window + mixed seq_len across GPUs (PR openai#1212 holds the
  fastest step in the leaderboard at 69.6 ms/step)
- Shot 10: Parameter Banking + Parallel Muon (PR openai#399): 66 nn.Linear → 4
  contiguous 3D banks → Newton-Schulz becomes one bmm → optimizer time 19.7 ms
  → 1.3 ms (15x). World-novel, NOT in modded-nanogpt.
- Shot 11: CUTLASS EVT backward with the novel `post=0.5·act_grad·pre` identity
  (PRs openai#1105, openai#1420). Identity itself looks world-novel.
- Shots 13-14: eval path wins (Triton KV-cache backend, fused softcap+CE
  megakernel). Combined eval speedup ~5x on top of Shot 0b.

TIER 3 BIG DREAMS (world-first opportunities):
- Megadream 1: **Training megakernel** (fwd+bwd+optim in a single persistent
  SM kernel). HazyResearch / Mirage / MegaQwen have inference megakernels;
  nobody has built one for TRAINING. 1.3us × ~600 launches per step = 16% of
  our step budget is pure launch overhead. 5-7 days, 500-1500 LOC, ThunderKittens
  templates. Potential PhD-defensible mini-paper.
- Megadream 2: **Streaming KV sliding-window eval** (our Shot 0b, also novel)
- Megadream 3: **Fuzzy LR bandit per microbatch** — user's "dial-in" hint
  operationalized. Thompson sampling from {0.5x, 1x, 2x} * base_lr. 80 LOC.
- Megadream 4: **CPU n-gram precompute thread** — user's "CPU while GPU" hint
  operationalized. BG thread pre-computes n-gram hash tensors, 50 LOC.
- Megadream 5: **GPU-resident successive halving** — user's "GPU tests" hint
  operationalized. Run 4 replicas × 100 steps inside the 600s budget, pick
  winner, continue. Online hyperband. 200 LOC.
- Megadream 6: **AOTInductor precompile + binary ship** — kill the 5+ min
  compile cold-start permanently.

Stacked expected impact:
- Phase 1 (now): 180 steps / 600s, val_bpb ~1.4-1.6
- +Tier 0 free wins: ~540 steps, val_bpb ~1.25-1.35
- +Tier 1 kernel work: ~2000 steps, val_bpb ~1.15-1.22
- +Tier 2 comp ports: ~4000 steps, val_bpb ~1.10-1.15
- +Tier 3 Megadream 1 (training megakernel): ~8000 steps, val_bpb ~1.08-1.12
- +Tier 3 all: ~10000 steps, val_bpb ~1.06-1.10 (**ahead of comp on 1xH100**)

10000 steps on 1xH100 = 4x more per-GPU training than the comp's 20000 on 8xH100.
That's where val_bpb drops BELOW comp records.

Key finding: eval path holds the biggest speed wins currently, not training.
Our sliding-window eval eats 10-15 min of the 600s budget. Tier 0b + Tier 2
Shots 13-14 save 5-8 min per eval pass. More than any training-side single
patch would buy at our current rate.

Source reports: /tmp/phase2_comp_speed_audit.md (22 PRs surveyed),
/tmp/phase2_world_speed_research.md (12 research areas surveyed).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
taka6745 pushed a commit to taka6745/paramgolf that referenced this pull request Apr 9, 2026
1. phase2/metrics.py (~330 lines)
   - Structured JSONL telemetry for per-step timing + GPU/CPU/RAM utilization
   - mark(event, **extra) for phase-level events (setup done, train started, etc)
   - step(step, ms, train_loss, tok_per_sec, prefetch_queue_depth, ...) hot-path
   - Best-effort nvidia-smi + /proc/meminfo + torch.cuda.memory_allocated readers
     so the helper has no new deps (no pynvml, no psutil)
   - Sparse nvidia-smi sampling (every 50 steps by default) to avoid per-step cost
   - print_summary() for end-of-run table, compare_runs() for before/after
   - Smoke test in __main__ passes
   - Used by every subsequent Phase 2 shot so we can measure the speedup and
     verify the val_bpb invariant

2. submission/run.sh: free env var wins (Tier 0, zero risk, zero LOC)
   - TORCHINDUCTOR_MIX_ORDER_REDUCTION=0 — disables the Inductor pass that
     @abaybektursun fixed upstream in pytorch#179494 / #179422 specifically for
     this comp's shape. Per PR openai#1420 this gives +5.93 ms/step (+8.8%).
   - PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True,garbage_collection_threshold:0.8
     reduces fragmentation and avoids cudaMalloc stalls during training.
     Respects user override if already set.
   - Honest note on grad_accum: the research agent's "drop 8 → 1 for 30-50% win"
     claim is wrong. Peak activation memory at grad_accum=1 microbatch=384 seqs
     is ~448 GB (vs our current 56 GB at microbatch=48), blows H100 80GB 8×.
     We KEEP grad_accum=8 for world_size=1 at the current TRAIN_BATCH_TOKENS.
     Documented in the script so future-me doesn't fall for it again.

Next: data prefetch thread + pinned RAM (Task 9), then compile cache warmup
(Task 10), then CPU n-gram precompute (Task 11), then wire phase2/bootstrap
(Task 12).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants