Record: 11L CANON-AC(last5)+DeltaGate Report (Humble Record Attempt, val_bpb: 1.1296)#400
Conversation
…nai#400 openai#369 openai#398) KEY DISCOVERY: PR#414 stacks EMA + Tight SWA together (-0.0006 BPB free) GPTQ should be per-ROW not per-matrix (-0.0006 BPB) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Community Review — Record: 11L CANON-AC(last5)+DeltaGate Report (Humble Record Attempt, val_bpb: 1.1296)Compliance: NEEDS AUTHOR ACTION — What I found: The CPU smoke test on CT2038 (proteus-engine, 128 GB RAM, Triton 3.6.0, flash_attn stub, cutlass_evt_fusion stub) failed at the import step with: A few of the common patterns I've seen for this class of error in the 2026-04-11 sweep:
Recommendation: Could you run Once the parse/import issue is fixed, I'll re-run the compliance audit through the normal pipeline. No other flags identified yet because the audit halts at the import step. Reviewed by @MatoTeziTanka — The Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): IMPORT_FAIL — SyntaxError: (unicode error) 'utf-8' codec can't decode byte 0x9e in position 0: invalid start byte (line 1). Classification via |
|
@MatoTeziTanka sorry I checked it worked perfectly well. |
Correction — Community Review for PR #400@chanwoo-park-official you're right, my apologies. I re-ran the audit against the actual head file ( Corrected review (audited head SHA BPB: 1.1296 (sliding-window val_bpb, seed 1337) | Compliance: LOOKS CLEAN What I checked manually:
Verdict: LOOKS CLEAN. BigramHash + CANON(last5) + DeltaGate + tight SWA + int6 mixed quant is a legal stack against the current leaderboard pattern. The 1.1296 → 1.1297 → 1.1303 across 3 seeds is a nice tight spread. Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation — already reported, under-16MB artifact cap — 15.58MB per the PR body so right under the wire, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the re-audit. Once again sorry for the noise — the IMPORT_FAIL false positive on this PR has been retracted and I'm adding a guard to the bulk-smoke runner to prevent it on other PRs that delete/rename record folders. Reviewed by @MatoTeziTanka — The Agora. Manual re-audit after author pushback. Original false-positive IMPORT_FAIL retracted. |
|
Also — thanks for engaging with this, seriously. The whole reason I started running these community reviews in the first place is that 900-ish PRs have been sitting open with zero maintainer response, and a lot of good work (yours included) gets buried under the pile. One-sided drive-bys from me aren't worth much; the back-and-forth is what actually catches mistakes like the runner bug that fired on your PR. I'd much rather take a public L on a false positive and get corrected in five minutes than have nobody look at any of these at all. If more authors push back when I'm wrong I'll end up with a much better classifier, and the leaderboard gets a more trustworthy signal on what's legal vs. what isn't. So — appreciated. Hope the mods chime in on the 1.1296 record soon, because on re-read this looks like a real one. |
|
@MatoTeziTanka No problem at all! All of the things that you are doing are good for the community, so thx for your dedication! |
Retraction — this IMPORT_FAIL was a deleted-file artifact in my smoke runnerSorry @chanwoo-park-official, this one's on me. I re-audited the What happened: Your PR deletes 5 old Verified at head The real Your PR is not broken by this error. I'm retracting the IMPORT_FAIL classification. I'll re-queue the full compliance audit and post findings separately. Again — sorry for the noise. I'm adding a "don't fetch paths marked deleted in the PR diff" guard to the runner so this doesn't hit other PRs that delete/rename records folders. |
|
Small correction to my own re-audit above — I was sloppy with the "looks like a real one" framing and the MERGE recommendation. The current leaderboard floor for the 10min/16MB track is 1.0810 (bigbag, PR #1493, SP8192 + 3-layer recurrence + parallel residuals + legal TTT), and there are 16 entries between 1.0810 and your 1.1296. Your result would slot in between the 1.1271 and 1.1307 entries from 2026-03-20, not near the top. So: the compliance verdict stands (code is clean, legal pure-neural stack, no flags) but "MERGE as record" is the wrong recommendation — it's a legal but non-winning result, not a new record. The title of your PR already says "Humble Record Attempt" so you clearly knew this; the error was mine in the correction comment. Sorry for muddying it. Still a solid clean attempt on a known-legal stack; just wanted to correct the record-vs-not-record framing before it sits here unchallenged. |
|
@MatoTeziTanka Yes you are correct, this was top 2 when I was posting it, and it is already April 11st therefore many new attempts. |
Summary
This run builds on the current leaderboard-aligned stack (official + pending-validated direction) and focuses on a scoped CANON placement with CANON delta gate.
Best observed result in this sweep:
final_int6_sliding_window_exact val_bpb: 1.12961770Compared to my previous PR #312:
1.16682362 -> 1.12961770(large improvement)Quick Comparison (vs #312)
1.1668236213,267,347bytes62781.1296177015,581,348bytes62431.130315,505,544bytes62521.1297033715,579,865bytesWhat Was Reused From Current Leaderboard (not unofficial-only additions)
This run intentionally reuses patterns already common in official/pending leaderboard entries, to check the possibility of Canon layers.:
XSA_LAST_N=4)ROPE_DIMS=16) + LN Scalestride=64)Main Configuration (this report)
CANON_SET=ACCANON_LAST_N=5CANON_DELTA_GATE=1SWA_ENABLED=1,TIGHT_SWA=1,TIGHT_SWA_EVERY=50,TIGHT_SWA_START_LRMUL=0.2,TIGHT_SWA_MAX_CHECKPOINTS=12TRAIN_BATCH_TOKENS=786432, wallclock-capped run (MAX_WALLCLOCK_SECONDS=600)Definitions (for this report)
Delta(inAC(last5)+delta) means CANON delta gate (CANON_DELTA_GATE=1).In this work,
CANON_DELTA_GATE_INIT(g)=-4.0, which makes initialization near-identity and lets the model learn how strongly to use the CANON path during training.Last 4means XSA is enabled only on the last 4 transformer blocks:XSA_LAST_N=4XSA learnable gatemeans an extra learnable scalar that mixes normal attention output and XSA output:y <- y + sigmoid(g) * (y_xsa - y)XSA_LEARNABLE_GATEandXSA_GATE_INITFinal Run Command (renamed RUN_ID)
Results
Seed-level excerpts
1337:step:6278/7000 val_loss:1.9339 val_bpb:1.1454final_int6_sliding_window_exact val_loss:1.90730712 val_bpb:1.12961770Total submission size int6+zstd: 15581348 bytes1335:step:6252/7000 val_loss:1.9349 val_bpb:1.1460final_int6_sliding_window_exact val_loss:1.90745178 val_bpb:1.12970337Total submission size int6+zstd: 15579865 bytes1336:step:6243/7000 val_loss:1.9365 val_bpb:1.1469final_int6_sliding_window_exact val_bpb: 1.1303Total submission size int6+zstd: 15505544 bytesWallclock / speed notes
6930steps under the same cap (faster, but lower quality).Ablations (sliding-window val_bpb)
ACD:1.14083538AC(broad):1.13218808AC(first 4 layers):1.13141.13587538-- it was faster, but it doesn't have a better bpb.AC(last5)+delta: best observed1.1296XSA_LEARNABLE_GATE=1): not helpful here (~1.131)Comparison vs Previous PR
Previous: #312
final_int6_sliding_window_exact val_bpb: 1.16682362Current best in this report:
final_int6_sliding_window_exact val_bpb: 1.12961770Approx improvement:
Δ bpb = -0.03720592Δ nats ≈ 0.0258(usingbpb * ln(2)conversion)Significance Note
Against official SOTA context (
1.1428 BPB), this run clears the>=0.005 natimprovement margin by a comfortable amount in point estimate.For formal
p < 0.01reporting, include the completed 3-seed list (1335/1336/1337) and test output in PR comments.Humble Notes