Skip to content

Record: 11L CANON-AC(last5)+DeltaGate Report (Humble Record Attempt, val_bpb: 1.1296)#400

Open
chanwoo-park-official wants to merge 4 commits intoopenai:mainfrom
chanwoo-park-official:pr/canon-deltagate-11l
Open

Record: 11L CANON-AC(last5)+DeltaGate Report (Humble Record Attempt, val_bpb: 1.1296)#400
chanwoo-park-official wants to merge 4 commits intoopenai:mainfrom
chanwoo-park-official:pr/canon-deltagate-11l

Conversation

@chanwoo-park-official
Copy link
Copy Markdown

@chanwoo-park-official chanwoo-park-official commented Mar 22, 2026

Summary

This run builds on the current leaderboard-aligned stack (official + pending-validated direction) and focuses on a scoped CANON placement with CANON delta gate.

Best observed result in this sweep:

  • final_int6_sliding_window_exact val_bpb: 1.12961770

Compared to my previous PR #312:

  • 1.16682362 -> 1.12961770 (large improvement)

Quick Comparison (vs #312)

Run Setup Steps Before Wallclock Stop Final sliding-window val_bpb Submission size (int6+zstd)
Previous #312 ACD (all) + SWA 7210 (batch size is default) 1.16682362 13,267,347 bytes
This work (seed 1337) AC(last5)+delta+tightSWA 6278 1.12961770 15,581,348 bytes
This work (seed 1336) AC(last5)+delta+tightSWA 6243 1.1303 15,505,544 bytes
This work (seed 1335) AC(last5)+delta+tightSWA 6252 1.12970337 15,579,865 bytes

What Was Reused From Current Leaderboard (not unofficial-only additions)

This run intentionally reuses patterns already common in official/pending leaderboard entries, to check the possibility of Canon layers.:

  • 11L / 512-dim / GQA (8 heads, 4 KV heads), MLP 3x
  • BigramHash + SmearGate
  • XSA on last 4 layers (XSA_LAST_N=4)
  • Partial RoPE (ROPE_DIMS=16) + LN Scale
  • Late QAT
  • WD 0.04, Tight SWA schedule
  • Sliding-window eval (stride=64)

Main Configuration (this report)

  • CANON_SET=AC
  • CANON_LAST_N=5
  • CANON_DELTA_GATE=1
  • SWA_ENABLED=1, TIGHT_SWA=1, TIGHT_SWA_EVERY=50, TIGHT_SWA_START_LRMUL=0.2, TIGHT_SWA_MAX_CHECKPOINTS=12
  • TRAIN_BATCH_TOKENS=786432, wallclock-capped run (MAX_WALLCLOCK_SECONDS=600)

Definitions (for this report)

Delta (in AC(last5)+delta) means CANON delta gate (CANON_DELTA_GATE=1).

delta = CanonConv(X)
g = sigmoid(alpha)
Y = X + g * delta      # residual mode
Y = g * delta          # non-residual mode

In this work, CANON_DELTA_GATE_INIT(g)=-4.0, which makes initialization near-identity and lets the model learn how strongly to use the CANON path during training.

  • Last 4 means XSA is enabled only on the last 4 transformer blocks:
    • XSA_LAST_N=4
  • XSA learnable gate means an extra learnable scalar that mixes normal attention output and XSA output:
    • y <- y + sigmoid(g) * (y_xsa - y)
    • controlled by XSA_LEARNABLE_GATE and XSA_GATE_INIT

Final Run Command (renamed RUN_ID)

CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
env \
  RUN_ID=frontier_canon_ac_k3_8gpu_final_report_seed1336 \
  DATA_PATH=./data/datasets/fineweb10B_sp1024/ \
  TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
  VOCAB_SIZE=1024 SEED=1336 \
  NUM_LAYERS=11 MODEL_DIM=512 NUM_HEADS=8 NUM_KV_HEADS=4 MLP_MULT=3.0 \
  TRAIN_SEQ_LEN=2048 \
  EVAL_SEQ_LEN=2048 EVAL_STRIDE=64 EVAL_BATCH_SEQS=32 \
  ITERATIONS=7000 WARMUP_STEPS=20 WARMDOWN_ITERS=3000 MAX_WALLCLOCK_SECONDS=600 \
  MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \
  MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 MUON_MOMENTUM_WARMUP_STEPS=1500 \
  MUON_WEIGHT_DECAY=0.04 ADAM_WEIGHT_DECAY=0.04 \
  EMA_ENABLED=0 \
  SWA_ENABLED=1 TIGHT_SWA=1 TIGHT_SWA_EVERY=50 TIGHT_SWA_START_LRMUL=0.2 TIGHT_SWA_MAX_CHECKPOINTS=12 \
  XSA_LAST_N=4 ROPE_DIMS=16 LN_SCALE=1 \
  LATE_QAT=1 QAT_THRESHOLD=0.1 \
  INT6_CATEGORIES=mlp,attn TRAIN_BATCH_TOKENS=786432 GRAD_CLIP_NORM=0.3 \
  CANON_SET=AC CANON_KERNEL=3 CANON_RESIDUAL=1 CANON_ACTIVATION=0 CANON_BIAS=0 \
  CANON_FIRST_N=0 CANON_LAST_N=5 CANON_DELTA_GATE=1 CANON_DELTA_GATE_INIT=-4.0 \
  TORCHINDUCTOR_CACHE_DIR=/tmp/torchinductor \
  torchrun --standalone --nproc_per_node=8 train_gpt.py

Results

Seed-level excerpts

  • Seed 1337:
    • step:6278/7000 val_loss:1.9339 val_bpb:1.1454
    • final_int6_sliding_window_exact val_loss:1.90730712 val_bpb:1.12961770
    • Total submission size int6+zstd: 15581348 bytes
  • Seed 1335:
    • step:6252/7000 val_loss:1.9349 val_bpb:1.1460
    • final_int6_sliding_window_exact val_loss:1.90745178 val_bpb:1.12970337
    • Total submission size int6+zstd: 15579865 bytes
  • Seed 1336:
    • step:6243/7000 val_loss:1.9365 val_bpb:1.1469
    • final_int6_sliding_window_exact val_bpb: 1.1303
    • Total submission size int6+zstd: 15505544 bytes

Wallclock / speed notes

  • AC(last5)+delta runs stopped around ~6250-6280 steps due to 600s wallclock cap.
  • No-canon run reached 6930 steps under the same cap (faster, but lower quality).

Ablations (sliding-window val_bpb)

  • Full CANON ACD: 1.14083538
  • CANON AC (broad): 1.13218808
  • CANON AC (first 4 layers): 1.1314
  • No CANON: 1.13587538 -- it was faster, but it doesn't have a better bpb.
  • CANON AC(last5)+delta: best observed 1.1296
  • XSA learnable gate (XSA_LEARNABLE_GATE=1): not helpful here (~1.131)

Comparison vs Previous PR

Previous: #312

  • final_int6_sliding_window_exact val_bpb: 1.16682362

Current best in this report:

  • final_int6_sliding_window_exact val_bpb: 1.12961770

Approx improvement:

  • Δ bpb = -0.03720592
  • Δ nats ≈ 0.0258 (using bpb * ln(2) conversion)

Significance Note

Against official SOTA context (1.1428 BPB), this run clears the >=0.005 nat improvement margin by a comfortable amount in point estimate.
For formal p < 0.01 reporting, include the completed 3-seed list (1335/1336/1337) and test output in PR comments.

Humble Notes

  • This is an incremental engineering result built on existing leaderboard-proven ideas plus scoped CANON placement and gating.
  • The strongest gain seems to come from the interaction of AC(last5), CANON delta gate, and tight SWA under the same compute budget.

haikosys pushed a commit to haikosys/parameter-golf that referenced this pull request Mar 30, 2026
…nai#400 openai#369 openai#398)

KEY DISCOVERY: PR#414 stacks EMA + Tight SWA together (-0.0006 BPB free)
GPTQ should be per-ROW not per-matrix (-0.0006 BPB)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@MatoTeziTanka
Copy link
Copy Markdown

MatoTeziTanka commented Apr 11, 2026

[RETRACTED 2026-04-11] — This IMPORT_FAIL was a false positive. Root cause: runner fetched a path marked deleted in the PR diff. Your code is not broken. See correction below: #400 (comment)


[RETRACTED 2026-04-11] — This IMPORT_FAIL was a false positive caused by the bulk-smoke runner picking up a deleted records/*/train_gpt.py path instead of the root train_gpt.py that this PR actually edits. The real head file parses cleanly under Python 3.10. See correction below: #400 (comment)


Community Review — Record: 11L CANON-AC(last5)+DeltaGate Report (Humble Record Attempt, val_bpb: 1.1296)

Compliance: NEEDS AUTHOR ACTION — train_gpt.py fails to import on CT2038 (Python 3.10 / torch 2.10.0+cpu)

What I found: The CPU smoke test on CT2038 (proteus-engine, 128 GB RAM, Triton 3.6.0, flash_attn stub, cutlass_evt_fusion stub) failed at the import step with:

SyntaxError: (unicode error) 'utf-8' codec can't decode byte 0x9e in position 0: invalid start byte (line 1)

A few of the common patterns I've seen for this class of error in the 2026-04-11 sweep:

Recommendation: Could you run python3 -c "import py_compile; py_compile.compile('train_gpt.py')" on your records-folder train_gpt.py under Python 3.10 specifically? The eval image is Python 3.10 per Issue #17 / the README, so any parse error on 3.10 blocks the submission at import time before any of the scored-eval logic runs.

Once the parse/import issue is fixed, I'll re-run the compliance audit through the normal pipeline. No other flags identified yet because the audit halts at the import step.


Reviewed by @MatoTeziTankaThe Agora. CPU smoke test (CT2038 proteus-engine, 2026-04-11): IMPORT_FAIL — SyntaxError: (unicode error) 'utf-8' codec can't decode byte 0x9e in position 0: invalid start byte (line 1). Classification via classify_prs.py AST-based classifier; full compliance audit deferred until the import issue is resolved. Auto-drafted from a template and spot-checked before posting.

@chanwoo-park-official
Copy link
Copy Markdown
Author

@MatoTeziTanka sorry I checked it worked perfectly well.

@MatoTeziTanka
Copy link
Copy Markdown

Correction — Community Review for PR #400

@chanwoo-park-official you're right, my apologies. I re-ran the audit against the actual head file (train_gpt.py at f4952b1, 86,197 bytes) and the earlier IMPORT_FAIL was a false positive on my side — the bulk-smoke runner picked up a path from the records/* deletions in this PR's diff instead of the root train_gpt.py you're editing, and the blob it grabbed wasn't a real Python source file, which is where the byte 0x9e at position 0 decode error came from. That's a bug in my runner, not your code.

Corrected review (audited head SHA f4952b1b0cf9009f05ecf227cef98848b04db34c):

BPB: 1.1296 (sliding-window val_bpb, seed 1337) | Compliance: LOOKS CLEAN

What I checked manually:

  • Parse check: root train_gpt.py compiles fine under Python 3.10 (py_compile.compile passes)
  • N-gram cache: BigramHashEmbedding.bigram_hash at line 1083 uses torch.bitwise_xor(36313 * t[..., 1:], 27191 * t[..., :-1]) % mod — this is position-conditioned on the previous and current context tokens, no target-in-key pattern, no relation to the PR Record: BackoffNgramMixer + Drift-Free TTT (3-seed mean val_bpb=0.6683) #779 family bug.
  • Eval path: eval_val_sliding at line 383 is the standard sliding-window scorer — base_model.eval() + torch.inference_mode(), scores the tail [score_start:wlen] slice of each window, first window gets full scoring. No grad updates during eval, no SLOT optimization, no TTT.
  • Quantization ordering: int6 quantize → load_state_dicteval_valeval_val_sliding. Quantization happens before eval, not pre-quant TTT on val tokens. Clean.
  • No val-token fine-tuning: val_tokens only appears in the eval functions and the single load site at line 1530. No AdamW/SGD step with val_tokens as input.

Verdict: LOOKS CLEAN. BigramHash + CANON(last5) + DeltaGate + tight SWA + int6 mixed quant is a legal stack against the current leaderboard pattern. The 1.1296 → 1.1297 → 1.1303 across 3 seeds is a nice tight spread.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: MERGE pending the usual record-track checks (3-seed validation — already reported, under-16MB artifact cap — 15.58MB per the PR body so right under the wire, ≤600s train + ≤600s eval on 8×H100 SXM). No compliance flags from the re-audit.

Once again sorry for the noise — the IMPORT_FAIL false positive on this PR has been retracted and I'm adding a guard to the bulk-smoke runner to prevent it on other PRs that delete/rename record folders.


Reviewed by @MatoTeziTankaThe Agora. Manual re-audit after author pushback. Original false-positive IMPORT_FAIL retracted.

@MatoTeziTanka
Copy link
Copy Markdown

Also — thanks for engaging with this, seriously. The whole reason I started running these community reviews in the first place is that 900-ish PRs have been sitting open with zero maintainer response, and a lot of good work (yours included) gets buried under the pile. One-sided drive-bys from me aren't worth much; the back-and-forth is what actually catches mistakes like the runner bug that fired on your PR. I'd much rather take a public L on a false positive and get corrected in five minutes than have nobody look at any of these at all.

If more authors push back when I'm wrong I'll end up with a much better classifier, and the leaderboard gets a more trustworthy signal on what's legal vs. what isn't. So — appreciated. Hope the mods chime in on the 1.1296 record soon, because on re-read this looks like a real one.

@chanwoo-park-official
Copy link
Copy Markdown
Author

@MatoTeziTanka No problem at all! All of the things that you are doing are good for the community, so thx for your dedication!

@MatoTeziTanka
Copy link
Copy Markdown

Retraction — this IMPORT_FAIL was a deleted-file artifact in my smoke runner

Sorry @chanwoo-park-official, this one's on me. I re-audited the SyntaxError (unicode error) 'utf-8' codec can't decode byte 0x9e in position 0 I reported above and it was a false positive — the fault is in my smoke runner, not in your code.

What happened:

Your PR deletes 5 old records/*/train_gpt.py path(s) while editing a different file, and my bulk smoke runner iterated the diff's file list and fetched one of the paths that's already marked for deletion. The raw GitHub content endpoint returned either a binary stub or a non-UTF8 response, and my runner tried to import it as Python source, producing the byte 0x9e at position 0 error. That error was about the deleted/non-existent file, not the train_gpt.py you're actually submitting.

Verified at head f4952b1:

The real train_gpt.py you're editing parses cleanly under Python 3.10:

py_compile.compile('train_gpt.py') → PARSES OK
86197 bytes

Your PR is not broken by this error. I'm retracting the IMPORT_FAIL classification. I'll re-queue the full compliance audit and post findings separately.

Again — sorry for the noise. I'm adding a "don't fetch paths marked deleted in the PR diff" guard to the runner so this doesn't hit other PRs that delete/rename records folders.

@MatoTeziTanka
Copy link
Copy Markdown

Small correction to my own re-audit above — I was sloppy with the "looks like a real one" framing and the MERGE recommendation.

The current leaderboard floor for the 10min/16MB track is 1.0810 (bigbag, PR #1493, SP8192 + 3-layer recurrence + parallel residuals + legal TTT), and there are 16 entries between 1.0810 and your 1.1296. Your result would slot in between the 1.1271 and 1.1307 entries from 2026-03-20, not near the top.

So: the compliance verdict stands (code is clean, legal pure-neural stack, no flags) but "MERGE as record" is the wrong recommendation — it's a legal but non-winning result, not a new record. The title of your PR already says "Humble Record Attempt" so you clearly knew this; the error was mine in the correction comment. Sorry for muddying it.

Still a solid clean attempt on a known-legal stack; just wanted to correct the record-vs-not-record framing before it sits here unchallenged.

@chanwoo-park-official
Copy link
Copy Markdown
Author

@MatoTeziTanka Yes you are correct, this was top 2 when I was posting it, and it is already April 11st therefore many new attempts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants