Midnight 12L

newjordan · 2026-04-08T00:40:32Z

12-layer decoder model using GQA (8 query heads / 4 KV heads), Bigram-2048
Features, RoPE-16, and XSA on the last 11 layers.
mixed-int quantization (attn=int5, mlp=int6, aux=int6, Embed=int8, other=int8) plus
Brotli-compressed mixed checkpoint artifacts.

Pure neural submission: this run targets model quality from training-time architecture/design, not eval-time adaptation.
No eval tricks: no TTT/SLOT/ngram overlays, no eval-time optimizer loops, with standard vocab.
Score comes from core neural learning + artifact compression/quantization while explicitly managing quantization loss.

Results

Seed	val_bpb (sliding window)	Steps	Size
444	1.10567949	6160	15631603 B
300	1.10582448	6154	15624171 B
42	1.10641160	6153	15619003 B
mean	1.1060		15631603 B

3-seed exact mean: 1.10597186 · population std: 0.00031653

Hardware: 8xH100 SXM · 600s wallclock · bytes_code: 124698

Architecture changes

Added a 12th Rascal layer via mixed-int quantization and Brotli packing.

Reproduce

# From repo root, with flash-attention/hopper on PYTHONPATH
SKIP_GPTQ=1 SEED=444 torchrun --standalone --nproc_per_node=8 \
  records/track_10min_16mb/2026-04-07_Midnight_12L_8xH100/train_gpt.py

12L_3.mp4

MatoTeziTanka · 2026-04-11T20:05:02Z

[RETRACTED 2026-04-12] — This n-gram flag was a false positive. The n-gram eval code exists in the file but is disabled (NGRAM_EVAL_ORDER=0 default). The reported 1.1057 BPB comes from the pure neural sliding-window path. See correction: #1458 (comment)

newjordan · 2026-04-12T01:07:12Z

hey man, this doesnt have ngram on, its jsut a dingleberry in the code. your bot can go away yesterday.

MatoTeziTanka · 2026-04-12T04:03:43Z

Correction + deep re-audit — PR #1458

@newjordan you're right about the n-gram flag. I re-audited the head SHA (97874fc) with three independent passes — manual line-by-line, an independent LLM peer audit, and an OpenAI Codex CLI review — and the n-gram code is confirmed dead. Here's the full accounting:

What I got wrong originally: My AST classifier found _ngram_bulk_update at line 1544 and flagged the full_key = ctx_hash ^ (tgt * primes[...]) & mask pattern as the PR #779 family bug. The classifier didn't check whether the function is actually reachable. It isn't — NGRAM_EVAL_ORDER defaults to 0 (line 155), the call at line 2547 is guarded by if args.ngram_eval_order >= 2:, and your training log confirms it never ran. Your own log header says no n-gram eval. I've since patched the classifier to check reachability before flagging.

What the re-audit found (three independent passes):

1. Compliance — CLEAN across all checks:

N-gram eval: dead. NGRAM_EVAL_ORDER=0 default, guard at line 2547, zero ngram_eval: lines in training log. The 1.10568 BPB comes from final_sliding_window_exact — pure neural path.
BigramHashEmbedding (line 1197): torch.bitwise_xor(36313 * t[..., 1:], 27191 * t[..., :-1]) % mod — both operands are adjacent context tokens (current ^ previous). No target token in the key. This is a trained embedding (learned params, frozen at eval via model.eval()), not an eval-time cache.
TrainNgramTracker (line 250): training-time loss weight adjustment guarded by self.training (line 1467). Default complement_alpha=0 (line 154) disables it entirely. Never active during eval.
Pre-Quant TTT: no ttt_adapt, prequant_ttt, or any function calling optimizer.step() on val_tokens. All optimizer updates (lines 2265, 2350) are strictly on training data.
SLOT optimization: none. No per-window mask optimization pattern.
Eval path: standard eval_val_sliding with torch.inference_mode() and base_model.eval(). No logit postprocessing beyond standard logit_softcap * tanh(logits / softcap).

2. Code quality findings from the Codex pass (not compliance issues, but real bugs in dead code):

[P1] MTP heads never train (lines 2177–2188): when MTP_NUM_HEADS > 0, the heads are zero-initialized (line 1390) and contribute an auxiliary loss in forward() (line 1473), but they're never added to any optimizer group. optimizer_tok handles embeddings, optimizer_muon handles the 4 weight banks, optimizer_scalar handles block scalars, and optimizer_head handles lm_head. MTP heads fall through all four. They'd stay frozen at zero, sending zero gradients back through W^T @ grad. In this run MTP_NUM_HEADS=0 so it's inert, but if someone enables it, it silently adds compute without learning.
[P2] Distributed n-gram timeout desync (lines 1657–1660): when WORLD_SIZE > 1 and NGRAM_EVAL_MAX_SECONDS > 0, each rank checks time.perf_counter() >= deadline independently. Different ranks can exit the chunk loop at different points, making the all-reduced loss_sum/token_count aggregate different window subsets. Needs a coordinated stop flag (e.g., all-reduce a boolean). Again, NGRAM_EVAL_ORDER=0 in this run so the path never fires.

3. Submission mechanics — all check out:

Check	Result
3-seed validation	444 / 300 / 42 — all present
val_bpb spread	1.10568 / 1.10582 / 1.10641 — std 0.00032, tight
Artifact size	max 15.63 MB (under 16 MB cap)
Train wallclock	600s per seed (wallclock-capped)
GPTQ	Skipped (`SKIP_GPTQ=1`)
Code size	124,698 bytes (matches submission.json)

Corrected verdict: LOOKS CLEAN. The 1.1057 BPB is a legitimate pure-neural result from a 12-layer architecture with mixed-int quantization + Brotli. No n-gram boost, no TTT, no SLOT.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: The original CLOSE flag is retracted. This is a clean pure-neural submission eligible for standard record-track checks.

For what it's worth, the two code quality findings above (the MTP optimizer gap and the distributed timeout race) were only caught because your comment prompted the re-audit — so the pushback was productive, even if the delivery was a bit much.

For context on how this happened and why we're here — we're running community compliance reviews across 900+ open PRs because the backlog has had zero maintainer triage for weeks. We batch-classified using an AST-based pattern matcher to cover ground fast, and that batch process is what misfired on your PR — it found the n-gram pattern in the file but didn't check whether the code path was live. Every author who's pushed back on a finding has gotten a full re-audit like this one, and every time we've been wrong, we've posted the correction publicly. That's what community review looks like when nobody else is doing it. The classifier bug that caused this has been patched.

Reviewed by @MatoTeziTanka — The Agora. Triple-pass re-audit: manual line-by-line + independent LLM peer audit + OpenAI Codex CLI review. Original false-positive retracted. Classifier dead-code detection bug identified and fixed.

The pod runs PyTorch 2.11.0+cu130 on CUDA 13.0 (vastai/pytorch:cuda-13.0.2-auto). Every prior version of Im_sorry_pod_setup.sh had the wrong CUDA tag (cu128), the /venv/main/bin PATH prepend was deleted, and the FA3 symlink fallback was removed by automated edits. Changes: - FA3 wheel URL: cu128 → cu130 (validated from pod instance 34775495 logs) - Restore /venv/main/bin PATH prepend before any python3/pip calls - Restore flash-attention/hopper symlink fallback in install_fa3() - Remove cuda_tag_from_torch() dynamic detection (source of bugs) - Remove BLOCK_CU124 variable; replace with positive cu130 assertion - Remove WRITE_ACTIVATE_HELPER gate; always write activation helper - Remove /workspace/venv_cu124 reference - Add DO NOT EDIT warning header with validated pod environment specs - Add frozen backup: Im_sorry_pod_setup.sh.cu130_frozen_20260414 - Update pod_stack.lock hash - Add AGENTS.md (Codex reads this) with frozen-file rules - Update CLAUDE.md with explicit frozen-file protection Evidence: PR openai/parameter-golf#1458 seed 444 log, pod instance 34775495 status JSON (image_uuid=vastai/pytorch:cuda-13.0.2-auto), extracted pod log (Running PyTorch 2.11.0+cu130, CUDA Version: 13.0, Driver: 580.126.09). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add Midnight 12L submission — 1.1060 BPB, 15.63MB

97874fc

Tonyy1977 mentioned this pull request Apr 15, 2026

Crawler Transformer 3f+2cx2 + SP8192 + SDClip + Post-Quant TTT — val_bpb 1.1372 #1579

Open

newjordan closed this Apr 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Midnight 12L — 1.10567949 val_bpb (seed 444)#1458

Midnight 12L — 1.10567949 val_bpb (seed 444)#1458
newjordan wants to merge 1 commit intoopenai:mainfrom
newjordan:submission/midnight-12l

newjordan commented Apr 8, 2026 •

edited

Loading

Uh oh!

MatoTeziTanka commented Apr 11, 2026 •

edited

Loading

Uh oh!

newjordan commented Apr 12, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

newjordan commented Apr 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Midnight 12L

Results

Architecture changes

Reproduce

Uh oh!

MatoTeziTanka commented Apr 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

newjordan commented Apr 12, 2026

Uh oh!

MatoTeziTanka commented Apr 12, 2026

Correction + deep re-audit — PR #1458

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

newjordan commented Apr 8, 2026 •

edited

Loading

MatoTeziTanka commented Apr 11, 2026 •

edited

Loading