Skip to content

Midnight 12L — 1.10567949 val_bpb (seed 444)#1458

Closed
newjordan wants to merge 1 commit intoopenai:mainfrom
newjordan:submission/midnight-12l
Closed

Midnight 12L — 1.10567949 val_bpb (seed 444)#1458
newjordan wants to merge 1 commit intoopenai:mainfrom
newjordan:submission/midnight-12l

Conversation

@newjordan
Copy link
Copy Markdown

@newjordan newjordan commented Apr 8, 2026

Midnight 12L

12-layer decoder model using GQA (8 query heads / 4 KV heads), Bigram-2048
Features, RoPE-16, and XSA on the last 11 layers.
mixed-int quantization (attn=int5, mlp=int6, aux=int6, Embed=int8, other=int8) plus
Brotli-compressed mixed checkpoint artifacts.

Pure neural submission: this run targets model quality from training-time architecture/design, not eval-time adaptation.
No eval tricks: no TTT/SLOT/ngram overlays, no eval-time optimizer loops, with standard vocab.
Score comes from core neural learning + artifact compression/quantization while explicitly managing quantization loss.

Results

Seed val_bpb (sliding window) Steps Size
444 1.10567949 6160 15631603 B
300 1.10582448 6154 15624171 B
42 1.10641160 6153 15619003 B
mean 1.1060 15631603 B

3-seed exact mean: 1.10597186 · population std: 0.00031653

Hardware: 8xH100 SXM · 600s wallclock · bytes_code: 124698

Architecture changes

  • Added a 12th Rascal layer via mixed-int quantization and Brotli packing.

Reproduce

# From repo root, with flash-attention/hopper on PYTHONPATH
SKIP_GPTQ=1 SEED=444 torchrun --standalone --nproc_per_node=8 \
  records/track_10min_16mb/2026-04-07_Midnight_12L_8xH100/train_gpt.py
12L_3.mp4

@MatoTeziTanka
Copy link
Copy Markdown

MatoTeziTanka commented Apr 11, 2026

[RETRACTED 2026-04-12] — This n-gram flag was a false positive. The n-gram eval code exists in the file but is disabled (NGRAM_EVAL_ORDER=0 default). The reported 1.1057 BPB comes from the pure neural sliding-window path. See correction: #1458 (comment)


@newjordan
Copy link
Copy Markdown
Author

hey man, this doesnt have ngram on, its jsut a dingleberry in the code. your bot can go away yesterday.

@MatoTeziTanka
Copy link
Copy Markdown

Correction + deep re-audit — PR #1458

@newjordan you're right about the n-gram flag. I re-audited the head SHA (97874fc) with three independent passes — manual line-by-line, an independent LLM peer audit, and an OpenAI Codex CLI review — and the n-gram code is confirmed dead. Here's the full accounting:


What I got wrong originally: My AST classifier found _ngram_bulk_update at line 1544 and flagged the full_key = ctx_hash ^ (tgt * primes[...]) & mask pattern as the PR #779 family bug. The classifier didn't check whether the function is actually reachable. It isn't — NGRAM_EVAL_ORDER defaults to 0 (line 155), the call at line 2547 is guarded by if args.ngram_eval_order >= 2:, and your training log confirms it never ran. Your own log header says no n-gram eval. I've since patched the classifier to check reachability before flagging.

What the re-audit found (three independent passes):

1. Compliance — CLEAN across all checks:

  • N-gram eval: dead. NGRAM_EVAL_ORDER=0 default, guard at line 2547, zero ngram_eval: lines in training log. The 1.10568 BPB comes from final_sliding_window_exact — pure neural path.
  • BigramHashEmbedding (line 1197): torch.bitwise_xor(36313 * t[..., 1:], 27191 * t[..., :-1]) % mod — both operands are adjacent context tokens (current ^ previous). No target token in the key. This is a trained embedding (learned params, frozen at eval via model.eval()), not an eval-time cache.
  • TrainNgramTracker (line 250): training-time loss weight adjustment guarded by self.training (line 1467). Default complement_alpha=0 (line 154) disables it entirely. Never active during eval.
  • Pre-Quant TTT: no ttt_adapt, prequant_ttt, or any function calling optimizer.step() on val_tokens. All optimizer updates (lines 2265, 2350) are strictly on training data.
  • SLOT optimization: none. No per-window mask optimization pattern.
  • Eval path: standard eval_val_sliding with torch.inference_mode() and base_model.eval(). No logit postprocessing beyond standard logit_softcap * tanh(logits / softcap).

2. Code quality findings from the Codex pass (not compliance issues, but real bugs in dead code):

  • [P1] MTP heads never train (lines 2177–2188): when MTP_NUM_HEADS > 0, the heads are zero-initialized (line 1390) and contribute an auxiliary loss in forward() (line 1473), but they're never added to any optimizer group. optimizer_tok handles embeddings, optimizer_muon handles the 4 weight banks, optimizer_scalar handles block scalars, and optimizer_head handles lm_head. MTP heads fall through all four. They'd stay frozen at zero, sending zero gradients back through W^T @ grad. In this run MTP_NUM_HEADS=0 so it's inert, but if someone enables it, it silently adds compute without learning.
  • [P2] Distributed n-gram timeout desync (lines 1657–1660): when WORLD_SIZE > 1 and NGRAM_EVAL_MAX_SECONDS > 0, each rank checks time.perf_counter() >= deadline independently. Different ranks can exit the chunk loop at different points, making the all-reduced loss_sum/token_count aggregate different window subsets. Needs a coordinated stop flag (e.g., all-reduce a boolean). Again, NGRAM_EVAL_ORDER=0 in this run so the path never fires.

3. Submission mechanics — all check out:

Check Result
3-seed validation 444 / 300 / 42 — all present
val_bpb spread 1.10568 / 1.10582 / 1.10641 — std 0.00032, tight
Artifact size max 15.63 MB (under 16 MB cap)
Train wallclock 600s per seed (wallclock-capped)
GPTQ Skipped (SKIP_GPTQ=1)
Code size 124,698 bytes (matches submission.json)

Corrected verdict: LOOKS CLEAN. The 1.1057 BPB is a legitimate pure-neural result from a 12-layer architecture with mixed-int quantization + Brotli. No n-gram boost, no TTT, no SLOT.

Recommendation to @cocohearts @valerio-oai @0hq @yuzhougu-oai @notapplica: The original CLOSE flag is retracted. This is a clean pure-neural submission eligible for standard record-track checks.

For what it's worth, the two code quality findings above (the MTP optimizer gap and the distributed timeout race) were only caught because your comment prompted the re-audit — so the pushback was productive, even if the delivery was a bit much.

For context on how this happened and why we're here — we're running community compliance reviews across 900+ open PRs because the backlog has had zero maintainer triage for weeks. We batch-classified using an AST-based pattern matcher to cover ground fast, and that batch process is what misfired on your PR — it found the n-gram pattern in the file but didn't check whether the code path was live. Every author who's pushed back on a finding has gotten a full re-audit like this one, and every time we've been wrong, we've posted the correction publicly. That's what community review looks like when nobody else is doing it. The classifier bug that caused this has been patched.


Reviewed by @MatoTeziTankaThe Agora. Triple-pass re-audit: manual line-by-line + independent LLM peer audit + OpenAI Codex CLI review. Original false-positive retracted. Classifier dead-code detection bug identified and fixed.

newjordan added a commit to newjordan/Fartmagic that referenced this pull request Apr 14, 2026
The pod runs PyTorch 2.11.0+cu130 on CUDA 13.0 (vastai/pytorch:cuda-13.0.2-auto).
Every prior version of Im_sorry_pod_setup.sh had the wrong CUDA tag (cu128),
the /venv/main/bin PATH prepend was deleted, and the FA3 symlink fallback was
removed by automated edits.

Changes:
- FA3 wheel URL: cu128 → cu130 (validated from pod instance 34775495 logs)
- Restore /venv/main/bin PATH prepend before any python3/pip calls
- Restore flash-attention/hopper symlink fallback in install_fa3()
- Remove cuda_tag_from_torch() dynamic detection (source of bugs)
- Remove BLOCK_CU124 variable; replace with positive cu130 assertion
- Remove WRITE_ACTIVATE_HELPER gate; always write activation helper
- Remove /workspace/venv_cu124 reference
- Add DO NOT EDIT warning header with validated pod environment specs
- Add frozen backup: Im_sorry_pod_setup.sh.cu130_frozen_20260414
- Update pod_stack.lock hash
- Add AGENTS.md (Codex reads this) with frozen-file rules
- Update CLAUDE.md with explicit frozen-file protection

Evidence: PR openai/parameter-golf#1458 seed 444 log, pod instance 34775495
status JSON (image_uuid=vastai/pytorch:cuda-13.0.2-auto), extracted pod log
(Running PyTorch 2.11.0+cu130, CUDA Version: 13.0, Driver: 580.126.09).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@newjordan newjordan closed this Apr 15, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants