diff --git a/records/track_non_record_16mb/2026-03-31_Retrodiction_PetzRecoveryMap_16L_M1Max/README.md b/records/track_non_record_16mb/2026-03-31_Retrodiction_PetzRecoveryMap_16L_M1Max/README.md
new file mode 100644
index 0000000000..a2953e4631
--- /dev/null
+++ b/records/track_non_record_16mb/2026-03-31_Retrodiction_PetzRecoveryMap_16L_M1Max/README.md
@@ -0,0 +1,91 @@
+# Non-record: Retrodiction Training — val_bpb 1.508
+
+**Author:** Sheng-Kai Huang ([@akaiHuang](https://github.com/akaiHuang))
+**Hardware:** M1 Max 64GB (not 8xH100 — hence non-record)
+**Track:** Non-record, 16MB
+
+## Summary
+
+We introduce **Retrodiction**, a novel auxiliary training loss inspired by the Petz recovery map from quantum information theory. The model trains on both forward and reversed sequences, learning bidirectional representations while maintaining causal attention.
+
+```
+loss = AR_loss(forward) + 0.3 * AR_loss(reversed)
+```
+
+This achieves **1.508 BPB at 2000 steps** (131M tokens) on a 16-layer, 39M parameter model, trained entirely on M1 Max.
+
+## Why Non-record
+
+Trained on M1 Max (65K tokens/step), not 8xH100 (786K tokens/step). With 12x larger batch on H100, we estimate significantly better convergence within 10 minutes.
+
+## Approach: Retrodiction
+
+Standard AR: predict next token from left context only.
+
+Retrodiction: **additionally** train on reversed sequences. The model learns right-to-left patterns through the same causal attention, enriching token embeddings with bidirectional information.
+
+### Theoretical Foundation
+
+The Petz recovery map (Petz 1986) provides the optimal retrodiction channel in quantum information theory — inferring past from future. Our retrodiction loss is a direct application at the language level.
+
+## Architecture
+
+- **16 layers**, 512 dim, 8 heads (4 KV heads), 3x MLP
+- **39M params** → Int6 + lzma = **14.8MB** (within 16MB)
+- Muon optimizer (matrices) + AdamW (embeddings/scalars)
+- EMA (decay=0.997, start at 80% of training)
+- XSA on last 4 layers
+- BigramHash (2048 buckets) + SmearGate
+- LeakyReLU(0.5)^2 activation
+- Retrodiction alpha=0.3, applied every 4 steps
+
+## Results (M1 Max)
+
+### Retrodiction vs Pure AR (11L, 27M, fair comparison)
+
+| Step | Tokens | Retro BPB | Pure AR BPB | Improvement |
+|------|--------|-----------|-------------|-------------|
+| 100 | 7M | 2.155 | 2.183 | -1.3% |
+| 200 | 13M | 1.934 | 2.006 | -3.6% |
+| 400 | 26M | 1.727 | 1.764 | -2.1% |
+| 500 | 33M | 1.714 | ~1.72 | -0.6% |
+
+### 16-Layer 39M Model
+
+| Step | Tokens | BPB |
+|------|--------|-----|
+| 500 | 33M | 1.705 |
+| 1000 | 66M | 1.576 |
+| **2000** | **131M** | **1.508** |
+
+### Methods Tested (step 400)
+
+| Method | BPB | vs AR | Notes |
+|--------|-----|-------|-------|
+| Pure AR | 1.764 | — | Baseline |
+| CDM rightmask | 1.744 | -0.021 | Mask right-side tokens |
+| **Retrodiction** | **1.727** | **-0.037** | Reversed sequence loss |
+| Petz-weighted loss | 2.091 | +0.327 | Too aggressive |
+
+## Quantization
+
+39M params × Int6 (6 bits/param) + lzma compression = **14.8MB**.
+Int6 quantization loss is minimal (~0.01-0.02 BPB).
+
+## Novel Contributions
+
+1. **Retrodiction training**: First application of Petz recovery map to LLM training
+2. Consistent 1-3.6% BPB improvement over pure AR at matched token counts
+3. Zero inference cost (retrodiction is training-only)
+
+## Estimated H100 Performance
+
+With 12x larger batch on H100 (786K vs 65K tokens/step), 10 minutes yields ~7.8B tokens vs M1's 131M at 2000 steps. We estimate val_bpb in the range **1.10-1.15** on H100.
+
+## Reproduction
+
+```bash
+python3 train_retrodiction_16L.py --steps 2000 --grad_accum 2 \
+    --microbatch_tokens 32768 --max_sub_chunk 8192 \
+    --warmdown 150 --val_every 100 --val_tokens 1000000
+```
diff --git a/records/track_non_record_16mb/2026-03-31_Retrodiction_PetzRecoveryMap_16L_M1Max/submission.json b/records/track_non_record_16mb/2026-03-31_Retrodiction_PetzRecoveryMap_16L_M1Max/submission.json
new file mode 100644
index 0000000000..6cd18e4dfb
--- /dev/null
+++ b/records/track_non_record_16mb/2026-03-31_Retrodiction_PetzRecoveryMap_16L_M1Max/submission.json
@@ -0,0 +1,11 @@
+{
+  "track": "non_record_16mb",
+  "date": "2026-03-31",
+  "name": "Retrodiction Training (Petz Recovery Map) — 16L on M1 Max",
+  "author": "Sheng-Kai Huang",
+  "github_id": "akaiHuang",
+  "val_bpb": 1.508,
+  "hardware": "M1 Max 64GB (not 8xH100)",
+  "params": 39000000,
+  "quantization": "Int6 + lzma (14.8MB)"
+}
diff --git a/records/track_non_record_16mb/2026-03-31_Retrodiction_PetzRecoveryMap_16L_M1Max/train_gpt.py b/records/track_non_record_16mb/2026-03-31_Retrodiction_PetzRecoveryMap_16L_M1Max/train_gpt.py
new file mode 100644
index 0000000000..ea263e1191
--- /dev/null
+++ b/records/track_non_record_16mb/2026-03-31_Retrodiction_PetzRecoveryMap_16L_M1Max/train_gpt.py
@@ -0,0 +1,708 @@
+#!/usr/bin/env -S python3 -u
+"""
+Golf V2: Top-3 techniques + CDM eval advantage.
+
+Architecture upgrades (vs V1):
+  - 11 layers (was 9) + 3x MLP (was 2x)
+  - XSA on last 4 layers (exclusive self-attention)
+  - LeakyReLU(0.5)^2 (was relu^2)
+  - BigramHash(2048) + SmearGate
+  - EMA (decay=0.997)
+  - LN Scale (1/sqrt(layer+1))
+
+Eval upgrades:
+  - N-gram boosting (orders 2-7, entropy-adaptive)
+  - Score-first TTT (AR or CDM mode)
+
+Training: same Muon + Adam split as baseline.
+"""
+from __future__ import annotations
+
+import argparse
+import glob
+import math
+import os
+import sys
+import time
+from pathlib import Path
+from collections import defaultdict
+
+import numpy as np
+import sentencepiece as spm
+
+import mlx.core as mx
+import mlx.nn as nn
+import mlx.optimizers as optim
+from mlx.utils import tree_flatten, tree_unflatten
+
+# ==============================================================================
+# CONFIG
+# ==============================================================================
+COMPUTE_DTYPE = mx.bfloat16
+
+DATA_DIR = "/Users/akaihuangm1/Desktop/github/parameter-golf/data/datasets/fineweb10B_sp1024"
+TOKENIZER_PATH = "/Users/akaihuangm1/Desktop/github/parameter-golf/data/tokenizers/fineweb_1024_bpe.model"
+
+VOCAB_SIZE = 1024
+NUM_LAYERS = 16          # was 9
+MODEL_DIM = 512
+NUM_HEADS = 8
+NUM_KV_HEADS = 4
+MLP_MULT = 3             # was 2
+ROPE_BASE = 10000.0
+QK_GAIN_INIT = 1.5
+TIED_EMBED_INIT_STD = 0.005
+LOGIT_SOFTCAP = 30.0
+SEQ_LEN = 1024
+
+XSA_LAST_N = 4
+BIGRAM_BUCKETS = 2048
+BIGRAM_DIM = 128
+
+# Optimizer
+TIED_EMBED_LR = 0.035
+MATRIX_LR = 0.025
+SCALAR_LR = 0.025
+BETA1 = 0.9
+BETA2 = 0.95
+ADAM_EPS = 1e-8
+MUON_MOMENTUM = 0.99
+MUON_BACKEND_STEPS = 5
+MUON_MOMENTUM_WARMUP_START = 0.92
+MUON_MOMENTUM_WARMUP_STEPS = 1500
+WEIGHT_DECAY = 0.04
+GRAD_CLIP = 0.3
+
+# EMA
+EMA_DECAY = 0.997
+
+SEED = 1337
+
+# ==============================================================================
+# MATH HELPERS
+# ==============================================================================
+def rms_norm(x, eps=1e-6):
+    return (x * mx.rsqrt(mx.mean(x * x, axis=-1, keepdims=True) + eps)).astype(x.dtype)
+
+def zeropower_newtonschulz5(g, steps, eps=1e-7):
+    a, b, c = 3.4445, -4.7750, 2.0315
+    x = g.astype(mx.float32)
+    x = x / (mx.sqrt(mx.sum(x * x)) + eps)
+    transposed = x.shape[0] > x.shape[1]
+    if transposed:
+        x = x.T
+    for _ in range(steps):
+        a_mat = x @ x.T
+        b_mat = b * a_mat + c * (a_mat @ a_mat)
+        x = a * x + b_mat @ x
+    if transposed:
+        x = x.T
+    return x.astype(g.dtype)
+
+# ==============================================================================
+# DATA LOADING
+# ==============================================================================
+def load_data_shard(path):
+    header_bytes = 256 * np.dtype("<i4").itemsize
+    header = np.fromfile(path, dtype="<i4", count=256)
+    num_tokens = int(header[2])
+    tokens = np.fromfile(path, dtype="<u2", count=num_tokens, offset=header_bytes)
+    return tokens.astype(np.int32, copy=False)
+
+class TokenStream:
+    def __init__(self, pattern):
+        self.files = [Path(p) for p in sorted(glob.glob(pattern))]
+        if not self.files:
+            raise FileNotFoundError(f"No files: {pattern}")
+        self.epoch = 1
+        self.file_idx = 0
+        self.tokens = load_data_shard(self.files[0])
+        self.pos = 0
+
+    def next_file(self):
+        self.file_idx = (self.file_idx + 1) % len(self.files)
+        if self.file_idx == 0:
+            self.epoch += 1
+            print(f"WARNING: starting epoch:{self.epoch}")
+        self.tokens = load_data_shard(self.files[self.file_idx])
+        self.pos = 0
+
+    def take(self, n):
+        chunks = []
+        left = n
+        while left > 0:
+            if self.pos >= self.tokens.size:
+                self.next_file()
+            k = min(left, int(self.tokens.size - self.pos))
+            chunks.append(self.tokens[self.pos:self.pos + k])
+            self.pos += k
+            left -= k
+        return chunks[0] if len(chunks) == 1 else np.concatenate(chunks)
+
+class TokenLoader:
+    def __init__(self, pattern):
+        self.stream = TokenStream(pattern)
+
+    def next_batch(self, batch_tokens, seq_len):
+        usable = (batch_tokens // seq_len) * seq_len
+        chunk = self.stream.take(usable + 1)
+        x = chunk[:-1].reshape(-1, seq_len)
+        y = chunk[1:].reshape(-1, seq_len)
+        return mx.array(x, dtype=mx.int32), mx.array(y, dtype=mx.int32)
+
+def load_validation_tokens(pattern, seq_len):
+    files = [Path(p) for p in sorted(glob.glob(pattern))]
+    tokens = np.concatenate([load_data_shard(f) for f in files])
+    usable = ((tokens.size - 1) // seq_len) * seq_len
+    return tokens[:usable + 1]
+
+# ==============================================================================
+# MODEL BLOCKS
+# ==============================================================================
+class CastedLinear(nn.Module):
+    def __init__(self, in_dim, out_dim):
+        super().__init__()
+        self.weight = nn.Linear(in_dim, out_dim, bias=False).weight.astype(mx.float32)
+    def __call__(self, x):
+        return x @ self.weight.astype(x.dtype).T
+
+class RMSNormNoWeight(nn.Module):
+    def __call__(self, x):
+        return rms_norm(x)
+
+class DualModeAttention(nn.Module):
+    def __init__(self, dim, num_heads, num_kv_heads, rope_base, qk_gain_init, use_xsa=False):
+        super().__init__()
+        self.num_heads = num_heads
+        self.num_kv_heads = num_kv_heads
+        self.head_dim = dim // num_heads
+        kv_dim = num_kv_heads * self.head_dim
+        self.c_q = CastedLinear(dim, dim)
+        self.c_k = CastedLinear(dim, kv_dim)
+        self.c_v = CastedLinear(dim, kv_dim)
+        self.proj = CastedLinear(dim, dim)
+        self.q_gain = mx.ones((num_heads,), dtype=mx.float32) * qk_gain_init
+        self.rope = nn.RoPE(self.head_dim, traditional=False, base=rope_base)
+        self.scale = self.head_dim ** -0.5
+        self.use_xsa = use_xsa
+
+    def _xsa(self, y, v):
+        """Subtract self-value projection (XSA)."""
+        bsz, seqlen, dim = y.shape
+        hd = self.head_dim
+        nkv = self.num_kv_heads
+        nh = self.num_heads
+        group = nh // nkv
+
+        # y: [B, T, nh*hd] -> [B, T, nkv, group, hd]
+        y_g = y.reshape(bsz, seqlen, nkv, group, hd)
+        # v: [B, nkv, T, hd] -> [B, T, nkv, 1, hd]
+        v_t = v.transpose(0, 2, 1, 3)  # [B, T, nkv, hd]
+        vn = v_t / (mx.sqrt(mx.sum(v_t * v_t, axis=-1, keepdims=True)) + 1e-8)
+        vn = mx.expand_dims(vn, axis=3)  # [B, T, nkv, 1, hd]
+
+        # Project y onto v direction and subtract
+        proj = mx.sum(y_g * vn, axis=-1, keepdims=True) * vn
+        return (y_g - proj).reshape(bsz, seqlen, dim)
+
+    def __call__(self, x, is_causal=True):
+        bsz, seqlen, dim = x.shape
+        q = self.c_q(x).reshape(bsz, seqlen, self.num_heads, self.head_dim).transpose(0, 2, 1, 3)
+        k = self.c_k(x).reshape(bsz, seqlen, self.num_kv_heads, self.head_dim).transpose(0, 2, 1, 3)
+        v = self.c_v(x).reshape(bsz, seqlen, self.num_kv_heads, self.head_dim).transpose(0, 2, 1, 3)
+        q = self.rope(rms_norm(q).astype(COMPUTE_DTYPE))
+        k = self.rope(rms_norm(k).astype(COMPUTE_DTYPE))
+        q = q * self.q_gain.astype(q.dtype)[None, :, None, None]
+
+        if is_causal:
+            y = mx.fast.scaled_dot_product_attention(q, k, v, scale=self.scale, mask="causal")
+        else:
+            y = mx.fast.scaled_dot_product_attention(q, k, v, scale=self.scale)
+
+        y = y.transpose(0, 2, 1, 3).reshape(bsz, seqlen, dim)
+
+        # XSA: subtract self-value projection
+        if self.use_xsa:
+            y = self._xsa(y, v)
+
+        return self.proj(y)
+
+
+class MLP(nn.Module):
+    """LeakyReLU(0.5)^2 MLP."""
+    def __init__(self, dim, mlp_mult):
+        super().__init__()
+        hidden = dim * mlp_mult
+        self.fc = CastedLinear(dim, hidden)
+        self.proj = CastedLinear(hidden, dim)
+
+    def __call__(self, x):
+        h = self.fc(x)
+        # LeakyReLU(0.5) squared
+        h = mx.where(h >= 0, h, 0.5 * h)
+        return self.proj(h * h)
+
+
+class BigramHashEmbedding(nn.Module):
+    """Learned bigram hash embeddings."""
+    def __init__(self, buckets, bigram_dim, model_dim):
+        super().__init__()
+        self.buckets = buckets
+        self.embed = nn.Embedding(buckets, bigram_dim)
+        # Init to zero so bigram starts with no effect
+        self.embed.weight = mx.zeros_like(self.embed.weight)
+        self.proj = CastedLinear(bigram_dim, model_dim)
+        self.scale = mx.array(0.05, dtype=mx.float32)
+
+    def bigram_hash(self, tokens):
+        """Hash (prev, current) token pairs into bucket indices."""
+        t = tokens.astype(mx.int32)
+        mod = self.buckets - 1
+        # First position has no prev token -> use last bucket
+        shifted = mx.concatenate([mx.full((t.shape[0], 1), mod, dtype=mx.int32),
+                                  t[:, :-1]], axis=1)
+        # XOR hash
+        hashed = (36313 * t + 27191 * shifted) % mod
+        return hashed
+
+    def __call__(self, token_ids):
+        h = self.embed(self.bigram_hash(token_ids))
+        h = self.proj(h)
+        return h * self.scale.astype(h.dtype)
+
+
+class SmearGate(nn.Module):
+    """Learned blending with previous token."""
+    def __init__(self, dim):
+        super().__init__()
+        self.gate = mx.zeros((dim,), dtype=mx.float32)
+
+    def __call__(self, x):
+        g = mx.sigmoid(self.gate.astype(x.dtype))[None, None, :]  # [1, 1, dim]
+        x_prev = mx.concatenate([mx.zeros_like(x[:, :1]), x[:, :-1]], axis=1)
+        return (1 - g) * x + g * x_prev
+
+
+class Block(nn.Module):
+    def __init__(self, dim, num_heads, num_kv_heads, mlp_mult, rope_base, qk_gain_init,
+                 layer_idx=0, use_xsa=False):
+        super().__init__()
+        self.attn_norm = RMSNormNoWeight()
+        self.mlp_norm = RMSNormNoWeight()
+        self.attn = DualModeAttention(dim, num_heads, num_kv_heads, rope_base, qk_gain_init,
+                                       use_xsa=use_xsa)
+        self.mlp = MLP(dim, mlp_mult)
+        # LN Scale: 1/sqrt(layer+1)
+        self.ln_scale = 1.0 / math.sqrt(layer_idx + 1)
+        self.attn_scale = mx.ones((dim,), dtype=mx.float32)
+        self.mlp_scale = mx.ones((dim,), dtype=mx.float32)
+        self.resid_mix = mx.array(np.stack((
+            np.ones((dim,), dtype=np.float32),
+            np.zeros((dim,), dtype=np.float32)
+        )))
+
+    def __call__(self, x, x0, is_causal=True):
+        mix = self.resid_mix.astype(x.dtype)
+        x = mix[0][None, None, :] * x + mix[1][None, None, :] * x0
+        attn_out = self.attn(self.attn_norm(x) * self.ln_scale, is_causal=is_causal)
+        x = x + self.attn_scale.astype(x.dtype)[None, None, :] * attn_out
+        x = x + self.mlp_scale.astype(x.dtype)[None, None, :] * self.mlp(self.mlp_norm(x) * self.ln_scale)
+        return x
+
+
+class GPTv2(nn.Module):
+    """Upgraded GPT with all Top-3 techniques."""
+    def __init__(self):
+        super().__init__()
+        self.logit_softcap = LOGIT_SOFTCAP
+        self.tok_emb = nn.Embedding(VOCAB_SIZE, MODEL_DIM)
+        self.bigram = BigramHashEmbedding(BIGRAM_BUCKETS, BIGRAM_DIM, MODEL_DIM)
+        self.smear = SmearGate(MODEL_DIM)
+
+        self.num_encoder_layers = NUM_LAYERS // 2  # 5
+        self.num_decoder_layers = NUM_LAYERS - self.num_encoder_layers  # 6
+        self.num_skip_weights = min(self.num_encoder_layers, self.num_decoder_layers)
+        self.skip_weights = mx.ones((self.num_skip_weights, MODEL_DIM), dtype=mx.float32)
+
+        self.blocks = []
+        for i in range(NUM_LAYERS):
+            use_xsa = i >= (NUM_LAYERS - XSA_LAST_N)  # last 4 layers
+            self.blocks.append(
+                Block(MODEL_DIM, NUM_HEADS, NUM_KV_HEADS, MLP_MULT, ROPE_BASE, QK_GAIN_INIT,
+                      layer_idx=i, use_xsa=use_xsa)
+            )
+        self.final_norm = RMSNormNoWeight()
+
+        # Init: zero out output projections
+        for b in self.blocks:
+            b.attn.proj.weight = mx.zeros_like(b.attn.proj.weight)
+            b.mlp.proj.weight = mx.zeros_like(b.mlp.proj.weight)
+        self.tok_emb.weight = (
+            mx.random.normal(self.tok_emb.weight.shape, dtype=mx.float32) * TIED_EMBED_INIT_STD
+        ).astype(COMPUTE_DTYPE)
+
+    def softcap(self, logits):
+        c = self.logit_softcap
+        return c * mx.tanh(logits / c)
+
+    def forward_hidden(self, input_ids, is_causal=True):
+        x = self.tok_emb(input_ids).astype(COMPUTE_DTYPE)
+        x = x + self.bigram(input_ids).astype(COMPUTE_DTYPE)
+        x = rms_norm(x)
+        x = self.smear(x)
+        x0 = x
+        skips = []
+        for i in range(self.num_encoder_layers):
+            x = self.blocks[i](x, x0, is_causal=is_causal)
+            skips.append(x)
+        for i in range(self.num_decoder_layers):
+            if skips:
+                x = x + self.skip_weights[i].astype(x.dtype)[None, None, :] * skips.pop()
+            x = self.blocks[self.num_encoder_layers + i](x, x0, is_causal=is_causal)
+        return self.final_norm(x)
+
+    def __call__(self, input_ids):
+        return self.forward_hidden(input_ids, is_causal=True)
+
+    def loss_fn(self, input_ids, target_ids, is_causal=True):
+        h = self.forward_hidden(input_ids, is_causal=is_causal).reshape(-1, MODEL_DIM)
+        y = target_ids.reshape(-1)
+        logits = self.softcap(h @ self.tok_emb.weight.astype(h.dtype).T)
+        return nn.losses.cross_entropy(logits.astype(mx.float32), y, reduction="mean")
+
+
+# ==============================================================================
+# OPTIMIZER (Muon + Adam split)
+# ==============================================================================
+CONTROL_PATTERNS = ("attn_scale", "mlp_scale", "resid_mix", "q_gain", "skip_weight",
+                    "gate", "scale", "ln_scale")
+
+class Muon:
+    def __init__(self, keys, params):
+        self.keys = keys
+        self.buffers = {k: mx.zeros_like(params[k]) for k in keys}
+
+    def step(self, params, grads, step, lr_mul):
+        t = min(step / max(MUON_MOMENTUM_WARMUP_STEPS, 1), 1.0)
+        momentum = (1.0 - t) * MUON_MOMENTUM_WARMUP_START + t * MUON_MOMENTUM
+        lr = MATRIX_LR * lr_mul
+        out = {}
+        for k in self.keys:
+            p, g = params[k], grads[k]
+            # Gradient clipping
+            g_norm = mx.sqrt(mx.sum(g * g))
+            g = mx.where(g_norm > GRAD_CLIP, g * (GRAD_CLIP / (g_norm + 1e-8)), g)
+            # Momentum
+            buf = momentum * self.buffers[k] + g
+            self.buffers[k] = buf
+            g_eff = g + momentum * buf
+            # Newton-Schulz orthogonalization
+            g_ortho = zeropower_newtonschulz5(g_eff, MUON_BACKEND_STEPS)
+            scale = math.sqrt(max(1.0, float(p.shape[0]) / float(p.shape[1])))
+            # Weight decay
+            out[k] = p * (1 - lr * WEIGHT_DECAY) - lr * (g_ortho * scale).astype(p.dtype)
+        return out
+
+class SplitOptimizers:
+    def __init__(self, model):
+        params = dict(tree_flatten(model.parameters()))
+        self.embed_key = "tok_emb.weight"
+        self.matrix_keys = [
+            k for k, p in params.items()
+            if p.ndim == 2
+            and k != self.embed_key
+            and not any(pat in k for pat in CONTROL_PATTERNS)
+        ]
+        self.scalar_keys = [
+            k for k, p in params.items()
+            if k != self.embed_key and k not in self.matrix_keys
+        ]
+        self.muon = Muon(self.matrix_keys, params)
+        self.adam_embed = optim.Adam(learning_rate=TIED_EMBED_LR, betas=[BETA1, BETA2], eps=ADAM_EPS)
+        self.adam_scalar = optim.Adam(learning_rate=SCALAR_LR, betas=[BETA1, BETA2], eps=ADAM_EPS)
+
+    def step(self, model, grads_tree, step, lr_mul):
+        params = dict(tree_flatten(model.parameters()))
+        grads = dict(tree_flatten(grads_tree))
+        updated = dict(params)
+        updated.update(self.muon.step(params, grads, step=step, lr_mul=lr_mul))
+        self.adam_embed.learning_rate = TIED_EMBED_LR * lr_mul
+        if self.embed_key in grads:
+            updated.update(self.adam_embed.apply_gradients(
+                {self.embed_key: grads[self.embed_key]},
+                {self.embed_key: params[self.embed_key]},
+            ))
+        self.adam_scalar.learning_rate = SCALAR_LR * lr_mul
+        scalar_grads = {k: grads[k] for k in self.scalar_keys if k in grads}
+        scalar_params = {k: params[k] for k in self.scalar_keys if k in grads}
+        if scalar_grads:
+            updated.update(self.adam_scalar.apply_gradients(scalar_grads, scalar_params))
+        model.update(tree_unflatten(list(updated.items())))
+
+
+# ==============================================================================
+# SENTENCEPIECE BPB
+# ==============================================================================
+def build_sentencepiece_luts(sp, vocab_size):
+    sp_vocab_size = int(sp.vocab_size())
+    table_size = max(sp_vocab_size, vocab_size)
+    base_bytes_lut = np.zeros((table_size,), dtype=np.int16)
+    has_leading_space_lut = np.zeros((table_size,), dtype=np.bool_)
+    is_boundary_token_lut = np.ones((table_size,), dtype=np.bool_)
+    for token_id in range(sp_vocab_size):
+        if sp.is_control(token_id) or sp.is_unknown(token_id) or sp.is_unused(token_id):
+            continue
+        is_boundary_token_lut[token_id] = False
+        if sp.is_byte(token_id):
+            base_bytes_lut[token_id] = 1
+            continue
+        piece = sp.id_to_piece(token_id)
+        if piece.startswith("\u2581"):
+            has_leading_space_lut[token_id] = True
+            piece = piece[1:]
+        base_bytes_lut[token_id] = len(piece.encode("utf-8"))
+    return base_bytes_lut, has_leading_space_lut, is_boundary_token_lut
+
+def compute_bpb(total_nll, total_tokens, total_bytes):
+    avg_loss = total_nll / total_tokens
+    bpt = avg_loss / math.log(2.0)
+    return bpt * (total_tokens / total_bytes)
+
+# ==============================================================================
+# HELPERS
+# ==============================================================================
+def accumulate_flat_grads(accum, grads_tree, scale):
+    flat = dict(tree_flatten(grads_tree))
+    if accum is None:
+        return {k: g * scale for k, g in flat.items()}
+    for k, g in flat.items():
+        accum[k] = accum[k] + g * scale
+    return accum
+
+def lr_schedule(step, total_steps, warmdown_iters):
+    warmdown_start = max(total_steps - warmdown_iters, 0)
+    if step >= warmdown_start and step < total_steps:
+        return max((total_steps - step) / max(warmdown_iters, 1), 0.0)
+    return 1.0
+
+# ==============================================================================
+# MAIN
+# ==============================================================================
+def main():
+    parser = argparse.ArgumentParser()
+    parser.add_argument("--steps", type=int, default=500)
+    parser.add_argument("--grad_accum", type=int, default=2)
+    parser.add_argument("--microbatch_tokens", type=int, default=32768)
+    parser.add_argument("--max_sub_chunk", type=int, default=8192,
+                        help="Smaller for 27M model on M1")
+    parser.add_argument("--warmdown", type=int, default=150)
+    parser.add_argument("--val_every", type=int, default=100)
+    parser.add_argument("--val_tokens", type=int, default=1_000_000)
+    parser.add_argument("--save_path", type=str, default="golf_v2_model.npz")
+    args = parser.parse_args()
+
+    effective_batch = args.grad_accum * args.microbatch_tokens
+    print("=" * 70)
+    print(f"Golf V2 + Retrodiction | {NUM_LAYERS}L d={MODEL_DIM} MLP={MLP_MULT}x | steps={args.steps}")
+    print(f"Retro alpha=0.3 | XSA last {XSA_LAST_N} | LeakyReLU² | BigramHash({BIGRAM_BUCKETS}) | EMA({EMA_DECAY})")
+    print(f"Effective batch: {effective_batch:,} tok/step")
+    print("=" * 70)
+
+    # Tokenizer
+    sp = spm.SentencePieceProcessor(model_file=TOKENIZER_PATH)
+    base_bytes_lut, has_leading_space_lut, is_boundary_token_lut = build_sentencepiece_luts(sp, VOCAB_SIZE)
+
+    # Validation
+    val_tokens = load_validation_tokens(f"{DATA_DIR}/fineweb_val_*.bin", SEQ_LEN)
+    if args.val_tokens > 0 and args.val_tokens < val_tokens.size:
+        usable = (args.val_tokens // SEQ_LEN) * SEQ_LEN
+        val_short = val_tokens[:usable + 1]
+    else:
+        val_short = val_tokens
+    print(f"Val tokens: {val_tokens.size - 1:,} (eval on {val_short.size - 1:,})")
+
+    # Model
+    mx.random.seed(SEED)
+    model = GPTv2()
+    n_params = sum(int(np.prod(p.shape)) for _, p in tree_flatten(model.parameters()))
+    print(f"Model params: {n_params:,}")
+
+    opt = SplitOptimizers(model)
+    train_loader = TokenLoader(f"{DATA_DIR}/fineweb_train_*.bin")
+
+    # Retrodiction loss: AR forward + AR backward (reversed sequence)
+    # Based on Petz recovery map: retrodiction = inferring past from future
+    RETRO_ALPHA = 0.3
+
+    def retrodiction_loss(x, y):
+        # Forward AR loss (standard)
+        forward_l = model.loss_fn(x, y, is_causal=True)
+
+        # Backward AR loss: reverse the sequence, predict in reverse order
+        # This teaches the model right→left patterns using causal attention
+        x_rev = x[:, ::-1]  # reverse token order
+        y_rev = y[:, ::-1]  # reverse target order
+        backward_l = model.loss_fn(x_rev, y_rev, is_causal=True)
+
+        return forward_l + RETRO_ALPHA * backward_l
+
+    def ar_loss(x, y):
+        return model.loss_fn(x, y, is_causal=True)
+
+    compiled_loss_and_grad = mx.compile(
+        nn.value_and_grad(model, retrodiction_loss), inputs=model.state, outputs=model.state)
+    compiled_loss = mx.compile(ar_loss, inputs=model.state, outputs=model.state)
+
+    # Warmup
+    print("Warming up...")
+    for _ in range(3):
+        x, y = train_loader.next_batch(min(args.max_sub_chunk, args.microbatch_tokens), SEQ_LEN)
+        loss, grads = compiled_loss_and_grad(x, y)
+        mx.eval(loss)
+    train_loader = TokenLoader(f"{DATA_DIR}/fineweb_train_*.bin")
+
+    # EMA state — delay start until 80% of training to avoid polluting with random init
+    ema_start_step = int(args.steps * 0.8)
+    ema_state = None
+
+    # Eval function
+    def eval_val(vtokens):
+        batch_seqs = max(args.microbatch_tokens // SEQ_LEN, 1)
+        total_seqs = (vtokens.size - 1) // SEQ_LEN
+        total_nll = 0.0
+        total_tok = 0
+        total_bytes = 0.0
+        for s in range(0, total_seqs, batch_seqs):
+            e = min(s + batch_seqs, total_seqs)
+            chunk = vtokens[s * SEQ_LEN:(e * SEQ_LEN) + 1]
+            x_np = chunk[:-1].reshape(-1, SEQ_LEN)
+            y_np = chunk[1:].reshape(-1, SEQ_LEN)
+            x = mx.array(x_np, dtype=mx.int32)
+            y = mx.array(y_np, dtype=mx.int32)
+            ct = float(y.size)
+            bl = compiled_loss(x, y).astype(mx.float32)
+            mx.eval(bl)
+            total_nll += float(bl.item()) * ct
+            prev_ids = x_np.reshape(-1)
+            tgt_ids = y_np.reshape(-1)
+            bytes_np = base_bytes_lut[tgt_ids].astype(np.float64)
+            bytes_np += (has_leading_space_lut[tgt_ids] & ~is_boundary_token_lut[prev_ids]).astype(np.float64)
+            total_tok += int(ct)
+            total_bytes += bytes_np.sum()
+        return compute_bpb(total_nll, total_tok, total_bytes)
+
+    # Sub-chunking
+    def sub_chunks(micro_tokens):
+        usable = (micro_tokens // SEQ_LEN) * SEQ_LEN
+        chunk_size = max((args.max_sub_chunk // SEQ_LEN) * SEQ_LEN, SEQ_LEN)
+        chunks = []
+        rem = usable
+        while rem > 0:
+            c = min(rem, chunk_size)
+            chunks.append(c)
+            rem -= c
+        return chunks
+
+    # Training loop
+    t0 = time.perf_counter()
+    best_bpb = float("inf")
+
+    for step in range(args.steps + 1):
+        is_last = (step == args.steps)
+
+        # Eval
+        if is_last or (args.val_every > 0 and step % args.val_every == 0):
+            use_ema = ema_state is not None
+            if use_ema:
+                orig_params = {k: mx.array(v) for k, v in tree_flatten(model.parameters())}
+                model.update(tree_unflatten(list(ema_state.items())))
+                mx.eval(model.parameters())
+
+            val_bpb = eval_val(val_short)
+            marker = " *BEST*" if val_bpb < best_bpb else ""
+            best_bpb = min(best_bpb, val_bpb)
+            elapsed = time.perf_counter() - t0
+            tokens_seen = step * effective_batch
+            ema_tag = " [EMA]" if use_ema else ""
+            print(f"step:{step}/{args.steps} val_bpb:{val_bpb:.4f}{marker}{ema_tag} "
+                  f"tokens:{tokens_seen / 1e6:.0f}M elapsed:{elapsed:.0f}s")
+
+            if use_ema:
+                model.update(tree_unflatten(list(orig_params.items())))
+                mx.eval(model.parameters())
+
+            if is_last:
+                if ema_state is not None:
+                    model.update(tree_unflatten(list(ema_state.items())))
+                    mx.eval(model.parameters())
+                break
+
+        # LR schedule
+        lrm = lr_schedule(step, args.steps, args.warmdown)
+
+        # Gradient accumulation
+        grad_accum = None
+        train_loss = mx.array(0.0, dtype=mx.float32)
+        gs = 1.0 / args.grad_accum
+
+        for _ in range(args.grad_accum):
+            chunks = sub_chunks(args.microbatch_tokens)
+            total_ct = float(sum(chunks))
+            micro_loss = mx.array(0.0, dtype=mx.float32)
+            micro_accum = None
+            for ct in chunks:
+                x, y = train_loader.next_batch(ct, SEQ_LEN)
+                loss, grads = compiled_loss_and_grad(x, y)
+                sc = float(ct) / total_ct
+                micro_loss = micro_loss + loss.astype(mx.float32) * sc
+                micro_accum = accumulate_flat_grads(micro_accum, grads, sc)
+                mx.eval(micro_loss, micro_accum)
+
+            train_loss = train_loss + micro_loss * gs
+            grad_accum = accumulate_flat_grads(
+                grad_accum, tree_unflatten(list(micro_accum.items())), gs)
+            mx.eval(train_loss, grad_accum)
+
+        grads_tree = tree_unflatten(list(grad_accum.items()))
+        opt.step(model, grads_tree, step=step, lr_mul=lrm)
+        mx.synchronize()
+
+        # EMA update — start after warmup
+        if step == ema_start_step:
+            ema_state = {k: mx.array(v) for k, v in tree_flatten(model.parameters())}
+            mx.eval(ema_state)
+            print(f"  EMA started at step {step}")
+        elif ema_state is not None:
+            d = EMA_DECAY
+            for k, v in tree_flatten(model.parameters()):
+                if k in ema_state:
+                    ema_state[k] = d * ema_state[k] + (1 - d) * v
+            mx.eval(ema_state)
+
+        if step % 100 == 0 and step > 0:
+            elapsed = time.perf_counter() - t0
+            tps = step * effective_batch / elapsed
+            print(f"  step:{step} train_loss:{float(train_loss.item()):.4f} "
+                  f"lr_mul:{lrm:.4f} tok/s:{tps:.0f}")
+
+    # Save (convert bfloat16 to float32 for numpy compatibility)
+    flat = dict(tree_flatten(model.parameters()))
+    np_weights = {}
+    for k, v in flat.items():
+        if v.dtype == mx.bfloat16:
+            np_weights[k] = np.array(v.astype(mx.float32))
+        else:
+            np_weights[k] = np.array(v)
+    np.savez(args.save_path, **np_weights)
+    print(f"\nSaved to {args.save_path}")
+
+    print("=" * 70)
+    print(f"FINAL val_bpb: {val_bpb:.4f} (best: {best_bpb:.4f})")
+    print(f"Baseline: 1.2244 | Gap: {best_bpb - 1.2244:+.4f}")
+    print(f"Total tokens: {args.steps * effective_batch / 1e9:.3f}B")
+    print(f"Model: {NUM_LAYERS}L d={MODEL_DIM} MLP={MLP_MULT}x | {n_params:,} params")
+    print("=" * 70)
+
+
+if __name__ == "__main__":
+    main()
diff --git a/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/README.md b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/README.md
new file mode 100644
index 0000000000..c46ae41817
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/README.md
@@ -0,0 +1,492 @@
+# Non-record: Shared AR + Masked Denoising — −0.0205 ± 0.005 BPB (5-seed mean) vs Matched Causal-Only Baseline at Final Checkpoint (11L, 1×H100)
+
+*This folder contains the full reproducible artifacts and submission writeup (v3.5) for the 6-run scaling sweep + 5-seed verification ablation reported in [openai/parameter-golf#1255](https://github.com/openai/parameter-golf/pull/1255). Standalone research diary mirror: [github.com/akaiHuang/meadow-golf](https://github.com/akaiHuang/meadow-golf).*
+
+*v3.5 changes (vs v3.3):*
+*1. Adds **5-seed multi-seed verification** of the 11L headline (`SEED ∈ {1337, 42, 2024, 7, 100}` for the shared model, `SEED=1337` for the causal-only control), all measured at the true final training step rather than the last `val_every`-aligned intermediate checkpoint.*
+*2. Reports the **5-seed mean delta `−0.0205 BPB`** as the primary headline (method-level effect size). The single-seed best (`SEED=1337`, delta `−0.0290 BPB`) is reported as a post-hoc reference for the deployable artifact only and is explicitly **not** the headline number.*
+*3. Methodology fix: `train_cdm.py` now unconditionally writes a `step_final.pt` checkpoint at the end of training, so CF evaluation no longer reads a checkpoint hundreds of steps before the actual end of training. This addresses the intermediate-checkpoint concern raised in v3.3 review.*
+*4. The original 6-run scaling sweep (5L + 11L) is retained in §3.2 as cross-scale evidence; the 11L numbers in §3.1 are now superseded by the multi-seed final-checkpoint measurement.*
+
+**Wishlist RFC addressed:** Text Diffusion (primary), TTT, Depth Recurrence.
+
+**Author:** Sheng-Kai Huang ([@akaiHuang](https://github.com/akaiHuang)) · akai@fawstudio.com
+
+**Note on authorship.** This is an individual, self-funded research submission. I am not part of a lab or a team. Total self-funded compute across both pods reported here: **~$7.43** ($3.93 for the §3.2 6-run scaling sweep on the first 1×H100 SXM pod in US-MO-1 on 2026-04-09, plus $3.50 for the §3.1 multi-seed verification on a second 1×H100 SXM pod the same day). Every script, log, and the `seeds_run/` spot-check artifacts for §3.1 are committed to this folder or available on my public Hugging Face datasets (`akaiii/meadow-golf-checkpoints`, `akaiii/meadow-golf-v4096`). The exact §3.1 `.npz` and `step_final.pt` state files are intentionally not committed to this PR folder because they would add ~1.3 GB; their location and availability-on-request path are documented in `seeds_run/README.md`. The text uses first-person singular throughout; where it reads "this work" or "this submission" it is shorthand for the same single author.
+
+**Summary.** A shared-weight 11L d=512 v4096 model jointly trained on causal AR + uniform-noise D3PM masked denoising, evaluated via a two-pass Coarse-to-Fine (CF) decoder at the **true final training step**, scores lower BPB than a matched-compute causal-only baseline (1×H100 SXM, 540 s, FineWeb v4096, N=500×1024). Across **5 fresh training seeds** for the shared model and **1 fresh training seed** for the matched control, the **5-seed mean delta is −0.0205 BPB**, with the shared model's CF Total estimated at **1.3009 ± 0.005** (5-seed mean ± std) against a single-control baseline at **1.3214**. The control's training-stochasticity term is *not directly measured* in this round (n=1 fresh seed), and no significance test is computed; see §3.1 for the intuitive calibration that the gap is large relative to the visible variance on the shared side, and §6.0 for the second-control-seed experiment that would close the gap.
+
+A causal-only control run under CF evaluation produces garbage (≈ 2.45 BPB), confirming the effect comes from joint training rather than a metric artifact. The 5-seed mean (−0.0205 BPB) is the method-level effect size; the single best seed (−0.0290 BPB, `SEED=1337`) is reported in §3.1 only as the deployable artifact reference. The original 6-run scaling sweep at 5L + 11L is retained in §3.2 as cross-scale evidence; the 5L row shows a −0.054 BPB single-seed gap that has not yet been multi-seed verified (§6.0). Total self-funded compute across both pods: **~$7.43** on 1×H100 SXM. Every headline number in §3.1 is auditable from files in this folder, including the training and CF eval logs committed under `seeds_run/`; exact reruns are specified in §9.1.
+
+---
+
+## 1. Why This Submission (RFC Response)
+
+The "Requests for PRs" list includes **Text diffusion** as a wishlist item. Twelve diffusion PRs are currently open; the dominant paradigm is bidirectional masked diffusion training evaluated with a discrete absorbing-mask variational bound (`val_var_bpb`), established by #820. That line is progressing well (#1241 at 0.9901, #1106 at 1.1465).
+
+I take a different operational question: **can joint training of causal-AR and masked-denoising objectives on shared weights lower BPB on the standard Parameter Golf metric, when evaluated via a concrete two-pass decoder rather than a 256-step variational bound?** The answer in this submission, under full matched-compute controls and 5-seed verification at 11L (§3.1), is yes: a **5-seed mean delta of −0.0205 BPB** at the matched 1×H100 540 s budget, with a single-seed control baseline (see §3.1 for the statistical caveat). The single-seed best (`SEED=1337`) gives a wider −0.0290 BPB and is reported only as a post-hoc reference for the deployable artifact. The cross-scale 5L row in §3.2 shows a single-seed −0.054 BPB gap that is consistent with the 11L direction but is not yet multi-seed verified (§6.0). The gain at 11L is not a metric artifact: the same CF evaluation run on a causal-only control produces 2.45 BPB (garbage), because the bidirectional mode was never trained. The effect comes from the shared training objective, not from the metric itself.
+
+---
+
+## 2. Method
+
+### 2.1 Training
+
+The shared-weight model is trained with two gradient contributions summed at every step (no phase switching, no loss schedule). The following pseudocode matches `train_cdm.py` lines 997–1012:
+
+```python
+# --- AR loss (causal mode) ---
+ar_loss = causal_lm_loss(model(x, is_causal=True), y) / grad_accum
+ar_loss.backward()
+
+# --- Denoising loss (bidirectional mode) ---
+# uniform-noise D3PM: replace masked positions with random vocab tokens
+mask_rate = np.random.uniform(0.15, 0.50)                     # per-step rate
+mask      = torch.rand(B, T) < mask_rate
+x_masked  = x.clone()
+x_masked[mask] = torch.randint(0, vocab_size, (mask.sum(),))  # uniform-noise D3PM corruption
+
+logits   = model.forward_hidden(x_masked, is_causal=False)    # bidirectional pass
+per_tok  = cross_entropy(logits, x, reduction="none")
+cdm_loss = (per_tok * mask.float()).sum() / mask.sum() * 0.3 / grad_accum   # weight = 0.3
+cdm_loss.backward()
+```
+
+The same parameter tensor is used in both forward calls. The only difference between the two forwards is the `is_causal` flag. There are no separate heads, no separate embedding tables, no phase switching. The two `.backward()` calls are equivalent to summing the gradients of `ar_loss + 0.3 * cdm_loss`.
+
+Key configuration:
+- **Mask rate**: `U(0.15, 0.50)` per step (not `U(0.0, 1.0)` — the model never sees fully-masked inputs)
+- **CDM loss weight**: `0.3` relative to `1.0` on the AR loss — the causal objective dominates during training
+- **Corruption type**: uniform-noise D3PM (each masked position replaced with a random token drawn uniformly from the vocabulary), not absorbing-mask MDLM
+
+### 2.2 Coarse-to-Fine Decoder (Evaluation)
+
+The evaluation procedure is a stride-structured variant of **Mask-Predict** (Ghazvininejad et al. 2019), with one change: the first round is a causal AR pass rather than an unconditional mask prediction. Given a sequence of length L and a stride `s`:
+
+1. **Pass 1 (causal mode, `is_causal=True`).** Run the model in causal mode and score log-probabilities at positions `{0, s, 2s, ...}`. These are the "skeleton" positions. The model can only see earlier tokens (verified in §2.3).
+2. **Pass 2 (bidirectional mode, `is_causal=False`).** Fill the remaining positions in `rounds` iterations. Within each round, positions that are still unresolved (the current round's positions plus all later rounds) are replaced by random vocabulary tokens drawn uniformly — this is the same D3PM-uniform corruption the model was trained on. The forward pass is then run bidirectionally. The script averages the resulting NLL over `n_random=3` independent random-fill draws to reduce variance. Ground-truth tokens at positions already resolved in earlier rounds are kept as-is (the code uses `x.copy()` with ground-truth reassignment for unresolved positions only; it does not propagate model samples from earlier rounds).
+
+The total BPB is the sum of pass-1 and pass-2 negative log-likelihoods, normalized by total bytes. It is the conditional cross-entropy of the two-pass decoding procedure described above, with Monte Carlo averaging (`n_random=3`) over the random fills used for unresolved positions during pass 2. It is *not* an exact entropy; it is the cross-entropy a decoder following this exact procedure would achieve.
+
+Full implementation: `eval_cf_dualbrain.py` (MLX, 5L reference) and `eval_cf_dualbrain_cuda.py` (PyTorch/CUDA, 11L). Both files are included in this folder.
+
+### 2.3 Causal-Mask Integrity Check
+
+Because the main numerical claim rests on the `is_causal=True` forward correctly masking future tokens, I ran an explicit future-token leakage test on the 5L checkpoint. The test constructs two token sequences `seq_A` and `seq_B` that are identical for positions `0..15` and differ for positions `16..31`, forwards both with `is_causal=True`, and compares logits.
+
+Under a correct causal mask, logits at positions `0..15` must be byte-identical between the two inputs (future tokens cannot influence earlier positions). Under a broken mask, they will diverge.
+
+Observed result on `shared_ar_cdm.npz`:
+
+```
+Prefix positions 0..15 (should be identical under causal):
+  max  |logits_A - logits_B| = 0.000000e+00
+  mean |logits_A - logits_B| = 0.000000e+00
+Suffix positions 16..31 (should differ, as inputs differ):
+  max  |logits_A - logits_B| = 1.82e+01
+```
+
+Prefix divergence is exactly zero (not merely below precision) and suffix divergence confirms the model is not constant. The `is_causal=True` path does not leak future tokens. The test script is included as `leakage_test.py`; reviewers can reproduce it on any Apple Silicon machine with `mlx >= 0.31` in under 30 seconds. The same SDPA call path (`F.scaled_dot_product_attention(q, k, v, is_causal=is_causal, scale=...)` with no additional `attn_mask` argument) is used by both `train_cdm.py` (training) and `eval_cf_dualbrain_cuda.py` (11L evaluation), so the integrity of the 5L test carries to the 11L numbers.
+
+### 2.4 Why Not `val_var_bpb`
+
+The MDLM line uses `val_var_bpb`, a variational upper bound on NLL under the discrete absorbing-mask Markov chain. I deliberately do not report this metric for three reasons:
+
+1. **Training-eval mismatch.** `val_var_bpb` assumes absorbing-mask training. This submission uses uniform-noise replacement (D3PM-uniform). Applying absorbing ELBO to a uniform-noise model is not a valid bound.
+2. **No realizable decoder at 256 steps.** `val_var_bpb` requires 256–512 forward passes. No practical compression procedure runs at that cost; the metric measures tightness, not decoder-ability.
+3. **Apples-to-oranges risk.** Mixing CF BPB with `val_var_bpb` in one table would compare different quantities.
+
+I cite `val_var_bpb` as a valid metric for its research line.
+
+### 2.5 Related Prior Work
+
+The core idea of this submission — one set of weights trained under multiple attention-mask regimes and used in more than one mode at evaluation — is not new. I do not claim to have invented joint causal + bidirectional training or iterative mask-and-refill decoding. The contribution here is the specific combination (uniform-noise D3PM denoising jointly trained with a causal AR loss at 0.3 : 1 weight, evaluated via a two-pass Mask-Predict-style decoder) and its empirical behavior in the parameter-golf regime.
+
+Relevant prior work that readers should consult:
+
+- **UniLM** — Dong et al. 2019, *"Unified Language Model Pre-training for Natural Language Understanding and Generation"* (arXiv:1905.03197). The closest architectural precedent: one transformer trained with three attention-mask regimes (unidirectional, bidirectional, seq2seq) on the same weights. My training is a simpler variant with only two mask regimes (causal + bidirectional) and a D3PM-uniform denoising objective in place of UniLM's masked-LM objective.
+- **GLM** — Du et al. 2022, *"GLM: General Language Model Pretraining with Autoregressive Blank Infilling"* (arXiv:2103.10360). Unifies understanding and generation via autoregressive blank infilling on spans. Directly motivates the "one model for generate + edit" framing in §5.
+- **FIM / Fill-in-the-Middle** — Bavarian et al. 2022, *"Efficient Training of Language Models to Fill in the Middle"* (arXiv:2207.14255). The production approach used by Codex/Copilot: reorder training data as `[prefix, suffix, middle]` and train a standard causal LM. This is the main baseline any future retrofit experiment (see §6) would compare against.
+- **D3PM** — Austin et al. 2021, *"Structured Denoising Diffusion Probabilistic Models in Discrete State-Spaces"* (arXiv:2107.03006). The source of the uniform-noise corruption used in §2.1 denoising loss. The training here uses the D3PM-uniform noise kernel (random token replacement), not the absorbing-mask kernel used by the MDLM line.
+- **Mask-Predict** — Ghazvininejad et al. 2019, *"Mask-Predict: Parallel Decoding of Conditional Masked Language Models"* (arXiv:1904.09324). Iterative parallel decoding with round-based refinement over masked positions. My two-pass Coarse-to-Fine decoder in §2.2 is a stride-structured variant with a causal AR skeleton pass replacing the initial Mask-Predict round.
+- **MDLM** — Sahoo et al. 2024, *"Simple and Effective Masked Diffusion Language Models"* (arXiv:2406.07524). The reference point for §2.4 and the dominant paradigm in the parameter-golf text-diffusion cluster (see §8).
+
+Additional references on joint causal + bidirectional training that are relevant but not directly adapted here: **XLNet** (Yang et al. 2019, permutation LM), **T5** (Raffel et al. 2020, span-corruption denoising), **BART** (Lewis et al. 2020, denoising autoencoder), **CM3** (Aghajanyan et al. 2022, causal-masked joint training).
+
+---
+
+## 3. Main Results
+
+### 3.1 11L Multi-Seed Verification — Primary Evidence (1×H100 SXM, 540 s each, final-checkpoint eval)
+
+Six independent training runs at 11L d=512 v4096, same unified script, same v4096 data, same 540 s training budget. **Five fresh seeds for the shared model** (`SEED ∈ {1337, 42, 2024, 7, 100}`, joint AR + 0.3 · masked-denoising) and **one fresh seed for the matched causal-only control** (`SEED=1337`, w=0.0).
+
+**Why the control uses a single training seed (and why this is a known limit, not a justified equivalence).** This round of compute was concentrated on the shared side because the shared CF Total is the headline quantity, and because the shared side is where the visible empirical variance lives. The control's training-stochasticity term is **not directly measured** in this round (n=1 fresh control seed); a second fresh control seed in §6.0 is the only way to actually estimate it.
+
+For *intuitive calibration only* — not as a formal upper bound or as input to any significance computation — the shared side's 5-seed training final `val_bpb` std is ≈ 0.0019 (`{1.4387, 1.4393, 1.4416, 1.4422, 1.4430}`), and a causal-only optimization is a strictly simpler training objective (one loss component vs two, no bidirectional forward pass), which makes it *plausible* (but not proven) that the control's training std is in the same order of magnitude or smaller. This is a working assumption used only to motivate why a one-seed control round was a reasonable allocation of compute given the budget constraint, not to claim that the delta CI has been fully bounded. The single control's CF-eval Pure-AR (1.3214) matches its own training final `val_bpb` (1.3146) to within 0.007 BPB, which is consistent with that intuition but is also a single data point and proves nothing on its own.
+
+**Bottom line for this section: the unmeasured control variance is the largest remaining methodological limit of v3.5. §6.0 closes it.**
+
+**Methodology fix from v3.3.** All evaluations in this section are run on `step_final.pt`, the actual last training step, rather than the last `val_every`-aligned intermediate checkpoint that v3.3 was using. The training script (`train_cdm.py`) now unconditionally writes a `step_final.pt` at end of training; the eval script (`eval_cf_ablation.py`) consumes it directly. The intermediate-checkpoint difference is asymmetrically large for the shared model (which trains slower per step due to the bidirectional pass and therefore reaches fewer total steps) and was the dominant noise source in v3.3. After this fix, the CF Total seed-to-seed sample standard deviation drops from 0.022 (v3.3, intermediate checkpoint) to **0.0051** (v3.5, final checkpoint; full-precision value from the 5 logged `cf_total` numbers, rounded), a 4.3× variance reduction.
+
+| Run | Seed | Training final val_bpb | CF eval Pure-AR | CF eval CF Total |
+|---|---|---|---|---|
+| **11L_w0 (control)** | 1337 | 1.3146 | **1.3214** | 2.4538 (invalid — bidirectional mode was never trained) |
+| **11L_w0.3 (shared)** | 1337 | 1.4387 | 1.4428 | **1.2924** ⭐ best |
+| **11L_w0.3 (shared)** | 42 | 1.4393 | 1.4425 | **1.3027** |
+| **11L_w0.3 (shared)** | 2024 | 1.4430 | 1.4459 | **1.3060** |
+| **11L_w0.3 (shared)** | 7 | 1.4416 | 1.4446 | **1.3025** |
+| **11L_w0.3 (shared)** | 100 | 1.4422 | 1.4456 | **1.3007** |
+
+**11L_w0.3 5-seed CF Total stats:** mean **1.3009**, sample std **0.0051** (≈ 0.005), min 1.2924, max 1.3060.
+
+**Headline delta computation.** The primary headline of this submission is the 5-seed mean delta. The single-seed best is reported as a post-hoc reference for the deployable artifact, not as the effect size.
+
+| Quantity | Value | Role |
+|---|---|---|
+| **5-seed mean delta** (primary headline) | `1.3009 − 1.3214 = ` **`−0.0205 BPB`** (shared CF mean − single-seed control) | method-level effect size |
+| Single-seed best (post-hoc reference) | `1.2924 − 1.3214 = ` `−0.0290 BPB` (`SEED=1337`, best of the 5 trained seeds) | the model file one would actually ship |
+
+**Statistical caveat.** Both deltas use the same single-seed final-checkpoint w0 control measurement, so the control side of the delta carries **no within-experiment variance estimate at all**. The shared side has a 5-seed CF Total sample std of 0.0051 → SE 0.0023, and a 5-seed training final `val_bpb` std of 0.0019 (both directly computable from the table above). I do not run any significance test or compute any joint CI here, because doing so would require either (a) a measured control std, which this round does not have, or (b) treating the shared side's 0.0019 as a control upper bound, which is at best a working intuition (causal-only training has fewer loss components and no bidirectional forward pass) and at worst a hand-wave — not a formal bound. The intended reading of §3.1 is therefore the *unweighted* observation: "the 5-seed shared CF mean lands ~0.02 BPB below the single control point at the same training protocol and the same eval sample, and the visible shared-side variance is much smaller than that gap". Whether this gap survives a directly measured control variance is what §6.0 tests; until then, no significance claim is made.
+
+The single-seed best (−0.0290) is the result of post-hoc selection over 5 seeds and is therefore upward-biased as an effect-size estimator; it is reported only because that specific `step_final.pt` is the file that one would actually deploy as the §10 submission artifact, and reviewers should be able to reconcile the deployable file with the §3.1 statistics.
+
+Notes on the table:
+- All BPB numbers are measured on the same FineWeb v4096 validation shard with the same sampling protocol (N=500 sequences × seq_len=1024, eval `--seed 42` fixed across all runs). Within each row, Pure-AR and CF are on the same sequences.
+- The "invalid" entry for the control row is informative: it is the result of running the `is_causal=False` pass on a model that was never trained with a bidirectional objective. The bidirectional mode is untrained weights, so it produces a nearly uniform distribution, and CF Total explodes to ≈ 2.45 BPB. This **validates** that the CF gain in the shared rows is not a metric artifact — if it were, the control would show the same CF reduction.
+- The shared model's `Pure-AR` column shows the cost of joint training: at `w=0.3`, the shared model is ≈ +0.12 BPB worse on causal-only generation than the dedicated control. The CF decoder more than recovers this gap, but it does not erase it — the shared model is **not** a free lunch on Pure-AR; the gain is conditional on running the CF decoder at inference time. This is the test-time-compute framing developed in §5.
+
+### 3.2 Original 6-Run Scaling Sweep at 5L + 11L (single seed, intermediate checkpoint, retained as cross-scale evidence)
+
+The original scaling sweep that motivated the multi-seed verification in §3.1. Six independent training runs at the same 1×H100 540 s budget, varying model size (5L d=256 vs 11L d=512) and CDM loss weight (0.0, 0.3, 1.0), all on `SEED=1337`, all evaluated at the last `val_every=500`-aligned intermediate checkpoint (`step_5000.pt` for w=0 / w=1.0, `step_1500.pt` for w=0.3). These numbers are subsumed by §3.1 for the 11L row (which is now multi-seed and final-checkpoint), but the 5L row has not yet been multi-seed verified and is retained here as the only cross-scale evidence.
+
+| Run | Params | Training objective | Pure-AR BPB (single-mode) | CF BPB (two-pass decoder) |
+|---|---|---|---|---|
+| **5L_w0 (control)** | 4.3 M | causal-only | **1.4479** | 2.4371 (invalid) |
+| 5L_w0.3 | 4.3 M | causal + 0.3 · masked denoising | 1.5231 | **1.4009** |
+| 5L_w1.0 | 4.3 M | causal + 1.0 · masked denoising | 1.5841 | **1.3939** |
+| 11L_w0 (control, intermediate ckpt) | 28.4 M | causal-only | 1.3574 | 2.3947 (invalid) — *superseded by §3.1* |
+| 11L_w0.3 (intermediate ckpt) | 28.4 M | causal + 0.3 · masked denoising | 1.4708 | 1.3301 — *superseded by §3.1 (best seed: 1.2924)* |
+| 11L_w1.0 (intermediate ckpt) | 28.4 M | causal + 1.0 · masked denoising | 1.5414 | 1.3527 |
+
+The 5L row gives a single-seed delta of `1.3939 − 1.4479 = −0.054 BPB` (5L_w1.0 CF vs 5L_w0 Pure-AR). This is **not yet multi-seed verified** and should be treated as a single-seed observation pending the §6.0 follow-up. The 11L numbers in this table are deprecated in favour of §3.1.
+
+### 3.3 5L d=256 SP1024 — 8-Config CF Sweep (M1 Max, free)
+
+Before the 6-run H100 ablation, I ran a free pre-flight sweep on M1 Max using an earlier 5L SP1024 shared checkpoint (`shared_ar_cdm.npz`, 4.2 M params) to locate the CF sweet spot across stride × rounds. This is the sweep that convinced me stride=2, rounds=2 is worth spending H100 compute to test. The checkpoint here is SP1024 (not v4096), so the absolute BPB values differ from §3.1 due to the tokenizer — but the *shape* of the sweep is the signal.
+
+| Config | Pass-1 (causal) NLL | Pass-2 (denoise) NLL | **CF Total BPB** | vs Pure-AR 2.5386 |
+|---|---|---|---|---|
+| Pure AR baseline (same model, single-mode) | — | — | **2.5386** | baseline |
+| stride=2, rounds=1 | 1.2615 | 1.2807 | 2.5422 | +0.14% |
+| **stride=2, rounds=2** | **1.2688** | **1.0598** | **2.3285** | **−8.28%** |
+| stride=3, rounds=1 | 0.8663 | 2.1996 | 3.0659 | +20.77% |
+| stride=3, rounds=2 | 0.8540 | 1.6754 | 2.5294 | −0.36% |
+| stride=3, rounds=3 | 0.8527 | 1.6052 | 2.4578 | −3.18% |
+| stride=4, rounds=1 | 0.6370 | 2.6794 | 3.3164 | +30.64% |
+| stride=4, rounds=2 | 0.6404 | 2.0915 | 2.7319 | +7.61% |
+| stride=4, rounds=3 | 0.6436 | 1.9617 | 2.6053 | +2.63% |
+
+Sweet spot: stride=2, rounds=2 (50/50 causal–bidirectional split with two denoising refinement rounds). This is the only CF configuration used in §3.1. Every `rounds ≥ 2` configuration either matches or beats pure-AR. Wider-stride single-round configurations are catastrophic because the bidirectional pass has too much to fill from too little context in a single pass.
+
+### 3.4 Earlier 5L SP1024 Headline (1 line, for continuity)
+
+Before running the §3.1 ablation, the same (stride=2, rounds=2) CF configuration was measured on the earlier SP1024 5L shared checkpoint (`shared_ar_cdm.npz`) at N=2000 × seq_len=256 on M1 Max: Pure-AR 2.5412, CF Total **2.3382**, Δ **−7.99%** (stable across N=500 → N=2000). Kept here only to show that the §3.3 sweet spot holds at larger sample sizes on the pre-flight checkpoint. Not the primary claim.
+
+### 3.5 CDM-Weight Sensitivity and Scale Behaviour
+
+From the §3.1 table, two monotonic patterns emerge that are informative about where this paradigm works and where it does not:
+
+**The causal-mode tax grows with CDM weight.** As the CDM loss weight increases from 0 → 0.3 → 1.0, the shared model's Pure-AR BPB gets worse in a near-linear way. The table below uses the **final-checkpoint** measurements from §3.1 for 11L (1 control seed and 5-seed mean for `w=0.3`) and the §3.2 single-seed scaling sweep for 5L (the only available 5L source until §6.0):
+
+| Scale | Source | w=0 Pure AR | w=0.3 Pure AR | w=1.0 Pure AR | Tax at w=0.3 | Tax at w=1.0 |
+|---|---|---|---|---|---|---|
+| 5L | §3.2 (intermediate ckpt, single seed) | 1.4479 | 1.5231 | 1.5841 | **+0.075** | **+0.136** |
+| 11L | §3.1 (final ckpt, 1 control + 5-seed mean) | **1.3214** | **1.4443** | — | **+0.123** | — |
+
+At 11L the tax at `w=0.3` is **larger in absolute terms than at 5L** (0.123 vs 0.075). This is a non-trivial finding: naively one might expect the extra capacity of 11L to absorb the multi-task objective more gracefully, but the opposite happens in this regime (the causal head gives up more ground at 11L). I do not yet know whether this trend continues at 100 M+ or starts to reverse; that is the primary open question for §6.
+
+(A `w=1.0` 11L row is not given here because the §3.1 verification did not retrain `w=1.0`. The intermediate-checkpoint single-seed `w=1.0` value from §3.2 is preserved as a legacy reference in **Appendix A**.)
+
+**The CF two-pass decoder recovers the tax and then some.** Even though the shared model is worse at pure causal scoring, running the two-pass CF decoder on the same model gets it below the control:
+
+| Scale | Control CF-eval Pure-AR | Shared CF (5-seed mean / post-hoc best) | CF advantage (mean / post-hoc best) | Verification status |
+|---|---|---|---|---|
+| 5L | 1.4479 | 1.3939 (w=1.0, 1 seed) | **−0.054** (single seed) | single-seed (§6.0 follow-up) |
+| 11L (final ckpt) | **1.3214** (1 seed) | **1.3009 ± 0.005** (5-seed mean, w=0.3) / 1.2924 (post-hoc best `SEED=1337`) | **−0.0205 mean** / −0.0290 post-hoc best | 5 fresh shared seeds + 1 fresh control seed (§3.1) |
+
+At 5L the best CF configuration is w=1.0 (stronger bidirectional signal); at 11L it is w=0.3 (where the model has enough capacity that a weak bidirectional signal is enough). At both scales the shared-CF configuration scores below the matched causal-only control. The 11L row is the multi-seed final-checkpoint version from §3.1; the 5L row is still single-seed and is the highest-priority remaining verification (§6.0). The "post-hoc best" column at 11L is upward-biased (best of 5 seeds) and is reported only as the deployable-artifact reference, not as an effect-size estimate.
+
+### 3.6 Earlier M1 Max Pre-Flight (3-eval-seed Subsample Check on an Earlier Checkpoint)
+
+*This section is retained for historical context only. The §3.1 multi-seed verification at 1×H100 with 5 fresh training seeds at the final-checkpoint state supersedes it as evidence for the headline claim.*
+
+Prior to the 6-run ablation, I ran a 3-eval-seed subsample check on an earlier 11L 8×H100 checkpoint (`11L_shared_cdm_bf16.pt`, no longer used for primary comparison). The 3-eval-seed mean CF BPB was 1.3083 ± 0.0047 at seq_len=1024, with N=500 per seed. **What the three seeds randomize**: the validation subsample (which 500 sequences are picked) and the random fill in pass-2 denoising — *not* training stochasticity. The Pure-AR std of 0.0008 BPB across these eval seeds reflects validation subsample variance only, not model variance.
+
+The §3.1 result is methodologically stronger because it varies the **training seed**, runs on **fresh trainings** with the unified script, and evaluates at the **true final checkpoint** rather than an intermediate save.
+
+| Eval Seed | N | Pure AR | CF Total | Δ |
+|---|---|---|---|---|
+| 42 | 500 | 1.4422 | 1.3021 | −9.71% |
+| 43 | 500 | 1.4438 | 1.3134 | −9.03% |
+| 44 | 500 | 1.4441 | 1.3095 | −9.32% |
+| **mean** | **1 500** | **1.4434 ± 0.0008** | **1.3083 ± 0.0047** | **−9.35% ± 0.28%** |
+
+---
+
+## 4. Honest Limitations
+
+This PR measures a BPB improvement on the standard Parameter Golf metric (cross-entropy per byte of validation text). It does **not** measure:
+
+- **Comparison to the 8×H100 leaderboard at matched training compute.** The 1×H100 540 s runs see approximately 1/8 the tokens of an 8×H100 540 s run. The §3.1 11L_w0 control at training val_bpb 1.3146 (CF-eval Pure-AR 1.3214) is therefore not directly comparable to the 8×H100 leaderboard entries (top 1 = 1.1147, baseline = 1.2244). The relevant comparison in this PR is always the matched control on the same hardware, not the leaderboard.
+- **Actual fill-in-middle generation quality.** Parameter Golf evaluates BPB, not generation, because 28 M-parameter models at ~270 M training tokens cannot produce coherent text regardless of architecture (GPT-2 small at 124 M / 10 B tokens is the rough coherence threshold in the literature). I ran a qualitative greedy-fill test on all six models as a sanity check (not as a claim): exact-match rates were 0–4.7% across all configurations, including the controls — consistent with the scale regime. This PR is about BPB, which *is* the Parameter Golf metric.
+- **Comparison to dedicated fill-in-middle baselines** (CodeLlama-FIM, StarCoder-FIM). Training did not target code, so FIM code-benchmarks are not applicable without a retrofit experiment. This is Next Step #2 in §6.
+- **Retrofit to pretrained LLMs.** All training here is from scratch. Whether the same shared-weight paradigm can be added to an existing pretrained causal LM via LoRA — the realistic production path for any shipping product — is the largest open question, listed as Next Step #1 in §6.
+- **Share-ratio grid beyond three points.** I tested weight ∈ {0, 0.3, 1.0}. A finer grid might reveal a different optimum.
+- **Multi-seed verification at 11L: partially resolved in §3.1** (5 fresh training seeds for the shared model, **1** fresh training seed for the matched control, all at the true final checkpoint, shared 5-seed CF Total sample std 0.0051, i.e. ≈0.005). The control side still has only one fresh seed in this round; a strict significance test is not run (see §3.1 statistical caveat). A second control seed is the smallest remaining gap. **Multi-seed verification at 5L: not yet done** — the −0.054 BPB gap at 5L in §3.2 is still single-seed and is the highest-priority remaining experiment (§6.0).
+
+---
+
+## 5. Why This Might Matter — Downstream Utility Under Test-Time Compute
+
+The §3.1 effect is modest in absolute terms (−0.029 BPB best seed, −0.0205 BPB 5-seed mean at 11L; −0.054 BPB single-seed at 5L). What I find interesting is not the magnitude but the factorization: the matched ablation separates two capabilities a production LLM would typically want to optimize independently:
+
+1. **Causal-only next-token prediction**, which is how every shipping LLM (ChatGPT, Claude, GPT-4, Codex, Copilot) is primarily measured.
+2. **Bidirectional conditioning** on both left and right context, which is today served either by a *second* specialized model (BERT, MDLM), by a training-time hack (FIM special tokens in Bavarian et al. 2022 / Rozière et al. 2023), or by retrieve-and-rewrite pipelines.
+
+The matched ablation is consistent with the reading that **a single set of weights, at matched compute, can expose both capabilities when evaluated under the two-pass CF decoder**. This fits naturally into the recent test-time-compute framing (Welleck 2024, speculative decoding, Mask-Predict Ghazvininejad 2019): the CF decoder is an inference-time compute knob that trades extra forward passes for lower BPB, and the shared-weight training makes those extra passes useful instead of noise.
+
+**Effect-size context.** The 5-seed shared CF mean (1.3009) lands ~0.02 BPB below the single-seed control point (1.3214). The shared side's empirical std on the 5 fresh seeds is 0.005 (CF Total) and 0.0019 (training `val_bpb`); the control side has no measured variance term in this round. No significance test is computed here (see §3.1 statistical caveat). The absolute effect is small. The relevant practical question is **whether it grows, shrinks, or inverts when the model and training budget are scaled up**, which neither §3.1 nor §3.2 can answer — that is the §6.1 / §6.2 work, gated on §6.0.
+
+**What this is not.** This is not a claim that a 28 M parameter model can generate coherent text, or that these 540 s runs are ready for any production use. Models at this scale cannot generate coherent English regardless of architecture (GPT-2 small at 124 M / 10 B tokens is the rough coherence threshold, and these models are 5× smaller and 30× less trained). The Parameter Golf competition accepts this — BPB is the metric precisely because coherence is out of reach at these scales. The claim here is scoped to BPB under a specific decoder with a specific control, nothing more.
+
+---
+
+## 6. What Might Work With More Compute
+
+Honest speculation. Each item below is a concrete experiment that would extend or close an open question from §3 — ordered by what most strongly constrains the conclusion of this submission. **§6.0 is the only follow-up that is gating; everything else is conditional on it.**
+
+### 6.0 5L multi-seed verification (highest-priority remaining experiment)
+
+The 11L row of §3.1 is now multi-seed verified at the true final checkpoint. The 5L row in §3.2 is **not**. The −0.054 BPB single-seed result at 5L is a stronger absolute effect than the verified 11L 5-seed mean (−0.0205), but it has the same risk profile that the 11L row had before §3.1: a single training seed at the last `val_every`-aligned intermediate checkpoint, where the training-stochasticity asymmetry between control and shared could plausibly manufacture a 0.05 BPB gap by chance.
+
+**Concretely**: 5 fresh training seeds for `5L_w1.0` (the winner) + 1 fresh training seed for `5L_w0` (control), all evaluated at `step_final.pt` with the new `eval_cf_ablation.py` protocol. 5L training is much cheaper than 11L (~3 min per run on 1×H100 SXM, or runnable on consumer GPUs at similar speed). Total compute estimate: ~30 min wall time, ~$1.5 self-funded on 1×H100 SXM, or essentially free on M1 Max in roughly the same wall time. This is the next experiment I will run.
+
+### 6.1 Retrofit onto a pretrained causal LLM via LoRA (the production path)
+
+The experiment that would most directly test whether this paradigm survives outside the Parameter Golf toy regime is a **LoRA-style retrofit of a pretrained causal LLM** (e.g. Qwen 3.5 0.8 B, which I already have locally). Rather than training from scratch at 28 M parameters, take a model that already generates coherent text and add a small LoRA adapter to expose a bidirectional forward mode, trained with the same joint AR + D3PM objective. No shipping LLM trains from scratch at 28 M parameters, so this is the setting where any downstream claim has to be tested. An initial result on Qwen 0.8 B fits in roughly 10–15 H100-hours and would tell, *within one pod session*, whether the shared-weight + CF-decoder pattern carries to a model that is actually coherent at inference. This is the single most compute-efficient downstream test and it is Next Step #1.
+
+### 6.2 Full-budget 8×H100 reproduction of the 11L ablation
+
+Run the exact §3.1 ablation at 8×H100 540 s (the production Parameter Golf budget) to test whether the 0.027 BPB improvement persists, narrows, or inverts when the training-token budget grows ~8×. I do not have a confident extrapolation to offer — the Pure-AR tax in §3.5 already grows with scale in a direction that works against the shared model, and this experiment is how I find out whether that trend continues or reverses at full compute. This is Next Step #2.
+
+### 6.3 Share-ratio grid search at 11L
+
+The 6 runs used weight ∈ {0, 0.3, 1.0}. At 11L, w=0.3 gave the best CF BPB; at 5L, w=1.0 did. A fine grid (0.1, 0.15, 0.2, 0.3, 0.5, 0.7, 1.0) at 11L would locate the actual optimum and tell whether the share-ratio optimum scales with model size. This is a cheap follow-up to §6.2 — roughly 7 additional 1×H100 runs.
+
+### 6.4 Finer scale sweep for the share-ratio → BPB curve
+
+I have two architectural data points (5L 4.2 M and 11L 28.4 M). Adding 7L d=384, 9L d=448, and 13L d=640 would give a scaling curve for both the Pure-AR tax (which appears to grow with scale in this data) and the CF recovery (which also grows with scale). A simple power-law fit would let me predict the crossover scale — the model size at which the CF gain exceeds the Pure-AR tax by a margin that makes the extra compute worth it.
+
+### 6.5 Absorbing-mask MDLM noise schedule for the bidirectional pass
+
+I used uniform-noise D3PM (random vocabulary replacement). The MDLM cluster (#820, #1106, #1241) uses absorbing-mask denoising, which the literature suggests gives stronger bidirectional representations. Swapping the noise schedule is a one-line training change; a matched ablation would tell whether the gain would be larger under the standard MDLM noise, at the cost of some comparison legibility.
+
+---
+
+## 7. Retrodiction — A Negative Result at Production Scale
+
+> **Scope note.** The runs in this section are a **different training line** from the Shared AR + Denoising model used in §3. They are a 1×H100 A/B sweep of retrodiction modes on a pure AR stack (no CDM auxiliary loss). The "Pure AR" numbers in this table are therefore *not comparable* to the "Pure AR" column of §3.3, which measures the Shared AR + Denoising checkpoint in single-mode causal. Different models, different training configurations. See §7.3 for an explicit side-by-side.
+
+This submission also documents a line of work I call **retrodiction** — a reversed-sequence auxiliary loss added to the standard causal AR loss. The operational definition is simply:
+
+```python
+loss = causal_lm_loss(model(x), x) + α · causal_lm_loss(model(x.flip(1)), x.flip(1))
+```
+
+I report it as a negative result at production scale. The compact story:
+
+### 7.1 Early-Training Signal on 5L / M1 Max
+
+At small scale and short token budgets, retrodiction gave up to −3.6% BPB at step 200/500, direction consistent with the motivation.
+
+### 7.2 Production-Stack A/B on 1×H100
+
+Five independent training runs, same architecture (11L d=512 v4096, XSA-4, BigramHash), same 540 s budget, same seeds, **pure causal AR stack with no CDM auxiliary loss** — only the retrodiction mode varied:
+
+| Test | Retro mode | Final val_bpb |
+|---|---|---|
+| **D** | **OFF** | **1.3401** (best) |
+| C | partial 15% | 1.3594 |
+| B | merged late 80/20 | 1.3695 |
+| E | alternating 90/10 | 1.3616 |
+| A | alternating 50/50 | 1.4109 |
+
+Pure contrast (C vs D): retrodiction is a **+0.019 BPB tax** at production scale, not a gain.
+
+### 7.3 Consolidated 11L Pure-AR Numbers
+
+The §3.1 ablation and the §7.2 retrodiction sweep each produce their own 11L Pure-AR BPB on the same nominal 11L d=512 v4096 architecture but with different training stacks. They are listed side by side here, restricted to a single metric kind (training final `val_bpb`) for direct comparability:
+
+| Source | Training objective | Retrodiction | Training stack | Pure AR final val_bpb |
+|---|---|---|---|---|
+| **§3.1** `11L_w0` (control, 1 seed, final ckpt) | Pure AR only | off | unified `train_cdm.py`, `--xsa_last_n=4` | **1.3146** |
+| **§3.1** `11L_w0.3` (5-seed mean, final ckpt) | Joint AR + 0.3·denoising | off | unified `train_cdm.py`, `--xsa_last_n=4` | **1.4410** |
+| §7.2 Test D (1 seed, earlier stack) | Pure AR only | off | earlier XSA / BigramHash configuration | 1.3401 |
+| §7.2 Test C (1 seed, earlier stack) | Pure AR only | partial 15 % | earlier XSA / BigramHash configuration | 1.3594 |
+
+The §3.1 11L_w0 (1.3146) and §7.2 Test D (1.3401) are both single-seed pure-AR 1×H100 540 s runs at the same nominal architecture but on **different training stacks**. The 0.026 BPB difference reflects training-stack drift, not retrodiction. For the primary claim of this submission, the relevant comparison is always §3.1 11L_w0 vs §3.1 11L_w0.3 CF (both measured with the *exact* same script, same data pipeline, same eval sampling, all at the final checkpoint). The §7 retrodiction sweep is a separate, older line of work included for completeness.
+
+**Interpretation (hypothesis).** At 5L on short budgets, the forward loss signal may be weak enough that the reversed loss provides complementary gradient. At 11L on production budgets, I hypothesize that the forward signal is strong enough to dominate and the reversed loss competes for updates rather than augmenting them. I do not have a mechanistic proof of this interpretation, and I have not found a useful parametrization of retrodiction for the parameter-golf regime.
+
+**Practical recommendation:** retrodiction is a tax on the production stack and should not be used. The matched-compute 6-run ablation in §3.1 was run *without* retrodiction for this reason.
+
+---
+
+## 8. Position in the Text-Diffusion Cluster
+
+Snapshot of the text-diffusion cluster as of 2026-04-09 (reproducible via `gh pr list --repo openai/parameter-golf --search "diffusion" --state open --limit 50`):
+
+- Bidirectional masked diffusion + discrete absorbing ELBO (`val_var_bpb`): #820 mtybadger (convention-setting), #1053, #1106 agalimova, #1241 aiejvn, #1403
+- Causal MDLM as AR regularizer (eval in causal mode): #1119 gowtham0992
+- Hybrid AR + MDLM mixed training with bidirectional head discarded at eval: #1194
+- AR with diffusion-inspired auxiliary noise, evaluated as pure AR: #904
+- Prefix-conditioned discrete diffusion: #905
+- Hybrid sparse diffusion: #1198
+- **This PR:** shared-weight joint causal + masked-denoising training, evaluated via a two-pass Coarse-to-Fine decoder on BPB, with a **matched causal-only control** at the same compute. This is, to my knowledge, the first submission in the text-diffusion cluster to include an explicit matched-compute control ablation.
+
+This approach differs from the cluster in that both modes are actively used at evaluation on the same weights, rather than the bidirectional mode being used only at training time or evaluated separately. I do not claim this is a strict improvement over the MDLM line — it is a different question evaluated on a different metric. Direct numerical comparison across metrics (val_var_bpb / val_bpb / CF BPB) is not meaningful because they measure different quantities. See §2.4.
+
+---
+
+## 9. Hardware and Reproducibility
+
+All training and evaluation artifacts are published on Hugging Face:
+
+- **`akaiii/meadow-golf-checkpoints`** — all 6 ablation checkpoints (`5L_w0.npz`, `5L_w03.npz`, `5L_w1.npz`, `11L_w0.npz`, `11L_w03.npz`, `11L_w1.npz`), 6 training logs, 6 CF eval logs, the unified training script (`train_cdm.py` + `train_ablation_runner.py`), and the CF eval scripts (`eval_cf_dualbrain.py`, `eval_cf_dualbrain_cuda.py`, `eval_cf_ablation.py`). Directory layout matches the `ablation_results/` folder in this PR.
+- **`akaiii/meadow-golf-v4096`** — `bpe_v4096.model` tokenizer and the v4096 retokenized FineWeb validation + training shards used for every training run in §3.1.
+
+### 9.1 Reproduction of the §3.1 multi-seed verification (the v3.5 headline) — ~70 min on 1×H100 SXM, ~$3.50
+
+The §3.1 5-seed shared verification + 1-seed control is the headline of this submission. Both orchestration scripts (`run_p5.sh`, `run_phase_b.sh`) and all training / CF eval logs from the actual run are committed to `seeds_run/` in this folder for reviewer-side spot checking. The scripts rely on the v3.5 copies of `train_cdm.py`, `train_ablation_runner.py` (with `--seed` support), and `eval_cf_ablation.py`; the reproduction commands below reproduce them from a clean H100 pod:
+
+```bash
+pip install --break-system-packages torch numpy sentencepiece huggingface_hub
+
+git clone https://github.com/akaiHuang/meadow-golf
+cd meadow-golf/experiments/2026-04-09_matched_ablation
+
+hf download akaiii/meadow-golf-v4096 --repo-type dataset --local-dir /workspace/gv4096
+
+# 5 fresh shared seeds (11L_w0.3 × {1337, 42, 2024, 7, 100}), final-checkpoint save
+SCRIPT_DIR=. \
+  DATA_DIR=/workspace/gv4096/data \
+  TOKENIZER=/workspace/gv4096/bpe_v4096.model \
+  OUT_DIR=/workspace/out \
+  CKPT_DIR=/workspace/ckpt \
+  LOG_DIR=/workspace/logs \
+  bash run_p5.sh
+
+# 1 fresh control seed (11L_w0 SEED=1337), final-checkpoint save
+bash run_phase_b.sh
+```
+
+Both `run_p5.sh` and `run_phase_b.sh` invoke the unified `train_cdm.py` (which now writes a `step_final.pt` checkpoint at the end of training, addressing the v3.3 intermediate-checkpoint issue) via `train_ablation_runner.py` (`--seed` patches the module-level `SEED` constant and emits per-seed patched modules), then run `eval_cf_ablation.py` directly on the `step_final.pt` saves. Final BPB numbers should match the §3.1 table within bf16 numerical noise on the same `--seed 42` eval sample. Total wall time: ~70 min on a single 1×H100 SXM; total self-funded compute: **$3.50** at $2.99/hr.
+
+Reviewer spot check without rerunning anything: every number in §3.1 is grep-able from `seeds_run/logs/*.log` and `seeds_run/eval/*.log` already present in this folder. See `seeds_run/README.md` for the file inventory.
+
+### 9.2 Reproduction of the §3.2 6-run scaling sweep (cross-scale evidence, ~90 min on 1×H100 SXM, ~$3.93)
+
+This is the *original* 6-run ablation that v3.3 reported as the headline; in v3.5 it is retained only as the §3.2 cross-scale evidence (single seed each, intermediate checkpoint). The 5L row of §3.2 is the only available 5L data until §6.0 follow-up. The 11L rows are superseded by §3.1 / Appendix A but included for traceability.
+
+```bash
+pip install torch numpy sentencepiece huggingface_hub
+
+hf download akaiii/meadow-golf-checkpoints --repo-type dataset --local-dir ./gcp
+hf download akaiii/meadow-golf-v4096       --repo-type dataset --local-dir ./gv4096
+
+export PYTHONPATH="./gcp:${PYTHONPATH}"
+mkdir -p out ckpt logs eval
+
+# Train all 6 ablation models (6 × ~10 min wallclock)
+for cfg in "5L 5 256 128 2 0.0"  "5L 5 256 128 2 0.3"  "5L 5 256 128 2 1.0" \
+           "11L 11 512 128 4 0.0" "11L 11 512 128 4 0.3" "11L 11 512 128 4 1.0"; do
+  read tag L D BD X W <<< "$cfg"
+  python3 ./gcp/train_ablation_runner.py \
+    --train_script ./gcp/train_cdm.py \
+    --num_layers $L --model_dim $D --vocab_size 4096 \
+    --bigram_dim $BD --xsa_last_n $X --cdm_weight $W \
+    -- \
+    --train_budget_secs 540 --steps 9999 \
+    --data_dir ./gv4096/data --tokenizer_path ./gv4096/bpe_v4096.model \
+    --save_path ./out/${tag}_w${W}.npz \
+    --checkpoint_dir ./ckpt/${tag}_w${W} \
+    > ./logs/${tag}_w${W}_train.log 2>&1
+done
+
+# Evaluate all 6 under CF (6 × ~5 min wallclock)
+for cfg in "5L 5 256 128 2 0.0"  "5L 5 256 128 2 0.3"  "5L 5 256 128 2 1.0" \
+           "11L 11 512 128 4 0.0" "11L 11 512 128 4 0.3" "11L 11 512 128 4 1.0"; do
+  read tag L D BD X W <<< "$cfg"
+  latest=$(ls ./ckpt/${tag}_w${W}/step_*.pt | sort -V | tail -1)
+  python3 ./gcp/eval_cf_ablation.py \
+    --ckpt $latest \
+    --train_module_path /tmp/train_cdm_patched_${L}L_w${W}.py \
+    --num_layers $L --model_dim $D --vocab_size 4096 \
+    --bigram_dim $BD --xsa_last_n $X \
+    --n_seqs 500 --seq_len 1024 --stride 2 --rounds 2 --seed 42 \
+    --data_dir ./gv4096/data --tokenizer_path ./gv4096/bpe_v4096.model \
+    --log_path ./eval/${tag}_w${W}_cf.log
+done
+```
+
+The patched training scripts `/tmp/train_cdm_patched_*.py` are created as a side effect of `train_ablation_runner.py` and are the model-class source for the matching `eval_cf_ablation.py` run. They are regenerated deterministically from `train_cdm.py` on each run. The 5L M1 Max pre-flight sweep uses `eval_cf_dualbrain.py` (MLX) against `shared_ar_cdm.npz`; it runs on any Apple Silicon Mac with `mlx >= 0.31` and reproduces the §3.3 table in under 4 minutes.
+
+Self-funded compute for the §3.2 6-run scaling sweep: **$3.93**. Combined with the §3.1 verification ($3.50), total self-funded for this submission: **~$7.43**.
+
+---
+
+## 10. Compliance
+
+- [x] **5L submission artifacts ≤ 16 MB**: the competition submission unit is the int6+lzma compressed checkpoint (`5L_*_int6.lzma` = ~3.0 MB each), well under the 16 MB cap. The intermediate `5L_w0.npz` (17.2 MB BF16) is *not* a submission artifact; it is the working final-state save used by the eval script and is never submitted.
+- [x] **11L submission artifacts** are non-record (trained on 1×H100, not matched to the 8×H100 production budget). The corresponding `11L_*_int6.lzma` files are ~18.7 MB each, *over* the 16 MB cap, which is why every 11L row in this submission is filed under the **non-record track** explicitly. They are never claimed as record candidates.
+- [x] No validation data accessed during training
+- [x] CF evaluation uses validation tokens only for scoring; no gradient updates
+- [x] No network calls during evaluation
+- [x] Hardware: original 6-run scaling sweep on a single 1×H100 SXM pod ($3.93). §3.1 multi-seed verification on a second 1×H100 SXM pod ($3.50). Total self-funded ~$7.43 across both sessions.
+- [x] Causal-mask integrity verified via the leakage test in §2.3 (`leakage_test.py` included in this folder, max prefix-logit divergence 0.0)
+- [x] CF evaluation is fully specified by `SEED`; the denoising pass is Monte Carlo averaged over `n_random=3` random fills for variance reduction on residual positions (not exact, but deterministic given the seed)
+- [x] All reviewer-facing §3.1 logs and orchestration scripts are stored locally in `seeds_run/` and reproducible from the per-seed `train_ablation_runner.py` invocations recorded in `run_p5.sh` / `run_phase_b.sh`
+- [x] The exact §3.1 `.npz` / `step_final.pt` state files are intentionally not committed to this PR folder (~1.3 GB total); their location and availability-on-request path are documented in `seeds_run/README.md`
+
+---
+
+## 11. Acknowledgments
+
+- **PR #820 (@mtybadger)** for establishing `val_var_bpb` and the MDLM reference point for text diffusion in parameter-golf. My disagreement with the metric in §2.3 is intended as productive, not dismissive.
+- **PR #363 (@evangelinehelsinki)** for the template of honest negative-result reporting that §7 follows, and for the `What Might Work With More Compute` section format.
+- **PRs #1106, #1241** for showing that the MDLM line is an active research target worth contributing alternatives to.
+
+---
+
+## 12. Related Closed Submission
+
+I earlier withdrew [PR #1442](https://github.com/openai/parameter-golf/pull/1442), a different stack combination submission targeting AR sliding BPB. A self-audit found methodological issues including a mismatch between the evaluation used and the compressed artifact. That line of work is not being pursued further; this PR represents my focused research effort going forward.
+
+---
+
+## Appendix A. Legacy intermediate-checkpoint 11L numbers (superseded by §3.1)
+
+**This appendix exists solely for traceability with v3.3. None of these legacy intermediate-checkpoint numbers are used in any headline claim or main analysis in v3.5.** It is here so that a reader cross-referencing v3.3 against v3.5 can find the original v3.3 single-seed 6-run table 11L values in one place, paired with the §3.1 final-checkpoint measurements that supersede them. The main analytical sections (§3.5, §7.3) of v3.5 carry only final-checkpoint measurements.
+
+The original v3.2 6-run sweep at 11L (single seed `1337`, evaluated at the last `val_every`-aligned intermediate checkpoint, *not* `step_final.pt`):
+
+| Run | Pure AR (intermediate ckpt) | CF Total (intermediate ckpt) | Status in v3.5 |
+|---|---|---|---|
+| `11L_w0` (control) | 1.3574 | 2.3947 (invalid) | superseded by §3.1 final ckpt: **1.3214 / 2.4538** |
+| `11L_w0.3` | 1.4708 | 1.3301 | superseded by §3.1 final ckpt 5-seed mean: **1.4443 / 1.3009** |
+| `11L_w1.0` | 1.5414 | 1.3527 | not retrained at final ckpt; legacy value retained |
+
+**Why these numbers were higher / lower than the §3.1 final-checkpoint numbers.** The intermediate checkpoint (`step_5000.pt` for w=0/w=1.0, `step_1500.pt` for w=0.3) is several hundred training steps before the actual end of the 540 s training budget. The shared model is hit asymmetrically harder by this gap because it trains slower per step (the bidirectional pass roughly doubles forward FLOPs at this size), so its last `val_every`-aligned save is *relatively* less converged than the control's. Fixing this with `step_final.pt` (§2 methodology fix in v3.5) improves the shared CF score by ~0.03 BPB and the control Pure-AR by ~0.03 BPB in the *opposite* direction (control gets *better* on the metric, shared also gets better but the difference reshapes) — net effect: the v3.3 single-seed delta (−0.027 BPB) and the v3.5 5-seed mean delta (−0.0205 BPB) are within 0.007 BPB of each other and have the same sign, but the v3.5 number is the one that survives the methodology fix and the multi-seed verification, and is the one quoted everywhere in the main text.
+
+The original 5L row of the v3.2 sweep is used directly in §3.5 as the only available 5L cross-scale evidence pending the §6.0 follow-up.
diff --git a/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/README_v3_5_DRAFT.md b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/README_v3_5_DRAFT.md
new file mode 100644
index 0000000000..c46ae41817
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/README_v3_5_DRAFT.md
@@ -0,0 +1,492 @@
+# Non-record: Shared AR + Masked Denoising — −0.0205 ± 0.005 BPB (5-seed mean) vs Matched Causal-Only Baseline at Final Checkpoint (11L, 1×H100)
+
+*This folder contains the full reproducible artifacts and submission writeup (v3.5) for the 6-run scaling sweep + 5-seed verification ablation reported in [openai/parameter-golf#1255](https://github.com/openai/parameter-golf/pull/1255). Standalone research diary mirror: [github.com/akaiHuang/meadow-golf](https://github.com/akaiHuang/meadow-golf).*
+
+*v3.5 changes (vs v3.3):*
+*1. Adds **5-seed multi-seed verification** of the 11L headline (`SEED ∈ {1337, 42, 2024, 7, 100}` for the shared model, `SEED=1337` for the causal-only control), all measured at the true final training step rather than the last `val_every`-aligned intermediate checkpoint.*
+*2. Reports the **5-seed mean delta `−0.0205 BPB`** as the primary headline (method-level effect size). The single-seed best (`SEED=1337`, delta `−0.0290 BPB`) is reported as a post-hoc reference for the deployable artifact only and is explicitly **not** the headline number.*
+*3. Methodology fix: `train_cdm.py` now unconditionally writes a `step_final.pt` checkpoint at the end of training, so CF evaluation no longer reads a checkpoint hundreds of steps before the actual end of training. This addresses the intermediate-checkpoint concern raised in v3.3 review.*
+*4. The original 6-run scaling sweep (5L + 11L) is retained in §3.2 as cross-scale evidence; the 11L numbers in §3.1 are now superseded by the multi-seed final-checkpoint measurement.*
+
+**Wishlist RFC addressed:** Text Diffusion (primary), TTT, Depth Recurrence.
+
+**Author:** Sheng-Kai Huang ([@akaiHuang](https://github.com/akaiHuang)) · akai@fawstudio.com
+
+**Note on authorship.** This is an individual, self-funded research submission. I am not part of a lab or a team. Total self-funded compute across both pods reported here: **~$7.43** ($3.93 for the §3.2 6-run scaling sweep on the first 1×H100 SXM pod in US-MO-1 on 2026-04-09, plus $3.50 for the §3.1 multi-seed verification on a second 1×H100 SXM pod the same day). Every script, log, and the `seeds_run/` spot-check artifacts for §3.1 are committed to this folder or available on my public Hugging Face datasets (`akaiii/meadow-golf-checkpoints`, `akaiii/meadow-golf-v4096`). The exact §3.1 `.npz` and `step_final.pt` state files are intentionally not committed to this PR folder because they would add ~1.3 GB; their location and availability-on-request path are documented in `seeds_run/README.md`. The text uses first-person singular throughout; where it reads "this work" or "this submission" it is shorthand for the same single author.
+
+**Summary.** A shared-weight 11L d=512 v4096 model jointly trained on causal AR + uniform-noise D3PM masked denoising, evaluated via a two-pass Coarse-to-Fine (CF) decoder at the **true final training step**, scores lower BPB than a matched-compute causal-only baseline (1×H100 SXM, 540 s, FineWeb v4096, N=500×1024). Across **5 fresh training seeds** for the shared model and **1 fresh training seed** for the matched control, the **5-seed mean delta is −0.0205 BPB**, with the shared model's CF Total estimated at **1.3009 ± 0.005** (5-seed mean ± std) against a single-control baseline at **1.3214**. The control's training-stochasticity term is *not directly measured* in this round (n=1 fresh seed), and no significance test is computed; see §3.1 for the intuitive calibration that the gap is large relative to the visible variance on the shared side, and §6.0 for the second-control-seed experiment that would close the gap.
+
+A causal-only control run under CF evaluation produces garbage (≈ 2.45 BPB), confirming the effect comes from joint training rather than a metric artifact. The 5-seed mean (−0.0205 BPB) is the method-level effect size; the single best seed (−0.0290 BPB, `SEED=1337`) is reported in §3.1 only as the deployable artifact reference. The original 6-run scaling sweep at 5L + 11L is retained in §3.2 as cross-scale evidence; the 5L row shows a −0.054 BPB single-seed gap that has not yet been multi-seed verified (§6.0). Total self-funded compute across both pods: **~$7.43** on 1×H100 SXM. Every headline number in §3.1 is auditable from files in this folder, including the training and CF eval logs committed under `seeds_run/`; exact reruns are specified in §9.1.
+
+---
+
+## 1. Why This Submission (RFC Response)
+
+The "Requests for PRs" list includes **Text diffusion** as a wishlist item. Twelve diffusion PRs are currently open; the dominant paradigm is bidirectional masked diffusion training evaluated with a discrete absorbing-mask variational bound (`val_var_bpb`), established by #820. That line is progressing well (#1241 at 0.9901, #1106 at 1.1465).
+
+I take a different operational question: **can joint training of causal-AR and masked-denoising objectives on shared weights lower BPB on the standard Parameter Golf metric, when evaluated via a concrete two-pass decoder rather than a 256-step variational bound?** The answer in this submission, under full matched-compute controls and 5-seed verification at 11L (§3.1), is yes: a **5-seed mean delta of −0.0205 BPB** at the matched 1×H100 540 s budget, with a single-seed control baseline (see §3.1 for the statistical caveat). The single-seed best (`SEED=1337`) gives a wider −0.0290 BPB and is reported only as a post-hoc reference for the deployable artifact. The cross-scale 5L row in §3.2 shows a single-seed −0.054 BPB gap that is consistent with the 11L direction but is not yet multi-seed verified (§6.0). The gain at 11L is not a metric artifact: the same CF evaluation run on a causal-only control produces 2.45 BPB (garbage), because the bidirectional mode was never trained. The effect comes from the shared training objective, not from the metric itself.
+
+---
+
+## 2. Method
+
+### 2.1 Training
+
+The shared-weight model is trained with two gradient contributions summed at every step (no phase switching, no loss schedule). The following pseudocode matches `train_cdm.py` lines 997–1012:
+
+```python
+# --- AR loss (causal mode) ---
+ar_loss = causal_lm_loss(model(x, is_causal=True), y) / grad_accum
+ar_loss.backward()
+
+# --- Denoising loss (bidirectional mode) ---
+# uniform-noise D3PM: replace masked positions with random vocab tokens
+mask_rate = np.random.uniform(0.15, 0.50)                     # per-step rate
+mask      = torch.rand(B, T) < mask_rate
+x_masked  = x.clone()
+x_masked[mask] = torch.randint(0, vocab_size, (mask.sum(),))  # uniform-noise D3PM corruption
+
+logits   = model.forward_hidden(x_masked, is_causal=False)    # bidirectional pass
+per_tok  = cross_entropy(logits, x, reduction="none")
+cdm_loss = (per_tok * mask.float()).sum() / mask.sum() * 0.3 / grad_accum   # weight = 0.3
+cdm_loss.backward()
+```
+
+The same parameter tensor is used in both forward calls. The only difference between the two forwards is the `is_causal` flag. There are no separate heads, no separate embedding tables, no phase switching. The two `.backward()` calls are equivalent to summing the gradients of `ar_loss + 0.3 * cdm_loss`.
+
+Key configuration:
+- **Mask rate**: `U(0.15, 0.50)` per step (not `U(0.0, 1.0)` — the model never sees fully-masked inputs)
+- **CDM loss weight**: `0.3` relative to `1.0` on the AR loss — the causal objective dominates during training
+- **Corruption type**: uniform-noise D3PM (each masked position replaced with a random token drawn uniformly from the vocabulary), not absorbing-mask MDLM
+
+### 2.2 Coarse-to-Fine Decoder (Evaluation)
+
+The evaluation procedure is a stride-structured variant of **Mask-Predict** (Ghazvininejad et al. 2019), with one change: the first round is a causal AR pass rather than an unconditional mask prediction. Given a sequence of length L and a stride `s`:
+
+1. **Pass 1 (causal mode, `is_causal=True`).** Run the model in causal mode and score log-probabilities at positions `{0, s, 2s, ...}`. These are the "skeleton" positions. The model can only see earlier tokens (verified in §2.3).
+2. **Pass 2 (bidirectional mode, `is_causal=False`).** Fill the remaining positions in `rounds` iterations. Within each round, positions that are still unresolved (the current round's positions plus all later rounds) are replaced by random vocabulary tokens drawn uniformly — this is the same D3PM-uniform corruption the model was trained on. The forward pass is then run bidirectionally. The script averages the resulting NLL over `n_random=3` independent random-fill draws to reduce variance. Ground-truth tokens at positions already resolved in earlier rounds are kept as-is (the code uses `x.copy()` with ground-truth reassignment for unresolved positions only; it does not propagate model samples from earlier rounds).
+
+The total BPB is the sum of pass-1 and pass-2 negative log-likelihoods, normalized by total bytes. It is the conditional cross-entropy of the two-pass decoding procedure described above, with Monte Carlo averaging (`n_random=3`) over the random fills used for unresolved positions during pass 2. It is *not* an exact entropy; it is the cross-entropy a decoder following this exact procedure would achieve.
+
+Full implementation: `eval_cf_dualbrain.py` (MLX, 5L reference) and `eval_cf_dualbrain_cuda.py` (PyTorch/CUDA, 11L). Both files are included in this folder.
+
+### 2.3 Causal-Mask Integrity Check
+
+Because the main numerical claim rests on the `is_causal=True` forward correctly masking future tokens, I ran an explicit future-token leakage test on the 5L checkpoint. The test constructs two token sequences `seq_A` and `seq_B` that are identical for positions `0..15` and differ for positions `16..31`, forwards both with `is_causal=True`, and compares logits.
+
+Under a correct causal mask, logits at positions `0..15` must be byte-identical between the two inputs (future tokens cannot influence earlier positions). Under a broken mask, they will diverge.
+
+Observed result on `shared_ar_cdm.npz`:
+
+```
+Prefix positions 0..15 (should be identical under causal):
+  max  |logits_A - logits_B| = 0.000000e+00
+  mean |logits_A - logits_B| = 0.000000e+00
+Suffix positions 16..31 (should differ, as inputs differ):
+  max  |logits_A - logits_B| = 1.82e+01
+```
+
+Prefix divergence is exactly zero (not merely below precision) and suffix divergence confirms the model is not constant. The `is_causal=True` path does not leak future tokens. The test script is included as `leakage_test.py`; reviewers can reproduce it on any Apple Silicon machine with `mlx >= 0.31` in under 30 seconds. The same SDPA call path (`F.scaled_dot_product_attention(q, k, v, is_causal=is_causal, scale=...)` with no additional `attn_mask` argument) is used by both `train_cdm.py` (training) and `eval_cf_dualbrain_cuda.py` (11L evaluation), so the integrity of the 5L test carries to the 11L numbers.
+
+### 2.4 Why Not `val_var_bpb`
+
+The MDLM line uses `val_var_bpb`, a variational upper bound on NLL under the discrete absorbing-mask Markov chain. I deliberately do not report this metric for three reasons:
+
+1. **Training-eval mismatch.** `val_var_bpb` assumes absorbing-mask training. This submission uses uniform-noise replacement (D3PM-uniform). Applying absorbing ELBO to a uniform-noise model is not a valid bound.
+2. **No realizable decoder at 256 steps.** `val_var_bpb` requires 256–512 forward passes. No practical compression procedure runs at that cost; the metric measures tightness, not decoder-ability.
+3. **Apples-to-oranges risk.** Mixing CF BPB with `val_var_bpb` in one table would compare different quantities.
+
+I cite `val_var_bpb` as a valid metric for its research line.
+
+### 2.5 Related Prior Work
+
+The core idea of this submission — one set of weights trained under multiple attention-mask regimes and used in more than one mode at evaluation — is not new. I do not claim to have invented joint causal + bidirectional training or iterative mask-and-refill decoding. The contribution here is the specific combination (uniform-noise D3PM denoising jointly trained with a causal AR loss at 0.3 : 1 weight, evaluated via a two-pass Mask-Predict-style decoder) and its empirical behavior in the parameter-golf regime.
+
+Relevant prior work that readers should consult:
+
+- **UniLM** — Dong et al. 2019, *"Unified Language Model Pre-training for Natural Language Understanding and Generation"* (arXiv:1905.03197). The closest architectural precedent: one transformer trained with three attention-mask regimes (unidirectional, bidirectional, seq2seq) on the same weights. My training is a simpler variant with only two mask regimes (causal + bidirectional) and a D3PM-uniform denoising objective in place of UniLM's masked-LM objective.
+- **GLM** — Du et al. 2022, *"GLM: General Language Model Pretraining with Autoregressive Blank Infilling"* (arXiv:2103.10360). Unifies understanding and generation via autoregressive blank infilling on spans. Directly motivates the "one model for generate + edit" framing in §5.
+- **FIM / Fill-in-the-Middle** — Bavarian et al. 2022, *"Efficient Training of Language Models to Fill in the Middle"* (arXiv:2207.14255). The production approach used by Codex/Copilot: reorder training data as `[prefix, suffix, middle]` and train a standard causal LM. This is the main baseline any future retrofit experiment (see §6) would compare against.
+- **D3PM** — Austin et al. 2021, *"Structured Denoising Diffusion Probabilistic Models in Discrete State-Spaces"* (arXiv:2107.03006). The source of the uniform-noise corruption used in §2.1 denoising loss. The training here uses the D3PM-uniform noise kernel (random token replacement), not the absorbing-mask kernel used by the MDLM line.
+- **Mask-Predict** — Ghazvininejad et al. 2019, *"Mask-Predict: Parallel Decoding of Conditional Masked Language Models"* (arXiv:1904.09324). Iterative parallel decoding with round-based refinement over masked positions. My two-pass Coarse-to-Fine decoder in §2.2 is a stride-structured variant with a causal AR skeleton pass replacing the initial Mask-Predict round.
+- **MDLM** — Sahoo et al. 2024, *"Simple and Effective Masked Diffusion Language Models"* (arXiv:2406.07524). The reference point for §2.4 and the dominant paradigm in the parameter-golf text-diffusion cluster (see §8).
+
+Additional references on joint causal + bidirectional training that are relevant but not directly adapted here: **XLNet** (Yang et al. 2019, permutation LM), **T5** (Raffel et al. 2020, span-corruption denoising), **BART** (Lewis et al. 2020, denoising autoencoder), **CM3** (Aghajanyan et al. 2022, causal-masked joint training).
+
+---
+
+## 3. Main Results
+
+### 3.1 11L Multi-Seed Verification — Primary Evidence (1×H100 SXM, 540 s each, final-checkpoint eval)
+
+Six independent training runs at 11L d=512 v4096, same unified script, same v4096 data, same 540 s training budget. **Five fresh seeds for the shared model** (`SEED ∈ {1337, 42, 2024, 7, 100}`, joint AR + 0.3 · masked-denoising) and **one fresh seed for the matched causal-only control** (`SEED=1337`, w=0.0).
+
+**Why the control uses a single training seed (and why this is a known limit, not a justified equivalence).** This round of compute was concentrated on the shared side because the shared CF Total is the headline quantity, and because the shared side is where the visible empirical variance lives. The control's training-stochasticity term is **not directly measured** in this round (n=1 fresh control seed); a second fresh control seed in §6.0 is the only way to actually estimate it.
+
+For *intuitive calibration only* — not as a formal upper bound or as input to any significance computation — the shared side's 5-seed training final `val_bpb` std is ≈ 0.0019 (`{1.4387, 1.4393, 1.4416, 1.4422, 1.4430}`), and a causal-only optimization is a strictly simpler training objective (one loss component vs two, no bidirectional forward pass), which makes it *plausible* (but not proven) that the control's training std is in the same order of magnitude or smaller. This is a working assumption used only to motivate why a one-seed control round was a reasonable allocation of compute given the budget constraint, not to claim that the delta CI has been fully bounded. The single control's CF-eval Pure-AR (1.3214) matches its own training final `val_bpb` (1.3146) to within 0.007 BPB, which is consistent with that intuition but is also a single data point and proves nothing on its own.
+
+**Bottom line for this section: the unmeasured control variance is the largest remaining methodological limit of v3.5. §6.0 closes it.**
+
+**Methodology fix from v3.3.** All evaluations in this section are run on `step_final.pt`, the actual last training step, rather than the last `val_every`-aligned intermediate checkpoint that v3.3 was using. The training script (`train_cdm.py`) now unconditionally writes a `step_final.pt` at end of training; the eval script (`eval_cf_ablation.py`) consumes it directly. The intermediate-checkpoint difference is asymmetrically large for the shared model (which trains slower per step due to the bidirectional pass and therefore reaches fewer total steps) and was the dominant noise source in v3.3. After this fix, the CF Total seed-to-seed sample standard deviation drops from 0.022 (v3.3, intermediate checkpoint) to **0.0051** (v3.5, final checkpoint; full-precision value from the 5 logged `cf_total` numbers, rounded), a 4.3× variance reduction.
+
+| Run | Seed | Training final val_bpb | CF eval Pure-AR | CF eval CF Total |
+|---|---|---|---|---|
+| **11L_w0 (control)** | 1337 | 1.3146 | **1.3214** | 2.4538 (invalid — bidirectional mode was never trained) |
+| **11L_w0.3 (shared)** | 1337 | 1.4387 | 1.4428 | **1.2924** ⭐ best |
+| **11L_w0.3 (shared)** | 42 | 1.4393 | 1.4425 | **1.3027** |
+| **11L_w0.3 (shared)** | 2024 | 1.4430 | 1.4459 | **1.3060** |
+| **11L_w0.3 (shared)** | 7 | 1.4416 | 1.4446 | **1.3025** |
+| **11L_w0.3 (shared)** | 100 | 1.4422 | 1.4456 | **1.3007** |
+
+**11L_w0.3 5-seed CF Total stats:** mean **1.3009**, sample std **0.0051** (≈ 0.005), min 1.2924, max 1.3060.
+
+**Headline delta computation.** The primary headline of this submission is the 5-seed mean delta. The single-seed best is reported as a post-hoc reference for the deployable artifact, not as the effect size.
+
+| Quantity | Value | Role |
+|---|---|---|
+| **5-seed mean delta** (primary headline) | `1.3009 − 1.3214 = ` **`−0.0205 BPB`** (shared CF mean − single-seed control) | method-level effect size |
+| Single-seed best (post-hoc reference) | `1.2924 − 1.3214 = ` `−0.0290 BPB` (`SEED=1337`, best of the 5 trained seeds) | the model file one would actually ship |
+
+**Statistical caveat.** Both deltas use the same single-seed final-checkpoint w0 control measurement, so the control side of the delta carries **no within-experiment variance estimate at all**. The shared side has a 5-seed CF Total sample std of 0.0051 → SE 0.0023, and a 5-seed training final `val_bpb` std of 0.0019 (both directly computable from the table above). I do not run any significance test or compute any joint CI here, because doing so would require either (a) a measured control std, which this round does not have, or (b) treating the shared side's 0.0019 as a control upper bound, which is at best a working intuition (causal-only training has fewer loss components and no bidirectional forward pass) and at worst a hand-wave — not a formal bound. The intended reading of §3.1 is therefore the *unweighted* observation: "the 5-seed shared CF mean lands ~0.02 BPB below the single control point at the same training protocol and the same eval sample, and the visible shared-side variance is much smaller than that gap". Whether this gap survives a directly measured control variance is what §6.0 tests; until then, no significance claim is made.
+
+The single-seed best (−0.0290) is the result of post-hoc selection over 5 seeds and is therefore upward-biased as an effect-size estimator; it is reported only because that specific `step_final.pt` is the file that one would actually deploy as the §10 submission artifact, and reviewers should be able to reconcile the deployable file with the §3.1 statistics.
+
+Notes on the table:
+- All BPB numbers are measured on the same FineWeb v4096 validation shard with the same sampling protocol (N=500 sequences × seq_len=1024, eval `--seed 42` fixed across all runs). Within each row, Pure-AR and CF are on the same sequences.
+- The "invalid" entry for the control row is informative: it is the result of running the `is_causal=False` pass on a model that was never trained with a bidirectional objective. The bidirectional mode is untrained weights, so it produces a nearly uniform distribution, and CF Total explodes to ≈ 2.45 BPB. This **validates** that the CF gain in the shared rows is not a metric artifact — if it were, the control would show the same CF reduction.
+- The shared model's `Pure-AR` column shows the cost of joint training: at `w=0.3`, the shared model is ≈ +0.12 BPB worse on causal-only generation than the dedicated control. The CF decoder more than recovers this gap, but it does not erase it — the shared model is **not** a free lunch on Pure-AR; the gain is conditional on running the CF decoder at inference time. This is the test-time-compute framing developed in §5.
+
+### 3.2 Original 6-Run Scaling Sweep at 5L + 11L (single seed, intermediate checkpoint, retained as cross-scale evidence)
+
+The original scaling sweep that motivated the multi-seed verification in §3.1. Six independent training runs at the same 1×H100 540 s budget, varying model size (5L d=256 vs 11L d=512) and CDM loss weight (0.0, 0.3, 1.0), all on `SEED=1337`, all evaluated at the last `val_every=500`-aligned intermediate checkpoint (`step_5000.pt` for w=0 / w=1.0, `step_1500.pt` for w=0.3). These numbers are subsumed by §3.1 for the 11L row (which is now multi-seed and final-checkpoint), but the 5L row has not yet been multi-seed verified and is retained here as the only cross-scale evidence.
+
+| Run | Params | Training objective | Pure-AR BPB (single-mode) | CF BPB (two-pass decoder) |
+|---|---|---|---|---|
+| **5L_w0 (control)** | 4.3 M | causal-only | **1.4479** | 2.4371 (invalid) |
+| 5L_w0.3 | 4.3 M | causal + 0.3 · masked denoising | 1.5231 | **1.4009** |
+| 5L_w1.0 | 4.3 M | causal + 1.0 · masked denoising | 1.5841 | **1.3939** |
+| 11L_w0 (control, intermediate ckpt) | 28.4 M | causal-only | 1.3574 | 2.3947 (invalid) — *superseded by §3.1* |
+| 11L_w0.3 (intermediate ckpt) | 28.4 M | causal + 0.3 · masked denoising | 1.4708 | 1.3301 — *superseded by §3.1 (best seed: 1.2924)* |
+| 11L_w1.0 (intermediate ckpt) | 28.4 M | causal + 1.0 · masked denoising | 1.5414 | 1.3527 |
+
+The 5L row gives a single-seed delta of `1.3939 − 1.4479 = −0.054 BPB` (5L_w1.0 CF vs 5L_w0 Pure-AR). This is **not yet multi-seed verified** and should be treated as a single-seed observation pending the §6.0 follow-up. The 11L numbers in this table are deprecated in favour of §3.1.
+
+### 3.3 5L d=256 SP1024 — 8-Config CF Sweep (M1 Max, free)
+
+Before the 6-run H100 ablation, I ran a free pre-flight sweep on M1 Max using an earlier 5L SP1024 shared checkpoint (`shared_ar_cdm.npz`, 4.2 M params) to locate the CF sweet spot across stride × rounds. This is the sweep that convinced me stride=2, rounds=2 is worth spending H100 compute to test. The checkpoint here is SP1024 (not v4096), so the absolute BPB values differ from §3.1 due to the tokenizer — but the *shape* of the sweep is the signal.
+
+| Config | Pass-1 (causal) NLL | Pass-2 (denoise) NLL | **CF Total BPB** | vs Pure-AR 2.5386 |
+|---|---|---|---|---|
+| Pure AR baseline (same model, single-mode) | — | — | **2.5386** | baseline |
+| stride=2, rounds=1 | 1.2615 | 1.2807 | 2.5422 | +0.14% |
+| **stride=2, rounds=2** | **1.2688** | **1.0598** | **2.3285** | **−8.28%** |
+| stride=3, rounds=1 | 0.8663 | 2.1996 | 3.0659 | +20.77% |
+| stride=3, rounds=2 | 0.8540 | 1.6754 | 2.5294 | −0.36% |
+| stride=3, rounds=3 | 0.8527 | 1.6052 | 2.4578 | −3.18% |
+| stride=4, rounds=1 | 0.6370 | 2.6794 | 3.3164 | +30.64% |
+| stride=4, rounds=2 | 0.6404 | 2.0915 | 2.7319 | +7.61% |
+| stride=4, rounds=3 | 0.6436 | 1.9617 | 2.6053 | +2.63% |
+
+Sweet spot: stride=2, rounds=2 (50/50 causal–bidirectional split with two denoising refinement rounds). This is the only CF configuration used in §3.1. Every `rounds ≥ 2` configuration either matches or beats pure-AR. Wider-stride single-round configurations are catastrophic because the bidirectional pass has too much to fill from too little context in a single pass.
+
+### 3.4 Earlier 5L SP1024 Headline (1 line, for continuity)
+
+Before running the §3.1 ablation, the same (stride=2, rounds=2) CF configuration was measured on the earlier SP1024 5L shared checkpoint (`shared_ar_cdm.npz`) at N=2000 × seq_len=256 on M1 Max: Pure-AR 2.5412, CF Total **2.3382**, Δ **−7.99%** (stable across N=500 → N=2000). Kept here only to show that the §3.3 sweet spot holds at larger sample sizes on the pre-flight checkpoint. Not the primary claim.
+
+### 3.5 CDM-Weight Sensitivity and Scale Behaviour
+
+From the §3.1 table, two monotonic patterns emerge that are informative about where this paradigm works and where it does not:
+
+**The causal-mode tax grows with CDM weight.** As the CDM loss weight increases from 0 → 0.3 → 1.0, the shared model's Pure-AR BPB gets worse in a near-linear way. The table below uses the **final-checkpoint** measurements from §3.1 for 11L (1 control seed and 5-seed mean for `w=0.3`) and the §3.2 single-seed scaling sweep for 5L (the only available 5L source until §6.0):
+
+| Scale | Source | w=0 Pure AR | w=0.3 Pure AR | w=1.0 Pure AR | Tax at w=0.3 | Tax at w=1.0 |
+|---|---|---|---|---|---|---|
+| 5L | §3.2 (intermediate ckpt, single seed) | 1.4479 | 1.5231 | 1.5841 | **+0.075** | **+0.136** |
+| 11L | §3.1 (final ckpt, 1 control + 5-seed mean) | **1.3214** | **1.4443** | — | **+0.123** | — |
+
+At 11L the tax at `w=0.3` is **larger in absolute terms than at 5L** (0.123 vs 0.075). This is a non-trivial finding: naively one might expect the extra capacity of 11L to absorb the multi-task objective more gracefully, but the opposite happens in this regime (the causal head gives up more ground at 11L). I do not yet know whether this trend continues at 100 M+ or starts to reverse; that is the primary open question for §6.
+
+(A `w=1.0` 11L row is not given here because the §3.1 verification did not retrain `w=1.0`. The intermediate-checkpoint single-seed `w=1.0` value from §3.2 is preserved as a legacy reference in **Appendix A**.)
+
+**The CF two-pass decoder recovers the tax and then some.** Even though the shared model is worse at pure causal scoring, running the two-pass CF decoder on the same model gets it below the control:
+
+| Scale | Control CF-eval Pure-AR | Shared CF (5-seed mean / post-hoc best) | CF advantage (mean / post-hoc best) | Verification status |
+|---|---|---|---|---|
+| 5L | 1.4479 | 1.3939 (w=1.0, 1 seed) | **−0.054** (single seed) | single-seed (§6.0 follow-up) |
+| 11L (final ckpt) | **1.3214** (1 seed) | **1.3009 ± 0.005** (5-seed mean, w=0.3) / 1.2924 (post-hoc best `SEED=1337`) | **−0.0205 mean** / −0.0290 post-hoc best | 5 fresh shared seeds + 1 fresh control seed (§3.1) |
+
+At 5L the best CF configuration is w=1.0 (stronger bidirectional signal); at 11L it is w=0.3 (where the model has enough capacity that a weak bidirectional signal is enough). At both scales the shared-CF configuration scores below the matched causal-only control. The 11L row is the multi-seed final-checkpoint version from §3.1; the 5L row is still single-seed and is the highest-priority remaining verification (§6.0). The "post-hoc best" column at 11L is upward-biased (best of 5 seeds) and is reported only as the deployable-artifact reference, not as an effect-size estimate.
+
+### 3.6 Earlier M1 Max Pre-Flight (3-eval-seed Subsample Check on an Earlier Checkpoint)
+
+*This section is retained for historical context only. The §3.1 multi-seed verification at 1×H100 with 5 fresh training seeds at the final-checkpoint state supersedes it as evidence for the headline claim.*
+
+Prior to the 6-run ablation, I ran a 3-eval-seed subsample check on an earlier 11L 8×H100 checkpoint (`11L_shared_cdm_bf16.pt`, no longer used for primary comparison). The 3-eval-seed mean CF BPB was 1.3083 ± 0.0047 at seq_len=1024, with N=500 per seed. **What the three seeds randomize**: the validation subsample (which 500 sequences are picked) and the random fill in pass-2 denoising — *not* training stochasticity. The Pure-AR std of 0.0008 BPB across these eval seeds reflects validation subsample variance only, not model variance.
+
+The §3.1 result is methodologically stronger because it varies the **training seed**, runs on **fresh trainings** with the unified script, and evaluates at the **true final checkpoint** rather than an intermediate save.
+
+| Eval Seed | N | Pure AR | CF Total | Δ |
+|---|---|---|---|---|
+| 42 | 500 | 1.4422 | 1.3021 | −9.71% |
+| 43 | 500 | 1.4438 | 1.3134 | −9.03% |
+| 44 | 500 | 1.4441 | 1.3095 | −9.32% |
+| **mean** | **1 500** | **1.4434 ± 0.0008** | **1.3083 ± 0.0047** | **−9.35% ± 0.28%** |
+
+---
+
+## 4. Honest Limitations
+
+This PR measures a BPB improvement on the standard Parameter Golf metric (cross-entropy per byte of validation text). It does **not** measure:
+
+- **Comparison to the 8×H100 leaderboard at matched training compute.** The 1×H100 540 s runs see approximately 1/8 the tokens of an 8×H100 540 s run. The §3.1 11L_w0 control at training val_bpb 1.3146 (CF-eval Pure-AR 1.3214) is therefore not directly comparable to the 8×H100 leaderboard entries (top 1 = 1.1147, baseline = 1.2244). The relevant comparison in this PR is always the matched control on the same hardware, not the leaderboard.
+- **Actual fill-in-middle generation quality.** Parameter Golf evaluates BPB, not generation, because 28 M-parameter models at ~270 M training tokens cannot produce coherent text regardless of architecture (GPT-2 small at 124 M / 10 B tokens is the rough coherence threshold in the literature). I ran a qualitative greedy-fill test on all six models as a sanity check (not as a claim): exact-match rates were 0–4.7% across all configurations, including the controls — consistent with the scale regime. This PR is about BPB, which *is* the Parameter Golf metric.
+- **Comparison to dedicated fill-in-middle baselines** (CodeLlama-FIM, StarCoder-FIM). Training did not target code, so FIM code-benchmarks are not applicable without a retrofit experiment. This is Next Step #2 in §6.
+- **Retrofit to pretrained LLMs.** All training here is from scratch. Whether the same shared-weight paradigm can be added to an existing pretrained causal LM via LoRA — the realistic production path for any shipping product — is the largest open question, listed as Next Step #1 in §6.
+- **Share-ratio grid beyond three points.** I tested weight ∈ {0, 0.3, 1.0}. A finer grid might reveal a different optimum.
+- **Multi-seed verification at 11L: partially resolved in §3.1** (5 fresh training seeds for the shared model, **1** fresh training seed for the matched control, all at the true final checkpoint, shared 5-seed CF Total sample std 0.0051, i.e. ≈0.005). The control side still has only one fresh seed in this round; a strict significance test is not run (see §3.1 statistical caveat). A second control seed is the smallest remaining gap. **Multi-seed verification at 5L: not yet done** — the −0.054 BPB gap at 5L in §3.2 is still single-seed and is the highest-priority remaining experiment (§6.0).
+
+---
+
+## 5. Why This Might Matter — Downstream Utility Under Test-Time Compute
+
+The §3.1 effect is modest in absolute terms (−0.029 BPB best seed, −0.0205 BPB 5-seed mean at 11L; −0.054 BPB single-seed at 5L). What I find interesting is not the magnitude but the factorization: the matched ablation separates two capabilities a production LLM would typically want to optimize independently:
+
+1. **Causal-only next-token prediction**, which is how every shipping LLM (ChatGPT, Claude, GPT-4, Codex, Copilot) is primarily measured.
+2. **Bidirectional conditioning** on both left and right context, which is today served either by a *second* specialized model (BERT, MDLM), by a training-time hack (FIM special tokens in Bavarian et al. 2022 / Rozière et al. 2023), or by retrieve-and-rewrite pipelines.
+
+The matched ablation is consistent with the reading that **a single set of weights, at matched compute, can expose both capabilities when evaluated under the two-pass CF decoder**. This fits naturally into the recent test-time-compute framing (Welleck 2024, speculative decoding, Mask-Predict Ghazvininejad 2019): the CF decoder is an inference-time compute knob that trades extra forward passes for lower BPB, and the shared-weight training makes those extra passes useful instead of noise.
+
+**Effect-size context.** The 5-seed shared CF mean (1.3009) lands ~0.02 BPB below the single-seed control point (1.3214). The shared side's empirical std on the 5 fresh seeds is 0.005 (CF Total) and 0.0019 (training `val_bpb`); the control side has no measured variance term in this round. No significance test is computed here (see §3.1 statistical caveat). The absolute effect is small. The relevant practical question is **whether it grows, shrinks, or inverts when the model and training budget are scaled up**, which neither §3.1 nor §3.2 can answer — that is the §6.1 / §6.2 work, gated on §6.0.
+
+**What this is not.** This is not a claim that a 28 M parameter model can generate coherent text, or that these 540 s runs are ready for any production use. Models at this scale cannot generate coherent English regardless of architecture (GPT-2 small at 124 M / 10 B tokens is the rough coherence threshold, and these models are 5× smaller and 30× less trained). The Parameter Golf competition accepts this — BPB is the metric precisely because coherence is out of reach at these scales. The claim here is scoped to BPB under a specific decoder with a specific control, nothing more.
+
+---
+
+## 6. What Might Work With More Compute
+
+Honest speculation. Each item below is a concrete experiment that would extend or close an open question from §3 — ordered by what most strongly constrains the conclusion of this submission. **§6.0 is the only follow-up that is gating; everything else is conditional on it.**
+
+### 6.0 5L multi-seed verification (highest-priority remaining experiment)
+
+The 11L row of §3.1 is now multi-seed verified at the true final checkpoint. The 5L row in §3.2 is **not**. The −0.054 BPB single-seed result at 5L is a stronger absolute effect than the verified 11L 5-seed mean (−0.0205), but it has the same risk profile that the 11L row had before §3.1: a single training seed at the last `val_every`-aligned intermediate checkpoint, where the training-stochasticity asymmetry between control and shared could plausibly manufacture a 0.05 BPB gap by chance.
+
+**Concretely**: 5 fresh training seeds for `5L_w1.0` (the winner) + 1 fresh training seed for `5L_w0` (control), all evaluated at `step_final.pt` with the new `eval_cf_ablation.py` protocol. 5L training is much cheaper than 11L (~3 min per run on 1×H100 SXM, or runnable on consumer GPUs at similar speed). Total compute estimate: ~30 min wall time, ~$1.5 self-funded on 1×H100 SXM, or essentially free on M1 Max in roughly the same wall time. This is the next experiment I will run.
+
+### 6.1 Retrofit onto a pretrained causal LLM via LoRA (the production path)
+
+The experiment that would most directly test whether this paradigm survives outside the Parameter Golf toy regime is a **LoRA-style retrofit of a pretrained causal LLM** (e.g. Qwen 3.5 0.8 B, which I already have locally). Rather than training from scratch at 28 M parameters, take a model that already generates coherent text and add a small LoRA adapter to expose a bidirectional forward mode, trained with the same joint AR + D3PM objective. No shipping LLM trains from scratch at 28 M parameters, so this is the setting where any downstream claim has to be tested. An initial result on Qwen 0.8 B fits in roughly 10–15 H100-hours and would tell, *within one pod session*, whether the shared-weight + CF-decoder pattern carries to a model that is actually coherent at inference. This is the single most compute-efficient downstream test and it is Next Step #1.
+
+### 6.2 Full-budget 8×H100 reproduction of the 11L ablation
+
+Run the exact §3.1 ablation at 8×H100 540 s (the production Parameter Golf budget) to test whether the 0.027 BPB improvement persists, narrows, or inverts when the training-token budget grows ~8×. I do not have a confident extrapolation to offer — the Pure-AR tax in §3.5 already grows with scale in a direction that works against the shared model, and this experiment is how I find out whether that trend continues or reverses at full compute. This is Next Step #2.
+
+### 6.3 Share-ratio grid search at 11L
+
+The 6 runs used weight ∈ {0, 0.3, 1.0}. At 11L, w=0.3 gave the best CF BPB; at 5L, w=1.0 did. A fine grid (0.1, 0.15, 0.2, 0.3, 0.5, 0.7, 1.0) at 11L would locate the actual optimum and tell whether the share-ratio optimum scales with model size. This is a cheap follow-up to §6.2 — roughly 7 additional 1×H100 runs.
+
+### 6.4 Finer scale sweep for the share-ratio → BPB curve
+
+I have two architectural data points (5L 4.2 M and 11L 28.4 M). Adding 7L d=384, 9L d=448, and 13L d=640 would give a scaling curve for both the Pure-AR tax (which appears to grow with scale in this data) and the CF recovery (which also grows with scale). A simple power-law fit would let me predict the crossover scale — the model size at which the CF gain exceeds the Pure-AR tax by a margin that makes the extra compute worth it.
+
+### 6.5 Absorbing-mask MDLM noise schedule for the bidirectional pass
+
+I used uniform-noise D3PM (random vocabulary replacement). The MDLM cluster (#820, #1106, #1241) uses absorbing-mask denoising, which the literature suggests gives stronger bidirectional representations. Swapping the noise schedule is a one-line training change; a matched ablation would tell whether the gain would be larger under the standard MDLM noise, at the cost of some comparison legibility.
+
+---
+
+## 7. Retrodiction — A Negative Result at Production Scale
+
+> **Scope note.** The runs in this section are a **different training line** from the Shared AR + Denoising model used in §3. They are a 1×H100 A/B sweep of retrodiction modes on a pure AR stack (no CDM auxiliary loss). The "Pure AR" numbers in this table are therefore *not comparable* to the "Pure AR" column of §3.3, which measures the Shared AR + Denoising checkpoint in single-mode causal. Different models, different training configurations. See §7.3 for an explicit side-by-side.
+
+This submission also documents a line of work I call **retrodiction** — a reversed-sequence auxiliary loss added to the standard causal AR loss. The operational definition is simply:
+
+```python
+loss = causal_lm_loss(model(x), x) + α · causal_lm_loss(model(x.flip(1)), x.flip(1))
+```
+
+I report it as a negative result at production scale. The compact story:
+
+### 7.1 Early-Training Signal on 5L / M1 Max
+
+At small scale and short token budgets, retrodiction gave up to −3.6% BPB at step 200/500, direction consistent with the motivation.
+
+### 7.2 Production-Stack A/B on 1×H100
+
+Five independent training runs, same architecture (11L d=512 v4096, XSA-4, BigramHash), same 540 s budget, same seeds, **pure causal AR stack with no CDM auxiliary loss** — only the retrodiction mode varied:
+
+| Test | Retro mode | Final val_bpb |
+|---|---|---|
+| **D** | **OFF** | **1.3401** (best) |
+| C | partial 15% | 1.3594 |
+| B | merged late 80/20 | 1.3695 |
+| E | alternating 90/10 | 1.3616 |
+| A | alternating 50/50 | 1.4109 |
+
+Pure contrast (C vs D): retrodiction is a **+0.019 BPB tax** at production scale, not a gain.
+
+### 7.3 Consolidated 11L Pure-AR Numbers
+
+The §3.1 ablation and the §7.2 retrodiction sweep each produce their own 11L Pure-AR BPB on the same nominal 11L d=512 v4096 architecture but with different training stacks. They are listed side by side here, restricted to a single metric kind (training final `val_bpb`) for direct comparability:
+
+| Source | Training objective | Retrodiction | Training stack | Pure AR final val_bpb |
+|---|---|---|---|---|
+| **§3.1** `11L_w0` (control, 1 seed, final ckpt) | Pure AR only | off | unified `train_cdm.py`, `--xsa_last_n=4` | **1.3146** |
+| **§3.1** `11L_w0.3` (5-seed mean, final ckpt) | Joint AR + 0.3·denoising | off | unified `train_cdm.py`, `--xsa_last_n=4` | **1.4410** |
+| §7.2 Test D (1 seed, earlier stack) | Pure AR only | off | earlier XSA / BigramHash configuration | 1.3401 |
+| §7.2 Test C (1 seed, earlier stack) | Pure AR only | partial 15 % | earlier XSA / BigramHash configuration | 1.3594 |
+
+The §3.1 11L_w0 (1.3146) and §7.2 Test D (1.3401) are both single-seed pure-AR 1×H100 540 s runs at the same nominal architecture but on **different training stacks**. The 0.026 BPB difference reflects training-stack drift, not retrodiction. For the primary claim of this submission, the relevant comparison is always §3.1 11L_w0 vs §3.1 11L_w0.3 CF (both measured with the *exact* same script, same data pipeline, same eval sampling, all at the final checkpoint). The §7 retrodiction sweep is a separate, older line of work included for completeness.
+
+**Interpretation (hypothesis).** At 5L on short budgets, the forward loss signal may be weak enough that the reversed loss provides complementary gradient. At 11L on production budgets, I hypothesize that the forward signal is strong enough to dominate and the reversed loss competes for updates rather than augmenting them. I do not have a mechanistic proof of this interpretation, and I have not found a useful parametrization of retrodiction for the parameter-golf regime.
+
+**Practical recommendation:** retrodiction is a tax on the production stack and should not be used. The matched-compute 6-run ablation in §3.1 was run *without* retrodiction for this reason.
+
+---
+
+## 8. Position in the Text-Diffusion Cluster
+
+Snapshot of the text-diffusion cluster as of 2026-04-09 (reproducible via `gh pr list --repo openai/parameter-golf --search "diffusion" --state open --limit 50`):
+
+- Bidirectional masked diffusion + discrete absorbing ELBO (`val_var_bpb`): #820 mtybadger (convention-setting), #1053, #1106 agalimova, #1241 aiejvn, #1403
+- Causal MDLM as AR regularizer (eval in causal mode): #1119 gowtham0992
+- Hybrid AR + MDLM mixed training with bidirectional head discarded at eval: #1194
+- AR with diffusion-inspired auxiliary noise, evaluated as pure AR: #904
+- Prefix-conditioned discrete diffusion: #905
+- Hybrid sparse diffusion: #1198
+- **This PR:** shared-weight joint causal + masked-denoising training, evaluated via a two-pass Coarse-to-Fine decoder on BPB, with a **matched causal-only control** at the same compute. This is, to my knowledge, the first submission in the text-diffusion cluster to include an explicit matched-compute control ablation.
+
+This approach differs from the cluster in that both modes are actively used at evaluation on the same weights, rather than the bidirectional mode being used only at training time or evaluated separately. I do not claim this is a strict improvement over the MDLM line — it is a different question evaluated on a different metric. Direct numerical comparison across metrics (val_var_bpb / val_bpb / CF BPB) is not meaningful because they measure different quantities. See §2.4.
+
+---
+
+## 9. Hardware and Reproducibility
+
+All training and evaluation artifacts are published on Hugging Face:
+
+- **`akaiii/meadow-golf-checkpoints`** — all 6 ablation checkpoints (`5L_w0.npz`, `5L_w03.npz`, `5L_w1.npz`, `11L_w0.npz`, `11L_w03.npz`, `11L_w1.npz`), 6 training logs, 6 CF eval logs, the unified training script (`train_cdm.py` + `train_ablation_runner.py`), and the CF eval scripts (`eval_cf_dualbrain.py`, `eval_cf_dualbrain_cuda.py`, `eval_cf_ablation.py`). Directory layout matches the `ablation_results/` folder in this PR.
+- **`akaiii/meadow-golf-v4096`** — `bpe_v4096.model` tokenizer and the v4096 retokenized FineWeb validation + training shards used for every training run in §3.1.
+
+### 9.1 Reproduction of the §3.1 multi-seed verification (the v3.5 headline) — ~70 min on 1×H100 SXM, ~$3.50
+
+The §3.1 5-seed shared verification + 1-seed control is the headline of this submission. Both orchestration scripts (`run_p5.sh`, `run_phase_b.sh`) and all training / CF eval logs from the actual run are committed to `seeds_run/` in this folder for reviewer-side spot checking. The scripts rely on the v3.5 copies of `train_cdm.py`, `train_ablation_runner.py` (with `--seed` support), and `eval_cf_ablation.py`; the reproduction commands below reproduce them from a clean H100 pod:
+
+```bash
+pip install --break-system-packages torch numpy sentencepiece huggingface_hub
+
+git clone https://github.com/akaiHuang/meadow-golf
+cd meadow-golf/experiments/2026-04-09_matched_ablation
+
+hf download akaiii/meadow-golf-v4096 --repo-type dataset --local-dir /workspace/gv4096
+
+# 5 fresh shared seeds (11L_w0.3 × {1337, 42, 2024, 7, 100}), final-checkpoint save
+SCRIPT_DIR=. \
+  DATA_DIR=/workspace/gv4096/data \
+  TOKENIZER=/workspace/gv4096/bpe_v4096.model \
+  OUT_DIR=/workspace/out \
+  CKPT_DIR=/workspace/ckpt \
+  LOG_DIR=/workspace/logs \
+  bash run_p5.sh
+
+# 1 fresh control seed (11L_w0 SEED=1337), final-checkpoint save
+bash run_phase_b.sh
+```
+
+Both `run_p5.sh` and `run_phase_b.sh` invoke the unified `train_cdm.py` (which now writes a `step_final.pt` checkpoint at the end of training, addressing the v3.3 intermediate-checkpoint issue) via `train_ablation_runner.py` (`--seed` patches the module-level `SEED` constant and emits per-seed patched modules), then run `eval_cf_ablation.py` directly on the `step_final.pt` saves. Final BPB numbers should match the §3.1 table within bf16 numerical noise on the same `--seed 42` eval sample. Total wall time: ~70 min on a single 1×H100 SXM; total self-funded compute: **$3.50** at $2.99/hr.
+
+Reviewer spot check without rerunning anything: every number in §3.1 is grep-able from `seeds_run/logs/*.log` and `seeds_run/eval/*.log` already present in this folder. See `seeds_run/README.md` for the file inventory.
+
+### 9.2 Reproduction of the §3.2 6-run scaling sweep (cross-scale evidence, ~90 min on 1×H100 SXM, ~$3.93)
+
+This is the *original* 6-run ablation that v3.3 reported as the headline; in v3.5 it is retained only as the §3.2 cross-scale evidence (single seed each, intermediate checkpoint). The 5L row of §3.2 is the only available 5L data until §6.0 follow-up. The 11L rows are superseded by §3.1 / Appendix A but included for traceability.
+
+```bash
+pip install torch numpy sentencepiece huggingface_hub
+
+hf download akaiii/meadow-golf-checkpoints --repo-type dataset --local-dir ./gcp
+hf download akaiii/meadow-golf-v4096       --repo-type dataset --local-dir ./gv4096
+
+export PYTHONPATH="./gcp:${PYTHONPATH}"
+mkdir -p out ckpt logs eval
+
+# Train all 6 ablation models (6 × ~10 min wallclock)
+for cfg in "5L 5 256 128 2 0.0"  "5L 5 256 128 2 0.3"  "5L 5 256 128 2 1.0" \
+           "11L 11 512 128 4 0.0" "11L 11 512 128 4 0.3" "11L 11 512 128 4 1.0"; do
+  read tag L D BD X W <<< "$cfg"
+  python3 ./gcp/train_ablation_runner.py \
+    --train_script ./gcp/train_cdm.py \
+    --num_layers $L --model_dim $D --vocab_size 4096 \
+    --bigram_dim $BD --xsa_last_n $X --cdm_weight $W \
+    -- \
+    --train_budget_secs 540 --steps 9999 \
+    --data_dir ./gv4096/data --tokenizer_path ./gv4096/bpe_v4096.model \
+    --save_path ./out/${tag}_w${W}.npz \
+    --checkpoint_dir ./ckpt/${tag}_w${W} \
+    > ./logs/${tag}_w${W}_train.log 2>&1
+done
+
+# Evaluate all 6 under CF (6 × ~5 min wallclock)
+for cfg in "5L 5 256 128 2 0.0"  "5L 5 256 128 2 0.3"  "5L 5 256 128 2 1.0" \
+           "11L 11 512 128 4 0.0" "11L 11 512 128 4 0.3" "11L 11 512 128 4 1.0"; do
+  read tag L D BD X W <<< "$cfg"
+  latest=$(ls ./ckpt/${tag}_w${W}/step_*.pt | sort -V | tail -1)
+  python3 ./gcp/eval_cf_ablation.py \
+    --ckpt $latest \
+    --train_module_path /tmp/train_cdm_patched_${L}L_w${W}.py \
+    --num_layers $L --model_dim $D --vocab_size 4096 \
+    --bigram_dim $BD --xsa_last_n $X \
+    --n_seqs 500 --seq_len 1024 --stride 2 --rounds 2 --seed 42 \
+    --data_dir ./gv4096/data --tokenizer_path ./gv4096/bpe_v4096.model \
+    --log_path ./eval/${tag}_w${W}_cf.log
+done
+```
+
+The patched training scripts `/tmp/train_cdm_patched_*.py` are created as a side effect of `train_ablation_runner.py` and are the model-class source for the matching `eval_cf_ablation.py` run. They are regenerated deterministically from `train_cdm.py` on each run. The 5L M1 Max pre-flight sweep uses `eval_cf_dualbrain.py` (MLX) against `shared_ar_cdm.npz`; it runs on any Apple Silicon Mac with `mlx >= 0.31` and reproduces the §3.3 table in under 4 minutes.
+
+Self-funded compute for the §3.2 6-run scaling sweep: **$3.93**. Combined with the §3.1 verification ($3.50), total self-funded for this submission: **~$7.43**.
+
+---
+
+## 10. Compliance
+
+- [x] **5L submission artifacts ≤ 16 MB**: the competition submission unit is the int6+lzma compressed checkpoint (`5L_*_int6.lzma` = ~3.0 MB each), well under the 16 MB cap. The intermediate `5L_w0.npz` (17.2 MB BF16) is *not* a submission artifact; it is the working final-state save used by the eval script and is never submitted.
+- [x] **11L submission artifacts** are non-record (trained on 1×H100, not matched to the 8×H100 production budget). The corresponding `11L_*_int6.lzma` files are ~18.7 MB each, *over* the 16 MB cap, which is why every 11L row in this submission is filed under the **non-record track** explicitly. They are never claimed as record candidates.
+- [x] No validation data accessed during training
+- [x] CF evaluation uses validation tokens only for scoring; no gradient updates
+- [x] No network calls during evaluation
+- [x] Hardware: original 6-run scaling sweep on a single 1×H100 SXM pod ($3.93). §3.1 multi-seed verification on a second 1×H100 SXM pod ($3.50). Total self-funded ~$7.43 across both sessions.
+- [x] Causal-mask integrity verified via the leakage test in §2.3 (`leakage_test.py` included in this folder, max prefix-logit divergence 0.0)
+- [x] CF evaluation is fully specified by `SEED`; the denoising pass is Monte Carlo averaged over `n_random=3` random fills for variance reduction on residual positions (not exact, but deterministic given the seed)
+- [x] All reviewer-facing §3.1 logs and orchestration scripts are stored locally in `seeds_run/` and reproducible from the per-seed `train_ablation_runner.py` invocations recorded in `run_p5.sh` / `run_phase_b.sh`
+- [x] The exact §3.1 `.npz` / `step_final.pt` state files are intentionally not committed to this PR folder (~1.3 GB total); their location and availability-on-request path are documented in `seeds_run/README.md`
+
+---
+
+## 11. Acknowledgments
+
+- **PR #820 (@mtybadger)** for establishing `val_var_bpb` and the MDLM reference point for text diffusion in parameter-golf. My disagreement with the metric in §2.3 is intended as productive, not dismissive.
+- **PR #363 (@evangelinehelsinki)** for the template of honest negative-result reporting that §7 follows, and for the `What Might Work With More Compute` section format.
+- **PRs #1106, #1241** for showing that the MDLM line is an active research target worth contributing alternatives to.
+
+---
+
+## 12. Related Closed Submission
+
+I earlier withdrew [PR #1442](https://github.com/openai/parameter-golf/pull/1442), a different stack combination submission targeting AR sliding BPB. A self-audit found methodological issues including a mismatch between the evaluation used and the compressed artifact. That line of work is not being pursued further; this PR represents my focused research effort going forward.
+
+---
+
+## Appendix A. Legacy intermediate-checkpoint 11L numbers (superseded by §3.1)
+
+**This appendix exists solely for traceability with v3.3. None of these legacy intermediate-checkpoint numbers are used in any headline claim or main analysis in v3.5.** It is here so that a reader cross-referencing v3.3 against v3.5 can find the original v3.3 single-seed 6-run table 11L values in one place, paired with the §3.1 final-checkpoint measurements that supersede them. The main analytical sections (§3.5, §7.3) of v3.5 carry only final-checkpoint measurements.
+
+The original v3.2 6-run sweep at 11L (single seed `1337`, evaluated at the last `val_every`-aligned intermediate checkpoint, *not* `step_final.pt`):
+
+| Run | Pure AR (intermediate ckpt) | CF Total (intermediate ckpt) | Status in v3.5 |
+|---|---|---|---|
+| `11L_w0` (control) | 1.3574 | 2.3947 (invalid) | superseded by §3.1 final ckpt: **1.3214 / 2.4538** |
+| `11L_w0.3` | 1.4708 | 1.3301 | superseded by §3.1 final ckpt 5-seed mean: **1.4443 / 1.3009** |
+| `11L_w1.0` | 1.5414 | 1.3527 | not retrained at final ckpt; legacy value retained |
+
+**Why these numbers were higher / lower than the §3.1 final-checkpoint numbers.** The intermediate checkpoint (`step_5000.pt` for w=0/w=1.0, `step_1500.pt` for w=0.3) is several hundred training steps before the actual end of the 540 s training budget. The shared model is hit asymmetrically harder by this gap because it trains slower per step (the bidirectional pass roughly doubles forward FLOPs at this size), so its last `val_every`-aligned save is *relatively* less converged than the control's. Fixing this with `step_final.pt` (§2 methodology fix in v3.5) improves the shared CF score by ~0.03 BPB and the control Pure-AR by ~0.03 BPB in the *opposite* direction (control gets *better* on the metric, shared also gets better but the difference reshapes) — net effect: the v3.3 single-seed delta (−0.027 BPB) and the v3.5 5-seed mean delta (−0.0205 BPB) are within 0.007 BPB of each other and have the same sign, but the v3.5 number is the one that survives the methodology fix and the multi-seed verification, and is the one quoted everywhere in the main text.
+
+The original 5L row of the v3.2 sweep is used directly in §3.5 as the only available 5L cross-scale evidence pending the §6.0 follow-up.
diff --git a/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/RUNBOOK.md b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/RUNBOOK.md
new file mode 100644
index 0000000000..25629ab5f9
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/RUNBOOK.md
@@ -0,0 +1,91 @@
+# RUNBOOK (8xH100, 10-minute budget)
+
+This runbook documents how to rerun and verify the submission in this folder.
+
+## 1) Environment
+
+- GPU: 8x H100 SXM
+- Python deps: `torch`, `sentencepiece`, `numpy`
+- Working dir:
+
+```bash
+cd /workspace/parameter-golf/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence
+```
+
+## 2) AR + Retrodiction (5L d=256, v4096)
+
+```bash
+torchrun --standalone --nproc_per_node=8 train_gpt.py \
+  --train_budget_secs 540 \
+  --steps 99999 \
+  --grad_accum 1 \
+  --microbatch_tokens 65536 \
+  --val_every 500 \
+  --val_tokens 1000000 \
+  --data_dir /workspace/data_v4096_full \
+  --tokenizer_path /workspace/bpe_v4096.model \
+  --save_path model_5L_v4096.npz \
+  --save_int6_path model_5L_v4096_int6.lzma
+```
+
+Expected outputs:
+- `model_5L_v4096.npz`
+- `model_5L_v4096_int6.lzma`
+- train log with `FINAL val_bpb`
+
+## 3) Shared AR+CDM (single model, SP1024)
+
+```bash
+torchrun --standalone --nproc_per_node=8 train_cdm.py \
+  --train_budget_secs 540 \
+  --steps 99999 \
+  --grad_accum 1 \
+  --microbatch_tokens 65536 \
+  --val_every 500 \
+  --val_tokens 1000000 \
+  --data_dir /data/datasets/fineweb10B_sp1024 \
+  --tokenizer_path /data/tokenizers/fineweb_1024_bpe.model \
+  --save_path shared_ar_cdm.npz \
+  --save_int6_path shared_ar_cdm_int6.lzma
+```
+
+Expected outputs:
+- `shared_ar_cdm.npz`
+- `shared_ar_cdm_int6.lzma`
+- train log with `FINAL val_bpb`
+
+## 4) Evaluation
+
+Sequential Unmasking script uses fixed paths/checkpoint names in-code:
+
+```bash
+python3 eval_sequential_unmasking.py
+```
+
+TTT evaluation:
+
+```bash
+python3 eval_ttt.py \
+  --model_path model_5L_v4096.npz \
+  --model_dim 512 \
+  --val_tokens 500000 \
+  --ttt_lr 3e-4 \
+  --ttt_steps 1
+```
+
+## 5) Verification checklist
+
+1. Training budget enforcement:
+   - log shows budget trigger near 540s.
+2. Metric extraction:
+   - log includes `FINAL val_bpb`.
+3. Artifact size:
+   - `ls -lh *.lzma` and confirm compressed model is under 16MB.
+4. PR metadata consistency:
+   - `submission.json` values match reported outputs.
+
+## 6) Optional reproducibility protocol
+
+- Run 3-5 repeats (separate logs).
+- Report mean/std for `val_bpb`.
+- Attach all logs to the PR thread for auditability.
diff --git a/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/ablation_logs/11L_w03_cf.log b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/ablation_logs/11L_w03_cf.log
new file mode 100644
index 0000000000..293cfb1515
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/ablation_logs/11L_w03_cf.log
@@ -0,0 +1,53 @@
+======================================================================
+CF ablation eval
+  ckpt: /workspace/ckpt/11L_w03/step_1500.pt
+  model: L=11 d=512 vocab=4096
+  eval: N=500 seq=1024 stride=2 rounds=2
+======================================================================
+Loading model class from /tmp/train_cdm_patched_11L_w0.0.py
+  module constants verified: 11L d=512 vocab=4096
+  loading state dict from /workspace/ckpt/11L_w03/step_1500.pt
+  loaded 116 keys, missing=0, unexpected=0
+  params: 28,402,777
+  val tokens: 45,122,403
+
+[1] Pure AR baseline (is_causal=True)
+    AR 100/500 | BPB:1.4803 | 2s
+    AR 200/500 | BPB:1.4813 | 3s
+    AR 300/500 | BPB:1.4758 | 5s
+    AR 400/500 | BPB:1.4709 | 6s
+    AR 500/500 | BPB:1.4708 | 8s
+  Pure AR BPB: 1.4708
+
+[2] CF eval (stride=2, rounds=2)
+    CF 100/500 | AR:0.7330 CDM:0.6050 Total:1.3379 | 9s
+    CF 200/500 | AR:0.7338 CDM:0.6084 Total:1.3422 | 19s
+    CF 300/500 | AR:0.7317 CDM:0.6061 Total:1.3378 | 28s
+    CF 400/500 | AR:0.7307 CDM:0.6058 Total:1.3365 | 37s
+    CF 500/500 | AR:0.7273 CDM:0.6028 Total:1.3301 | 46s
+
+======================================================================
+  Pure AR:       1.4708
+  CF AR part:    0.7273
+  CF CDM part:   0.6028
+  CF Total:      1.3301
+  CF vs Pure AR: -9.57%  (−0.1407 BPB)
+======================================================================
+
+JSON:
+{
+  "ckpt": "/workspace/ckpt/11L_w03/step_1500.pt",
+  "num_layers": 11,
+  "model_dim": 512,
+  "vocab_size": 4096,
+  "n_params": 28402777,
+  "n_seqs": 500,
+  "seq_len": 1024,
+  "stride": 2,
+  "rounds": 2,
+  "pure_ar_bpb": 1.4707942821001148,
+  "cf_ar_part": 0.7273059643224621,
+  "cf_cdm_part": 0.6027956697378927,
+  "cf_total": 1.3301016340603549,
+  "cf_vs_ar_pct": -9.56575979061245
+}
diff --git a/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/ablation_logs/11L_w03_train.log b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/ablation_logs/11L_w03_train.log
new file mode 100644
index 0000000000..74179cc6a0
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/ablation_logs/11L_w03_train.log
@@ -0,0 +1,60 @@
+[ablation runner] patched script written to /tmp/train_cdm_patched_11L_w0.3.py
+[ablation runner] config: 11L d=512 vocab=4096 xsa=4 bigram=128 cdm_weight=0.3
+WARNING: effective_batch=65,536 != official 524,288
+   grad_accum=1 x microbatch=65536 x 1 GPUs
+======================================================================
+Shared AR+CDM PyTorch | 11L d=512 MLP=3x | steps=9999
+NUM_HEADS=8 head_dim=64 NUM_KV_HEADS=4 BIGRAM_DIM=128
+Encoder=5 Decoder=6 (U-net skip connections)
+Retro alpha=0.3 | XSA last 4 layers | LeakyReLU^2 | BigramHash(2048) | EMA(0.997)
+Device: cuda | World size: 1 | Grad accum: 1
+Effective batch: 65,536 tok/step
+======================================================================
+Val tokens: 45,121,536 (eval on 999,424)
+Model params: 28,402,777 (28.4M)
+Estimated size BF16: 56.8MB | int6: 21.3MB
+Compiling model with torch.compile()...
+  Compilation requested (will happen on first forward pass)
+  Muon params: 68 tensors
+  Scalar params: 47 tensors
+  Embed param: tok_emb.weight
+EMA starts step 7999
+Time budget: 540s (warmdown in last 80s)
+Starting training...
+step:0/9999 val_bpb:3.6123 *BEST* tokens:0M elapsed:1s
+W0409 03:36:42.083000 134032994210432 torch/fx/experimental/symbolic_shapes.py:4449] [0/0] q2 is not in var_ranges, defaulting to unknown range.
+W0409 03:36:42.142000 134032994210432 torch/fx/experimental/symbolic_shapes.py:4449] [0/0] z2 is not in var_ranges, defaulting to unknown range.
+W0409 03:36:43.868000 134032994210432 torch/fx/experimental/symbolic_shapes.py:4449] [0/0] x0 is not in var_ranges, defaulting to unknown range.
+  step:100 train_loss:6.4939 lr_mul:1.0000 tok/s:111940
+  step:200 train_loss:5.9238 lr_mul:1.0000 tok/s:144981
+  step:300 train_loss:5.6456 lr_mul:1.0000 tok/s:160803
+  step:400 train_loss:4.9008 lr_mul:1.0000 tok/s:170091
+step:500/9999 val_bpb:1.6355 *BEST* tokens:33M elapsed:187s
+  step:500 train_loss:4.8979 lr_mul:1.0000 tok/s:175063
+  Checkpoint saved: /workspace/ckpt/11L_w03/step_500.pt
+  step:600 train_loss:5.0355 lr_mul:1.0000 tok/s:179465
+  step:700 train_loss:4.5977 lr_mul:1.0000 tok/s:182875
+  step:800 train_loss:4.5987 lr_mul:1.0000 tok/s:185518
+  step:900 train_loss:4.7096 lr_mul:1.0000 tok/s:187628
+step:1000/9999 val_bpb:1.5369 *BEST* tokens:66M elapsed:347s
+  step:1000 train_loss:4.7862 lr_mul:1.0000 tok/s:188710
+  Checkpoint saved: /workspace/ckpt/11L_w03/step_1000.pt
+  step:1100 train_loss:5.1869 lr_mul:1.0000 tok/s:190159
+  step:1200 train_loss:4.3484 lr_mul:1.0000 tok/s:191348
+  step:1300 train_loss:4.7448 lr_mul:1.0000 tok/s:192433
+  step:1400 train_loss:4.2465 lr_mul:0.8231 tok/s:193374
+step:1500/9999 val_bpb:1.4638 *BEST* tokens:98M elapsed:507s
+  step:1500 train_loss:4.3903 lr_mul:0.4112 tok/s:193733
+  Checkpoint saved: /workspace/ckpt/11L_w03/step_1500.pt
+  step:1600 train_loss:4.3777 lr_mul:0.0132 tok/s:194448
+  Budget 540s reached at step 1604 (540s elapsed) — triggering final eval+save
+step:1604/9999 val_bpb:1.4369 *BEST* tokens:105M elapsed:541s
+
+Saved NPZ to /workspace/out/11L_w03.npz
+Saved int6+lzma to /workspace/out/11L_w03_int6.lzma (18.8MB)
+======================================================================
+FINAL val_bpb: 1.4369
+Total tokens: 0.655B in 555s
+Model: 11L d=512 MLP=3x | 28,402,777 params
+Throughput: 1180792 tok/s
+======================================================================
diff --git a/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/ablation_logs/11L_w0_cf.log b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/ablation_logs/11L_w0_cf.log
new file mode 100644
index 0000000000..10a8c6ead6
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/ablation_logs/11L_w0_cf.log
@@ -0,0 +1,53 @@
+======================================================================
+CF ablation eval
+  ckpt: /workspace/ckpt/11L_w0/step_5000.pt
+  model: L=11 d=512 vocab=4096
+  eval: N=500 seq=1024 stride=2 rounds=2
+======================================================================
+Loading model class from /tmp/train_cdm_patched_11L_w0.0.py
+  module constants verified: 11L d=512 vocab=4096
+  loading state dict from /workspace/ckpt/11L_w0/step_5000.pt
+  loaded 116 keys, missing=0, unexpected=0
+  params: 28,402,777
+  val tokens: 45,122,403
+
+[1] Pure AR baseline (is_causal=True)
+    AR 100/500 | BPB:1.3670 | 2s
+    AR 200/500 | BPB:1.3681 | 3s
+    AR 300/500 | BPB:1.3634 | 4s
+    AR 400/500 | BPB:1.3580 | 5s
+    AR 500/500 | BPB:1.3574 | 6s
+  Pure AR BPB: 1.3574
+
+[2] CF eval (stride=2, rounds=2)
+    CF 100/500 | AR:0.6749 CDM:1.7227 Total:2.3976 | 9s
+    CF 200/500 | AR:0.6749 CDM:1.7343 Total:2.4092 | 19s
+    CF 300/500 | AR:0.6740 CDM:1.7274 Total:2.4014 | 28s
+    CF 400/500 | AR:0.6730 CDM:1.7283 Total:2.4014 | 37s
+    CF 500/500 | AR:0.6697 CDM:1.7250 Total:2.3947 | 46s
+
+======================================================================
+  Pure AR:       1.3574
+  CF AR part:    0.6697
+  CF CDM part:   1.7250
+  CF Total:      2.3947
+  CF vs Pure AR: +76.42%  (+1.0373 BPB)
+======================================================================
+
+JSON:
+{
+  "ckpt": "/workspace/ckpt/11L_w0/step_5000.pt",
+  "num_layers": 11,
+  "model_dim": 512,
+  "vocab_size": 4096,
+  "n_params": 28402777,
+  "n_seqs": 500,
+  "seq_len": 1024,
+  "stride": 2,
+  "rounds": 2,
+  "pure_ar_bpb": 1.357406095601703,
+  "cf_ar_part": 0.6697479452691406,
+  "cf_cdm_part": 1.7249967824431076,
+  "cf_total": 2.394744727712248,
+  "cf_vs_ar_pct": 76.42065520935498
+}
diff --git a/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/ablation_logs/11L_w0_train.log b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/ablation_logs/11L_w0_train.log
new file mode 100644
index 0000000000..5146506ac4
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/ablation_logs/11L_w0_train.log
@@ -0,0 +1,111 @@
+[ablation runner] patched script written to /tmp/train_cdm_patched_11L_w0.0.py
+[ablation runner] config: 11L d=512 vocab=4096 xsa=4 bigram=128 cdm_weight=0.0
+WARNING: effective_batch=65,536 != official 524,288
+   grad_accum=1 x microbatch=65536 x 1 GPUs
+======================================================================
+Shared AR+CDM PyTorch | 11L d=512 MLP=3x | steps=9999
+NUM_HEADS=8 head_dim=64 NUM_KV_HEADS=4 BIGRAM_DIM=128
+Encoder=5 Decoder=6 (U-net skip connections)
+Retro alpha=0.3 | XSA last 4 layers | LeakyReLU^2 | BigramHash(2048) | EMA(0.997)
+Device: cuda | World size: 1 | Grad accum: 1
+Effective batch: 65,536 tok/step
+======================================================================
+Val tokens: 45,121,536 (eval on 999,424)
+Model params: 28,402,777 (28.4M)
+Estimated size BF16: 56.8MB | int6: 21.3MB
+Compiling model with torch.compile()...
+  Compilation requested (will happen on first forward pass)
+  Muon params: 68 tensors
+  Scalar params: 47 tensors
+  Embed param: tok_emb.weight
+EMA starts step 7999
+Time budget: 540s (warmdown in last 80s)
+Starting training...
+step:0/9999 val_bpb:3.6123 *BEST* tokens:0M elapsed:1s
+W0409 03:27:23.002000 125955662033536 torch/fx/experimental/symbolic_shapes.py:4449] [0/0] q2 is not in var_ranges, defaulting to unknown range.
+W0409 03:27:23.061000 125955662033536 torch/fx/experimental/symbolic_shapes.py:4449] [0/0] z2 is not in var_ranges, defaulting to unknown range.
+W0409 03:27:24.793000 125955662033536 torch/fx/experimental/symbolic_shapes.py:4449] [0/0] x0 is not in var_ranges, defaulting to unknown range.
+  step:100 train_loss:4.7326 lr_mul:1.0000 tok/s:148416
+  step:200 train_loss:4.2383 lr_mul:1.0000 tok/s:245391
+  step:300 train_loss:4.0378 lr_mul:1.0000 tok/s:313812
+  step:400 train_loss:3.8172 lr_mul:1.0000 tok/s:364718
+step:500/9999 val_bpb:1.6258 *BEST* tokens:33M elapsed:82s
+  step:500 train_loss:3.7601 lr_mul:1.0000 tok/s:398194
+  Checkpoint saved: /workspace/ckpt/11L_w0/step_500.pt
+  step:600 train_loss:3.7328 lr_mul:1.0000 tok/s:429337
+  step:700 train_loss:3.7116 lr_mul:1.0000 tok/s:455053
+  step:800 train_loss:3.4941 lr_mul:1.0000 tok/s:476452
+  step:900 train_loss:3.4693 lr_mul:1.0000 tok/s:494532
+step:1000/9999 val_bpb:1.5300 *BEST* tokens:66M elapsed:130s
+  step:1000 train_loss:3.4981 lr_mul:1.0000 tok/s:505423
+  Checkpoint saved: /workspace/ckpt/11L_w0/step_1000.pt
+  step:1100 train_loss:3.7769 lr_mul:1.0000 tok/s:518728
+  step:1200 train_loss:3.4576 lr_mul:1.0000 tok/s:530233
+  step:1300 train_loss:3.5582 lr_mul:1.0000 tok/s:540823
+  step:1400 train_loss:3.3578 lr_mul:1.0000 tok/s:550242
+step:1500/9999 val_bpb:1.4932 *BEST* tokens:98M elapsed:177s
+  step:1500 train_loss:3.4604 lr_mul:1.0000 tok/s:554933
+  Checkpoint saved: /workspace/ckpt/11L_w0/step_1500.pt
+  step:1600 train_loss:3.4266 lr_mul:1.0000 tok/s:562522
+  step:1700 train_loss:3.5253 lr_mul:1.0000 tok/s:569613
+  step:1800 train_loss:3.4361 lr_mul:1.0000 tok/s:576061
+  step:1900 train_loss:3.3710 lr_mul:1.0000 tok/s:581964
+step:2000/9999 val_bpb:1.4600 *BEST* tokens:131M elapsed:224s
+  step:2000 train_loss:3.3497 lr_mul:1.0000 tok/s:584326
+  Checkpoint saved: /workspace/ckpt/11L_w0/step_2000.pt
+  step:2100 train_loss:3.3827 lr_mul:1.0000 tok/s:589241
+  step:2200 train_loss:3.2954 lr_mul:1.0000 tok/s:593968
+  step:2300 train_loss:3.4577 lr_mul:1.0000 tok/s:598052
+  step:2400 train_loss:3.3185 lr_mul:1.0000 tok/s:602124
+step:2500/9999 val_bpb:1.4412 *BEST* tokens:164M elapsed:271s
+  step:2500 train_loss:3.3139 lr_mul:1.0000 tok/s:603260
+  Checkpoint saved: /workspace/ckpt/11L_w0/step_2500.pt
+  step:2600 train_loss:3.3770 lr_mul:1.0000 tok/s:606706
+  step:2700 train_loss:3.4328 lr_mul:1.0000 tok/s:610086
+  step:2800 train_loss:3.3228 lr_mul:1.0000 tok/s:613259
+  step:2900 train_loss:3.2188 lr_mul:1.0000 tok/s:616241
+step:3000/9999 val_bpb:1.4264 *BEST* tokens:197M elapsed:319s
+  step:3000 train_loss:3.3162 lr_mul:1.0000 tok/s:616691
+  Checkpoint saved: /workspace/ckpt/11L_w0/step_3000.pt
+  step:3100 train_loss:3.2863 lr_mul:1.0000 tok/s:619287
+  step:3200 train_loss:3.3078 lr_mul:1.0000 tok/s:621861
+  step:3300 train_loss:3.1988 lr_mul:1.0000 tok/s:624289
+  step:3400 train_loss:3.2755 lr_mul:1.0000 tok/s:626368
+step:3500/9999 val_bpb:1.4153 *BEST* tokens:229M elapsed:366s
+  step:3500 train_loss:3.2365 lr_mul:1.0000 tok/s:626474
+  Checkpoint saved: /workspace/ckpt/11L_w0/step_3500.pt
+  step:3600 train_loss:3.5301 lr_mul:1.0000 tok/s:628506
+  step:3700 train_loss:3.2610 lr_mul:1.0000 tok/s:630546
+  step:3800 train_loss:3.3355 lr_mul:1.0000 tok/s:632486
+  step:3900 train_loss:3.1927 lr_mul:1.0000 tok/s:634339
+step:4000/9999 val_bpb:1.4089 *BEST* tokens:262M elapsed:413s
+  step:4000 train_loss:3.3703 lr_mul:1.0000 tok/s:634252
+  Checkpoint saved: /workspace/ckpt/11L_w0/step_4000.pt
+  step:4100 train_loss:3.2153 lr_mul:1.0000 tok/s:635867
+  step:4200 train_loss:3.1614 lr_mul:1.0000 tok/s:637522
+  step:4300 train_loss:3.2347 lr_mul:1.0000 tok/s:639115
+  step:4400 train_loss:3.3560 lr_mul:1.0000 tok/s:640641
+step:4500/9999 val_bpb:1.3999 *BEST* tokens:295M elapsed:460s
+  step:4500 train_loss:3.4165 lr_mul:0.9938 tok/s:640297
+  Checkpoint saved: /workspace/ckpt/11L_w0/step_4500.pt
+  step:4600 train_loss:3.1301 lr_mul:0.8780 tok/s:641616
+  step:4700 train_loss:3.2121 lr_mul:0.7631 tok/s:642983
+  step:4800 train_loss:3.3831 lr_mul:0.6482 tok/s:644303
+  step:4900 train_loss:3.0608 lr_mul:0.5332 tok/s:645568
+step:5000/9999 val_bpb:1.3485 *BEST* tokens:328M elapsed:508s
+  step:5000 train_loss:2.9204 lr_mul:0.4036 tok/s:645287
+  Checkpoint saved: /workspace/ckpt/11L_w0/step_5000.pt
+  step:5100 train_loss:3.0275 lr_mul:0.2880 tok/s:646425
+  step:5200 train_loss:2.9388 lr_mul:0.1732 tok/s:647597
+  step:5300 train_loss:2.9422 lr_mul:0.0583 tok/s:648718
+  Budget 540s reached at step 5351 (540s elapsed) — triggering final eval+save
+step:5351/9999 val_bpb:1.3133 *BEST* tokens:351M elapsed:541s
+
+Saved NPZ to /workspace/out/11L_w0.npz
+Saved int6+lzma to /workspace/out/11L_w0_int6.lzma (18.6MB)
+======================================================================
+FINAL val_bpb: 1.3133
+Total tokens: 0.655B in 555s
+Model: 11L d=512 MLP=3x | 28,402,777 params
+Throughput: 1181669 tok/s
+======================================================================
diff --git a/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/ablation_logs/11L_w1_cf.log b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/ablation_logs/11L_w1_cf.log
new file mode 100644
index 0000000000..491e6b5060
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/ablation_logs/11L_w1_cf.log
@@ -0,0 +1,53 @@
+======================================================================
+CF ablation eval
+  ckpt: /workspace/ckpt/11L_w1/step_1500.pt
+  model: L=11 d=512 vocab=4096
+  eval: N=500 seq=1024 stride=2 rounds=2
+======================================================================
+Loading model class from /tmp/train_cdm_patched_11L_w0.0.py
+  module constants verified: 11L d=512 vocab=4096
+  loading state dict from /workspace/ckpt/11L_w1/step_1500.pt
+  loaded 116 keys, missing=0, unexpected=0
+  params: 28,402,777
+  val tokens: 45,122,403
+
+[1] Pure AR baseline (is_causal=True)
+    AR 100/500 | BPB:1.5522 | 2s
+    AR 200/500 | BPB:1.5523 | 3s
+    AR 300/500 | BPB:1.5466 | 5s
+    AR 400/500 | BPB:1.5416 | 6s
+    AR 500/500 | BPB:1.5414 | 8s
+  Pure AR BPB: 1.5414
+
+[2] CF eval (stride=2, rounds=2)
+    CF 100/500 | AR:0.7666 CDM:0.5932 Total:1.3598 | 9s
+    CF 200/500 | AR:0.7687 CDM:0.5965 Total:1.3652 | 18s
+    CF 300/500 | AR:0.7666 CDM:0.5943 Total:1.3609 | 27s
+    CF 400/500 | AR:0.7658 CDM:0.5935 Total:1.3593 | 36s
+    CF 500/500 | AR:0.7624 CDM:0.5903 Total:1.3527 | 45s
+
+======================================================================
+  Pure AR:       1.5414
+  CF AR part:    0.7624
+  CF CDM part:   0.5903
+  CF Total:      1.3527
+  CF vs Pure AR: -12.24%  (−0.1887 BPB)
+======================================================================
+
+JSON:
+{
+  "ckpt": "/workspace/ckpt/11L_w1/step_1500.pt",
+  "num_layers": 11,
+  "model_dim": 512,
+  "vocab_size": 4096,
+  "n_params": 28402777,
+  "n_seqs": 500,
+  "seq_len": 1024,
+  "stride": 2,
+  "rounds": 2,
+  "pure_ar_bpb": 1.5413681222729079,
+  "cf_ar_part": 0.7623985235892268,
+  "cf_cdm_part": 0.5902907004017068,
+  "cf_total": 1.3526892239909336,
+  "cf_vs_ar_pct": -12.241001715005472
+}
diff --git a/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/ablation_logs/11L_w1_train.log b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/ablation_logs/11L_w1_train.log
new file mode 100644
index 0000000000..5f4670f63a
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/ablation_logs/11L_w1_train.log
@@ -0,0 +1,60 @@
+[ablation runner] patched script written to /tmp/train_cdm_patched_11L_w1.0.py
+[ablation runner] config: 11L d=512 vocab=4096 xsa=4 bigram=128 cdm_weight=1.0
+WARNING: effective_batch=65,536 != official 524,288
+   grad_accum=1 x microbatch=65536 x 1 GPUs
+======================================================================
+Shared AR+CDM PyTorch | 11L d=512 MLP=3x | steps=9999
+NUM_HEADS=8 head_dim=64 NUM_KV_HEADS=4 BIGRAM_DIM=128
+Encoder=5 Decoder=6 (U-net skip connections)
+Retro alpha=0.3 | XSA last 4 layers | LeakyReLU^2 | BigramHash(2048) | EMA(0.997)
+Device: cuda | World size: 1 | Grad accum: 1
+Effective batch: 65,536 tok/step
+======================================================================
+Val tokens: 45,121,536 (eval on 999,424)
+Model params: 28,402,777 (28.4M)
+Estimated size BF16: 56.8MB | int6: 21.3MB
+Compiling model with torch.compile()...
+  Compilation requested (will happen on first forward pass)
+  Muon params: 68 tensors
+  Scalar params: 47 tensors
+  Embed param: tok_emb.weight
+EMA starts step 7999
+Time budget: 540s (warmdown in last 80s)
+Starting training...
+step:0/9999 val_bpb:3.6123 *BEST* tokens:0M elapsed:2s
+W0409 03:46:01.625000 135878654083712 torch/fx/experimental/symbolic_shapes.py:4449] [0/0] q2 is not in var_ranges, defaulting to unknown range.
+W0409 03:46:01.684000 135878654083712 torch/fx/experimental/symbolic_shapes.py:4449] [0/0] z2 is not in var_ranges, defaulting to unknown range.
+W0409 03:46:03.402000 135878654083712 torch/fx/experimental/symbolic_shapes.py:4449] [0/0] x0 is not in var_ranges, defaulting to unknown range.
+  step:100 train_loss:10.5451 lr_mul:1.0000 tok/s:111758
+  step:200 train_loss:9.3241 lr_mul:1.0000 tok/s:144826
+  step:300 train_loss:7.8810 lr_mul:1.0000 tok/s:160669
+  step:400 train_loss:7.2725 lr_mul:1.0000 tok/s:169968
+step:500/9999 val_bpb:1.7042 *BEST* tokens:33M elapsed:187s
+  step:500 train_loss:8.4032 lr_mul:1.0000 tok/s:174973
+  Checkpoint saved: /workspace/ckpt/11L_w1/step_500.pt
+  step:600 train_loss:7.1105 lr_mul:1.0000 tok/s:179386
+  step:700 train_loss:8.5153 lr_mul:1.0000 tok/s:182727
+  step:800 train_loss:8.2336 lr_mul:1.0000 tok/s:185377
+  step:900 train_loss:6.7922 lr_mul:1.0000 tok/s:187507
+step:1000/9999 val_bpb:1.6091 *BEST* tokens:66M elapsed:347s
+  step:1000 train_loss:8.2942 lr_mul:1.0000 tok/s:188605
+  Checkpoint saved: /workspace/ckpt/11L_w1/step_1000.pt
+  step:1100 train_loss:7.4806 lr_mul:1.0000 tok/s:190057
+  step:1200 train_loss:7.7989 lr_mul:1.0000 tok/s:191237
+  step:1300 train_loss:7.1062 lr_mul:1.0000 tok/s:192310
+  step:1400 train_loss:6.0641 lr_mul:0.8194 tok/s:193255
+step:1500/9999 val_bpb:1.5361 *BEST* tokens:98M elapsed:507s
+  step:1500 train_loss:6.1946 lr_mul:0.4080 tok/s:193635
+  Checkpoint saved: /workspace/ckpt/11L_w1/step_1500.pt
+  step:1600 train_loss:7.4521 lr_mul:0.0102 tok/s:194361
+  Budget 540s reached at step 1603 (540s elapsed) — triggering final eval+save
+step:1603/9999 val_bpb:1.5086 *BEST* tokens:105M elapsed:541s
+
+Saved NPZ to /workspace/out/11L_w1.npz
+Saved int6+lzma to /workspace/out/11L_w1_int6.lzma (18.7MB)
+======================================================================
+FINAL val_bpb: 1.5086
+Total tokens: 0.655B in 555s
+Model: 11L d=512 MLP=3x | 28,402,777 params
+Throughput: 1180567 tok/s
+======================================================================
diff --git a/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/ablation_logs/5L_w03_cf.log b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/ablation_logs/5L_w03_cf.log
new file mode 100644
index 0000000000..ba99d97f34
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/ablation_logs/5L_w03_cf.log
@@ -0,0 +1,53 @@
+======================================================================
+CF ablation eval
+  ckpt: /workspace/ckpt/5L_w03/step_5000.pt
+  model: L=5 d=256 vocab=4096
+  eval: N=500 seq=1024 stride=2 rounds=2
+======================================================================
+Loading model class from /tmp/train_cdm_patched_5L_w0.0.py
+  module constants verified: 5L d=256 vocab=4096
+  loading state dict from /workspace/ckpt/5L_w03/step_5000.pt
+  loaded 56 keys, missing=0, unexpected=0
+  params: 4,298,537
+  val tokens: 45,122,403
+
+[1] Pure AR baseline (is_causal=True)
+    AR 100/500 | BPB:1.5312 | 1s
+    AR 200/500 | BPB:1.5325 | 2s
+    AR 300/500 | BPB:1.5287 | 3s
+    AR 400/500 | BPB:1.5234 | 3s
+    AR 500/500 | BPB:1.5231 | 4s
+  Pure AR BPB: 1.5231
+
+[2] CF eval (stride=2, rounds=2)
+    CF 100/500 | AR:0.7573 CDM:0.6501 Total:1.4074 | 6s
+    CF 200/500 | AR:0.7589 CDM:0.6544 Total:1.4133 | 11s
+    CF 300/500 | AR:0.7570 CDM:0.6514 Total:1.4085 | 17s
+    CF 400/500 | AR:0.7560 CDM:0.6509 Total:1.4069 | 23s
+    CF 500/500 | AR:0.7527 CDM:0.6482 Total:1.4009 | 28s
+
+======================================================================
+  Pure AR:       1.5231
+  CF AR part:    0.7527
+  CF CDM part:   0.6482
+  CF Total:      1.4009
+  CF vs Pure AR: -8.03%  (−0.1223 BPB)
+======================================================================
+
+JSON:
+{
+  "ckpt": "/workspace/ckpt/5L_w03/step_5000.pt",
+  "num_layers": 5,
+  "model_dim": 256,
+  "vocab_size": 4096,
+  "n_params": 4298537,
+  "n_seqs": 500,
+  "seq_len": 1024,
+  "stride": 2,
+  "rounds": 2,
+  "pure_ar_bpb": 1.5231493756157157,
+  "cf_ar_part": 0.7526675152508981,
+  "cf_cdm_part": 0.6482111594486734,
+  "cf_total": 1.4008786746995714,
+  "cf_vs_ar_pct": -8.027492435974493
+}
diff --git a/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/ablation_logs/5L_w03_train.log b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/ablation_logs/5L_w03_train.log
new file mode 100644
index 0000000000..3fc5a51b0f
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/ablation_logs/5L_w03_train.log
@@ -0,0 +1,111 @@
+[ablation runner] patched script written to /tmp/train_cdm_patched_5L_w0.3.py
+[ablation runner] config: 5L d=256 vocab=4096 xsa=2 bigram=128 cdm_weight=0.3
+WARNING: effective_batch=65,536 != official 524,288
+   grad_accum=1 x microbatch=65536 x 1 GPUs
+======================================================================
+Shared AR+CDM PyTorch | 5L d=256 MLP=3x | steps=9999
+NUM_HEADS=8 head_dim=32 NUM_KV_HEADS=4 BIGRAM_DIM=128
+Encoder=2 Decoder=3 (U-net skip connections)
+Retro alpha=0.3 | XSA last 2 layers | LeakyReLU^2 | BigramHash(2048) | EMA(0.997)
+Device: cuda | World size: 1 | Grad accum: 1
+Effective batch: 65,536 tok/step
+======================================================================
+Val tokens: 45,121,536 (eval on 999,424)
+Model params: 4,298,537 (4.3M)
+Estimated size BF16: 8.6MB | int6: 3.2MB
+Compiling model with torch.compile()...
+  Compilation requested (will happen on first forward pass)
+  Muon params: 32 tensors
+  Scalar params: 23 tensors
+  Embed param: tok_emb.weight
+EMA starts step 7999
+Time budget: 540s (warmdown in last 80s)
+Starting training...
+step:0/9999 val_bpb:3.6114 *BEST* tokens:0M elapsed:1s
+W0409 03:09:05.065000 125476072256128 torch/fx/experimental/symbolic_shapes.py:4449] [0/0] q2 is not in var_ranges, defaulting to unknown range.
+W0409 03:09:05.124000 125476072256128 torch/fx/experimental/symbolic_shapes.py:4449] [0/0] z2 is not in var_ranges, defaulting to unknown range.
+W0409 03:09:06.031000 125476072256128 torch/fx/experimental/symbolic_shapes.py:4449] [0/0] x0 is not in var_ranges, defaulting to unknown range.
+  step:100 train_loss:6.6967 lr_mul:1.0000 tok/s:258999
+  step:200 train_loss:6.0995 lr_mul:1.0000 tok/s:374959
+  step:300 train_loss:5.7949 lr_mul:1.0000 tok/s:440743
+  step:400 train_loss:5.5774 lr_mul:1.0000 tok/s:483349
+step:500/9999 val_bpb:1.7438 *BEST* tokens:33M elapsed:65s
+  step:500 train_loss:5.5143 lr_mul:1.0000 tok/s:506640
+  Checkpoint saved: /workspace/ckpt/5L_w03/step_500.pt
+  step:600 train_loss:5.5101 lr_mul:1.0000 tok/s:529430
+  step:700 train_loss:5.2180 lr_mul:1.0000 tok/s:546905
+  step:800 train_loss:5.1464 lr_mul:1.0000 tok/s:560793
+  step:900 train_loss:5.0867 lr_mul:1.0000 tok/s:572143
+step:1000/9999 val_bpb:1.6575 *BEST* tokens:66M elapsed:113s
+  step:1000 train_loss:5.2449 lr_mul:1.0000 tok/s:577011
+  Checkpoint saved: /workspace/ckpt/5L_w03/step_1000.pt
+  step:1100 train_loss:5.1175 lr_mul:1.0000 tok/s:585152
+  step:1200 train_loss:4.8058 lr_mul:1.0000 tok/s:591533
+  step:1300 train_loss:4.9994 lr_mul:1.0000 tok/s:597658
+  step:1400 train_loss:4.9496 lr_mul:1.0000 tok/s:602961
+step:1500/9999 val_bpb:1.6252 *BEST* tokens:98M elapsed:163s
+  step:1500 train_loss:5.2494 lr_mul:1.0000 tok/s:604320
+  Checkpoint saved: /workspace/ckpt/5L_w03/step_1500.pt
+  step:1600 train_loss:4.8711 lr_mul:1.0000 tok/s:608632
+  step:1700 train_loss:5.2470 lr_mul:1.0000 tok/s:612564
+  step:1800 train_loss:4.6756 lr_mul:1.0000 tok/s:616065
+  step:1900 train_loss:4.9853 lr_mul:1.0000 tok/s:619265
+step:2000/9999 val_bpb:1.6024 *BEST* tokens:131M elapsed:211s
+  step:2000 train_loss:4.5996 lr_mul:1.0000 tok/s:619719
+  Checkpoint saved: /workspace/ckpt/5L_w03/step_2000.pt
+  step:2100 train_loss:5.0224 lr_mul:1.0000 tok/s:622516
+  step:2200 train_loss:5.0494 lr_mul:1.0000 tok/s:625027
+  step:2300 train_loss:4.8569 lr_mul:1.0000 tok/s:626985
+  step:2400 train_loss:4.9820 lr_mul:1.0000 tok/s:629126
+step:2500/9999 val_bpb:1.5883 *BEST* tokens:164M elapsed:260s
+  step:2500 train_loss:4.7816 lr_mul:1.0000 tok/s:629179
+  Checkpoint saved: /workspace/ckpt/5L_w03/step_2500.pt
+  step:2600 train_loss:5.1174 lr_mul:1.0000 tok/s:631116
+  step:2700 train_loss:5.0121 lr_mul:1.0000 tok/s:632883
+  step:2800 train_loss:5.0821 lr_mul:1.0000 tok/s:634535
+  step:2900 train_loss:4.4510 lr_mul:1.0000 tok/s:636084
+step:3000/9999 val_bpb:1.5768 *BEST* tokens:197M elapsed:309s
+  step:3000 train_loss:4.7237 lr_mul:1.0000 tok/s:635701
+  Checkpoint saved: /workspace/ckpt/5L_w03/step_3000.pt
+  step:3100 train_loss:4.5691 lr_mul:1.0000 tok/s:637116
+  step:3200 train_loss:4.6995 lr_mul:1.0000 tok/s:638457
+  step:3300 train_loss:4.4842 lr_mul:1.0000 tok/s:639719
+  step:3400 train_loss:4.8522 lr_mul:1.0000 tok/s:640597
+step:3500/9999 val_bpb:1.5684 *BEST* tokens:229M elapsed:358s
+  step:3500 train_loss:4.5599 lr_mul:1.0000 tok/s:640226
+  Checkpoint saved: /workspace/ckpt/5L_w03/step_3500.pt
+  step:3600 train_loss:5.0577 lr_mul:1.0000 tok/s:641328
+  step:3700 train_loss:4.5498 lr_mul:1.0000 tok/s:642449
+  step:3800 train_loss:5.1453 lr_mul:1.0000 tok/s:643453
+  step:3900 train_loss:4.4077 lr_mul:1.0000 tok/s:644442
+step:4000/9999 val_bpb:1.5636 *BEST* tokens:262M elapsed:407s
+  step:4000 train_loss:4.9227 lr_mul:1.0000 tok/s:643967
+  Checkpoint saved: /workspace/ckpt/5L_w03/step_4000.pt
+  step:4100 train_loss:4.7866 lr_mul:1.0000 tok/s:644856
+  step:4200 train_loss:4.7015 lr_mul:1.0000 tok/s:645733
+  step:4300 train_loss:4.5466 lr_mul:1.0000 tok/s:646623
+  step:4400 train_loss:5.0215 lr_mul:1.0000 tok/s:647421
+step:4500/9999 val_bpb:1.5553 *BEST* tokens:295M elapsed:456s
+  step:4500 train_loss:5.2365 lr_mul:1.0000 tok/s:646629
+  Checkpoint saved: /workspace/ckpt/5L_w03/step_4500.pt
+  step:4600 train_loss:4.8541 lr_mul:0.9301 tok/s:647361
+  step:4700 train_loss:4.8105 lr_mul:0.8100 tok/s:648063
+  step:4800 train_loss:4.9849 lr_mul:0.6899 tok/s:648736
+  step:4900 train_loss:4.2994 lr_mul:0.5704 tok/s:649449
+step:5000/9999 val_bpb:1.5168 *BEST* tokens:328M elapsed:505s
+  step:5000 train_loss:4.4683 lr_mul:0.4409 tok/s:649035
+  Checkpoint saved: /workspace/ckpt/5L_w03/step_5000.pt
+  step:5100 train_loss:4.6952 lr_mul:0.3204 tok/s:649671
+  step:5200 train_loss:4.6690 lr_mul:0.2005 tok/s:650291
+  step:5300 train_loss:4.1794 lr_mul:0.0806 tok/s:650877
+  Budget 540s reached at step 5368 (540s elapsed) — triggering final eval+save
+step:5368/9999 val_bpb:1.4869 *BEST* tokens:352M elapsed:541s
+
+Saved NPZ to /workspace/out/5L_w03.npz
+Saved int6+lzma to /workspace/out/5L_w03_int6.lzma (3.0MB)
+======================================================================
+FINAL val_bpb: 1.4869
+Total tokens: 0.655B in 542s
+Model: 5L d=256 MLP=3x | 4,298,537 params
+Throughput: 1209199 tok/s
+======================================================================
diff --git a/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/ablation_logs/5L_w0_cf.log b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/ablation_logs/5L_w0_cf.log
new file mode 100644
index 0000000000..c9a07dd79f
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/ablation_logs/5L_w0_cf.log
@@ -0,0 +1,53 @@
+======================================================================
+CF ablation eval
+  ckpt: /workspace/ckpt/5L_w0/step_8000.pt
+  model: L=5 d=256 vocab=4096
+  eval: N=500 seq=1024 stride=2 rounds=2
+======================================================================
+Loading model class from /tmp/train_cdm_patched_5L_w0.0.py
+  module constants verified: 5L d=256 vocab=4096
+  loading state dict from /workspace/ckpt/5L_w0/step_8000.pt
+  loaded 56 keys, missing=0, unexpected=0
+  params: 4,298,537
+  val tokens: 45,122,403
+
+[1] Pure AR baseline (is_causal=True)
+    AR 100/500 | BPB:1.4582 | 1s
+    AR 200/500 | BPB:1.4578 | 2s
+    AR 300/500 | BPB:1.4533 | 3s
+    AR 400/500 | BPB:1.4483 | 3s
+    AR 500/500 | BPB:1.4479 | 4s
+  Pure AR BPB: 1.4479
+
+[2] CF eval (stride=2, rounds=2)
+    CF 100/500 | AR:0.7204 CDM:1.7196 Total:2.4400 | 6s
+    CF 200/500 | AR:0.7214 CDM:1.7336 Total:2.4550 | 13s
+    CF 300/500 | AR:0.7195 CDM:1.7261 Total:2.4456 | 19s
+    CF 400/500 | AR:0.7182 CDM:1.7258 Total:2.4440 | 26s
+    CF 500/500 | AR:0.7150 CDM:1.7221 Total:2.4371 | 32s
+
+======================================================================
+  Pure AR:       1.4479
+  CF AR part:    0.7150
+  CF CDM part:   1.7221
+  CF Total:      2.4371
+  CF vs Pure AR: +68.32%  (+0.9892 BPB)
+======================================================================
+
+JSON:
+{
+  "ckpt": "/workspace/ckpt/5L_w0/step_8000.pt",
+  "num_layers": 5,
+  "model_dim": 256,
+  "vocab_size": 4096,
+  "n_params": 4298537,
+  "n_seqs": 500,
+  "seq_len": 1024,
+  "stride": 2,
+  "rounds": 2,
+  "pure_ar_bpb": 1.44789016705184,
+  "cf_ar_part": 0.7149924362530551,
+  "cf_cdm_part": 1.7221083801242465,
+  "cf_total": 2.437100816377302,
+  "cf_vs_ar_pct": 68.32083481440235
+}
diff --git a/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/ablation_logs/5L_w0_train.log b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/ablation_logs/5L_w0_train.log
new file mode 100644
index 0000000000..e55f30d198
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/ablation_logs/5L_w0_train.log
@@ -0,0 +1,152 @@
+[ablation runner] patched script written to /tmp/train_cdm_patched_5L_w0.0.py
+[ablation runner] config: 5L d=256 vocab=4096 xsa=2 bigram=128 cdm_weight=0.0
+WARNING: effective_batch=65,536 != official 524,288
+   grad_accum=1 x microbatch=65536 x 1 GPUs
+======================================================================
+Shared AR+CDM PyTorch | 5L d=256 MLP=3x | steps=9999
+NUM_HEADS=8 head_dim=32 NUM_KV_HEADS=4 BIGRAM_DIM=128
+Encoder=2 Decoder=3 (U-net skip connections)
+Retro alpha=0.3 | XSA last 2 layers | LeakyReLU^2 | BigramHash(2048) | EMA(0.997)
+Device: cuda | World size: 1 | Grad accum: 1
+Effective batch: 65,536 tok/step
+======================================================================
+Val tokens: 45,121,536 (eval on 999,424)
+Model params: 4,298,537 (4.3M)
+Estimated size BF16: 8.6MB | int6: 3.2MB
+Compiling model with torch.compile()...
+  Compilation requested (will happen on first forward pass)
+  Muon params: 32 tensors
+  Scalar params: 23 tensors
+  Embed param: tok_emb.weight
+EMA starts step 7999
+Time budget: 540s (warmdown in last 80s)
+Starting training...
+step:0/9999 val_bpb:3.6114 *BEST* tokens:0M elapsed:1s
+W0409 02:59:59.025000 134680421003904 torch/fx/experimental/symbolic_shapes.py:4449] [0/0] q2 is not in var_ranges, defaulting to unknown range.
+W0409 02:59:59.085000 134680421003904 torch/fx/experimental/symbolic_shapes.py:4449] [0/0] z2 is not in var_ranges, defaulting to unknown range.
+W0409 02:59:59.988000 134680421003904 torch/fx/experimental/symbolic_shapes.py:4449] [0/0] x0 is not in var_ranges, defaulting to unknown range.
+  step:100 train_loss:4.7888 lr_mul:1.0000 tok/s:209699
+  step:200 train_loss:4.4027 lr_mul:1.0000 tok/s:349559
+  step:300 train_loss:4.2513 lr_mul:1.0000 tok/s:449336
+  step:400 train_loss:4.0485 lr_mul:1.0000 tok/s:524693
+step:500/9999 val_bpb:1.7246 *BEST* tokens:33M elapsed:57s
+  step:500 train_loss:3.9926 lr_mul:1.0000 tok/s:575265
+  Checkpoint saved: /workspace/ckpt/5L_w0/step_500.pt
+  step:600 train_loss:3.9500 lr_mul:1.0000 tok/s:623438
+  step:700 train_loss:3.9374 lr_mul:1.0000 tok/s:662338
+  step:800 train_loss:3.7337 lr_mul:1.0000 tok/s:693848
+  step:900 train_loss:3.7096 lr_mul:1.0000 tok/s:722280
+step:1000/9999 val_bpb:1.6365 *BEST* tokens:66M elapsed:88s
+  step:1000 train_loss:3.7536 lr_mul:1.0000 tok/s:740007
+  Checkpoint saved: /workspace/ckpt/5L_w0/step_1000.pt
+  step:1100 train_loss:4.0116 lr_mul:1.0000 tok/s:761526
+  step:1200 train_loss:3.7099 lr_mul:1.0000 tok/s:778163
+  step:1300 train_loss:3.7910 lr_mul:1.0000 tok/s:795547
+  step:1400 train_loss:3.6154 lr_mul:1.0000 tok/s:809857
+step:1500/9999 val_bpb:1.6030 *BEST* tokens:98M elapsed:120s
+  step:1500 train_loss:3.7220 lr_mul:1.0000 tok/s:817253
+  Checkpoint saved: /workspace/ckpt/5L_w0/step_1500.pt
+  step:1600 train_loss:3.6895 lr_mul:1.0000 tok/s:829156
+  step:1700 train_loss:3.7674 lr_mul:1.0000 tok/s:840452
+  step:1800 train_loss:3.7001 lr_mul:1.0000 tok/s:851167
+  step:1900 train_loss:3.6441 lr_mul:1.0000 tok/s:859570
+step:2000/9999 val_bpb:1.5770 *BEST* tokens:131M elapsed:152s
+  step:2000 train_loss:3.6230 lr_mul:1.0000 tok/s:863095
+  Checkpoint saved: /workspace/ckpt/5L_w0/step_2000.pt
+  step:2100 train_loss:3.6547 lr_mul:1.0000 tok/s:870538
+  step:2200 train_loss:3.5703 lr_mul:1.0000 tok/s:877577
+  step:2300 train_loss:3.7095 lr_mul:1.0000 tok/s:884103
+  step:2400 train_loss:3.5973 lr_mul:1.0000 tok/s:890317
+step:2500/9999 val_bpb:1.5635 *BEST* tokens:164M elapsed:184s
+  step:2500 train_loss:3.5910 lr_mul:1.0000 tok/s:892441
+  Checkpoint saved: /workspace/ckpt/5L_w0/step_2500.pt
+  step:2600 train_loss:3.6734 lr_mul:1.0000 tok/s:898263
+  step:2700 train_loss:3.7058 lr_mul:1.0000 tok/s:903749
+  step:2800 train_loss:3.6067 lr_mul:1.0000 tok/s:908131
+  step:2900 train_loss:3.5201 lr_mul:1.0000 tok/s:912574
+step:3000/9999 val_bpb:1.5506 *BEST* tokens:197M elapsed:215s
+  step:3000 train_loss:3.6052 lr_mul:1.0000 tok/s:913660
+  Checkpoint saved: /workspace/ckpt/5L_w0/step_3000.pt
+  step:3100 train_loss:3.5736 lr_mul:1.0000 tok/s:917772
+  step:3200 train_loss:3.5999 lr_mul:1.0000 tok/s:921451
+  step:3300 train_loss:3.4797 lr_mul:1.0000 tok/s:925443
+  step:3400 train_loss:3.5712 lr_mul:1.0000 tok/s:927801
+step:3500/9999 val_bpb:1.5410 *BEST* tokens:229M elapsed:247s
+  step:3500 train_loss:3.5196 lr_mul:1.0000 tok/s:928320
+  Checkpoint saved: /workspace/ckpt/5L_w0/step_3500.pt
+  step:3600 train_loss:3.8080 lr_mul:1.0000 tok/s:931450
+  step:3700 train_loss:3.5412 lr_mul:1.0000 tok/s:934924
+  step:3800 train_loss:3.6337 lr_mul:1.0000 tok/s:938006
+  step:3900 train_loss:3.4850 lr_mul:1.0000 tok/s:940494
+step:4000/9999 val_bpb:1.5362 *BEST* tokens:262M elapsed:279s
+  step:4000 train_loss:3.6449 lr_mul:1.0000 tok/s:940387
+  Checkpoint saved: /workspace/ckpt/5L_w0/step_4000.pt
+  step:4100 train_loss:3.5095 lr_mul:1.0000 tok/s:942839
+  step:4200 train_loss:3.4796 lr_mul:1.0000 tok/s:944927
+  step:4300 train_loss:3.5344 lr_mul:1.0000 tok/s:947256
+  step:4400 train_loss:3.6523 lr_mul:1.0000 tok/s:949205
+step:4500/9999 val_bpb:1.5284 *BEST* tokens:295M elapsed:311s
+  step:4500 train_loss:3.7115 lr_mul:1.0000 tok/s:948007
+  Checkpoint saved: /workspace/ckpt/5L_w0/step_4500.pt
+  step:4600 train_loss:3.4383 lr_mul:1.0000 tok/s:950200
+  step:4700 train_loss:3.5475 lr_mul:1.0000 tok/s:952502
+  step:4800 train_loss:3.7308 lr_mul:1.0000 tok/s:954711
+  step:4900 train_loss:3.4339 lr_mul:1.0000 tok/s:956614
+step:5000/9999 val_bpb:1.5236 *BEST* tokens:328M elapsed:343s
+  step:5000 train_loss:3.3306 lr_mul:1.0000 tok/s:956277
+  Checkpoint saved: /workspace/ckpt/5L_w0/step_5000.pt
+  step:5100 train_loss:3.4591 lr_mul:1.0000 tok/s:957787
+  step:5200 train_loss:3.4130 lr_mul:1.0000 tok/s:959292
+  step:5300 train_loss:3.4176 lr_mul:1.0000 tok/s:960931
+  step:5400 train_loss:3.4640 lr_mul:1.0000 tok/s:962474
+step:5500/9999 val_bpb:1.5212 *BEST* tokens:360M elapsed:375s
+  step:5500 train_loss:3.4291 lr_mul:1.0000 tok/s:962062
+  Checkpoint saved: /workspace/ckpt/5L_w0/step_5500.pt
+WARNING: starting epoch:2
+  step:5600 train_loss:3.4279 lr_mul:1.0000 tok/s:963102
+  step:5700 train_loss:3.4938 lr_mul:1.0000 tok/s:964776
+  step:5800 train_loss:3.4756 lr_mul:1.0000 tok/s:965818
+  step:5900 train_loss:3.4810 lr_mul:1.0000 tok/s:967466
+step:6000/9999 val_bpb:1.5195 *BEST* tokens:393M elapsed:407s
+  step:6000 train_loss:3.6886 lr_mul:1.0000 tok/s:966937
+  Checkpoint saved: /workspace/ckpt/5L_w0/step_6000.pt
+  step:6100 train_loss:4.4400 lr_mul:1.0000 tok/s:968523
+  step:6200 train_loss:3.4238 lr_mul:1.0000 tok/s:969369
+  step:6300 train_loss:3.5883 lr_mul:1.0000 tok/s:970578
+  step:6400 train_loss:3.4604 lr_mul:1.0000 tok/s:971887
+step:6500/9999 val_bpb:1.5223 tokens:426M elapsed:438s
+  step:6500 train_loss:3.4117 lr_mul:1.0000 tok/s:971763
+  Checkpoint saved: /workspace/ckpt/5L_w0/step_6500.pt
+  step:6600 train_loss:3.4653 lr_mul:1.0000 tok/s:972885
+  step:6700 train_loss:3.3841 lr_mul:1.0000 tok/s:973461
+  step:6800 train_loss:3.5882 lr_mul:1.0000 tok/s:974738
+  step:6900 train_loss:3.4954 lr_mul:0.9563 tok/s:975487
+step:7000/9999 val_bpb:1.5090 *BEST* tokens:459M elapsed:470s
+  step:7000 train_loss:3.4969 lr_mul:0.8690 tok/s:974911
+  Checkpoint saved: /workspace/ckpt/5L_w0/step_7000.pt
+  step:7100 train_loss:3.4181 lr_mul:0.7891 tok/s:975569
+  step:7200 train_loss:3.3943 lr_mul:0.7090 tok/s:976202
+  step:7300 train_loss:3.4688 lr_mul:0.6305 tok/s:977164
+  step:7400 train_loss:3.4096 lr_mul:0.5526 tok/s:978024
+step:7500/9999 val_bpb:1.4756 *BEST* tokens:492M elapsed:503s
+  step:7500 train_loss:3.4055 lr_mul:0.4665 tok/s:977641
+  Checkpoint saved: /workspace/ckpt/5L_w0/step_7500.pt
+  step:7600 train_loss:3.1793 lr_mul:0.3888 tok/s:978606
+  step:7700 train_loss:3.3912 lr_mul:0.3092 tok/s:979299
+  step:7800 train_loss:4.0023 lr_mul:0.2292 tok/s:979850
+  step:7900 train_loss:3.4398 lr_mul:0.1505 tok/s:980571
+  EMA started at step 7999
+step:8000/9999 val_bpb:1.4425 *BEST* [EMA] tokens:524M elapsed:535s
+  step:8000 train_loss:3.2151 lr_mul:0.0628 tok/s:979862
+  Checkpoint saved: /workspace/ckpt/5L_w0/step_8000.pt
+  Budget 540s reached at step 8079 (540s elapsed) — triggering final eval+save
+step:8079/9999 val_bpb:1.4420 *BEST* [EMA] tokens:529M elapsed:541s
+
+Saved NPZ to /workspace/out/5L_w0.npz
+Saved int6+lzma to /workspace/out/5L_w0_int6.lzma (3.0MB)
+======================================================================
+FINAL val_bpb: 1.4420
+Total tokens: 0.655B in 542s
+Model: 5L d=256 MLP=3x | 4,298,537 params
+Throughput: 1209292 tok/s
+======================================================================
diff --git a/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/ablation_logs/5L_w1_cf.log b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/ablation_logs/5L_w1_cf.log
new file mode 100644
index 0000000000..e9091dd950
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/ablation_logs/5L_w1_cf.log
@@ -0,0 +1,53 @@
+======================================================================
+CF ablation eval
+  ckpt: /workspace/ckpt/5L_w1/step_5000.pt
+  model: L=5 d=256 vocab=4096
+  eval: N=500 seq=1024 stride=2 rounds=2
+======================================================================
+Loading model class from /tmp/train_cdm_patched_5L_w0.0.py
+  module constants verified: 5L d=256 vocab=4096
+  loading state dict from /workspace/ckpt/5L_w1/step_5000.pt
+  loaded 56 keys, missing=0, unexpected=0
+  params: 4,298,537
+  val tokens: 45,122,403
+
+[1] Pure AR baseline (is_causal=True)
+    AR 100/500 | BPB:1.5922 | 1s
+    AR 200/500 | BPB:1.5934 | 2s
+    AR 300/500 | BPB:1.5883 | 2s
+    AR 400/500 | BPB:1.5841 | 3s
+    AR 500/500 | BPB:1.5841 | 3s
+  Pure AR BPB: 1.5841
+
+[2] CF eval (stride=2, rounds=2)
+    CF 100/500 | AR:0.7874 CDM:0.6148 Total:1.4022 | 4s
+    CF 200/500 | AR:0.7906 CDM:0.6175 Total:1.4081 | 9s
+    CF 300/500 | AR:0.7875 CDM:0.6148 Total:1.4023 | 14s
+    CF 400/500 | AR:0.7865 CDM:0.6137 Total:1.4002 | 18s
+    CF 500/500 | AR:0.7831 CDM:0.6108 Total:1.3939 | 23s
+
+======================================================================
+  Pure AR:       1.5841
+  CF AR part:    0.7831
+  CF CDM part:   0.6108
+  CF Total:      1.3939
+  CF vs Pure AR: -12.01%  (−0.1902 BPB)
+======================================================================
+
+JSON:
+{
+  "ckpt": "/workspace/ckpt/5L_w1/step_5000.pt",
+  "num_layers": 5,
+  "model_dim": 256,
+  "vocab_size": 4096,
+  "n_params": 4298537,
+  "n_seqs": 500,
+  "seq_len": 1024,
+  "stride": 2,
+  "rounds": 2,
+  "pure_ar_bpb": 1.5841135640338342,
+  "cf_ar_part": 0.7830777644102263,
+  "cf_cdm_part": 0.6108156998148498,
+  "cf_total": 1.393893464225076,
+  "cf_vs_ar_pct": -12.007983778913934
+}
diff --git a/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/ablation_logs/5L_w1_train.log b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/ablation_logs/5L_w1_train.log
new file mode 100644
index 0000000000..a8abccc4e8
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/ablation_logs/5L_w1_train.log
@@ -0,0 +1,111 @@
+[ablation runner] patched script written to /tmp/train_cdm_patched_5L_w1.0.py
+[ablation runner] config: 5L d=256 vocab=4096 xsa=2 bigram=128 cdm_weight=1.0
+WARNING: effective_batch=65,536 != official 524,288
+   grad_accum=1 x microbatch=65536 x 1 GPUs
+======================================================================
+Shared AR+CDM PyTorch | 5L d=256 MLP=3x | steps=9999
+NUM_HEADS=8 head_dim=32 NUM_KV_HEADS=4 BIGRAM_DIM=128
+Encoder=2 Decoder=3 (U-net skip connections)
+Retro alpha=0.3 | XSA last 2 layers | LeakyReLU^2 | BigramHash(2048) | EMA(0.997)
+Device: cuda | World size: 1 | Grad accum: 1
+Effective batch: 65,536 tok/step
+======================================================================
+Val tokens: 45,121,536 (eval on 999,424)
+Model params: 4,298,537 (4.3M)
+Estimated size BF16: 8.6MB | int6: 3.2MB
+Compiling model with torch.compile()...
+  Compilation requested (will happen on first forward pass)
+  Muon params: 32 tensors
+  Scalar params: 23 tensors
+  Embed param: tok_emb.weight
+EMA starts step 7999
+Time budget: 540s (warmdown in last 80s)
+Starting training...
+step:0/9999 val_bpb:3.6114 *BEST* tokens:0M elapsed:1s
+W0409 03:18:10.713000 123410662843008 torch/fx/experimental/symbolic_shapes.py:4449] [0/0] q2 is not in var_ranges, defaulting to unknown range.
+W0409 03:18:10.772000 123410662843008 torch/fx/experimental/symbolic_shapes.py:4449] [0/0] z2 is not in var_ranges, defaulting to unknown range.
+W0409 03:18:11.690000 123410662843008 torch/fx/experimental/symbolic_shapes.py:4449] [0/0] x0 is not in var_ranges, defaulting to unknown range.
+  step:100 train_loss:10.4744 lr_mul:1.0000 tok/s:259719
+  step:200 train_loss:10.0276 lr_mul:1.0000 tok/s:376243
+  step:300 train_loss:9.1702 lr_mul:1.0000 tok/s:442454
+  step:400 train_loss:8.0525 lr_mul:1.0000 tok/s:485201
+step:500/9999 val_bpb:1.8232 *BEST* tokens:33M elapsed:64s
+  step:500 train_loss:8.4776 lr_mul:1.0000 tok/s:507888
+  Checkpoint saved: /workspace/ckpt/5L_w1/step_500.pt
+  step:600 train_loss:8.2609 lr_mul:1.0000 tok/s:530359
+  step:700 train_loss:8.8714 lr_mul:1.0000 tok/s:547539
+  step:800 train_loss:7.3681 lr_mul:1.0000 tok/s:561123
+  step:900 train_loss:7.9888 lr_mul:1.0000 tok/s:572086
+step:1000/9999 val_bpb:1.7297 *BEST* tokens:66M elapsed:113s
+  step:1000 train_loss:8.1901 lr_mul:1.0000 tok/s:576803
+  Checkpoint saved: /workspace/ckpt/5L_w1/step_1000.pt
+  step:1100 train_loss:8.2153 lr_mul:1.0000 tok/s:584829
+  step:1200 train_loss:8.1841 lr_mul:1.0000 tok/s:591262
+  step:1300 train_loss:8.8357 lr_mul:1.0000 tok/s:597471
+  step:1400 train_loss:8.1065 lr_mul:1.0000 tok/s:602916
+step:1500/9999 val_bpb:1.6972 *BEST* tokens:98M elapsed:162s
+  step:1500 train_loss:7.5149 lr_mul:1.0000 tok/s:604507
+  Checkpoint saved: /workspace/ckpt/5L_w1/step_1500.pt
+  step:1600 train_loss:7.1155 lr_mul:1.0000 tok/s:608929
+  step:1700 train_loss:7.7209 lr_mul:1.0000 tok/s:612914
+  step:1800 train_loss:8.2857 lr_mul:1.0000 tok/s:616554
+  step:1900 train_loss:7.6713 lr_mul:1.0000 tok/s:619745
+step:2000/9999 val_bpb:1.6705 *BEST* tokens:131M elapsed:211s
+  step:2000 train_loss:8.4872 lr_mul:1.0000 tok/s:620255
+  Checkpoint saved: /workspace/ckpt/5L_w1/step_2000.pt
+  step:2100 train_loss:7.6873 lr_mul:1.0000 tok/s:623033
+  step:2200 train_loss:7.9321 lr_mul:1.0000 tok/s:625603
+  step:2300 train_loss:7.8769 lr_mul:1.0000 tok/s:627626
+  step:2400 train_loss:7.4892 lr_mul:1.0000 tok/s:629826
+step:2500/9999 val_bpb:1.6588 *BEST* tokens:164M elapsed:260s
+  step:2500 train_loss:6.9694 lr_mul:1.0000 tok/s:629808
+  Checkpoint saved: /workspace/ckpt/5L_w1/step_2500.pt
+  step:2600 train_loss:8.4293 lr_mul:1.0000 tok/s:631591
+  step:2700 train_loss:7.0345 lr_mul:1.0000 tok/s:633053
+  step:2800 train_loss:6.7055 lr_mul:1.0000 tok/s:634691
+  step:2900 train_loss:7.9959 lr_mul:1.0000 tok/s:636293
+step:3000/9999 val_bpb:1.6460 *BEST* tokens:197M elapsed:309s
+  step:3000 train_loss:7.2242 lr_mul:1.0000 tok/s:636028
+  Checkpoint saved: /workspace/ckpt/5L_w1/step_3000.pt
+  step:3100 train_loss:7.3618 lr_mul:1.0000 tok/s:637480
+  step:3200 train_loss:7.7660 lr_mul:1.0000 tok/s:638911
+  step:3300 train_loss:8.2330 lr_mul:1.0000 tok/s:640228
+  step:3400 train_loss:7.0444 lr_mul:1.0000 tok/s:641207
+step:3500/9999 val_bpb:1.6368 *BEST* tokens:229M elapsed:358s
+  step:3500 train_loss:7.9337 lr_mul:1.0000 tok/s:640861
+  Checkpoint saved: /workspace/ckpt/5L_w1/step_3500.pt
+  step:3600 train_loss:8.0702 lr_mul:1.0000 tok/s:642013
+  step:3700 train_loss:6.6918 lr_mul:1.0000 tok/s:643132
+  step:3800 train_loss:7.4814 lr_mul:1.0000 tok/s:644141
+  step:3900 train_loss:6.8128 lr_mul:1.0000 tok/s:645120
+step:4000/9999 val_bpb:1.6309 *BEST* tokens:262M elapsed:406s
+  step:4000 train_loss:8.5158 lr_mul:1.0000 tok/s:644693
+  Checkpoint saved: /workspace/ckpt/5L_w1/step_4000.pt
+  step:4100 train_loss:6.8397 lr_mul:1.0000 tok/s:645583
+  step:4200 train_loss:8.3491 lr_mul:1.0000 tok/s:646488
+  step:4300 train_loss:7.5772 lr_mul:1.0000 tok/s:647349
+  step:4400 train_loss:7.9396 lr_mul:1.0000 tok/s:648129
+step:4500/9999 val_bpb:1.6218 *BEST* tokens:295M elapsed:455s
+  step:4500 train_loss:8.7106 lr_mul:1.0000 tok/s:647640
+  Checkpoint saved: /workspace/ckpt/5L_w1/step_4500.pt
+  step:4600 train_loss:7.4152 lr_mul:0.9396 tok/s:648416
+  step:4700 train_loss:7.9952 lr_mul:0.8200 tok/s:649152
+  step:4800 train_loss:7.9543 lr_mul:0.7005 tok/s:649869
+  step:4900 train_loss:6.8295 lr_mul:0.5810 tok/s:650556
+step:5000/9999 val_bpb:1.5800 *BEST* tokens:328M elapsed:504s
+  step:5000 train_loss:6.9087 lr_mul:0.4516 tok/s:650135
+  Checkpoint saved: /workspace/ckpt/5L_w1/step_5000.pt
+  step:5100 train_loss:7.9890 lr_mul:0.3312 tok/s:650763
+  step:5200 train_loss:7.2282 lr_mul:0.2115 tok/s:651380
+  step:5300 train_loss:7.6120 lr_mul:0.0919 tok/s:651983
+  Budget 540s reached at step 5377 (540s elapsed) — triggering final eval+save
+step:5377/9999 val_bpb:1.5476 *BEST* tokens:352M elapsed:541s
+
+Saved NPZ to /workspace/out/5L_w1.npz
+Saved int6+lzma to /workspace/out/5L_w1_int6.lzma (3.0MB)
+======================================================================
+FINAL val_bpb: 1.5476
+Total tokens: 0.655B in 542s
+Model: 5L d=256 MLP=3x | 4,298,537 params
+Throughput: 1209261 tok/s
+======================================================================
diff --git a/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/ablation_logs/gen_test.log b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/ablation_logs/gen_test.log
new file mode 100644
index 0000000000..71d43ca92f
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/ablation_logs/gen_test.log
@@ -0,0 +1,222 @@
+Val tokens: 45,122,403
+
+============================================================================
+  5L_w0
+============================================================================
+  model: 5L d=256 vocab=4096
+
+  [seq 1/4]
+  PREFIX:    your little Johnny G. like a porter who'd just seen a ghost. Crystal had fire in her eyes--this skirt was served over
+  ORIGINAL:  [-easy, just the way like I like 'em. "S]
+  BIDIR:     [ufactata� call Su pay UKtainks reviewsirif Octecutoolay]  EM 0/16
+  CAUSAL:    [.ure.�ingzment.....i ives]  EM 0/16
+  SUFFIX:    ay, Crystal, you're a choice piece of calico. What say we split some chopped tuber chips and get to know each other,
+
+  [seq 2/4]
+  PREFIX:    X Tyres services over 60,000 of the Australian motoring public each month through a network of 80 stores along the Eastern Seaboard. JAX Tyres
+  ORIGINAL:  [Sutherland is conveniently located at 686-6]
+  BIDIR:     [ton 1 purp digitalthing flowitespre9 Inter players Billustom 1 during9]  EM 0/16
+  CAUSAL:    [out,01ose c toers,ne2view canysM the]  EM 0/16
+  SUFFIX:    90 Old Princes Highway and is open from 8am to 5pm Monday to Friday and 8am to 1pm on Saturdays. Phone 9521 73
+
+  [seq 3/4]
+  PREFIX:    ’s the carrot dangling in front of me, that’s the pressure I am putting on this year. So, Dad, Happy Birthday!
+  ORIGINAL:  [Enjoy your day! And here’s to hoping things will be a]
+  BIDIR:     [rem Rem  of yadaients interestingphempt larger Facebook system starting whole]  EM 0/16
+  CAUSAL:    [meindember� theester, to toonesygest,,, with]  EM 0/16
+  SUFFIX:    whole lot different when your birthday rolls around again. Dad also loves peanut butter, so with this recipe comes the conclusion of my week of pe
+
+  [seq 4/4]
+  PREFIX:    financial info (bank account, social security, paypal account, etc). You will find beautiful wine barrel furniture to accent your wine decor.
+  ORIGINAL:  [Even used wine barrels are given a second chance. Wine bar]
+  BIDIR:     [pro Dem perfect rece purp,foreode Qu er Day Trump eng Am Ire]  EM 0/16
+  CAUSAL:    [,vonlyipose a you.icka 1,agementazon to]  EM 0/16
+  SUFFIX:    rel furniture is built, for home or garden, with the old craftsman style in mind that enhances the wonderful wood grain of the oak barre
+
+  TOTAL EM: bidir=0/64 = 0.0%, causal=0/64 = 0.0%
+
+============================================================================
+  5L_w03
+============================================================================
+  model: 5L d=256 vocab=4096
+
+  [seq 1/4]
+  PREFIX:    your little Johnny G. like a porter who'd just seen a ghost. Crystal had fire in her eyes--this skirt was served over
+  ORIGINAL:  [-easy, just the way like I like 'em. "S]
+  BIDIR:     [.,o calls pay C for L, Cir Oct Ex Fool]  EM 0/16
+  CAUSAL:    [asuring.�ingzment for.. forai.as]  EM 1/16
+  SUFFIX:    ay, Crystal, you're a choice piece of calico. What say we split some chopped tuber chips and get to know each other,
+
+  [seq 2/4]
+  PREFIX:    X Tyres services over 60,000 of the Australian motoring public each month through a network of 80 stores along the Eastern Seaboard. JAX Tyres
+  ORIGINAL:  [Sutherland is conveniently located at 686-6]
+  BIDIR:     [out a andoseose flowers of 2 Inter players Bill Cs is 2]  EM 0/16
+  CAUSAL:    [the,000ose c ofers,ne0view andysT the]  EM 0/16
+  SUFFIX:    90 Old Princes Highway and is open from 8am to 5pm Monday to Friday and 8am to 1pm on Saturdays. Phone 9521 73
+
+  [seq 3/4]
+  PREFIX:    ’s the carrot dangling in front of me, that’s the pressure I am putting on this year. So, Dad, Happy Birthday!
+  ORIGINAL:  [Enjoy your day! And here’s to hoping things will be a]
+  BIDIR:     [Iainsoveing your peell,, toygerger system is a]  EM 2/16
+  CAUSAL:    [meindemberon theester,, toyygero,, to]  EM 0/16
+  SUFFIX:    whole lot different when your birthday rolls around again. Dad also loves peanut butter, so with this recipe comes the conclusion of my week of pe
+
+  [seq 4/4]
+  PREFIX:    financial info (bank account, social security, paypal account, etc). You will find beautiful wine barrel furniture to accent your wine decor.
+  ORIGINAL:  [Even used wine barrels are given a second chance. Wine bar]
+  BIDIR:     [im Dem way rece theiveros qu Qu er Dayan eng Amass I]  EM 0/16
+  CAUSAL:    [,ofocr foripose ast andesta,'agesazing a]  EM 0/16
+  SUFFIX:    rel furniture is built, for home or garden, with the old craftsman style in mind that enhances the wonderful wood grain of the oak barre
+
+  TOTAL EM: bidir=2/64 = 3.1%, causal=1/64 = 1.6%
+
+============================================================================
+  5L_w1
+============================================================================
+  model: 5L d=256 vocab=4096
+
+  [seq 1/4]
+  PREFIX:    your little Johnny G. like a porter who'd just seen a ghost. Crystal had fire in her eyes--this skirt was served over
+  ORIGINAL:  [-easy, just the way like I like 'em. "S]
+  BIDIR:     [ufactataata call call pay forta wwwiririf Octecut_ay]  EM 0/16
+  CAUSAL:    [.ure.�szment... .i.ives]  EM 1/16
+  SUFFIX:    ay, Crystal, you're a choice piece of calico. What say we split some chopped tuber chips and get to know each other,
+
+  [seq 2/4]
+  PREFIX:    X Tyres services over 60,000 of the Australian motoring public each month through a network of 80 stores along the Eastern Seaboard. JAX Tyres
+  ORIGINAL:  [Sutherland is conveniently located at 686-6]
+  BIDIR:     [mitmit of digitalthingthingites Inter 18 Inter playersiorustom\,9]  EM 0/16
+  CAUSAL:    [the,,000ose,,ers,vers andview,ysU the]  EM 0/16
+  SUFFIX:    90 Old Princes Highway and is open from 8am to 5pm Monday to Friday and 8am to 1pm on Saturdays. Phone 9521 73
+
+  [seq 3/4]
+  PREFIX:    ’s the carrot dangling in front of me, that’s the pressure I am putting on this year. So, Dad, Happy Birthday!
+  ORIGINAL:  [Enjoy your day! And here’s to hoping things will be a]
+  BIDIR:     [toindember all yadaada isph exsgergest is is a]  EM 1/16
+  CAUSAL:    [youindember to theester. to tootiveger... to]  EM 0/16
+  SUFFIX:    whole lot different when your birthday rolls around again. Dad also loves peanut butter, so with this recipe comes the conclusion of my week of pe
+
+  [seq 4/4]
+  PREFIX:    financial info (bank account, social security, paypal account, etc). You will find beautiful wine barrel furniture to accent your wine decor.
+  ORIGINAL:  [Even used wine barrels are given a second chance. Wine bar]
+  BIDIR:     [pro Dem islyiveiving aodebe er Day Trump Day Am thingre]  EM 0/16
+  CAUSAL:    [,folyipose ast,esta 1,lishazon.]  EM 0/16
+  SUFFIX:    rel furniture is built, for home or garden, with the old craftsman style in mind that enhances the wonderful wood grain of the oak barre
+
+  TOTAL EM: bidir=1/64 = 1.6%, causal=1/64 = 1.6%
+
+============================================================================
+  11L_w0
+============================================================================
+  model: 11L d=512 vocab=4096
+
+  [seq 1/4]
+  PREFIX:    your little Johnny G. like a porter who'd just seen a ghost. Crystal had fire in her eyes--this skirt was served over
+  ORIGINAL:  [-easy, just the way like I like 'em. "S]
+  BIDIR:     [ufactata� call Su pay UKtainks reviewsirif Octecutoolay]  EM 0/16
+  CAUSAL:    [.ured.�szment ...si 2ives]  EM 0/16
+  SUFFIX:    ay, Crystal, you're a choice piece of calico. What say we split some chopped tuber chips and get to know each other,
+
+  [seq 2/4]
+  PREFIX:    X Tyres services over 60,000 of the Australian motoring public each month through a network of 80 stores along the Eastern Seaboard. JAX Tyres
+  ORIGINAL:  [Sutherland is conveniently located at 686-6]
+  BIDIR:     [mit 1 purp digitalthing flowitespre9 Inter players Billustom\ during9]  EM 0/16
+  CAUSAL:    [out,000ose c isers thefer8act canyss the]  EM 0/16
+  SUFFIX:    90 Old Princes Highway and is open from 8am to 5pm Monday to Friday and 8am to 1pm on Saturdays. Phone 9521 73
+
+  [seq 3/4]
+  PREFIX:    ’s the carrot dangling in front of me, that’s the pressure I am putting on this year. So, Dad, Happy Birthday!
+  ORIGINAL:  [Enjoy your day! And here’s to hoping things will be a]
+  BIDIR:     [removember of yadaients interestingphempt larger Facebook system starting whole]  EM 0/16
+  CAUSAL:    [youoteemberу theester, to.rygerо.. at]  EM 0/16
+  SUFFIX:    whole lot different when your birthday rolls around again. Dad also loves peanut butter, so with this recipe comes the conclusion of my week of pe
+
+  [seq 4/4]
+  PREFIX:    financial info (bank account, social security, paypal account, etc). You will find beautiful wine barrel furniture to accent your wine decor.
+  ORIGINAL:  [Even used wine barrels are given a second chance. Wine bar]
+  BIDIR:     [pro Dem perfect rece purp isforeode Qu er Day Trump eng Am thingre]  EM 0/16
+  CAUSAL:    [Ifolyipose ast.estutonetagementy I]  EM 0/16
+  SUFFIX:    rel furniture is built, for home or garden, with the old craftsman style in mind that enhances the wonderful wood grain of the oak barre
+
+  TOTAL EM: bidir=0/64 = 0.0%, causal=0/64 = 0.0%
+
+============================================================================
+  11L_w03
+============================================================================
+  model: 11L d=512 vocab=4096
+
+  [seq 1/4]
+  PREFIX:    your little Johnny G. like a porter who'd just seen a ghost. Crystal had fire in her eyes--this skirt was served over
+  ORIGINAL:  [-easy, just the way like I like 'em. "S]
+  BIDIR:     [a... Su,, for l, Airif, Wk]  EM 0/16
+  CAUSAL:    [.uring.�szing....si.ives]  EM 1/16
+  SUFFIX:    ay, Crystal, you're a choice piece of calico. What say we split some chopped tuber chips and get to know each other,
+
+  [seq 2/4]
+  PREFIX:    X Tyres services over 60,000 of the Australian motoring public each month through a network of 80 stores along the Eastern Seaboard. JAX Tyres
+  ORIGINAL:  [Sutherland is conveniently located at 686-6]
+  BIDIR:     [Com 1- digital digitalthings to 900act Custom\ 9]  EM 0/16
+  CAUSAL:    [thete-ose mot,ing andvious0ior,yss the]  EM 0/16
+  SUFFIX:    90 Old Princes Highway and is open from 8am to 5pm Monday to Friday and 8am to 1pm on Saturdays. Phone 9521 73
+
+  [seq 3/4]
+  PREFIX:    ’s the carrot dangling in front of me, that’s the pressure I am putting on this year. So, Dad, Happy Birthday!
+  ORIGINAL:  [Enjoy your day! And here’s to hoping things will be a]
+  BIDIR:     [rem remind of the yog and (.ygergery is a]  EM 1/16
+  CAUSAL:    [meind to to thees’,,otygeren., with]  EM 0/16
+  SUFFIX:    whole lot different when your birthday rolls around again. Dad also loves peanut butter, so with this recipe comes the conclusion of my week of pe
+
+  [seq 4/4]
+  PREFIX:    financial info (bank account, social security, paypal account, etc). You will find beautiful wine barrel furniture to accent your wine decor.
+  ORIGINAL:  [Even used wine barrels are given a second chance. Wine bar]
+  BIDIR:     [ProThe the the forivingose c. Questa..lish bar]  EM 2/16
+  CAUSAL:    [.focrlyivingose astcestua,agementazon to]  EM 0/16
+  SUFFIX:    rel furniture is built, for home or garden, with the old craftsman style in mind that enhances the wonderful wood grain of the oak barre
+
+  TOTAL EM: bidir=3/64 = 4.7%, causal=1/64 = 1.6%
+
+============================================================================
+  11L_w1
+============================================================================
+  model: 11L d=512 vocab=4096
+
+  [seq 1/4]
+  PREFIX:    your little Johnny G. like a porter who'd just seen a ghost. Crystal had fire in her eyes--this skirt was served over
+  ORIGINAL:  [-easy, just the way like I like 'em. "S]
+  BIDIR:     [..if,.s.. l, iri,.ay]  EM 0/16
+  CAUSAL:    [.ure.�sppment....lty.ives]  EM 1/16
+  SUFFIX:    ay, Crystal, you're a choice piece of calico. What say we split some chopped tuber chips and get to know each other,
+
+  [seq 2/4]
+  PREFIX:    X Tyres services over 60,000 of the Australian motoring public each month through a network of 80 stores along the Eastern Seaboard. JAX Tyres
+  ORIGINAL:  [Sutherland is conveniently located at 686-6]
+  BIDIR:     [Without 1000 thingll from 9 the0ior Bill\\$]  EM 0/16
+  CAUSAL:    [a is000oseity,ers,p0ior andyerT the]  EM 0/16
+  SUFFIX:    90 Old Princes Highway and is open from 8am to 5pm Monday to Friday and 8am to 1pm on Saturdays. Phone 9521 73
+
+  [seq 3/4]
+  PREFIX:    ’s the carrot dangling in front of me, that’s the pressure I am putting on this year. So, Dad, Happy Birthday!
+  ORIGINAL:  [Enjoy your day! And here’s to hoping things will be a]
+  BIDIR:     [I theind all of yred- emphantsger Facebook is a]  EM 1/16
+  CAUSAL:    [meindaint thees, to,iayger... at]  EM 0/16
+  SUFFIX:    whole lot different when your birthday rolls around again. Dad also loves peanut butter, so with this recipe comes the conclusion of my week of pe
+
+  [seq 4/4]
+  PREFIX:    financial info (bank account, social security, paypal account, etc). You will find beautiful wine barrel furniture to accent your wine decor.
+  ORIGINAL:  [Even used wine barrels are given a second chance. Wine bar]
+  BIDIR:     [Al.verolyivingos C. Quer. The Theine bar]  EM 2/16
+  CAUSAL:    [,vocrlyivesose ase.esta.,agementazon about]  EM 0/16
+  SUFFIX:    rel furniture is built, for home or garden, with the old craftsman style in mind that enhances the wonderful wood grain of the oak barre
+
+  TOTAL EM: bidir=3/64 = 4.7%, causal=1/64 = 1.6%
+
+============================================================================
+  SUMMARY
+============================================================================
+tag          bidir EM     causal EM   
+  5L_w0      0/64 (0.0%)   0/64 (0.0%)
+  5L_w03     2/64 (3.1%)   1/64 (1.6%)
+  5L_w1      1/64 (1.6%)   1/64 (1.6%)
+  11L_w0     0/64 (0.0%)   0/64 (0.0%)
+  11L_w03    3/64 (4.7%)   1/64 (1.6%)
+  11L_w1     3/64 (4.7%)   1/64 (1.6%)
diff --git a/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/bpe_v4096.model b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/bpe_v4096.model
new file mode 100644
index 0000000000..ce8c992a61
Binary files /dev/null and b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/bpe_v4096.model differ
diff --git a/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/eval_6.sh b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/eval_6.sh
new file mode 100644
index 0000000000..36c4ed4b65
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/eval_6.sh
@@ -0,0 +1,75 @@
+#!/bin/bash
+# CF evaluation orchestration for the 6-run matched ablation (§3.2 of README.md).
+# Run this after run_6.sh. It uses the patched training scripts that
+# train_ablation_runner.py writes to PATCHED_DIR as a side-effect.
+#
+# Override any path via environment variables:
+#   SCRIPT_DIR    — directory containing eval_cf_ablation.py (default: script dir)
+#   DATA_DIR      — directory with fineweb_val_000000.bin (default: ./data)
+#   TOKENIZER     — path to bpe_v4096.model (default: ./bpe_v4096.model)
+#   CKPT_DIR      — ckpt root from run_6.sh (default: ./ckpt)
+#   PATCHED_DIR   — where train_ablation_runner.py wrote patched scripts
+#                   (default: /tmp, matches runner default)
+#   EVAL_DIR      — output directory for CF eval logs (default: ./eval)
+set -e
+
+SCRIPT_DIR="${SCRIPT_DIR:-$(cd "$(dirname "$0")" && pwd)}"
+DATA_DIR="${DATA_DIR:-./data}"
+TOKENIZER="${TOKENIZER:-./bpe_v4096.model}"
+CKPT_DIR="${CKPT_DIR:-./ckpt}"
+PATCHED_DIR="${PATCHED_DIR:-/tmp}"
+EVAL_DIR="${EVAL_DIR:-./eval}"
+
+export PYTHONPATH="$SCRIPT_DIR:${PYTHONPATH:-}"
+
+mkdir -p "$EVAL_DIR"
+
+eval_one () {
+  local tag=$1 layers=$2 dim=$3 bdim=$4 xsa=$5
+  echo "================================================================"
+  echo "== EVAL CF: $tag  (L=$layers d=$dim)"
+  echo "================================================================"
+
+  local ckpt_subdir="$CKPT_DIR/${tag}"
+  local latest=$(ls -1 "$ckpt_subdir"/step_*.pt 2>/dev/null | sort -V | tail -1)
+  if [ -z "$latest" ]; then
+    echo "  NO CHECKPOINT FOUND in $ckpt_subdir"
+    return
+  fi
+  echo "  using checkpoint: $latest"
+
+  # The runner writes /tmp/train_cdm_patched_<L>L_w<weight>.py as a side effect.
+  # Pick the one matching this layer count (any weight of the matching scale works
+  # for the model class — constants are identical within a scale).
+  local patched=$(ls -1 "$PATCHED_DIR"/train_cdm_patched_${layers}L_w*.py 2>/dev/null | head -1)
+  if [ -z "$patched" ]; then
+    echo "  NO PATCHED SCRIPT for ${layers}L — run run_6.sh first (it writes them). Skipping."
+    return
+  fi
+  echo "  using patched script: $patched"
+
+  python3 "$SCRIPT_DIR/eval_cf_ablation.py" \
+    --ckpt "$latest" \
+    --train_module_path "$patched" \
+    --num_layers $layers --model_dim $dim --vocab_size 4096 \
+    --bigram_dim $bdim --xsa_last_n $xsa \
+    --n_seqs 500 --seq_len 1024 --stride 2 --rounds 2 --seed 42 \
+    --data_dir "$DATA_DIR" --tokenizer_path "$TOKENIZER" \
+    --log_path "$EVAL_DIR/${tag}_cf.log" \
+    > "$EVAL_DIR/${tag}_eval.out" 2>&1
+  echo "  eval done -> $EVAL_DIR/${tag}_cf.log"
+  tail -10 "$EVAL_DIR/${tag}_cf.log"
+}
+
+# Eval all 6
+eval_one 5L_w0    5 256 128 2
+eval_one 5L_w03   5 256 128 2
+eval_one 5L_w1    5 256 128 2
+eval_one 11L_w0   11 512 128 4
+eval_one 11L_w03  11 512 128 4
+eval_one 11L_w1   11 512 128 4
+
+echo "================================================================"
+echo "== ALL 6 CF EVALS DONE"
+echo "================================================================"
+ls -la "$EVAL_DIR/"
diff --git a/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/eval_cf_ablation.py b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/eval_cf_ablation.py
new file mode 100644
index 0000000000..acff755470
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/eval_cf_ablation.py
@@ -0,0 +1,254 @@
+#!/usr/bin/env python3
+"""
+CF dual-mode evaluation for the 6-run ablation. Loads the model class
+dynamically from a patched train_cdm.py so it can handle arbitrary
+num_layers/model_dim/vocab_size without a hard-coded import.
+"""
+import argparse, os, sys, math, time, json, importlib.util
+import numpy as np
+import torch
+import sentencepiece as spm
+
+
+def load_module_from_path(path, name):
+    spec = importlib.util.spec_from_file_location(name, path)
+    mod = importlib.util.module_from_spec(spec)
+    # The module may try to run argparse / main(); protect it
+    sys.argv = [path]  # empty argv so argparse doesn't complain if main is invoked
+    spec.loader.exec_module(mod)
+    return mod
+
+
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--ckpt", type=str, required=True)
+    ap.add_argument("--train_module_path", type=str, required=True)
+    ap.add_argument("--num_layers", type=int, required=True)
+    ap.add_argument("--model_dim", type=int, required=True)
+    ap.add_argument("--vocab_size", type=int, required=True)
+    ap.add_argument("--bigram_dim", type=int, required=True)
+    ap.add_argument("--xsa_last_n", type=int, required=True)
+    ap.add_argument("--n_seqs", type=int, default=500)
+    ap.add_argument("--seq_len", type=int, default=1024)
+    ap.add_argument("--stride", type=int, default=2)
+    ap.add_argument("--rounds", type=int, default=2)
+    ap.add_argument("--n_random", type=int, default=3)
+    ap.add_argument("--seed", type=int, default=42)
+    ap.add_argument("--data_dir", type=str, required=True)
+    ap.add_argument("--tokenizer_path", type=str, required=True)
+    ap.add_argument("--log_path", type=str, default="")
+    args = ap.parse_args()
+
+    device = "cuda" if torch.cuda.is_available() else "cpu"
+
+    log_fh = open(args.log_path, "w") if args.log_path else None
+    def log(msg=""):
+        print(msg, flush=True)
+        if log_fh:
+            log_fh.write(msg + "\n")
+            log_fh.flush()
+
+    log("=" * 70)
+    log(f"CF ablation eval")
+    log(f"  ckpt: {args.ckpt}")
+    log(f"  model: L={args.num_layers} d={args.model_dim} vocab={args.vocab_size}")
+    log(f"  eval: N={args.n_seqs} seq={args.seq_len} stride={args.stride} rounds={args.rounds}")
+    log("=" * 70)
+
+    # Dynamically import the patched training module to get the GPTv2 class
+    # The patched module was created by train_ablation_runner.py and has the
+    # correct module-level constants baked in.
+    log(f"Loading model class from {args.train_module_path}")
+    # Temporarily block the module's main() from running
+    with open(args.train_module_path) as f:
+        src = f.read()
+    # Neuter the `if __name__ == "__main__"` block to prevent training from running
+    # during import
+    src = src.replace('if __name__ == "__main__"', 'if False')
+
+    mod_globals = {"__name__": "train_cdm_ablation_class_source", "__file__": args.train_module_path}
+    exec(compile(src, args.train_module_path, "exec"), mod_globals)
+
+    GPTv2 = mod_globals["GPTv2"]
+
+    # Sanity check: the patched module's constants should match our args
+    assert mod_globals["NUM_LAYERS"] == args.num_layers, f"layer mismatch: mod={mod_globals['NUM_LAYERS']} arg={args.num_layers}"
+    assert mod_globals["MODEL_DIM"] == args.model_dim, f"dim mismatch"
+    assert mod_globals["VOCAB_SIZE"] == args.vocab_size, f"vocab mismatch"
+    log(f"  module constants verified: {args.num_layers}L d={args.model_dim} vocab={args.vocab_size}")
+
+    # Instantiate and load state dict
+    model = GPTv2().to(device)
+    log(f"  loading state dict from {args.ckpt}")
+    if args.ckpt.endswith(".npz"):
+        # Final-state save written by train_cdm.py at end of training (np.savez of
+        # raw_model.state_dict(), with bf16 weights expanded to float32). Load each
+        # array as a torch tensor; load_state_dict(strict=False) will cast back.
+        log("  detected .npz final-state checkpoint")
+        npz = np.load(args.ckpt)
+        sd = {k: torch.from_numpy(np.array(npz[k])) for k in npz.files}
+    else:
+        sd = torch.load(args.ckpt, map_location=device, weights_only=False)
+        # The checkpoint might be wrapped
+        if isinstance(sd, dict):
+            for k in ("model", "state_dict", "raw_model"):
+                if k in sd and isinstance(sd[k], dict):
+                    sd = sd[k]
+                    break
+    # Strip common DDP / compile prefixes
+    clean = {}
+    for k, v in sd.items():
+        k2 = k
+        if k2.startswith("module."): k2 = k2[7:]
+        if k2.startswith("_orig_mod."): k2 = k2[10:]
+        clean[k2] = v
+    # Cast each tensor to match the model parameter dtype, so loading float32
+    # weights from .npz into a bf16 model works without silent precision loss.
+    model_dtypes = {n: p.dtype for n, p in model.named_parameters()}
+    model_dtypes.update({n: b.dtype for n, b in model.named_buffers()})
+    for k in list(clean.keys()):
+        if k in model_dtypes and clean[k].dtype != model_dtypes[k]:
+            clean[k] = clean[k].to(model_dtypes[k])
+    missing, unexpected = model.load_state_dict(clean, strict=False)
+    log(f"  loaded {len(clean)} keys, missing={len(missing)}, unexpected={len(unexpected)}")
+    if len(missing) > 10:
+        log(f"  WARNING: many missing keys, first 5: {list(missing)[:5]}")
+    model.eval()
+    n_params = sum(p.numel() for p in model.parameters())
+    log(f"  params: {n_params:,}")
+
+    # Load tokenizer + val data
+    sp = spm.SentencePieceProcessor(model_file=args.tokenizer_path)
+    header = np.fromfile(os.path.join(args.data_dir, "fineweb_val_000000.bin"), dtype="<i4", count=256)
+    val = np.fromfile(os.path.join(args.data_dir, "fineweb_val_000000.bin"),
+                      dtype="<u2", count=int(header[2]), offset=256*4).astype(np.int32)
+    log(f"  val tokens: {len(val):,}")
+
+    # Byte LUTs
+    sz = int(sp.vocab_size())
+    bb = np.zeros(sz, dtype=np.int16)
+    hs = np.zeros(sz, dtype=np.bool_)
+    ib = np.ones(sz, dtype=np.bool_)
+    for t in range(sz):
+        if sp.is_control(t) or sp.is_unknown(t) or sp.is_unused(t): continue
+        ib[t] = False
+        if sp.is_byte(t): bb[t] = 1; continue
+        p = sp.id_to_piece(t)
+        if p.startswith("\u2581"): hs[t] = True; p = p[1:]
+        bb[t] = len(p.encode("utf-8"))
+
+    def count_bytes(tokens, prev_tokens):
+        total = 0.0
+        for i in range(len(tokens)):
+            b = float(bb[tokens[i]])
+            if hs[tokens[i]] and not ib[prev_tokens[i]]:
+                b += 1.0
+            total += max(b, 1.0)
+        return total
+
+    def split_rounds(cdm_positions, cdm_rounds):
+        rounds = [[] for _ in range(cdm_rounds)]
+        for i, pos in enumerate(cdm_positions):
+            rounds[i % cdm_rounds].append(pos)
+        return rounds
+
+    @torch.no_grad()
+    def get_log_probs(input_ids, is_causal):
+        h = model.forward_hidden(input_ids, is_causal=is_causal)
+        logits = model.softcap(torch.nn.functional.linear(h, model.tok_emb.weight.to(h.dtype)))
+        return torch.log_softmax(logits.float(), dim=-1)
+
+    # Pure AR eval (the shared model in causal mode)
+    log("\n[1] Pure AR baseline (is_causal=True)")
+    total_nll, total_bytes = 0.0, 0.0
+    rng = np.random.RandomState(args.seed)
+    t0 = time.time()
+    for s in range(args.n_seqs):
+        idx = rng.randint(0, len(val) - args.seq_len - 1)
+        seq = val[idx:idx+args.seq_len+1]
+        inp = torch.from_numpy(seq[:-1].reshape(1,-1).astype(np.int64)).to(device)
+        lp = get_log_probs(inp, is_causal=True)[0].cpu().numpy()
+        tgt = seq[1:]
+        for t in range(args.seq_len):
+            total_nll -= lp[t, int(tgt[t])]
+        total_bytes += count_bytes(tgt, seq[:-1])
+        if (s+1) % max(1, args.n_seqs//5) == 0:
+            log(f"    AR {s+1}/{args.n_seqs} | BPB:{total_nll/total_bytes/math.log(2):.4f} | {time.time()-t0:.0f}s")
+    pure_ar_bpb = total_nll / total_bytes / math.log(2)
+    log(f"  Pure AR BPB: {pure_ar_bpb:.4f}")
+
+    # CF eval
+    log(f"\n[2] CF eval (stride={args.stride}, rounds={args.rounds})")
+    total_ar_nll = total_cdm_nll = total_bytes = 0.0
+    rng = np.random.RandomState(args.seed + 1)
+    t0 = time.time()
+    for s in range(args.n_seqs):
+        idx = rng.randint(0, len(val) - args.seq_len - 1)
+        seq = val[idx:idx + args.seq_len + 1]
+        x = seq[1:]
+        prev = seq[:-1]
+        total_bytes += count_bytes(x, prev)
+
+        ar_positions = list(range(0, args.seq_len, args.stride))
+        cdm_positions = [i for i in range(args.seq_len) if i not in ar_positions]
+        round_groups = split_rounds(cdm_positions, args.rounds)
+
+        input_t = torch.from_numpy(prev.reshape(1, -1).astype(np.int64)).to(device)
+        ar_lp = get_log_probs(input_t, is_causal=True)[0].cpu().numpy()
+        for pos in ar_positions:
+            total_ar_nll -= ar_lp[pos, int(x[pos])]
+
+        for ridx, current_round in enumerate(round_groups):
+            if not current_round: continue
+            unresolved = set()
+            for g in round_groups[ridx:]:
+                unresolved.update(g)
+            avg_round_nll = np.zeros(len(current_round))
+            for r in range(args.n_random):
+                cdm_input = x.copy()
+                for pos in unresolved:
+                    cdm_input[pos] = rng.randint(0, args.vocab_size)
+                cdm_t = torch.from_numpy(cdm_input.reshape(1,-1).astype(np.int64)).to(device)
+                cdm_lp = get_log_probs(cdm_t, is_causal=False)[0].cpu().numpy()
+                for i, pos in enumerate(current_round):
+                    avg_round_nll[i] -= cdm_lp[pos, int(x[pos])] / args.n_random
+            total_cdm_nll += avg_round_nll.sum()
+
+        if (s+1) % max(1, args.n_seqs//5) == 0:
+            ar_bpb = total_ar_nll / total_bytes / math.log(2)
+            cdm_bpb = total_cdm_nll / total_bytes / math.log(2)
+            total_bpb = ar_bpb + cdm_bpb
+            log(f"    CF {s+1}/{args.n_seqs} | AR:{ar_bpb:.4f} CDM:{cdm_bpb:.4f} Total:{total_bpb:.4f} | {time.time()-t0:.0f}s")
+
+    cf_ar = total_ar_nll / total_bytes / math.log(2)
+    cf_cdm = total_cdm_nll / total_bytes / math.log(2)
+    cf_total = cf_ar + cf_cdm
+    delta = (cf_total / pure_ar_bpb - 1) * 100
+
+    log("\n" + "=" * 70)
+    log(f"  Pure AR:       {pure_ar_bpb:.4f}")
+    log(f"  CF AR part:    {cf_ar:.4f}")
+    log(f"  CF CDM part:   {cf_cdm:.4f}")
+    log(f"  CF Total:      {cf_total:.4f}")
+    log(f"  CF vs Pure AR: {delta:+.2f}%  ({'−' if delta<0 else '+'}{abs(cf_total-pure_ar_bpb):.4f} BPB)")
+    log("=" * 70)
+
+    result = {
+        "ckpt": args.ckpt, "num_layers": args.num_layers, "model_dim": args.model_dim,
+        "vocab_size": args.vocab_size, "n_params": n_params,
+        "n_seqs": args.n_seqs, "seq_len": args.seq_len,
+        "stride": args.stride, "rounds": args.rounds,
+        "pure_ar_bpb": float(pure_ar_bpb),
+        "cf_ar_part": float(cf_ar),
+        "cf_cdm_part": float(cf_cdm),
+        "cf_total": float(cf_total),
+        "cf_vs_ar_pct": float(delta),
+    }
+    log("\nJSON:")
+    log(json.dumps(result, indent=2))
+
+    if log_fh: log_fh.close()
+
+
+if __name__ == "__main__":
+    main()
diff --git a/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/eval_cf_dualbrain.py b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/eval_cf_dualbrain.py
new file mode 100644
index 0000000000..a207f60883
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/eval_cf_dualbrain.py
@@ -0,0 +1,408 @@
+#!/usr/bin/env -S python3 -u
+"""
+Coarse-to-Fine BPB Eval — TRUE Shared Dual-Brain
+
+Uses ONE model (shared_ar_cdm.npz, 5L d=256) in TWO modes:
+  - Left brain  (AR):  is_causal=True,  predicts skeleton positions
+  - Right brain (CDM): is_causal=False, fills gaps bidirectionally
+
+Both modes share the SAME weights. This is the real "R+L brain" setup.
+
+Configurable via env vars:
+  N_SEQS=20            # number of sequences to sample (20 = smoke, 200 = subset, 2000 = scaled, full val = 242K)
+  SEQ_LEN=256          # sequence length (matches prior runs)
+  STRIDE=2             # AR predicts every N-th position
+  ROUNDS=2             # CDM refinement passes over gap positions
+  N_RANDOM=3           # random fills per CDM round (variance reduction)
+  MODEL_PATH=.../shared_ar_cdm.npz
+  DATA_DIR=.../fineweb10B_sp1024
+  TOKENIZER_PATH=.../fineweb_1024_bpe.model
+  LOG_PATH=eval_cf_dualbrain.log   # if set, tee to file
+  SEED=42
+"""
+import os, sys, math, time, json
+import numpy as np
+import sentencepiece as spm
+# MLX imports — guarded so this Apple-Silicon-only pre-flight script can be
+# IMPORTED on Linux CPU smoke-test environments (Python 3.10 + torch CPU,
+# where `pip install mlx` is not possible). The stub class below lets module-
+# level definitions like `class Foo(nn.Module):` and `COMPUTE_DTYPE = mx.bfloat16`
+# parse without crashing; any actual MLX call only happens inside main() and
+# fails clearly when the script is run as __main__ on a non-Apple machine.
+try:
+    import mlx.core as mx
+    import mlx.nn as nn
+    from mlx.utils import tree_flatten, tree_unflatten
+    _HAS_MLX = True
+except ImportError:
+    _HAS_MLX = False
+    class _MlxStub:
+        """Permissive stub: any attribute access or call returns another stub.
+        Lets `class Foo(nn.Module):` and `COMPUTE_DTYPE = mx.bfloat16` parse on
+        a non-Apple machine without raising at module load time. Real MLX work
+        is gated on _HAS_MLX inside main()."""
+        Module = type("Module", (object,), {})
+        def __getattr__(self, name): return _MlxStub()
+        def __call__(self, *a, **k): return _MlxStub()
+    mx = _MlxStub()
+    nn = _MlxStub()
+    def tree_flatten(*a, **k):
+        raise RuntimeError("MLX not available; this script is Apple Silicon only.")
+    def tree_unflatten(*a, **k):
+        raise RuntimeError("MLX not available; this script is Apple Silicon only.")
+
+COMPUTE_DTYPE = mx.bfloat16
+
+# --- 5L d=256 config (matches shared_ar_cdm.npz) ---
+VOCAB_SIZE = 1024
+NUM_LAYERS = 5
+MODEL_DIM = 256
+NUM_HEADS = 8
+NUM_KV_HEADS = 4
+MLP_MULT = 3
+ROPE_BASE = 10000.0
+QK_GAIN_INIT = 1.5
+LOGIT_SOFTCAP = 30.0
+XSA_LAST_N = 2
+BIGRAM_BUCKETS = 2048
+BIGRAM_DIM = 64
+
+# --- env config ---
+N_SEQS       = int(os.environ.get("N_SEQS", "20"))
+SEQ_LEN      = int(os.environ.get("SEQ_LEN", "256"))
+STRIDE       = int(os.environ.get("STRIDE", "2"))
+ROUNDS       = int(os.environ.get("ROUNDS", "2"))
+N_RANDOM     = int(os.environ.get("N_RANDOM", "3"))
+MODEL_PATH   = os.environ.get("MODEL_PATH",
+                              os.path.expanduser("~/Desktop/github/meadow-parameter-golf/h100_results/shared_ar_cdm.npz"))
+DATA_DIR     = os.environ.get("DATA_DIR",
+                              os.path.expanduser("~/Desktop/github/parameter-golf/data/datasets/fineweb10B_sp1024"))
+TOKENIZER_PATH = os.environ.get("TOKENIZER_PATH",
+                                os.path.expanduser("~/Desktop/github/parameter-golf/data/tokenizers/fineweb_1024_bpe.model"))
+LOG_PATH     = os.environ.get("LOG_PATH", "")
+SEED         = int(os.environ.get("SEED", "42"))
+
+# --- tee logger ---
+_log_fh = None
+def log(msg=""):
+    print(msg, flush=True)
+    if _log_fh:
+        _log_fh.write(msg + "\n"); _log_fh.flush()
+
+if LOG_PATH:
+    _log_fh = open(LOG_PATH, "w")
+
+# ============================================================================
+# Model (5L d=256 GPTv2 with U-net skip, matches shared_ar_cdm training)
+# ============================================================================
+def rms_norm(x, eps=1e-6):
+    return (x * mx.rsqrt(mx.mean(x*x, axis=-1, keepdims=True) + eps)).astype(x.dtype)
+
+class CastedLinear(nn.Module):
+    def __init__(s, i, o):
+        super().__init__()
+        s.weight = nn.Linear(i, o, bias=False).weight.astype(mx.float32)
+    def __call__(s, x): return x @ s.weight.astype(x.dtype).T
+
+class SmearGate(nn.Module):
+    def __init__(s, d):
+        super().__init__(); s.gate = mx.zeros((d,), dtype=mx.float32)
+    def __call__(s, x):
+        g = mx.sigmoid(s.gate.astype(x.dtype))[None, None, :]
+        return (1-g)*x + g*mx.concatenate([mx.zeros_like(x[:,:1]), x[:,:-1]], axis=1)
+
+class BigramHashEmbedding(nn.Module):
+    def __init__(s):
+        super().__init__()
+        s.embed = nn.Embedding(BIGRAM_BUCKETS, BIGRAM_DIM)
+        s.embed.weight = mx.zeros_like(s.embed.weight)
+        s.proj = CastedLinear(BIGRAM_DIM, MODEL_DIM)
+        s.scale = mx.array(0.05, dtype=mx.float32)
+    def __call__(s, t):
+        ti = t.astype(mx.int32); m = BIGRAM_BUCKETS - 1
+        sh = mx.concatenate([mx.full((ti.shape[0],1), m, dtype=mx.int32), ti[:,:-1]], axis=1)
+        return s.proj(s.embed((36313*ti + 27191*sh) % m)) * s.scale.astype(COMPUTE_DTYPE)
+
+class DualModeAttention(nn.Module):
+    def __init__(s, use_xsa=False):
+        super().__init__()
+        s.nh=NUM_HEADS; s.nkv=NUM_KV_HEADS; s.hd=MODEL_DIM//NUM_HEADS; s.use_xsa=use_xsa
+        s.c_q=CastedLinear(MODEL_DIM, MODEL_DIM)
+        s.c_k=CastedLinear(MODEL_DIM, NUM_KV_HEADS*s.hd)
+        s.c_v=CastedLinear(MODEL_DIM, NUM_KV_HEADS*s.hd)
+        s.proj=CastedLinear(MODEL_DIM, MODEL_DIM)
+        s.q_gain=mx.ones((NUM_HEADS,), dtype=mx.float32)*QK_GAIN_INIT
+        s.rope=nn.RoPE(s.hd, traditional=False, base=ROPE_BASE)
+        s.scale=s.hd**-0.5
+    def __call__(s, x, is_causal=True):
+        B,T,D = x.shape
+        q=s.c_q(x).reshape(B,T,s.nh,s.hd).transpose(0,2,1,3)
+        k=s.c_k(x).reshape(B,T,s.nkv,s.hd).transpose(0,2,1,3)
+        v=s.c_v(x).reshape(B,T,s.nkv,s.hd).transpose(0,2,1,3)
+        q=s.rope(rms_norm(q).astype(COMPUTE_DTYPE))
+        k=s.rope(rms_norm(k).astype(COMPUTE_DTYPE))
+        q=q*s.q_gain.astype(q.dtype)[None,:,None,None]
+        y=mx.fast.scaled_dot_product_attention(q,k,v,scale=s.scale,
+            mask="causal" if is_causal else None)
+        return s.proj(y.transpose(0,2,1,3).reshape(B,T,D))
+
+class MLP(nn.Module):
+    def __init__(s):
+        super().__init__()
+        s.fc=CastedLinear(MODEL_DIM, MODEL_DIM*MLP_MULT)
+        s.proj=CastedLinear(MODEL_DIM*MLP_MULT, MODEL_DIM)
+    def __call__(s, x):
+        h=s.fc(x); h=mx.where(h>=0,h,0.5*h); return s.proj(h*h)
+
+class Block(nn.Module):
+    def __init__(s, li=0, use_xsa=False):
+        super().__init__()
+        s.attn=DualModeAttention(use_xsa=use_xsa); s.mlp=MLP()
+        s.ln_scale=1.0/math.sqrt(li+1)
+        s.attn_scale=mx.ones((MODEL_DIM,), dtype=mx.float32)
+        s.mlp_scale=mx.ones((MODEL_DIM,), dtype=mx.float32)
+        s.resid_mix=mx.array(np.stack((np.ones(MODEL_DIM,dtype=np.float32),
+                                        np.zeros(MODEL_DIM,dtype=np.float32))))
+    def __call__(s, x, x0, is_causal=True):
+        m=s.resid_mix.astype(x.dtype)
+        x=m[0][None,None,:]*x+m[1][None,None,:]*x0
+        x=x+s.attn_scale.astype(x.dtype)[None,None,:]*s.attn(rms_norm(x)*s.ln_scale, is_causal=is_causal)
+        x=x+s.mlp_scale.astype(x.dtype)[None,None,:]*s.mlp(rms_norm(x)*s.ln_scale)
+        return x
+
+class GPTv2(nn.Module):
+    def __init__(s):
+        super().__init__()
+        s.tok_emb=nn.Embedding(VOCAB_SIZE, MODEL_DIM)
+        s.bigram=BigramHashEmbedding(); s.smear=SmearGate(MODEL_DIM)
+        ne=NUM_LAYERS//2; nd=NUM_LAYERS-ne; s.ne=ne; s.nd=nd
+        s.skip_weights=mx.ones((min(ne,nd),MODEL_DIM), dtype=mx.float32)
+        s.blocks=[Block(li=i, use_xsa=i>=(NUM_LAYERS-XSA_LAST_N)) for i in range(NUM_LAYERS)]
+        s.tok_emb.weight=(mx.random.normal(s.tok_emb.weight.shape)*0.005).astype(COMPUTE_DTYPE)
+
+    def forward_hidden(s, ids, is_causal=True):
+        x=s.tok_emb(ids).astype(COMPUTE_DTYPE)+s.bigram(ids).astype(COMPUTE_DTYPE)
+        x=rms_norm(x); x=s.smear(x); x0=x; skips=[]
+        for i in range(s.ne):
+            x=s.blocks[i](x, x0, is_causal=is_causal); skips.append(x)
+        for i in range(s.nd):
+            if skips: x=x+s.skip_weights[i].astype(x.dtype)[None,None,:]*skips.pop()
+            x=s.blocks[s.ne+i](x, x0, is_causal=is_causal)
+        return rms_norm(x)
+
+    def get_logits(s, ids, is_causal=True):
+        h = s.forward_hidden(ids, is_causal=is_causal)
+        return LOGIT_SOFTCAP * mx.tanh(h @ s.tok_emb.weight.astype(h.dtype).T / LOGIT_SOFTCAP)
+
+# ============================================================================
+def load_model(path):
+    mx.random.seed(SEED)
+    model = GPTv2()
+    w = dict(np.load(path))
+    loaded = {}
+    for k, v in w.items():
+        if v.dtype.str == '|V2':
+            loaded[k] = mx.array(v.view(np.uint16)).view(mx.bfloat16)
+        else:
+            loaded[k] = mx.array(v)
+    model_keys = set(k for k, _ in tree_flatten(model.parameters()))
+    matched = {k: v for k, v in loaded.items() if k in model_keys}
+    model.update(tree_unflatten(list(matched.items())))
+    mx.eval(model.parameters())
+    log(f"  Loaded {len(matched)}/{len(model_keys)} keys from {os.path.basename(path)}")
+    return model
+
+# ============================================================================
+# Byte counting (tokenizer-agnostic BPB denominator)
+# ============================================================================
+def build_byte_luts(sp):
+    sz = int(sp.vocab_size())
+    bb = np.zeros(sz, dtype=np.int16)
+    hs = np.zeros(sz, dtype=np.bool_)
+    ib = np.ones(sz, dtype=np.bool_)
+    for t in range(sz):
+        if sp.is_control(t) or sp.is_unknown(t) or sp.is_unused(t): continue
+        ib[t] = False
+        if sp.is_byte(t): bb[t] = 1; continue
+        p = sp.id_to_piece(t)
+        if p.startswith("\u2581"): hs[t] = True; p = p[1:]
+        bb[t] = len(p.encode("utf-8"))
+    return bb, hs, ib
+
+def count_bytes(tokens, prev_tokens, bb, hs, ib):
+    total = 0.0
+    for i in range(len(tokens)):
+        b = float(bb[tokens[i]])
+        if hs[tokens[i]] and not ib[prev_tokens[i]]:
+            b += 1.0
+        total += max(b, 1.0)
+    return total
+
+def split_rounds(cdm_positions, cdm_rounds):
+    rounds = [[] for _ in range(cdm_rounds)]
+    for i, pos in enumerate(cdm_positions):
+        rounds[i % cdm_rounds].append(pos)
+    return rounds
+
+# ============================================================================
+# Pure AR eval (single-mode baseline, same shared model in is_causal=True)
+# ============================================================================
+def eval_pure_ar(model, tokens, sp, n_seqs, seq_len):
+    bb, hs, ib = build_byte_luts(sp)
+    total_nll = 0.0; total_bytes = 0.0
+    rng = np.random.RandomState(SEED)
+    t0 = time.time()
+    for s in range(n_seqs):
+        idx = rng.randint(0, len(tokens) - seq_len - 1)
+        seq = tokens[idx:idx+seq_len+1]
+        inp = mx.array(seq[:-1].reshape(1,-1))
+        tgt = seq[1:]
+        prev = seq[:-1]
+        logits = model.get_logits(inp, is_causal=True)
+        lp = logits - mx.logsumexp(logits, axis=-1, keepdims=True)
+        mx.eval(lp)
+        lp_np = np.array(lp.astype(mx.float32))[0]
+        for t in range(seq_len):
+            total_nll -= lp_np[t, int(tgt[t])]
+        total_bytes += count_bytes(tgt, prev, bb, hs, ib)
+        if (s+1) % max(1, n_seqs//10) == 0:
+            elapsed = time.time() - t0
+            bpb = total_nll / total_bytes / math.log(2)
+            log(f"    AR {s+1}/{n_seqs} | BPB:{bpb:.4f} | {elapsed:.1f}s")
+    return total_nll / total_bytes / math.log(2)
+
+# ============================================================================
+# Coarse-to-Fine eval (L brain: causal / R brain: bidirect, SAME weights)
+# ============================================================================
+def eval_coarse_to_fine(model, tokens, sp, n_seqs, seq_len, stride, n_random, cdm_rounds):
+    bb, hs, ib = build_byte_luts(sp)
+    log(f"  CF eval (stride={stride}, rounds={cdm_rounds}, {n_seqs} seqs × {seq_len} tok, R={n_random})")
+
+    total_ar_nll = 0.0
+    total_cdm_nll = 0.0
+    total_bytes = 0.0
+    rng = np.random.RandomState(SEED + 1)
+    t0 = time.time()
+
+    for s in range(n_seqs):
+        idx = rng.randint(0, len(tokens) - seq_len - 1)
+        seq = tokens[idx:idx + seq_len + 1]
+        x = seq[1:]
+        prev = seq[:-1]
+        total_bytes += count_bytes(x, prev, bb, hs, ib)
+
+        ar_positions = list(range(0, seq_len, stride))
+        cdm_positions = [i for i in range(seq_len) if i not in ar_positions]
+        round_groups = split_rounds(cdm_positions, cdm_rounds)
+
+        # === L brain: causal AR over full input, only score skeleton positions ===
+        input_ids = seq[:-1]
+        input_mx = mx.array(input_ids.reshape(1, -1))
+        ar_logits = model.get_logits(input_mx, is_causal=True)      # ← is_causal=True
+        ar_lp = ar_logits - mx.logsumexp(ar_logits, axis=-1, keepdims=True)
+        mx.eval(ar_lp)
+        ar_lp_np = np.array(ar_lp.astype(mx.float32))[0]
+        for pos in ar_positions:
+            total_ar_nll -= ar_lp_np[pos, int(x[pos])]
+
+        # === R brain: bidirectional CDM, rounds of gap filling ===
+        for ridx, current_round in enumerate(round_groups):
+            if not current_round: continue
+            unresolved = set()
+            for g in round_groups[ridx:]:
+                unresolved.update(g)
+
+            avg_round_nll = np.zeros(len(current_round))
+            for r in range(n_random):
+                cdm_input = x.copy()
+                for pos in unresolved:
+                    cdm_input[pos] = rng.randint(0, VOCAB_SIZE)
+                cdm_input_mx = mx.array(cdm_input.reshape(1, -1))
+                cdm_logits = model.get_logits(cdm_input_mx, is_causal=False)   # ← is_causal=False, SAME model
+                cdm_lp = cdm_logits - mx.logsumexp(cdm_logits, axis=-1, keepdims=True)
+                mx.eval(cdm_lp)
+                cdm_lp_np = np.array(cdm_lp.astype(mx.float32))[0]
+                for i, pos in enumerate(current_round):
+                    avg_round_nll[i] -= cdm_lp_np[pos, int(x[pos])] / n_random
+
+            total_cdm_nll += avg_round_nll.sum()
+
+        if (s + 1) % max(1, n_seqs // 20) == 0:
+            elapsed = time.time() - t0
+            ar_bpb = total_ar_nll / total_bytes / math.log(2)
+            cdm_bpb = total_cdm_nll / total_bytes / math.log(2)
+            total_bpb = (total_ar_nll + total_cdm_nll) / total_bytes / math.log(2)
+            rate = (s+1) / elapsed if elapsed > 0 else 0
+            eta = (n_seqs - (s+1)) / rate if rate > 0 else 0
+            log(f"    CF {s+1}/{n_seqs} | AR:{ar_bpb:.4f} + CDM:{cdm_bpb:.4f} = {total_bpb:.4f} | {elapsed:.0f}s ETA:{eta:.0f}s")
+
+    ar_bpb = total_ar_nll / total_bytes / math.log(2)
+    cdm_bpb = total_cdm_nll / total_bytes / math.log(2)
+    total_bpb = (total_ar_nll + total_cdm_nll) / total_bytes / math.log(2)
+    return ar_bpb, cdm_bpb, total_bpb
+
+# ============================================================================
+def main():
+    log("=" * 68)
+    log(f"  Shared Dual-Brain CF Eval  |  N_SEQS={N_SEQS}  stride={STRIDE}  rounds={ROUNDS}")
+    log(f"  Model: {os.path.basename(MODEL_PATH)}  (5L d=256, ONE set of weights, TWO modes)")
+    log("=" * 68)
+
+    np.random.seed(SEED); mx.random.seed(SEED)
+    sp = spm.SentencePieceProcessor(model_file=TOKENIZER_PATH)
+
+    header = np.fromfile(f"{DATA_DIR}/fineweb_val_000000.bin", dtype="<i4", count=256)
+    val = np.fromfile(f"{DATA_DIR}/fineweb_val_000000.bin", dtype="<u2",
+                      count=int(header[2]), offset=256*4).astype(np.int32)
+    log(f"Val tokens available: {len(val):,}")
+    log(f"Sampling {N_SEQS} sequences of length {SEQ_LEN} = {N_SEQS*SEQ_LEN:,} tokens scored")
+
+    log("\nLoading shared AR+CDM model (ONE model, used in both modes)...")
+    model = load_model(MODEL_PATH)
+
+    log("\n[1] Pure AR baseline (single-mode, left brain only, is_causal=True)")
+    t0 = time.time()
+    ar_only = eval_pure_ar(model, val, sp, N_SEQS, SEQ_LEN)
+    log(f"  Pure AR BPB: {ar_only:.4f}  ({time.time()-t0:.0f}s)")
+
+    log(f"\n[2] Coarse-to-Fine dual-brain (stride={STRIDE}, rounds={ROUNDS})")
+    t0 = time.time()
+    ar_bpb, cdm_bpb, cf_bpb = eval_coarse_to_fine(
+        model, val, sp, N_SEQS, SEQ_LEN, STRIDE, N_RANDOM, ROUNDS)
+    dt = time.time() - t0
+
+    ar_frac = 1.0 / STRIDE
+    cdm_frac = 1.0 - ar_frac
+    delta_pct = (cf_bpb/ar_only - 1) * 100
+
+    log("\n" + "=" * 68)
+    log(f"  RESULTS  (n_seqs={N_SEQS}, seq_len={SEQ_LEN}, total scored={N_SEQS*SEQ_LEN:,} tokens)")
+    log("=" * 68)
+    log(f"  Pure AR (baseline):              {ar_only:.4f}")
+    log(f"  CF AR part ({ar_frac*100:.0f}% of positions):       {ar_bpb:.4f}")
+    log(f"  CF CDM part ({cdm_frac*100:.0f}% of positions):     {cdm_bpb:.4f}")
+    log(f"  CF Total (dual-brain):           {cf_bpb:.4f}")
+    log(f"  CF vs Pure AR:                   {delta_pct:+.2f}%  ({'-' if delta_pct<0 else '+'}{abs(cf_bpb-ar_only):.4f} BPB)")
+    log(f"  CF eval time:                    {dt:.0f}s  ({N_SEQS/dt:.1f} seqs/s)")
+    log("=" * 68)
+
+    # machine-readable result
+    result = {
+        "n_seqs": N_SEQS, "seq_len": SEQ_LEN, "tokens_scored": N_SEQS*SEQ_LEN,
+        "stride": STRIDE, "rounds": ROUNDS, "n_random": N_RANDOM,
+        "pure_ar_bpb": float(ar_only),
+        "cf_ar_part": float(ar_bpb),
+        "cf_cdm_part": float(cdm_bpb),
+        "cf_total": float(cf_bpb),
+        "cf_vs_ar_pct": float(delta_pct),
+        "cf_time_sec": float(dt),
+        "model": os.path.basename(MODEL_PATH),
+        "seed": SEED,
+    }
+    log("\nJSON:")
+    log(json.dumps(result, indent=2))
+
+if __name__ == "__main__":
+    main()
+    if _log_fh:
+        _log_fh.close()
diff --git a/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/eval_cf_dualbrain_cuda.py b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/eval_cf_dualbrain_cuda.py
new file mode 100644
index 0000000000..0e2dfaad7c
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/eval_cf_dualbrain_cuda.py
@@ -0,0 +1,303 @@
+#!/usr/bin/env -S python3 -u
+"""
+Coarse-to-Fine BPB Eval (PyTorch/CUDA) — TRUE Shared Dual-Brain, 11L variant
+
+Counterpart of eval_cf_dualbrain.py (MLX/5L) for H100.
+
+Loads a PyTorch 11L shared AR+CDM checkpoint and runs it in BOTH modes
+(is_causal=True for L-brain, is_causal=False for R-brain) via the SAME weights.
+
+Expects train_cdm_ar_v4096_pytorch.py to be importable (copy to same dir on pod).
+
+Env vars:
+  N_SEQS=20            # smoke=20, scaled=10000, full val ~ 242000
+  SEQ_LEN=1024         # 11L model was trained at seq_len=1024
+  STRIDE=2             # AR every N-th position
+  ROUNDS=2             # CDM refinement rounds
+  N_RANDOM=3           # random fills per round
+  MODEL_PATH=.../11L_shared_cdm_bf16.pt
+  DATA_DIR=/workspace/data_v4096_full          # v4096 tokenized FineWeb
+  TOKENIZER_PATH=/workspace/bpe_v4096.model
+  LOG_PATH=eval_cf_11L.log                     # tee to file
+  SEED=42
+  DEVICE=cuda
+  BATCH=1              # sequences per forward pass (kept small for now)
+"""
+import os, sys, math, time, json
+import numpy as np
+import torch
+import sentencepiece as spm
+
+# --- env config ---
+N_SEQS       = int(os.environ.get("N_SEQS", "20"))
+SEQ_LEN      = int(os.environ.get("SEQ_LEN", "1024"))
+STRIDE       = int(os.environ.get("STRIDE", "2"))
+ROUNDS       = int(os.environ.get("ROUNDS", "2"))
+N_RANDOM     = int(os.environ.get("N_RANDOM", "3"))
+MODEL_PATH   = os.environ.get("MODEL_PATH", "/workspace/models/11L_shared_cdm_bf16.pt")
+DATA_DIR     = os.environ.get("DATA_DIR", "/workspace/data_v4096_full")
+TOKENIZER_PATH = os.environ.get("TOKENIZER_PATH", "/workspace/bpe_v4096.model")
+LOG_PATH     = os.environ.get("LOG_PATH", "")
+SEED         = int(os.environ.get("SEED", "42"))
+DEVICE       = os.environ.get("DEVICE", "cuda" if torch.cuda.is_available() else "cpu")
+
+_log_fh = None
+def log(msg=""):
+    print(msg, flush=True)
+    if _log_fh:
+        _log_fh.write(msg + "\n"); _log_fh.flush()
+
+if LOG_PATH:
+    _log_fh = open(LOG_PATH, "w")
+
+# ============================================================================
+# Import the model class from the training file (must be in PYTHONPATH)
+# ============================================================================
+sys.path.insert(0, os.path.dirname(os.path.abspath(__file__)))
+import train_cdm_ar_v4096_pytorch as trmod
+
+# Sanity check
+assert trmod.VOCAB_SIZE == 4096, f"Expected VOCAB_SIZE=4096, got {trmod.VOCAB_SIZE}"
+assert trmod.NUM_LAYERS == 11,   f"Expected NUM_LAYERS=11, got {trmod.NUM_LAYERS}"
+assert trmod.MODEL_DIM == 512,   f"Expected MODEL_DIM=512, got {trmod.MODEL_DIM}"
+
+VOCAB_SIZE = trmod.VOCAB_SIZE
+MODEL_DIM = trmod.MODEL_DIM
+
+# ============================================================================
+# Byte counting (tokenizer-agnostic BPB denominator)
+# ============================================================================
+def build_byte_luts(sp):
+    sz = int(sp.vocab_size())
+    bb = np.zeros(sz, dtype=np.int16)
+    hs = np.zeros(sz, dtype=np.bool_)
+    ib = np.ones(sz, dtype=np.bool_)
+    for t in range(sz):
+        if sp.is_control(t) or sp.is_unknown(t) or sp.is_unused(t): continue
+        ib[t] = False
+        if sp.is_byte(t): bb[t] = 1; continue
+        p = sp.id_to_piece(t)
+        if p.startswith("\u2581"): hs[t] = True; p = p[1:]
+        bb[t] = len(p.encode("utf-8"))
+    return bb, hs, ib
+
+def count_bytes(tokens, prev_tokens, bb, hs, ib):
+    total = 0.0
+    for i in range(len(tokens)):
+        b = float(bb[tokens[i]])
+        if hs[tokens[i]] and not ib[prev_tokens[i]]:
+            b += 1.0
+        total += max(b, 1.0)
+    return total
+
+def split_rounds(cdm_positions, cdm_rounds):
+    rounds = [[] for _ in range(cdm_rounds)]
+    for i, pos in enumerate(cdm_positions):
+        rounds[i % cdm_rounds].append(pos)
+    return rounds
+
+# ============================================================================
+# Logits from hidden state (matches train_cdm_ar_v4096_pytorch.forward)
+# ============================================================================
+@torch.no_grad()
+def get_logits(model, input_ids, is_causal):
+    h = model.forward_hidden(input_ids, is_causal=is_causal)  # [B, T, D]
+    logits = model.softcap(torch.nn.functional.linear(h, model.tok_emb.weight.to(h.dtype)))
+    return logits
+
+@torch.no_grad()
+def get_log_probs(model, input_ids, is_causal):
+    logits = get_logits(model, input_ids, is_causal)
+    return torch.log_softmax(logits.float(), dim=-1)
+
+# ============================================================================
+# Pure AR eval (shared model in single-mode is_causal=True)
+# ============================================================================
+@torch.no_grad()
+def eval_pure_ar(model, tokens, sp, n_seqs, seq_len):
+    bb, hs, ib = build_byte_luts(sp)
+    total_nll = 0.0; total_bytes = 0.0
+    rng = np.random.RandomState(SEED)
+    t0 = time.time()
+    for s in range(n_seqs):
+        idx = rng.randint(0, len(tokens) - seq_len - 1)
+        seq = tokens[idx:idx+seq_len+1]
+        inp = torch.from_numpy(seq[:-1].reshape(1,-1).astype(np.int64)).to(DEVICE)
+        tgt = seq[1:]
+        prev = seq[:-1]
+        lp = get_log_probs(model, inp, is_causal=True)[0].cpu().numpy()
+        for t in range(seq_len):
+            total_nll -= lp[t, int(tgt[t])]
+        total_bytes += count_bytes(tgt, prev, bb, hs, ib)
+        if (s+1) % max(1, n_seqs//10) == 0:
+            elapsed = time.time() - t0
+            bpb = total_nll / total_bytes / math.log(2)
+            log(f"    AR {s+1}/{n_seqs} | BPB:{bpb:.4f} | {elapsed:.1f}s")
+    return total_nll / total_bytes / math.log(2)
+
+# ============================================================================
+# Coarse-to-Fine dual-brain eval
+# ============================================================================
+@torch.no_grad()
+def eval_coarse_to_fine(model, tokens, sp, n_seqs, seq_len, stride, n_random, cdm_rounds):
+    bb, hs, ib = build_byte_luts(sp)
+    log(f"  CF eval (stride={stride}, rounds={cdm_rounds}, {n_seqs} seqs × {seq_len} tok, R={n_random})")
+
+    total_ar_nll = 0.0
+    total_cdm_nll = 0.0
+    total_bytes = 0.0
+    rng = np.random.RandomState(SEED + 1)
+    t0 = time.time()
+
+    for s in range(n_seqs):
+        idx = rng.randint(0, len(tokens) - seq_len - 1)
+        seq = tokens[idx:idx + seq_len + 1]
+        x = seq[1:]
+        prev = seq[:-1]
+        total_bytes += count_bytes(x, prev, bb, hs, ib)
+
+        ar_positions = list(range(0, seq_len, stride))
+        cdm_positions = [i for i in range(seq_len) if i not in ar_positions]
+        round_groups = split_rounds(cdm_positions, cdm_rounds)
+
+        # === L brain: causal AR, score skeleton positions ===
+        input_ids = seq[:-1]
+        input_t = torch.from_numpy(input_ids.reshape(1, -1).astype(np.int64)).to(DEVICE)
+        ar_lp = get_log_probs(model, input_t, is_causal=True)[0].cpu().numpy()
+        for pos in ar_positions:
+            total_ar_nll -= ar_lp[pos, int(x[pos])]
+
+        # === R brain: bidirectional CDM, rounds of gap fill, SAME weights ===
+        for ridx, current_round in enumerate(round_groups):
+            if not current_round: continue
+            unresolved = set()
+            for g in round_groups[ridx:]:
+                unresolved.update(g)
+
+            avg_round_nll = np.zeros(len(current_round))
+            for r in range(n_random):
+                cdm_input = x.copy()
+                for pos in unresolved:
+                    cdm_input[pos] = rng.randint(0, VOCAB_SIZE)
+                cdm_input_t = torch.from_numpy(cdm_input.reshape(1,-1).astype(np.int64)).to(DEVICE)
+                cdm_lp = get_log_probs(model, cdm_input_t, is_causal=False)[0].cpu().numpy()
+                for i, pos in enumerate(current_round):
+                    avg_round_nll[i] -= cdm_lp[pos, int(x[pos])] / n_random
+
+            total_cdm_nll += avg_round_nll.sum()
+
+        if (s + 1) % max(1, n_seqs // 20) == 0:
+            elapsed = time.time() - t0
+            ar_bpb = total_ar_nll / total_bytes / math.log(2)
+            cdm_bpb = total_cdm_nll / total_bytes / math.log(2)
+            total_bpb = (total_ar_nll + total_cdm_nll) / total_bytes / math.log(2)
+            rate = (s+1) / elapsed if elapsed > 0 else 0
+            eta = (n_seqs - (s+1)) / rate if rate > 0 else 0
+            log(f"    CF {s+1}/{n_seqs} | AR:{ar_bpb:.4f} + CDM:{cdm_bpb:.4f} = {total_bpb:.4f} | {elapsed:.0f}s ETA:{eta:.0f}s")
+
+    ar_bpb = total_ar_nll / total_bytes / math.log(2)
+    cdm_bpb = total_cdm_nll / total_bytes / math.log(2)
+    total_bpb = (total_ar_nll + total_cdm_nll) / total_bytes / math.log(2)
+    return ar_bpb, cdm_bpb, total_bpb
+
+# ============================================================================
+def load_val_tokens():
+    """Parameter-golf binary format: 256 × int32 header + uint16 tokens."""
+    val_path = os.path.join(DATA_DIR, "fineweb_val_000000.bin")
+    if not os.path.exists(val_path):
+        raise FileNotFoundError(f"Val data not found: {val_path}")
+    header = np.fromfile(val_path, dtype="<i4", count=256)
+    val = np.fromfile(val_path, dtype="<u2",
+                      count=int(header[2]), offset=256*4).astype(np.int32)
+    return val
+
+def load_model():
+    log(f"Instantiating GPTv2 (V={VOCAB_SIZE}, L={trmod.NUM_LAYERS}, D={MODEL_DIM})")
+    model = trmod.GPTv2().to(DEVICE)
+    log(f"Loading state_dict from {MODEL_PATH}")
+    sd = torch.load(MODEL_PATH, map_location=DEVICE, weights_only=False)
+    if isinstance(sd, dict) and "model" in sd:
+        sd = sd["model"]
+    elif isinstance(sd, dict) and "state_dict" in sd:
+        sd = sd["state_dict"]
+    # strip common prefixes
+    clean = {}
+    for k, v in sd.items():
+        k2 = k
+        if k2.startswith("module."): k2 = k2[7:]
+        if k2.startswith("_orig_mod."): k2 = k2[10:]
+        clean[k2] = v
+    missing, unexpected = model.load_state_dict(clean, strict=False)
+    log(f"  loaded {len(clean)} keys, missing={len(missing)}, unexpected={len(unexpected)}")
+    if missing:  log(f"  missing first 5:   {list(missing)[:5]}")
+    if unexpected: log(f"  unexpected first 5:{list(unexpected)[:5]}")
+    model.eval()
+    return model
+
+def main():
+    log("=" * 70)
+    log(f"  CUDA Shared Dual-Brain CF Eval | N_SEQS={N_SEQS} stride={STRIDE} rounds={ROUNDS}")
+    log(f"  Device: {DEVICE}")
+    log(f"  Model:  {MODEL_PATH}")
+    log(f"  Seq len: {SEQ_LEN}   Seed: {SEED}")
+    log("=" * 70)
+
+    torch.manual_seed(SEED); np.random.seed(SEED)
+
+    sp = spm.SentencePieceProcessor(model_file=TOKENIZER_PATH)
+    log(f"Tokenizer: vocab_size={sp.vocab_size()}")
+
+    val = load_val_tokens()
+    log(f"Val tokens available: {len(val):,}")
+    log(f"Sampling {N_SEQS} seqs × {SEQ_LEN} tok = {N_SEQS*SEQ_LEN:,} tokens scored\n")
+
+    model = load_model()
+    n_params = sum(p.numel() for p in model.parameters())
+    log(f"  params: {n_params:,}")
+
+    log("\n[1] Pure AR baseline (shared model, is_causal=True)")
+    t0 = time.time()
+    ar_only = eval_pure_ar(model, val, sp, N_SEQS, SEQ_LEN)
+    log(f"  Pure AR BPB: {ar_only:.4f}  ({time.time()-t0:.0f}s)")
+
+    log(f"\n[2] Coarse-to-Fine dual-brain (stride={STRIDE}, rounds={ROUNDS})")
+    t0 = time.time()
+    ar_bpb, cdm_bpb, cf_bpb = eval_coarse_to_fine(
+        model, val, sp, N_SEQS, SEQ_LEN, STRIDE, N_RANDOM, ROUNDS)
+    dt = time.time() - t0
+
+    ar_frac = 1.0 / STRIDE
+    cdm_frac = 1.0 - ar_frac
+    delta_pct = (cf_bpb/ar_only - 1) * 100
+
+    log("\n" + "=" * 70)
+    log(f"  RESULTS (n_seqs={N_SEQS}, seq_len={SEQ_LEN}, total={N_SEQS*SEQ_LEN:,} tokens)")
+    log("=" * 70)
+    log(f"  Pure AR (baseline):              {ar_only:.4f}")
+    log(f"  CF AR part ({ar_frac*100:.0f}% positions):       {ar_bpb:.4f}")
+    log(f"  CF CDM part ({cdm_frac*100:.0f}% positions):     {cdm_bpb:.4f}")
+    log(f"  CF Total (dual-brain):           {cf_bpb:.4f}")
+    log(f"  CF vs Pure AR:                   {delta_pct:+.2f}%  ({'-' if delta_pct<0 else '+'}{abs(cf_bpb-ar_only):.4f} BPB)")
+    log(f"  CF eval time:                    {dt:.0f}s  ({N_SEQS/dt:.1f} seqs/s)")
+    log("=" * 70)
+
+    result = {
+        "n_seqs": N_SEQS, "seq_len": SEQ_LEN, "tokens_scored": N_SEQS*SEQ_LEN,
+        "stride": STRIDE, "rounds": ROUNDS, "n_random": N_RANDOM,
+        "pure_ar_bpb": float(ar_only),
+        "cf_ar_part": float(ar_bpb),
+        "cf_cdm_part": float(cdm_bpb),
+        "cf_total": float(cf_bpb),
+        "cf_vs_ar_pct": float(delta_pct),
+        "cf_time_sec": float(dt),
+        "model": os.path.basename(MODEL_PATH),
+        "n_params": n_params,
+        "device": DEVICE,
+        "seed": SEED,
+    }
+    log("\nJSON:")
+    log(json.dumps(result, indent=2))
+
+if __name__ == "__main__":
+    main()
+    if _log_fh: _log_fh.close()
diff --git a/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/eval_sequential_unmasking.py b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/eval_sequential_unmasking.py
new file mode 100644
index 0000000000..799d7b0998
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/eval_sequential_unmasking.py
@@ -0,0 +1,392 @@
+#!/usr/bin/env -S python3 -u
+"""
+Sequential Unmasking BPB Eval — 正確的 CDM 量尺
+
+不改模型。用 chain rule 精確計算 log P(x)：
+
+  Step 1: [rand, rand, rand, ...] → P(x_1)
+  Step 2: [x_1,  rand, rand, ...] → P(x_2 | x_1)
+  Step 3: [x_1,  x_2,  rand, ...] → P(x_3 | x_1, x_2)
+  ...
+  Step L: [x_1,  x_2,  ..., x_{L-1}, rand] → P(x_L | x_{<L})
+
+BPB = -Σ log₂ P(x_t | x_{<t}, rand_{≥t}) / total_bytes
+
+比較：
+  A: CDM model + sequential unmasking (新尺)
+  B: AR model + standard causal eval (傳統尺)
+"""
+import glob, math, time, os
+import numpy as np
+import sentencepiece as spm
+# MLX imports — guarded so this Apple-Silicon-only pre-flight script can be
+# IMPORTED on Linux CPU smoke-test environments (Python 3.10 + torch CPU,
+# where `pip install mlx` is not possible). The stub class below lets module-
+# level definitions like `class Foo(nn.Module):` and `COMPUTE_DTYPE = mx.bfloat16`
+# parse without crashing; any actual MLX call only happens inside main() and
+# fails clearly when the script is run as __main__ on a non-Apple machine.
+try:
+    import mlx.core as mx
+    import mlx.nn as nn
+    from mlx.utils import tree_flatten, tree_unflatten
+    _HAS_MLX = True
+except ImportError:
+    _HAS_MLX = False
+    class _MlxStub:
+        """Permissive stub: any attribute access or call returns another stub.
+        Lets `class Foo(nn.Module):` and `COMPUTE_DTYPE = mx.bfloat16` parse on
+        a non-Apple machine without raising at module load time."""
+        Module = type("Module", (object,), {})
+        def __getattr__(self, name): return _MlxStub()
+        def __call__(self, *a, **k): return _MlxStub()
+    mx = _MlxStub()
+    nn = _MlxStub()
+    def tree_flatten(*a, **k):
+        raise RuntimeError("MLX not available; this script is Apple Silicon only.")
+    def tree_unflatten(*a, **k):
+        raise RuntimeError("MLX not available; this script is Apple Silicon only.")
+
+COMPUTE_DTYPE = mx.bfloat16
+DATA_DIR = "/Users/akaihuangm1/Desktop/github/parameter-golf/data/datasets/fineweb10B_sp1024"
+TOKENIZER_PATH = "/Users/akaihuangm1/Desktop/github/parameter-golf/data/tokenizers/fineweb_1024_bpe.model"
+SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
+
+VOCAB_SIZE = 1024
+MODEL_DIM = 512
+NUM_LAYERS = 11
+NUM_HEADS = 8
+NUM_KV_HEADS = 4
+MLP_MULT = 3
+ROPE_BASE = 10000.0
+QK_GAIN_INIT = 1.5
+LOGIT_SOFTCAP = 30.0
+SEQ_LEN = 1024
+XSA_LAST_N = 4
+BIGRAM_BUCKETS = 2048
+BIGRAM_DIM = 128
+
+# ============================================================================
+# Model (same as golf_v2)
+# ============================================================================
+def rms_norm(x, eps=1e-6):
+    return (x * mx.rsqrt(mx.mean(x*x, axis=-1, keepdims=True) + eps)).astype(x.dtype)
+
+class CastedLinear(nn.Module):
+    def __init__(s, i, o):
+        super().__init__()
+        s.weight = nn.Linear(i, o, bias=False).weight.astype(mx.float32)
+    def __call__(s, x): return x @ s.weight.astype(x.dtype).T
+
+class SmearGate(nn.Module):
+    def __init__(s, d):
+        super().__init__(); s.gate = mx.zeros((d,), dtype=mx.float32)
+    def __call__(s, x):
+        g = mx.sigmoid(s.gate.astype(x.dtype))[None, None, :]
+        return (1-g)*x + g*mx.concatenate([mx.zeros_like(x[:,:1]), x[:,:-1]], axis=1)
+
+class BigramHashEmbedding(nn.Module):
+    def __init__(s):
+        super().__init__()
+        s.embed = nn.Embedding(BIGRAM_BUCKETS, BIGRAM_DIM)
+        s.embed.weight = mx.zeros_like(s.embed.weight)
+        s.proj = CastedLinear(BIGRAM_DIM, MODEL_DIM)
+        s.scale = mx.array(0.05, dtype=mx.float32)
+    def __call__(s, t):
+        ti = t.astype(mx.int32); m = BIGRAM_BUCKETS - 1
+        sh = mx.concatenate([mx.full((ti.shape[0],1), m, dtype=mx.int32), ti[:,:-1]], axis=1)
+        return s.proj(s.embed((36313*ti + 27191*sh) % m)) * s.scale.astype(COMPUTE_DTYPE)
+
+class DualModeAttention(nn.Module):
+    def __init__(s, use_xsa=False):
+        super().__init__()
+        s.nh=NUM_HEADS; s.nkv=NUM_KV_HEADS; s.hd=MODEL_DIM//NUM_HEADS; s.use_xsa=use_xsa
+        s.c_q=CastedLinear(MODEL_DIM, MODEL_DIM)
+        s.c_k=CastedLinear(MODEL_DIM, NUM_KV_HEADS*s.hd)
+        s.c_v=CastedLinear(MODEL_DIM, NUM_KV_HEADS*s.hd)
+        s.proj=CastedLinear(MODEL_DIM, MODEL_DIM)
+        s.q_gain=mx.ones((NUM_HEADS,), dtype=mx.float32)*QK_GAIN_INIT
+        s.rope=nn.RoPE(s.hd, traditional=False, base=ROPE_BASE)
+        s.scale=s.hd**-0.5
+    def __call__(s, x, is_causal=True):
+        B,T,D = x.shape
+        q=s.c_q(x).reshape(B,T,s.nh,s.hd).transpose(0,2,1,3)
+        k=s.c_k(x).reshape(B,T,s.nkv,s.hd).transpose(0,2,1,3)
+        v=s.c_v(x).reshape(B,T,s.nkv,s.hd).transpose(0,2,1,3)
+        q=s.rope(rms_norm(q).astype(COMPUTE_DTYPE))
+        k=s.rope(rms_norm(k).astype(COMPUTE_DTYPE))
+        q=q*s.q_gain.astype(q.dtype)[None,:,None,None]
+        y=mx.fast.scaled_dot_product_attention(q,k,v,scale=s.scale,
+            mask="causal" if is_causal else None)
+        return s.proj(y.transpose(0,2,1,3).reshape(B,T,D))
+
+class MLP(nn.Module):
+    def __init__(s):
+        super().__init__()
+        s.fc=CastedLinear(MODEL_DIM, MODEL_DIM*MLP_MULT)
+        s.proj=CastedLinear(MODEL_DIM*MLP_MULT, MODEL_DIM)
+    def __call__(s, x):
+        h=s.fc(x); h=mx.where(h>=0,h,0.5*h); return s.proj(h*h)
+
+class Block(nn.Module):
+    def __init__(s, li=0, use_xsa=False):
+        super().__init__()
+        s.attn=DualModeAttention(use_xsa=use_xsa); s.mlp=MLP()
+        s.ln_scale=1.0/math.sqrt(li+1)
+        s.attn_scale=mx.ones((MODEL_DIM,), dtype=mx.float32)
+        s.mlp_scale=mx.ones((MODEL_DIM,), dtype=mx.float32)
+        s.resid_mix=mx.array(np.stack((np.ones(MODEL_DIM,dtype=np.float32),
+                                        np.zeros(MODEL_DIM,dtype=np.float32))))
+    def __call__(s, x, x0, is_causal=True):
+        m=s.resid_mix.astype(x.dtype)
+        x=m[0][None,None,:]*x+m[1][None,None,:]*x0
+        x=x+s.attn_scale.astype(x.dtype)[None,None,:]*s.attn(rms_norm(x)*s.ln_scale, is_causal=is_causal)
+        x=x+s.mlp_scale.astype(x.dtype)[None,None,:]*s.mlp(rms_norm(x)*s.ln_scale)
+        return x
+
+class GPTv2(nn.Module):
+    def __init__(s):
+        super().__init__()
+        s.tok_emb=nn.Embedding(VOCAB_SIZE, MODEL_DIM)
+        s.bigram=BigramHashEmbedding(); s.smear=SmearGate(MODEL_DIM)
+        ne=NUM_LAYERS//2; nd=NUM_LAYERS-ne; s.ne=ne; s.nd=nd
+        s.skip_weights=mx.ones((min(ne,nd),MODEL_DIM), dtype=mx.float32)
+        s.blocks=[Block(li=i, use_xsa=i>=(NUM_LAYERS-XSA_LAST_N)) for i in range(NUM_LAYERS)]
+        s.tok_emb.weight=(mx.random.normal(s.tok_emb.weight.shape)*0.005).astype(COMPUTE_DTYPE)
+
+    def forward_hidden(s, ids, is_causal=True):
+        x=s.tok_emb(ids).astype(COMPUTE_DTYPE)+s.bigram(ids).astype(COMPUTE_DTYPE)
+        x=rms_norm(x); x=s.smear(x); x0=x; skips=[]
+        for i in range(s.ne):
+            x=s.blocks[i](x, x0, is_causal=is_causal); skips.append(x)
+        for i in range(s.nd):
+            if skips: x=x+s.skip_weights[i].astype(x.dtype)[None,None,:]*skips.pop()
+            x=s.blocks[s.ne+i](x, x0, is_causal=is_causal)
+        return rms_norm(x)
+
+    def get_logits(s, ids, is_causal=True):
+        h = s.forward_hidden(ids, is_causal=is_causal)
+        return LOGIT_SOFTCAP * mx.tanh(h @ s.tok_emb.weight.astype(h.dtype).T / LOGIT_SOFTCAP)
+
+# ============================================================================
+# Load model
+# ============================================================================
+def load_model(path):
+    mx.random.seed(42)
+    model = GPTv2()
+    w = dict(np.load(path))
+    loaded = {}
+    for k, v in w.items():
+        if v.dtype.str == '|V2':
+            loaded[k] = mx.array(v.view(np.uint16)).view(mx.bfloat16)
+        else:
+            loaded[k] = mx.array(v)
+    model_keys = set(k for k, _ in tree_flatten(model.parameters()))
+    matched = {k: v for k, v in loaded.items() if k in model_keys}
+    model.update(tree_unflatten(list(matched.items())))
+    mx.eval(model.parameters())
+    print(f"  Loaded {len(matched)}/{len(model_keys)} keys from {os.path.basename(path)}")
+    return model
+
+# ============================================================================
+# Sequential Unmasking BPB
+# ============================================================================
+def eval_sequential_unmasking(model, tokens, sp, n_seqs=20, seq_len=256, batch_positions=32, n_random=3):
+    """
+    Sequential Unmasking: exact BPB via chain rule.
+
+    For each position t:
+      input = [x_1..x_{t-1} (real), rand_t..rand_L (random)]
+      output P(x_t | input) at position t
+
+    Batch: process `batch_positions` positions per forward pass.
+    n_random: average over multiple random fills for variance reduction.
+    """
+    print(f"  Sequential Unmasking eval ({n_seqs} seqs, seq_len={seq_len}, batch={batch_positions}, R={n_random})...")
+
+    # Byte counting
+    base_bytes = np.zeros(int(sp.vocab_size()), dtype=np.int16)
+    has_space = np.zeros(int(sp.vocab_size()), dtype=np.bool_)
+    is_boundary = np.ones(int(sp.vocab_size()), dtype=np.bool_)
+    for tid in range(int(sp.vocab_size())):
+        if sp.is_control(tid) or sp.is_unknown(tid) or sp.is_unused(tid): continue
+        is_boundary[tid] = False
+        if sp.is_byte(tid): base_bytes[tid] = 1; continue
+        piece = sp.id_to_piece(tid)
+        if piece.startswith("\u2581"): has_space[tid] = True; piece = piece[1:]
+        base_bytes[tid] = len(piece.encode("utf-8"))
+
+    total_nll = 0.0
+    total_bytes = 0.0
+    t0 = time.time()
+
+    for s in range(n_seqs):
+        idx = np.random.randint(0, len(tokens) - seq_len - 1)
+        seq = tokens[idx:idx + seq_len + 1]  # +1 for prev token
+        x = seq[1:]   # tokens to predict (positions 0..seq_len-1)
+        prev = seq[:-1]  # previous tokens for byte counting
+
+        # Byte counting for this sequence
+        for t in range(seq_len):
+            b = float(base_bytes[x[t]])
+            if has_space[x[t]] and not is_boundary[prev[t]]:
+                b += 1.0
+            total_bytes += max(b, 1.0)
+
+        # Sequential unmasking: process in chunks of batch_positions
+        for start in range(0, seq_len, batch_positions):
+            end = min(start + batch_positions, seq_len)
+            n_pos = end - start
+
+            avg_nll = np.zeros(n_pos)
+
+            for r in range(n_random):
+                # Build batch: each sample reveals up to position t
+                batch = np.zeros((n_pos, seq_len), dtype=np.int32)
+                for i in range(n_pos):
+                    t = start + i
+                    # Positions < t: real tokens
+                    batch[i, :t] = x[:t]
+                    # Positions >= t: random tokens
+                    batch[i, t:] = np.random.randint(0, VOCAB_SIZE, size=seq_len - t)
+
+                # Forward pass (bidirectional!)
+                batch_mx = mx.array(batch)
+                logits = model.get_logits(batch_mx, is_causal=False)  # [n_pos, seq_len, vocab]
+                log_probs = logits - mx.logsumexp(logits, axis=-1, keepdims=True)  # log_softmax
+                mx.eval(log_probs)
+                log_probs_np = np.array(log_probs.astype(mx.float32))
+
+                # Extract P(x_t) at position t for each sample
+                for i in range(n_pos):
+                    t = start + i
+                    target = int(x[t])
+                    nll = -log_probs_np[i, t, target]
+                    avg_nll[i] += nll / n_random
+
+            total_nll += avg_nll.sum()
+
+        if (s + 1) % 5 == 0:
+            elapsed = time.time() - t0
+            bpb = total_nll / total_bytes / math.log(2)
+            print(f"    {s+1}/{n_seqs} | BPB: {bpb:.4f} | {elapsed:.1f}s")
+
+    bpb = total_nll / total_bytes / math.log(2)
+    return bpb
+
+# ============================================================================
+# Standard Causal BPB (for comparison)
+# ============================================================================
+def eval_causal_bpb(model, tokens, sp, n_seqs=20, seq_len=256):
+    """Standard AR causal BPB."""
+    print(f"  Causal AR eval ({n_seqs} seqs, seq_len={seq_len})...")
+
+    base_bytes = np.zeros(int(sp.vocab_size()), dtype=np.int16)
+    has_space = np.zeros(int(sp.vocab_size()), dtype=np.bool_)
+    is_boundary = np.ones(int(sp.vocab_size()), dtype=np.bool_)
+    for tid in range(int(sp.vocab_size())):
+        if sp.is_control(tid) or sp.is_unknown(tid) or sp.is_unused(tid): continue
+        is_boundary[tid] = False
+        if sp.is_byte(tid): base_bytes[tid] = 1; continue
+        piece = sp.id_to_piece(tid)
+        if piece.startswith("\u2581"): has_space[tid] = True; piece = piece[1:]
+        base_bytes[tid] = len(piece.encode("utf-8"))
+
+    total_nll = 0.0
+    total_bytes = 0.0
+    t0 = time.time()
+
+    for s in range(n_seqs):
+        idx = np.random.randint(0, len(tokens) - seq_len - 1)
+        x = mx.array(tokens[idx:idx+seq_len].reshape(1, -1))
+        y = tokens[idx+1:idx+seq_len+1]
+        prev = tokens[idx:idx+seq_len]
+
+        logits = model.get_logits(x, is_causal=True)  # [1, seq_len, vocab]
+        log_probs = logits - mx.logsumexp(logits, axis=-1, keepdims=True)
+        mx.eval(log_probs)
+        lp = np.array(log_probs.astype(mx.float32))[0]
+
+        for t in range(seq_len):
+            total_nll -= lp[t, int(y[t])]
+            b = float(base_bytes[y[t]])
+            if has_space[y[t]] and not is_boundary[prev[t]]:
+                b += 1.0
+            total_bytes += max(b, 1.0)
+
+    bpb = total_nll / total_bytes / math.log(2)
+    elapsed = time.time() - t0
+    print(f"    Done | BPB: {bpb:.4f} | {elapsed:.1f}s")
+    return bpb
+
+# ============================================================================
+# Main
+# ============================================================================
+def main():
+    print("=" * 60)
+    print("  Sequential Unmasking vs Causal Eval")
+    print("  正確的 CDM 量尺 vs 傳統 AR 量尺")
+    print("=" * 60)
+
+    np.random.seed(42); mx.random.seed(42)
+
+    sp = spm.SentencePieceProcessor(model_file=TOKENIZER_PATH)
+
+    # Load validation data
+    header = np.fromfile(f"{DATA_DIR}/fineweb_val_000000.bin", dtype="<i4", count=256)
+    val_tokens = np.fromfile(f"{DATA_DIR}/fineweb_val_000000.bin", dtype="<u2",
+                             count=int(header[2]), offset=256*4).astype(np.int32)
+    print(f"Val: {len(val_tokens):,} tokens")
+
+    # === CDM model (right brain, bidirectional) ===
+    print(f"\n[1] Loading CDM model (right brain)...")
+    cdm = load_model(os.path.join(SCRIPT_DIR, "diffusion_right_brain_27M.npz"))
+
+    # === AR model (left brain, causal) ===
+    print(f"\n[2] Loading AR model (left brain)...")
+    ar = load_model(os.path.join(SCRIPT_DIR, "golf_v2_retro_5000step.npz"))
+
+    N_SEQS = 20
+    SEQ = 256  # shorter for speed
+
+    # === Eval ===
+    print(f"\n[3] CDM + Sequential Unmasking (新尺):")
+    cdm_su_bpb = eval_sequential_unmasking(cdm, val_tokens, sp,
+                                             n_seqs=N_SEQS, seq_len=SEQ,
+                                             batch_positions=32, n_random=3)
+
+    print(f"\n[4] CDM + Causal Eval (舊的錯誤尺):")
+    cdm_causal_bpb = eval_causal_bpb(cdm, val_tokens, sp, n_seqs=N_SEQS, seq_len=SEQ)
+
+    print(f"\n[5] AR + Causal Eval (傳統尺):")
+    ar_causal_bpb = eval_causal_bpb(ar, val_tokens, sp, n_seqs=N_SEQS, seq_len=SEQ)
+
+    print(f"\n[6] AR + Sequential Unmasking (AR 用新尺會怎樣):")
+    ar_su_bpb = eval_sequential_unmasking(ar, val_tokens, sp,
+                                            n_seqs=N_SEQS, seq_len=SEQ,
+                                            batch_positions=32, n_random=3)
+
+    # === Results ===
+    print(f"\n{'='*60}")
+    print(f"  RESULTS")
+    print(f"{'='*60}")
+    print(f"")
+    print(f"  {'Model':<20s} {'Eval':<25s} {'BPB':>8s}")
+    print(f"  {'-'*55}")
+    print(f"  {'CDM (right brain)':<20s} {'Sequential Unmasking':<25s} {cdm_su_bpb:>8.4f}  ← 新尺")
+    print(f"  {'CDM (right brain)':<20s} {'Causal (wrong ruler)':<25s} {cdm_causal_bpb:>8.4f}  ← 舊尺")
+    print(f"  {'AR (left brain)':<20s} {'Causal (standard)':<25s} {ar_causal_bpb:>8.4f}  ← 傳統")
+    print(f"  {'AR (left brain)':<20s} {'Sequential Unmasking':<25s} {ar_su_bpb:>8.4f}  ← AR 用新尺")
+    print(f"")
+
+    if cdm_su_bpb < ar_causal_bpb:
+        improvement = (1 - cdm_su_bpb / ar_causal_bpb) * 100
+        print(f"  ✅ CDM + 正確量尺 WINS!")
+        print(f"     CDM {cdm_su_bpb:.4f} < AR {ar_causal_bpb:.4f}")
+        print(f"     改善: {improvement:.1f}%")
+    else:
+        gap = (cdm_su_bpb / ar_causal_bpb - 1) * 100
+        print(f"  AR still leads by {gap:.1f}%")
+        print(f"     CDM {cdm_su_bpb:.4f} vs AR {ar_causal_bpb:.4f}")
+
+if __name__ == "__main__":
+    main()
diff --git a/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/eval_ttt.py b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/eval_ttt.py
new file mode 100644
index 0000000000..0c45266426
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/eval_ttt.py
@@ -0,0 +1,605 @@
+#!/usr/bin/env -S python3 -u
+"""
+Test-Time Training (TTT) evaluation script.
+
+Loads a saved .npz model (train_depth_recurrent.py or train_depth_768.py architecture)
+and compares two evaluation modes:
+
+  1. Standard BPB: forward pass only, no weight updates.
+  2. TTT BPB: process val tokens sequentially; after each batch, do N gradient steps
+     on that same batch to adapt the model, then eval on the next batch.
+
+TTT Rationale:
+  Each validation batch is treated as a small "training signal". The model adapts
+  to the local distribution seen so far. This is a form of online learning at eval time.
+  Useful because FineWeb has document-level coherence: adapting to early tokens in a
+  document may improve prediction of later tokens.
+
+Usage:
+  # dim=512 model (default)
+  python eval_ttt.py --model_path golf_v2_model.npz
+
+  # dim=768 model
+  python eval_ttt.py --model_path golf_768_model.npz --model_dim 768
+
+  # Tune TTT hyperparams
+  python eval_ttt.py --model_path golf_v2_model.npz --ttt_lr 1e-4 --ttt_steps 1
+  python eval_ttt.py --model_path golf_v2_model.npz --ttt_lr 3e-4 --ttt_steps 2
+  python eval_ttt.py --model_path golf_v2_model.npz --ttt_lr 1e-3 --ttt_steps 3
+"""
+from __future__ import annotations
+
+import argparse
+import copy
+import glob
+import math
+import time
+from pathlib import Path
+
+import numpy as np
+import sentencepiece as spm
+
+# MLX imports — guarded so this Apple-Silicon-only pre-flight script can be
+# IMPORTED on Linux CPU smoke-test environments (Python 3.10 + torch CPU,
+# where `pip install mlx` is not possible). The stub class below lets module-
+# level definitions like `class Foo(nn.Module):` and `COMPUTE_DTYPE = mx.bfloat16`
+# parse without crashing; any actual MLX call only happens inside main() and
+# fails clearly when the script is run as __main__ on a non-Apple machine.
+try:
+    import mlx.core as mx
+    import mlx.nn as nn
+    import mlx.optimizers as optim
+    from mlx.utils import tree_flatten, tree_unflatten
+    _HAS_MLX = True
+except ImportError:
+    _HAS_MLX = False
+    class _MlxStub:
+        """Permissive stub: any attribute access or call returns another stub.
+        Lets `class Foo(nn.Module):` and `COMPUTE_DTYPE = mx.bfloat16` parse on
+        a non-Apple machine without raising at module load time."""
+        Module = type("Module", (object,), {})
+        def __getattr__(self, name): return _MlxStub()
+        def __call__(self, *a, **k): return _MlxStub()
+    mx = _MlxStub()
+    nn = _MlxStub()
+    optim = _MlxStub()
+    def tree_flatten(*a, **k):
+        raise RuntimeError("MLX not available; this script is Apple Silicon only.")
+    def tree_unflatten(*a, **k):
+        raise RuntimeError("MLX not available; this script is Apple Silicon only.")
+
+# ==============================================================================
+# DEFAULTS (overridden by --model_dim)
+# ==============================================================================
+DATA_DIR = "/Users/akaihuangm1/Desktop/github/parameter-golf/data/datasets/fineweb10B_sp1024"
+TOKENIZER_PATH = "/Users/akaihuangm1/Desktop/github/parameter-golf/data/tokenizers/fineweb_1024_bpe.model"
+
+VOCAB_SIZE = 1024
+NUM_LAYERS = 17
+NUM_UNIQUE_LAYERS = 6
+LAYER_MAP = [0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4, 5, 0, 1, 2, 3, 4]
+ROPE_BASE = 10000.0
+QK_GAIN_INIT = 1.5
+LOGIT_SOFTCAP = 30.0
+XSA_LAST_N = 4
+BIGRAM_BUCKETS = 2048
+COMPUTE_DTYPE = mx.bfloat16
+
+# Per-dim config
+DIM_CONFIGS = {
+    512: dict(model_dim=512, num_heads=8, num_kv_heads=4, mlp_mult=3, bigram_dim=128),
+    768: dict(model_dim=768, num_heads=8, num_kv_heads=4, mlp_mult=3, bigram_dim=192),
+}
+
+# ==============================================================================
+# MATH HELPERS
+# ==============================================================================
+def rms_norm(x, eps=1e-6):
+    return (x * mx.rsqrt(mx.mean(x * x, axis=-1, keepdims=True) + eps)).astype(x.dtype)
+
+# ==============================================================================
+# MODEL (same architecture as train_depth_recurrent.py / train_depth_768.py)
+# ==============================================================================
+class CastedLinear(nn.Module):
+    def __init__(self, in_dim, out_dim):
+        super().__init__()
+        self.weight = nn.Linear(in_dim, out_dim, bias=False).weight.astype(mx.float32)
+    def __call__(self, x):
+        return x @ self.weight.astype(x.dtype).T
+
+class RMSNormNoWeight(nn.Module):
+    def __call__(self, x):
+        return rms_norm(x)
+
+class DualModeAttention(nn.Module):
+    def __init__(self, dim, num_heads, num_kv_heads, rope_base, qk_gain_init):
+        super().__init__()
+        self.num_heads = num_heads
+        self.num_kv_heads = num_kv_heads
+        self.head_dim = dim // num_heads
+        kv_dim = num_kv_heads * self.head_dim
+        self.c_q = CastedLinear(dim, dim)
+        self.c_k = CastedLinear(dim, kv_dim)
+        self.c_v = CastedLinear(dim, kv_dim)
+        self.proj = CastedLinear(dim, dim)
+        self.q_gain = mx.ones((num_heads,), dtype=mx.float32) * qk_gain_init
+        self.rope = nn.RoPE(self.head_dim, traditional=False, base=rope_base)
+        self.scale = self.head_dim ** -0.5
+
+    def _xsa(self, y, v):
+        bsz, seqlen, dim = y.shape
+        hd = self.head_dim
+        nkv = self.num_kv_heads
+        nh = self.num_heads
+        group = nh // nkv
+        y_g = y.reshape(bsz, seqlen, nkv, group, hd)
+        v_t = v.transpose(0, 2, 1, 3)
+        vn = v_t / (mx.sqrt(mx.sum(v_t * v_t, axis=-1, keepdims=True)) + 1e-8)
+        vn = mx.expand_dims(vn, axis=3)
+        proj = mx.sum(y_g * vn, axis=-1, keepdims=True) * vn
+        return (y_g - proj).reshape(bsz, seqlen, dim)
+
+    def __call__(self, x, is_causal=True, use_xsa=False):
+        bsz, seqlen, dim = x.shape
+        q = self.c_q(x).reshape(bsz, seqlen, self.num_heads, self.head_dim).transpose(0, 2, 1, 3)
+        k = self.c_k(x).reshape(bsz, seqlen, self.num_kv_heads, self.head_dim).transpose(0, 2, 1, 3)
+        v = self.c_v(x).reshape(bsz, seqlen, self.num_kv_heads, self.head_dim).transpose(0, 2, 1, 3)
+        q = self.rope(rms_norm(q).astype(COMPUTE_DTYPE))
+        k = self.rope(rms_norm(k).astype(COMPUTE_DTYPE))
+        q = q * self.q_gain.astype(q.dtype)[None, :, None, None]
+        if is_causal:
+            y = mx.fast.scaled_dot_product_attention(q, k, v, scale=self.scale, mask="causal")
+        else:
+            y = mx.fast.scaled_dot_product_attention(q, k, v, scale=self.scale)
+        y = y.transpose(0, 2, 1, 3).reshape(bsz, seqlen, dim)
+        if use_xsa:
+            y = self._xsa(y, v)
+        return self.proj(y)
+
+class MLP(nn.Module):
+    def __init__(self, dim, mlp_mult):
+        super().__init__()
+        hidden = dim * mlp_mult
+        self.fc = CastedLinear(dim, hidden)
+        self.proj = CastedLinear(hidden, dim)
+
+    def __call__(self, x):
+        h = self.fc(x)
+        h = mx.where(h >= 0, h, 0.5 * h)
+        return self.proj(h * h)
+
+class BigramHashEmbedding(nn.Module):
+    def __init__(self, buckets, bigram_dim, model_dim):
+        super().__init__()
+        self.buckets = buckets
+        self.embed = nn.Embedding(buckets, bigram_dim)
+        self.proj = CastedLinear(bigram_dim, model_dim)
+        self.scale = mx.array(0.05, dtype=mx.float32)
+
+    def bigram_hash(self, tokens):
+        t = tokens.astype(mx.int32)
+        mod = self.buckets - 1
+        shifted = mx.concatenate([mx.full((t.shape[0], 1), mod, dtype=mx.int32), t[:, :-1]], axis=1)
+        hashed = (36313 * t + 27191 * shifted) % mod
+        return hashed
+
+    def __call__(self, token_ids):
+        h = self.embed(self.bigram_hash(token_ids))
+        h = self.proj(h)
+        return h * self.scale.astype(h.dtype)
+
+class SmearGate(nn.Module):
+    def __init__(self, dim):
+        super().__init__()
+        self.gate = mx.zeros((dim,), dtype=mx.float32)
+
+    def __call__(self, x):
+        g = mx.sigmoid(self.gate.astype(x.dtype))[None, None, :]
+        x_prev = mx.concatenate([mx.zeros_like(x[:, :1]), x[:, :-1]], axis=1)
+        return (1 - g) * x + g * x_prev
+
+class Block(nn.Module):
+    def __init__(self, dim, num_heads, num_kv_heads, mlp_mult, rope_base, qk_gain_init):
+        super().__init__()
+        self.attn_norm = RMSNormNoWeight()
+        self.mlp_norm = RMSNormNoWeight()
+        self.attn = DualModeAttention(dim, num_heads, num_kv_heads, rope_base, qk_gain_init)
+        self.mlp = MLP(dim, mlp_mult)
+        self.attn_scale = mx.ones((dim,), dtype=mx.float32)
+        self.mlp_scale = mx.ones((dim,), dtype=mx.float32)
+        self.resid_mix = mx.array(np.stack((
+            np.ones((dim,), dtype=np.float32),
+            np.zeros((dim,), dtype=np.float32)
+        )))
+
+    def __call__(self, x, x0, exec_pos=0, is_causal=True):
+        ln_scale = 1.0 / math.sqrt(exec_pos + 1)
+        use_xsa = exec_pos >= (NUM_LAYERS - XSA_LAST_N)
+        mix = self.resid_mix.astype(x.dtype)
+        x = mix[0][None, None, :] * x + mix[1][None, None, :] * x0
+        attn_out = self.attn(self.attn_norm(x) * ln_scale, is_causal=is_causal, use_xsa=use_xsa)
+        x = x + self.attn_scale.astype(x.dtype)[None, None, :] * attn_out
+        x = x + self.mlp_scale.astype(x.dtype)[None, None, :] * self.mlp(self.mlp_norm(x) * ln_scale)
+        return x
+
+class GPTv2(nn.Module):
+    def __init__(self, cfg):
+        super().__init__()
+        self.cfg = cfg
+        dim = cfg["model_dim"]
+        self.logit_softcap = LOGIT_SOFTCAP
+        self.tok_emb = nn.Embedding(VOCAB_SIZE, dim)
+        self.bigram = BigramHashEmbedding(BIGRAM_BUCKETS, cfg["bigram_dim"], dim)
+        self.smear = SmearGate(dim)
+        self.num_encoder_layers = NUM_LAYERS // 2
+        self.num_decoder_layers = NUM_LAYERS - self.num_encoder_layers
+        self.num_skip_weights = min(self.num_encoder_layers, self.num_decoder_layers)
+        self.skip_weights = mx.ones((self.num_skip_weights, dim), dtype=mx.float32)
+        self.blocks = [
+            Block(dim, cfg["num_heads"], cfg["num_kv_heads"], cfg["mlp_mult"], ROPE_BASE, QK_GAIN_INIT)
+            for _ in range(NUM_UNIQUE_LAYERS)
+        ]
+        self.depth_bias = mx.zeros((NUM_LAYERS, dim), dtype=mx.float32)
+        self.final_norm = RMSNormNoWeight()
+
+    def softcap(self, logits):
+        c = self.logit_softcap
+        return c * mx.tanh(logits / c)
+
+    def forward_hidden(self, input_ids, is_causal=True):
+        dim = self.cfg["model_dim"]
+        x = self.tok_emb(input_ids).astype(COMPUTE_DTYPE)
+        x = x + self.bigram(input_ids).astype(COMPUTE_DTYPE)
+        x = rms_norm(x)
+        x = self.smear(x)
+        x0 = x
+        skips = []
+        dec_idx = 0
+        for exec_pos in range(NUM_LAYERS):
+            unique_idx = LAYER_MAP[exec_pos]
+            x = x + self.depth_bias[exec_pos].astype(x.dtype)[None, None, :]
+            if exec_pos < self.num_encoder_layers:
+                x = self.blocks[unique_idx](x, x0, exec_pos=exec_pos, is_causal=is_causal)
+                skips.append(x)
+            else:
+                if skips:
+                    x = x + self.skip_weights[dec_idx % self.num_skip_weights].astype(x.dtype)[None, None, :] * skips.pop()
+                    dec_idx += 1
+                x = self.blocks[unique_idx](x, x0, exec_pos=exec_pos, is_causal=is_causal)
+        return self.final_norm(x)
+
+    def __call__(self, input_ids):
+        return self.forward_hidden(input_ids, is_causal=True)
+
+    def loss_fn(self, input_ids, target_ids):
+        dim = self.cfg["model_dim"]
+        h = self.forward_hidden(input_ids, is_causal=True).reshape(-1, dim)
+        y = target_ids.reshape(-1)
+        logits = self.softcap(h @ self.tok_emb.weight.astype(h.dtype).T)
+        return nn.losses.cross_entropy(logits.astype(mx.float32), y, reduction="mean")
+
+# ==============================================================================
+# DATA LOADING
+# ==============================================================================
+def load_data_shard(path):
+    header_bytes = 256 * np.dtype("<i4").itemsize
+    header = np.fromfile(path, dtype="<i4", count=256)
+    num_tokens = int(header[2])
+    tokens = np.fromfile(path, dtype="<u2", count=num_tokens, offset=header_bytes)
+    return tokens.astype(np.int32, copy=False)
+
+def load_validation_tokens(pattern, seq_len):
+    files = [Path(p) for p in sorted(glob.glob(pattern))]
+    if not files:
+        raise FileNotFoundError(f"No val files found: {pattern}")
+    tokens = np.concatenate([load_data_shard(f) for f in files])
+    usable = ((tokens.size - 1) // seq_len) * seq_len
+    return tokens[:usable + 1]
+
+# ==============================================================================
+# SENTENCEPIECE BPB
+# ==============================================================================
+def build_sentencepiece_luts(sp, vocab_size):
+    sp_vocab_size = int(sp.vocab_size())
+    table_size = max(sp_vocab_size, vocab_size)
+    base_bytes_lut = np.zeros((table_size,), dtype=np.int16)
+    has_leading_space_lut = np.zeros((table_size,), dtype=np.bool_)
+    is_boundary_token_lut = np.ones((table_size,), dtype=np.bool_)
+    for token_id in range(sp_vocab_size):
+        if sp.is_control(token_id) or sp.is_unknown(token_id) or sp.is_unused(token_id):
+            continue
+        is_boundary_token_lut[token_id] = False
+        if sp.is_byte(token_id):
+            base_bytes_lut[token_id] = 1
+            continue
+        piece = sp.id_to_piece(token_id)
+        if piece.startswith("\u2581"):
+            has_leading_space_lut[token_id] = True
+            piece = piece[1:]
+        base_bytes_lut[token_id] = len(piece.encode("utf-8"))
+    return base_bytes_lut, has_leading_space_lut, is_boundary_token_lut
+
+def compute_bpb(total_nll, total_tokens, total_bytes):
+    avg_loss = total_nll / total_tokens
+    bpt = avg_loss / math.log(2.0)
+    return bpt * (total_tokens / total_bytes)
+
+# ==============================================================================
+# LOAD MODEL FROM NPZ
+# ==============================================================================
+def load_model_from_npz(path, cfg):
+    """Load saved .npz weights into a GPTv2 model with the given config."""
+    model = GPTv2(cfg)
+    # Initialize with dummy data to create parameter shapes
+    dummy = mx.zeros((1, 4), dtype=mx.int32)
+    _ = model(dummy)
+    mx.eval(model.parameters())
+
+    # Load weights
+    data = np.load(path)
+    flat_params = dict(tree_flatten(model.parameters()))
+    new_params = {}
+    for k in flat_params:
+        if k in data:
+            arr = data[k]
+            # Match dtype of existing param
+            target_dtype = flat_params[k].dtype
+            new_params[k] = mx.array(arr).astype(target_dtype)
+        else:
+            print(f"WARNING: key '{k}' not found in checkpoint, keeping init value")
+            new_params[k] = flat_params[k]
+
+    model.update(tree_unflatten(list(new_params.items())))
+    mx.eval(model.parameters())
+    return model
+
+# ==============================================================================
+# EVAL HELPERS
+# ==============================================================================
+def eval_batch_loss(model, x_np, y_np, seq_len):
+    """Compute NLL and token/byte counts for a batch."""
+    x = mx.array(x_np, dtype=mx.int32)
+    y = mx.array(y_np, dtype=mx.int32)
+    loss = model.loss_fn(x, y)
+    mx.eval(loss)
+    return float(loss.item()) * float(y.size)
+
+def copy_params(model):
+    """Deep copy model parameters as a flat dict of numpy arrays."""
+    return {k: np.array(v) for k, v in tree_flatten(model.parameters())}
+
+def restore_params(model, saved_params):
+    """Restore model parameters from a flat dict of numpy arrays."""
+    flat_current = dict(tree_flatten(model.parameters()))
+    restored = {}
+    for k, v_np in saved_params.items():
+        restored[k] = mx.array(v_np).astype(flat_current[k].dtype)
+    model.update(tree_unflatten(list(restored.items())))
+    mx.eval(model.parameters())
+
+# ==============================================================================
+# TTT UPDATE
+# ==============================================================================
+def ttt_update(model, x_np, y_np, ttt_lr, ttt_steps):
+    """
+    Perform TTT gradient steps on the given batch.
+    Uses simple AdamW (no Muon needed for TTT — just need a few fast steps).
+    """
+    x = mx.array(x_np, dtype=mx.int32)
+    y = mx.array(y_np, dtype=mx.int32)
+
+    # Use Adam with a modest weight decay
+    ttt_optimizer = optim.AdamW(learning_rate=ttt_lr, betas=[0.9, 0.95], eps=1e-8, weight_decay=0.01)
+
+    loss_and_grad_fn = nn.value_and_grad(model, model.loss_fn)
+
+    for _ in range(ttt_steps):
+        loss, grads = loss_and_grad_fn(x, y)
+        ttt_optimizer.update(model, grads)
+        mx.eval(model.parameters(), loss)
+
+# ==============================================================================
+# MAIN EVALUATION
+# ==============================================================================
+def main():
+    parser = argparse.ArgumentParser(description="Test-Time Training BPB Evaluation")
+    parser.add_argument("--model_path", type=str, required=True,
+                        help="Path to .npz model checkpoint")
+    parser.add_argument("--model_dim", type=int, default=512, choices=[512, 768],
+                        help="Model dimension (512 for train_depth_recurrent, 768 for train_depth_768)")
+    parser.add_argument("--seq_len", type=int, default=1024,
+                        help="Sequence length")
+    parser.add_argument("--val_tokens", type=int, default=500_000,
+                        help="Number of val tokens to evaluate (0 = all)")
+    parser.add_argument("--batch_seqs", type=int, default=8,
+                        help="Number of sequences per eval batch (also the TTT adapt batch)")
+    parser.add_argument("--ttt_lr", type=float, default=3e-4,
+                        help="TTT learning rate (default: 3e-4)")
+    parser.add_argument("--ttt_steps", type=int, default=1,
+                        help="Number of TTT gradient steps per batch (default: 1)")
+    parser.add_argument("--ttt_batch_seqs", type=int, default=-1,
+                        help="Sequences per TTT update (-1 = same as batch_seqs)")
+    parser.add_argument("--skip_standard", action="store_true",
+                        help="Skip standard eval (only run TTT)")
+    parser.add_argument("--skip_ttt", action="store_true",
+                        help="Skip TTT eval (only run standard)")
+    parser.add_argument("--data_dir", type=str, default=DATA_DIR)
+    parser.add_argument("--tokenizer_path", type=str, default=TOKENIZER_PATH)
+    args = parser.parse_args()
+
+    seq_len = args.seq_len
+    ttt_batch_seqs = args.ttt_batch_seqs if args.ttt_batch_seqs > 0 else args.batch_seqs
+
+    cfg = DIM_CONFIGS[args.model_dim]
+    print("=" * 70)
+    print(f"TTT Eval | model_dim={args.model_dim} | {args.model_path}")
+    print(f"TTT: lr={args.ttt_lr} steps={args.ttt_steps} batch_seqs={ttt_batch_seqs}")
+    print("=" * 70)
+
+    # Load tokenizer
+    sp = spm.SentencePieceProcessor(model_file=args.tokenizer_path)
+    base_bytes_lut, has_leading_space_lut, is_boundary_token_lut = build_sentencepiece_luts(sp, VOCAB_SIZE)
+
+    # Load val tokens
+    val_tokens = load_validation_tokens(f"{args.data_dir}/fineweb_val_*.bin", seq_len)
+    if args.val_tokens > 0 and args.val_tokens < val_tokens.size:
+        usable = (args.val_tokens // seq_len) * seq_len
+        val_tokens = val_tokens[:usable + 1]
+    total_seqs = (val_tokens.size - 1) // seq_len
+    print(f"Val tokens: {val_tokens.size - 1:,} | total_seqs: {total_seqs:,}")
+
+    # Load model
+    print(f"Loading model from {args.model_path}...")
+    model = load_model_from_npz(args.model_path, cfg)
+    n_params = sum(int(np.prod(p.shape)) for _, p in tree_flatten(model.parameters()))
+    print(f"Model params: {n_params:,} ({n_params/1e6:.1f}M)")
+
+    # =========================================================================
+    # STANDARD EVAL (no TTT)
+    # =========================================================================
+    standard_bpb = None
+    if not args.skip_standard:
+        print("\n--- Standard Eval (no TTT) ---")
+        t0 = time.perf_counter()
+        total_nll = 0.0
+        total_tok = 0
+        total_bytes = 0.0
+
+        compiled_loss = mx.compile(model.loss_fn, inputs=model.state, outputs=model.state)
+
+        for s in range(0, total_seqs, args.batch_seqs):
+            e = min(s + args.batch_seqs, total_seqs)
+            chunk = val_tokens[s * seq_len:(e * seq_len) + 1]
+            x_np = chunk[:-1].reshape(-1, seq_len)
+            y_np = chunk[1:].reshape(-1, seq_len)
+            x = mx.array(x_np, dtype=mx.int32)
+            y = mx.array(y_np, dtype=mx.int32)
+            ct = float(y.size)
+            bl = compiled_loss(x, y)
+            mx.eval(bl)
+            total_nll += float(bl.item()) * ct
+            prev_ids = x_np.reshape(-1)
+            tgt_ids = y_np.reshape(-1)
+            bytes_np = base_bytes_lut[tgt_ids].astype(np.float64)
+            bytes_np += (has_leading_space_lut[tgt_ids] & ~is_boundary_token_lut[prev_ids]).astype(np.float64)
+            total_tok += int(ct)
+            total_bytes += bytes_np.sum()
+
+            if (s // args.batch_seqs) % 50 == 0:
+                interim_bpb = compute_bpb(total_nll, total_tok, total_bytes)
+                print(f"  seqs {e}/{total_seqs} interim_bpb={interim_bpb:.4f}")
+
+        standard_bpb = compute_bpb(total_nll, total_tok, total_bytes)
+        elapsed = time.perf_counter() - t0
+        print(f"Standard BPB: {standard_bpb:.4f}  (elapsed: {elapsed:.1f}s)")
+
+    # =========================================================================
+    # TTT EVAL
+    # =========================================================================
+    ttt_bpb = None
+    if not args.skip_ttt:
+        print(f"\n--- TTT Eval (lr={args.ttt_lr}, steps={args.ttt_steps}) ---")
+        print("Strategy: adapt on batch[i], eval on batch[i+1]")
+
+        # Save original weights to restore after TTT
+        original_params = copy_params(model)
+        restore_params(model, original_params)
+
+        t0 = time.perf_counter()
+        total_nll_ttt = 0.0
+        total_tok_ttt = 0
+        total_bytes_ttt = 0.0
+
+        # Process in TTT batches
+        # We adapt on batch[i] then measure loss on batch[i+1]
+        # First batch: eval without adaptation (cold start)
+        batch_boundaries = list(range(0, total_seqs, ttt_batch_seqs))
+        num_batches = len(batch_boundaries)
+
+        adapt_params_saved = copy_params(model)  # track adapted state across batches
+
+        for batch_i, s in enumerate(batch_boundaries):
+            e = min(s + ttt_batch_seqs, total_seqs)
+            chunk = val_tokens[s * seq_len:(e * seq_len) + 1]
+            x_np = chunk[:-1].reshape(-1, seq_len)
+            y_np = chunk[1:].reshape(-1, seq_len)
+
+            if batch_i == 0:
+                # First batch: eval without any adaptation
+                x = mx.array(x_np, dtype=mx.int32)
+                y = mx.array(y_np, dtype=mx.int32)
+                ct = float(y.size)
+                loss_val = model.loss_fn(x, y)
+                mx.eval(loss_val)
+                total_nll_ttt += float(loss_val.item()) * ct
+                prev_ids = x_np.reshape(-1)
+                tgt_ids = y_np.reshape(-1)
+                bytes_np = base_bytes_lut[tgt_ids].astype(np.float64)
+                bytes_np += (has_leading_space_lut[tgt_ids] & ~is_boundary_token_lut[prev_ids]).astype(np.float64)
+                total_tok_ttt += int(ct)
+                total_bytes_ttt += bytes_np.sum()
+
+                # Now adapt on this batch for the next eval
+                ttt_update(model, x_np, y_np, args.ttt_lr, args.ttt_steps)
+                adapt_params_saved = copy_params(model)
+
+            else:
+                # Eval on current batch (model was adapted on previous batch)
+                x = mx.array(x_np, dtype=mx.int32)
+                y = mx.array(y_np, dtype=mx.int32)
+                ct = float(y.size)
+                loss_val = model.loss_fn(x, y)
+                mx.eval(loss_val)
+                total_nll_ttt += float(loss_val.item()) * ct
+                prev_ids = x_np.reshape(-1)
+                tgt_ids = y_np.reshape(-1)
+                bytes_np = base_bytes_lut[tgt_ids].astype(np.float64)
+                bytes_np += (has_leading_space_lut[tgt_ids] & ~is_boundary_token_lut[prev_ids]).astype(np.float64)
+                total_tok_ttt += int(ct)
+                total_bytes_ttt += bytes_np.sum()
+
+                # Adapt on current batch (cumulative adaptation)
+                ttt_update(model, x_np, y_np, args.ttt_lr, args.ttt_steps)
+                adapt_params_saved = copy_params(model)
+
+            if batch_i % 20 == 0:
+                interim_ttt_bpb = compute_bpb(total_nll_ttt, total_tok_ttt, total_bytes_ttt)
+                elapsed = time.perf_counter() - t0
+                print(f"  batch {batch_i+1}/{num_batches} seqs {e}/{total_seqs} "
+                      f"interim_ttt_bpb={interim_ttt_bpb:.4f} elapsed:{elapsed:.0f}s")
+
+        ttt_bpb = compute_bpb(total_nll_ttt, total_tok_ttt, total_bytes_ttt)
+        elapsed = time.perf_counter() - t0
+        print(f"TTT BPB: {ttt_bpb:.4f}  (elapsed: {elapsed:.1f}s)")
+
+        # Restore original weights
+        restore_params(model, original_params)
+
+    # =========================================================================
+    # SUMMARY
+    # =========================================================================
+    print("\n" + "=" * 70)
+    print("SUMMARY")
+    print("=" * 70)
+    if standard_bpb is not None:
+        print(f"  Standard BPB: {standard_bpb:.4f}")
+    if ttt_bpb is not None:
+        print(f"  TTT BPB:      {ttt_bpb:.4f}  (lr={args.ttt_lr}, steps={args.ttt_steps})")
+    if standard_bpb is not None and ttt_bpb is not None:
+        delta = standard_bpb - ttt_bpb
+        pct = delta / standard_bpb * 100
+        if delta > 0:
+            print(f"  TTT improvement: -{delta:.4f} ({pct:.2f}% better)")
+            if delta > 0.02:
+                print("  VERDICT: TTT is EFFECTIVE — integrate into final pipeline")
+            elif delta > 0.005:
+                print("  VERDICT: TTT shows modest improvement — worth testing at scale")
+            else:
+                print("  VERDICT: TTT improvement is marginal — may not be worth eval cost")
+        else:
+            print(f"  TTT degradation: +{-delta:.4f} ({-pct:.2f}% worse)")
+            print("  VERDICT: TTT is NOT effective with these hyperparams — try lower lr or fewer steps")
+    print("=" * 70)
+
+if __name__ == "__main__":
+    main()
diff --git a/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/gen_test.py b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/gen_test.py
new file mode 100644
index 0000000000..117a17c01e
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/gen_test.py
@@ -0,0 +1,189 @@
+#!/usr/bin/env python3
+"""
+Greedy bidirectional fill generation test across the 6 ablation checkpoints.
+
+For each model:
+  1. Pick 4 real FineWeb val sequences
+  2. Mask a 16-token middle span
+  3. Run bidirectional (is_causal=False) forward, greedy decode the masked positions
+  4. Also run causal (is_causal=True) for comparison on the same masked input
+  5. Print prefix / original / bidir_fill / causal_fill / suffix
+
+Also compute an "infill exact match" metric: how many of the 16 masked tokens
+the model recovers exactly.
+"""
+import os, sys, argparse, importlib.util
+import numpy as np
+import torch
+import sentencepiece as spm
+
+
+def load_model_from_patched(patched_path, ckpt_path, device):
+    # Block the module from running its main() training loop at import time
+    with open(patched_path) as f:
+        src = f.read()
+    src = src.replace('if __name__ == "__main__"', 'if False')
+
+    ns = {"__name__": "gen_test_module", "__file__": patched_path}
+    exec(compile(src, patched_path, "exec"), ns)
+
+    GPTv2 = ns["GPTv2"]
+    model = GPTv2().to(device)
+
+    sd = torch.load(ckpt_path, map_location=device, weights_only=False)
+    if isinstance(sd, dict):
+        for k in ("model", "state_dict", "raw_model"):
+            if k in sd and isinstance(sd[k], dict):
+                sd = sd[k]
+                break
+    clean = {}
+    for k, v in sd.items():
+        k2 = k
+        if k2.startswith("module."): k2 = k2[7:]
+        if k2.startswith("_orig_mod."): k2 = k2[10:]
+        clean[k2] = v
+    model.load_state_dict(clean, strict=False)
+    model.eval()
+    return model, ns["VOCAB_SIZE"], ns["NUM_LAYERS"], ns["MODEL_DIM"]
+
+
+@torch.no_grad()
+def get_logits(model, input_ids, is_causal):
+    h = model.forward_hidden(input_ids, is_causal=is_causal)
+    logits = model.softcap(torch.nn.functional.linear(h, model.tok_emb.weight.to(h.dtype)))
+    return logits
+
+
+def run_model(tag, ckpt_path, patched_path, sp, val_tokens, device):
+    print(f"\n{'='*76}")
+    print(f"  {tag}")
+    print(f"{'='*76}")
+    try:
+        model, vocab, L, D = load_model_from_patched(patched_path, ckpt_path, device)
+    except Exception as e:
+        print(f"  [skip] load error: {e}")
+        return None
+    print(f"  model: {L}L d={D} vocab={vocab}")
+
+    SEQ_LEN = 96
+    MASK_START = 40
+    MASK_END = 56  # 16 tokens masked
+    N_SEQS = 4
+
+    rng = np.random.RandomState(7)
+    total_bidir_em = 0
+    total_causal_em = 0
+    total_tokens = 0
+
+    for idx in range(N_SEQS):
+        # pick a sequence with actual English content
+        attempts = 0
+        while attempts < 20:
+            start = rng.randint(0, len(val_tokens) - SEQ_LEN - 1)
+            seq = val_tokens[start:start + SEQ_LEN].copy()
+            text = sp.decode(seq.tolist())
+            if len(text) > 200 and "\\x" not in repr(text):
+                break
+            attempts += 1
+
+        prefix_tok = seq[:MASK_START]
+        middle_true = seq[MASK_START:MASK_END]
+        suffix_tok = seq[MASK_END:]
+
+        # Build masked input (random noise at middle)
+        masked = seq.copy()
+        masked[MASK_START:MASK_END] = rng.randint(0, vocab, MASK_END - MASK_START)
+
+        x = torch.from_numpy(masked.reshape(1, -1).astype(np.int64)).to(device)
+
+        # Bidirectional greedy fill
+        lg_bi = get_logits(model, x, is_causal=False)
+        bi_fill = lg_bi[0, MASK_START:MASK_END].argmax(dim=-1).cpu().numpy()
+
+        # Causal greedy fill (for comparison on same input)
+        lg_ca = get_logits(model, x, is_causal=True)
+        ca_fill = lg_ca[0, MASK_START:MASK_END].argmax(dim=-1).cpu().numpy()
+
+        bi_em = int((bi_fill == middle_true).sum())
+        ca_em = int((ca_fill == middle_true).sum())
+        total_bidir_em += bi_em
+        total_causal_em += ca_em
+        total_tokens += (MASK_END - MASK_START)
+
+        def safe(txt):
+            return txt.replace("\n", " \\n ").replace("\r", " \\r ").strip()
+
+        print(f"\n  [seq {idx+1}/{N_SEQS}]")
+        print(f"  PREFIX:    {safe(sp.decode(prefix_tok.tolist()))}")
+        print(f"  ORIGINAL:  [{safe(sp.decode(middle_true.tolist()))}]")
+        print(f"  BIDIR:     [{safe(sp.decode(bi_fill.astype(np.int32).tolist()))}]  EM {bi_em}/16")
+        print(f"  CAUSAL:    [{safe(sp.decode(ca_fill.astype(np.int32).tolist()))}]  EM {ca_em}/16")
+        print(f"  SUFFIX:    {safe(sp.decode(suffix_tok.tolist()))}")
+
+    print(f"\n  TOTAL EM: bidir={total_bidir_em}/{total_tokens} = {100*total_bidir_em/total_tokens:.1f}%, "
+          f"causal={total_causal_em}/{total_tokens} = {100*total_causal_em/total_tokens:.1f}%")
+
+    return {"tag": tag, "bidir_em": total_bidir_em, "causal_em": total_causal_em, "total": total_tokens}
+
+
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--data_dir", type=str, required=True,
+                    help="directory containing fineweb_val_000000.bin")
+    ap.add_argument("--tokenizer_path", type=str, required=True,
+                    help="path to bpe_v4096.model")
+    ap.add_argument("--ckpt_root", type=str, required=True,
+                    help="directory containing per-run ckpt subdirs (5L_w0, 5L_w03, ...) "
+                         "each holding a final .npz checkpoint from train_ablation_runner.py")
+    ap.add_argument("--patched_dir", type=str, default="/tmp",
+                    help="directory containing the patched training scripts produced "
+                         "by train_ablation_runner.py (default: /tmp, matches runner output)")
+    args = ap.parse_args()
+
+    device = "cuda" if torch.cuda.is_available() else "cpu"
+
+    sp = spm.SentencePieceProcessor(model_file=args.tokenizer_path)
+    header = np.fromfile(os.path.join(args.data_dir, "fineweb_val_000000.bin"), dtype="<i4", count=256)
+    val = np.fromfile(os.path.join(args.data_dir, "fineweb_val_000000.bin"),
+                      dtype="<u2", count=int(header[2]), offset=256*4).astype(np.int32)
+    print(f"Val tokens: {len(val):,}")
+
+    MODELS = [
+        # (tag, ckpt subdir name, patched script layer tag)
+        ("5L_w0",  "5L_w0",  "5L_w0.0"),
+        ("5L_w03", "5L_w03", "5L_w0.3"),
+        ("5L_w1",  "5L_w1",  "5L_w1.0"),
+        ("11L_w0",  "11L_w0",  "11L_w0.0"),
+        ("11L_w03", "11L_w03", "11L_w0.3"),
+        ("11L_w1",  "11L_w1",  "11L_w1.0"),
+    ]
+
+    results = []
+    for tag, ckpt_subdir, patch_tag in MODELS:
+        import glob
+        ckpt_dir = os.path.join(args.ckpt_root, ckpt_subdir)
+        ckpts = sorted(glob.glob(os.path.join(ckpt_dir, "step_*.pt")),
+                       key=lambda p: int(p.split("step_")[-1].split(".")[0]))
+        if not ckpts:
+            print(f"\n[skip] no checkpoint in {ckpt_dir}")
+            continue
+        ckpt = ckpts[-1]
+        patched = os.path.join(args.patched_dir, f"train_cdm_patched_{patch_tag}.py")
+        if not os.path.exists(patched):
+            print(f"\n[skip] no patched script at {patched}")
+            continue
+
+        r = run_model(tag, ckpt, patched, sp, val, device)
+        if r:
+            results.append(r)
+
+    print(f"\n{'='*76}\n  SUMMARY\n{'='*76}")
+    print(f"{'tag':<12} {'bidir EM':<12} {'causal EM':<12}")
+    for r in results:
+        bi_pct = 100 * r["bidir_em"] / r["total"]
+        ca_pct = 100 * r["causal_em"] / r["total"]
+        print(f"  {r['tag']:<10} {r['bidir_em']}/{r['total']} ({bi_pct:.1f}%)   {r['causal_em']}/{r['total']} ({ca_pct:.1f}%)")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/leakage_test.py b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/leakage_test.py
new file mode 100644
index 0000000000..0d235fe137
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/leakage_test.py
@@ -0,0 +1,144 @@
+#!/usr/bin/env python3
+"""
+Trivial future-token leakage test for the shared AR+CDM model.
+
+Procedure:
+  1. Load a shared model checkpoint (via MODEL_PATH env var)
+  2. Create two sequences identical for positions 0..t and different for t+1..L-1
+  3. Forward both with is_causal=True
+  4. Assert: logits[position < t] are identical between sequences
+     (if not, future tokens are leaking into earlier positions)
+
+Usage:
+  MODEL_PATH=./shared_ar_cdm.npz python3 leakage_test.py
+  # or from the PR folder with the checkpoint downloaded alongside:
+  MODEL_PATH=../../../../models/shared_ar_cdm.npz python3 leakage_test.py
+
+Requires mlx >= 0.31 and eval_cf_dualbrain.py in the same directory (the test
+re-uses the GPTv2 class from that script).
+"""
+import os, sys, numpy as np
+# MLX imports — guarded so this Apple-Silicon-only pre-flight script can be
+# IMPORTED on Linux CPU smoke-test environments (Python 3.10 + torch CPU,
+# where `pip install mlx` is not possible). When the file is imported (not
+# executed as __main__), nothing below the imports actually runs. When the
+# file is executed as __main__ on a non-Apple-Silicon machine without MLX
+# installed, it exits cleanly with a clear message.
+try:
+    import mlx.core as mx
+    import mlx.nn as nn
+    _HAS_MLX = True
+except ImportError:
+    _HAS_MLX = False
+    mx = None
+    nn = None
+
+
+def _main() -> int:
+    """Run the leakage test. Requires MLX to be available."""
+    if not _HAS_MLX:
+        print("leakage_test.py requires Apple MLX (Apple Silicon only).")
+        print("It is included in the PR folder as M1 pre-flight reproducibility")
+        print("evidence; the H100 production CF eval uses eval_cf_ablation.py.")
+        print("Install MLX with `pip install mlx` on an Apple Silicon Mac and")
+        print("re-run with `python3 leakage_test.py`.")
+        return 0
+
+    SCRIPT_DIR = os.path.dirname(os.path.abspath(__file__))
+    MODEL_PATH = os.environ.get("MODEL_PATH", os.path.join(SCRIPT_DIR, "shared_ar_cdm.npz"))
+    EVAL_SCRIPT = os.path.join(SCRIPT_DIR, "eval_cf_dualbrain.py")
+
+    # Pull the GPTv2 class from eval_cf_dualbrain.py without running its main()
+    with open(EVAL_SCRIPT) as f:
+        src = f.read()
+    src_prelude = src.split("def main()")[0]
+    ns = {"__name__": "leakage_test_module", "__file__": EVAL_SCRIPT}
+    exec(src_prelude, ns)
+
+    GPTv2 = ns["GPTv2"]
+    load_model = ns["load_model"]
+    MODEL_DIM = ns["MODEL_DIM"]
+    VOCAB_SIZE = ns["VOCAB_SIZE"]
+
+    # ---- Load model ----
+    if not os.path.exists(MODEL_PATH):
+        print(f"ERROR: MODEL_PATH not found: {MODEL_PATH}")
+        print(f"Set MODEL_PATH to a shared_ar_cdm.npz checkpoint.")
+        print(f"Download from: https://huggingface.co/datasets/akaiii/meadow-golf-checkpoints")
+        return 1
+
+    model = load_model(MODEL_PATH)
+    print(f"Loaded 5L d={MODEL_DIM} model with vocab={VOCAB_SIZE}")
+
+    # ---- Build two test sequences ----
+    T = 32       # short is fine for a leakage test
+    SPLIT = 16   # position 0..SPLIT-1 identical, SPLIT..T-1 different
+
+    rng = np.random.default_rng(1337)
+    prefix = rng.integers(0, VOCAB_SIZE, size=SPLIT, dtype=np.int32)
+
+    suffix_A = rng.integers(0, VOCAB_SIZE, size=(T - SPLIT), dtype=np.int32)
+    # Make suffix_B intentionally different from suffix_A
+    suffix_B = rng.integers(0, VOCAB_SIZE, size=(T - SPLIT), dtype=np.int32)
+    while np.array_equal(suffix_A, suffix_B):
+        suffix_B = rng.integers(0, VOCAB_SIZE, size=(T - SPLIT), dtype=np.int32)
+
+    seq_A = np.concatenate([prefix, suffix_A])
+    seq_B = np.concatenate([prefix, suffix_B])
+
+    assert np.array_equal(seq_A[:SPLIT], seq_B[:SPLIT])
+    assert not np.array_equal(seq_A, seq_B)
+
+    print(f"\nSequences constructed:")
+    print(f"  T={T}, prefix length (identical part)={SPLIT}")
+    print(f"  positions 0..{SPLIT-1}: identical")
+    print(f"  positions {SPLIT}..{T-1}: different")
+
+    # ---- Forward both with is_causal=True ----
+    x_A = mx.array(seq_A.reshape(1, -1))
+    x_B = mx.array(seq_B.reshape(1, -1))
+
+    logits_A = model.get_logits(x_A, is_causal=True)
+    logits_B = model.get_logits(x_B, is_causal=True)
+    mx.eval(logits_A, logits_B)
+
+    lp_A = np.array(logits_A.astype(mx.float32))[0]  # [T, V]
+    lp_B = np.array(logits_B.astype(mx.float32))[0]
+
+    print(f"\nLogits shape: {lp_A.shape}")
+
+    # ---- Check identity on positions 0..SPLIT-1 ----
+    diff_prefix = np.abs(lp_A[:SPLIT] - lp_B[:SPLIT])
+    max_diff_prefix = diff_prefix.max()
+    mean_diff_prefix = diff_prefix.mean()
+    print(f"\nPrefix positions 0..{SPLIT-1} (should be identical under causal):")
+    print(f"  max  |logits_A - logits_B| = {max_diff_prefix:.6e}")
+    print(f"  mean |logits_A - logits_B| = {mean_diff_prefix:.6e}")
+
+    # ---- Check that suffix positions DO differ (sanity: the model is not constant) ----
+    diff_suffix = np.abs(lp_A[SPLIT:] - lp_B[SPLIT:])
+    max_diff_suffix = diff_suffix.max()
+    print(f"\nSuffix positions {SPLIT}..{T-1} (should differ, as inputs differ):")
+    print(f"  max  |logits_A - logits_B| = {max_diff_suffix:.6e}")
+
+    # ---- Verdict ----
+    THRESHOLD_LEAK = 1e-3   # bfloat16 precision is ~3e-3 per operation so we allow 1e-3
+    print("\n" + "="*60)
+    if max_diff_prefix < THRESHOLD_LEAK and max_diff_suffix > THRESHOLD_LEAK:
+        print("  PASS: no future-token leakage detected")
+        print(f"  (prefix max diff {max_diff_prefix:.2e} < threshold {THRESHOLD_LEAK:.0e})")
+        print(f"  (suffix max diff {max_diff_suffix:.2e} confirms model is not constant)")
+    elif max_diff_prefix >= THRESHOLD_LEAK:
+        print("  FAIL: possible future-token leakage")
+        print(f"  (prefix max diff {max_diff_prefix:.2e} >= threshold {THRESHOLD_LEAK:.0e})")
+        print("  Changing tokens at position >= {SPLIT} affects logits at position < {SPLIT}.")
+        print("  This indicates the causal mask is NOT correctly applied.")
+    elif max_diff_suffix < THRESHOLD_LEAK:
+        print("  INCONCLUSIVE: model appears constant across different inputs")
+        print("  (this is a model problem, not a leakage test problem)")
+    print("="*60)
+    return 0
+
+
+if __name__ == "__main__":
+    sys.exit(_main())
diff --git a/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/run_6.sh b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/run_6.sh
new file mode 100644
index 0000000000..88da77f3d1
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/run_6.sh
@@ -0,0 +1,68 @@
+#!/bin/bash
+# Orchestration script for the 6-run matched ablation (§3.2 of README.md).
+#
+# Override any path via environment variables before running:
+#   SCRIPT_DIR   — directory containing train_cdm.py + train_ablation_runner.py
+#                  (default: the directory this script lives in)
+#   DATA_DIR     — directory containing fineweb_train_*.bin + fineweb_val_000000.bin
+#                  (v4096 shards; default: ./data)
+#   TOKENIZER    — path to bpe_v4096.model (default: ./bpe_v4096.model)
+#   OUT_DIR      — where .npz and .lzma checkpoints land (default: ./out)
+#   CKPT_DIR     — intermediate checkpoint directory (default: ./ckpt)
+#   LOG_DIR      — training logs (default: ./logs)
+#
+# Example (exactly matching the run reported in the PR):
+#   SCRIPT_DIR=./gcp \
+#   DATA_DIR=./gv4096/data \
+#   TOKENIZER=./gv4096/bpe_v4096.model \
+#   bash run_6.sh
+set -e
+
+SCRIPT_DIR="${SCRIPT_DIR:-$(cd "$(dirname "$0")" && pwd)}"
+DATA_DIR="${DATA_DIR:-./data}"
+TOKENIZER="${TOKENIZER:-./bpe_v4096.model}"
+OUT_DIR="${OUT_DIR:-./out}"
+CKPT_DIR="${CKPT_DIR:-./ckpt}"
+LOG_DIR="${LOG_DIR:-./logs}"
+
+export PYTHONPATH="$SCRIPT_DIR:${PYTHONPATH:-}"
+
+mkdir -p "$OUT_DIR" "$CKPT_DIR" "$LOG_DIR"
+
+run_one () {
+  local tag=$1 layers=$2 dim=$3 bdim=$4 xsa=$5 weight=$6
+  echo "================================================================"
+  echo "== RUN: $tag  (L=$layers d=$dim cdm_w=$weight)"
+  echo "================================================================"
+  python3 "$SCRIPT_DIR/train_ablation_runner.py" \
+    --train_script "$SCRIPT_DIR/train_cdm.py" \
+    --num_layers $layers --model_dim $dim --vocab_size 4096 \
+    --bigram_dim $bdim --xsa_last_n $xsa \
+    --cdm_weight $weight \
+    -- \
+    --train_budget_secs 540 \
+    --steps 9999 \
+    --data_dir "$DATA_DIR" --tokenizer_path "$TOKENIZER" \
+    --save_path "$OUT_DIR/${tag}.npz" \
+    --save_int6_path "$OUT_DIR/${tag}_int6.lzma" \
+    --checkpoint_dir "$CKPT_DIR/${tag}" \
+    --val_every 500 --val_tokens 1000000 \
+    > "$LOG_DIR/${tag}_train.log" 2>&1
+  echo "  train done -> $LOG_DIR/${tag}_train.log"
+  tail -5 "$LOG_DIR/${tag}_train.log"
+}
+
+# 5L runs (d=256, bigram=128, xsa_last_n=2)
+run_one 5L_w0    5 256 128 2 0.0
+run_one 5L_w03   5 256 128 2 0.3
+run_one 5L_w1    5 256 128 2 1.0
+
+# 11L runs (d=512, bigram=128, xsa_last_n=4)
+run_one 11L_w0   11 512 128 4 0.0
+run_one 11L_w03  11 512 128 4 0.3
+run_one 11L_w1   11 512 128 4 1.0
+
+echo "================================================================"
+echo "== ALL 6 TRAINING RUNS DONE"
+echo "================================================================"
+ls -la "$OUT_DIR"/*.npz
diff --git a/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/seeds_run/README.md b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/seeds_run/README.md
new file mode 100644
index 0000000000..e28b4de388
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/seeds_run/README.md
@@ -0,0 +1,49 @@
+# `seeds_run/` — §3.1 multi-seed verification artifacts
+
+This folder contains the small text-only artifacts of the §3.1 multi-seed verification reported in the v3.5 README. It is meant to make the §3.1 numbers reviewer-checkable without requiring any external download.
+
+## What is in this folder
+
+- `run_p5.sh` — the orchestration script that ran the 5 fresh shared-model trainings (`11L_w0.3` × seeds `{1337, 42, 2024, 7, 100}`) and the 5 corresponding CF evaluations
+- `run_p5.out` — wrapper stdout from the run_p5 invocation, including the per-run training final `val_bpb` and per-run CF Total summary lines
+- `run_phase_b.sh` — the orchestration script that ran the 1 fresh causal-only control training (`11L_w0` `SEED=1337`) and its CF evaluation
+- `run_phase_b.out` — wrapper stdout from run_phase_b
+- `logs/11L_w03_s{1337,42,2024,7,100}_train.log` — full training stdout for each shared seed (loss curves, val_bpb every 500 steps, final val_bpb, throughput, FLOPs)
+- `logs/11L_w0_s1337_train.log` — full training stdout for the fresh control
+- `eval/11L_w03_s{1337,42,2024,7,100}_cf.log` — CF evaluation results for each shared seed (JSON: `pure_ar_bpb`, `cf_ar_part`, `cf_cdm_part`, `cf_total`, `cf_vs_ar_pct`)
+- `eval/11L_w0_s1337_cf.log` — CF evaluation result for the fresh control
+- `eval/*_eval.out` — wrapper stdout from each CF evaluation invocation
+
+Every number quoted in §3.1 of the v3.5 README is grep-able from these logs.
+
+## What is NOT in this folder (and why)
+
+The `.npz` final-state model files (~114 MB each) and the `step_final.pt` torch checkpoints (~109 MB each) for all 6 runs are NOT included in this PR folder. Six of each totals ~1.3 GB and exceeds what is reasonable to commit to a parameter-golf PR. They are stored locally in the author's `meadow-golf` research diary repo at `experiments/2026-04-09_matched_ablation/p5_results/{npz,ckpt}/` and are byte-identical (verified after download from the H100 SXM training pod) to what was used to produce the CF evaluations in `eval/`.
+
+If a reviewer wants to independently re-run the CF evaluation on the exact `.npz` / `.pt` files used in §3.1, contact the author (akai@fawstudio.com) for a Hugging Face dataset upload — this is a small request to fulfill on-demand and avoids bloating the PR folder with model weights that almost no reviewer will actually want to download.
+
+## How to reproduce §3.1 from scratch (no checkpoints needed)
+
+Both `run_p5.sh` and `run_phase_b.sh` are verbatim pod-side orchestration scripts. They expect the cloned experiment checkout at `/workspace/meadow-golf/experiments/2026-04-09_matched_ablation`, and they rely on the v3.5 copies of `train_cdm.py`, `train_ablation_runner.py`, and `eval_cf_ablation.py` in that directory. On a 1×H100 SXM pod:
+
+```bash
+# 1. Clone meadow-golf and overlay the v3.5 train_cdm.py / train_ablation_runner.py /
+#    eval_cf_ablation.py (which include the final-checkpoint save fix, seed-patched runner,
+#    and the .npz loader, respectively).
+git clone https://github.com/akaiHuang/meadow-golf
+cd meadow-golf/experiments/2026-04-09_matched_ablation
+
+# 2. Download the v4096 dataset (~20 GB, ~1 min on H100)
+hf download akaiii/meadow-golf-v4096 --repo-type dataset --local-dir /workspace/gv4096
+
+# 3. Run the 5 shared seeds
+SCRIPT_DIR=. DATA_DIR=/workspace/gv4096/data \
+  TOKENIZER=/workspace/gv4096/bpe_v4096.model \
+  OUT_DIR=/workspace/out CKPT_DIR=/workspace/ckpt LOG_DIR=/workspace/logs \
+  bash run_p5.sh
+
+# 4. Run the 1 control seed
+bash run_phase_b.sh
+```
+
+Total wall time on a single 1×H100 SXM pod: ~70 min (5 shared trainings ~50 min + 1 control training ~9 min + 6 CF evals ~10 min). Total self-funded compute cost: ~$3.50 at $2.99/hr.
diff --git a/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/seeds_run/eval/11L_w03_s100_cf.log b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/seeds_run/eval/11L_w03_s100_cf.log
new file mode 100644
index 0000000000..9c1b44567e
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/seeds_run/eval/11L_w03_s100_cf.log
@@ -0,0 +1,53 @@
+======================================================================
+CF ablation eval
+  ckpt: /workspace/ckpt/11L_w03_s100/step_final.pt
+  model: L=11 d=512 vocab=4096
+  eval: N=500 seq=1024 stride=2 rounds=2
+======================================================================
+Loading model class from /tmp/train_cdm_patched_11L_w0.3_s100.py
+  module constants verified: 11L d=512 vocab=4096
+  loading state dict from /workspace/ckpt/11L_w03_s100/step_final.pt
+  loaded 116 keys, missing=0, unexpected=0
+  params: 28,402,777
+  val tokens: 45,122,403
+
+[1] Pure AR baseline (is_causal=True)
+    AR 100/500 | BPB:1.4547 | 2s
+    AR 200/500 | BPB:1.4562 | 3s
+    AR 300/500 | BPB:1.4506 | 4s
+    AR 400/500 | BPB:1.4457 | 5s
+    AR 500/500 | BPB:1.4456 | 6s
+  Pure AR BPB: 1.4456
+
+[2] CF eval (stride=2, rounds=2)
+    CF 100/500 | AR:0.7198 CDM:0.5899 Total:1.3096 | 10s
+    CF 200/500 | AR:0.7198 CDM:0.5933 Total:1.3131 | 21s
+    CF 300/500 | AR:0.7180 CDM:0.5905 Total:1.3085 | 33s
+    CF 400/500 | AR:0.7173 CDM:0.5899 Total:1.3072 | 44s
+    CF 500/500 | AR:0.7139 CDM:0.5868 Total:1.3007 | 55s
+
+======================================================================
+  Pure AR:       1.4456
+  CF AR part:    0.7139
+  CF CDM part:   0.5868
+  CF Total:      1.3007
+  CF vs Pure AR: -10.02%  (−0.1449 BPB)
+======================================================================
+
+JSON:
+{
+  "ckpt": "/workspace/ckpt/11L_w03_s100/step_final.pt",
+  "num_layers": 11,
+  "model_dim": 512,
+  "vocab_size": 4096,
+  "n_params": 28402777,
+  "n_seqs": 500,
+  "seq_len": 1024,
+  "stride": 2,
+  "rounds": 2,
+  "pure_ar_bpb": 1.4456182718276978,
+  "cf_ar_part": 0.7139330506324768,
+  "cf_cdm_part": 0.5868047308576959,
+  "cf_total": 1.3007377814901728,
+  "cf_vs_ar_pct": -10.022043381781021
+}
diff --git a/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/seeds_run/eval/11L_w03_s100_eval.out b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/seeds_run/eval/11L_w03_s100_eval.out
new file mode 100644
index 0000000000..9c1b44567e
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/seeds_run/eval/11L_w03_s100_eval.out
@@ -0,0 +1,53 @@
+======================================================================
+CF ablation eval
+  ckpt: /workspace/ckpt/11L_w03_s100/step_final.pt
+  model: L=11 d=512 vocab=4096
+  eval: N=500 seq=1024 stride=2 rounds=2
+======================================================================
+Loading model class from /tmp/train_cdm_patched_11L_w0.3_s100.py
+  module constants verified: 11L d=512 vocab=4096
+  loading state dict from /workspace/ckpt/11L_w03_s100/step_final.pt
+  loaded 116 keys, missing=0, unexpected=0
+  params: 28,402,777
+  val tokens: 45,122,403
+
+[1] Pure AR baseline (is_causal=True)
+    AR 100/500 | BPB:1.4547 | 2s
+    AR 200/500 | BPB:1.4562 | 3s
+    AR 300/500 | BPB:1.4506 | 4s
+    AR 400/500 | BPB:1.4457 | 5s
+    AR 500/500 | BPB:1.4456 | 6s
+  Pure AR BPB: 1.4456
+
+[2] CF eval (stride=2, rounds=2)
+    CF 100/500 | AR:0.7198 CDM:0.5899 Total:1.3096 | 10s
+    CF 200/500 | AR:0.7198 CDM:0.5933 Total:1.3131 | 21s
+    CF 300/500 | AR:0.7180 CDM:0.5905 Total:1.3085 | 33s
+    CF 400/500 | AR:0.7173 CDM:0.5899 Total:1.3072 | 44s
+    CF 500/500 | AR:0.7139 CDM:0.5868 Total:1.3007 | 55s
+
+======================================================================
+  Pure AR:       1.4456
+  CF AR part:    0.7139
+  CF CDM part:   0.5868
+  CF Total:      1.3007
+  CF vs Pure AR: -10.02%  (−0.1449 BPB)
+======================================================================
+
+JSON:
+{
+  "ckpt": "/workspace/ckpt/11L_w03_s100/step_final.pt",
+  "num_layers": 11,
+  "model_dim": 512,
+  "vocab_size": 4096,
+  "n_params": 28402777,
+  "n_seqs": 500,
+  "seq_len": 1024,
+  "stride": 2,
+  "rounds": 2,
+  "pure_ar_bpb": 1.4456182718276978,
+  "cf_ar_part": 0.7139330506324768,
+  "cf_cdm_part": 0.5868047308576959,
+  "cf_total": 1.3007377814901728,
+  "cf_vs_ar_pct": -10.022043381781021
+}
diff --git a/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/seeds_run/eval/11L_w03_s1337_cf.log b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/seeds_run/eval/11L_w03_s1337_cf.log
new file mode 100644
index 0000000000..43822fb966
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/seeds_run/eval/11L_w03_s1337_cf.log
@@ -0,0 +1,53 @@
+======================================================================
+CF ablation eval
+  ckpt: /workspace/ckpt/11L_w03_s1337/step_final.pt
+  model: L=11 d=512 vocab=4096
+  eval: N=500 seq=1024 stride=2 rounds=2
+======================================================================
+Loading model class from /tmp/train_cdm_patched_11L_w0.3_s1337.py
+  module constants verified: 11L d=512 vocab=4096
+  loading state dict from /workspace/ckpt/11L_w03_s1337/step_final.pt
+  loaded 116 keys, missing=0, unexpected=0
+  params: 28,402,777
+  val tokens: 45,122,403
+
+[1] Pure AR baseline (is_causal=True)
+    AR 100/500 | BPB:1.4518 | 2s
+    AR 200/500 | BPB:1.4527 | 4s
+    AR 300/500 | BPB:1.4480 | 6s
+    AR 400/500 | BPB:1.4432 | 7s
+    AR 500/500 | BPB:1.4428 | 9s
+  Pure AR BPB: 1.4428
+
+[2] CF eval (stride=2, rounds=2)
+    CF 100/500 | AR:0.7178 CDM:0.5830 Total:1.3008 | 11s
+    CF 200/500 | AR:0.7191 CDM:0.5861 Total:1.3052 | 21s
+    CF 300/500 | AR:0.7172 CDM:0.5832 Total:1.3004 | 31s
+    CF 400/500 | AR:0.7162 CDM:0.5827 Total:1.2989 | 42s
+    CF 500/500 | AR:0.7127 CDM:0.5797 Total:1.2924 | 52s
+
+======================================================================
+  Pure AR:       1.4428
+  CF AR part:    0.7127
+  CF CDM part:   0.5797
+  CF Total:      1.2924
+  CF vs Pure AR: -10.43%  (−0.1505 BPB)
+======================================================================
+
+JSON:
+{
+  "ckpt": "/workspace/ckpt/11L_w03_s1337/step_final.pt",
+  "num_layers": 11,
+  "model_dim": 512,
+  "vocab_size": 4096,
+  "n_params": 28402777,
+  "n_seqs": 500,
+  "seq_len": 1024,
+  "stride": 2,
+  "rounds": 2,
+  "pure_ar_bpb": 1.442813515663147,
+  "cf_ar_part": 0.7126697897911072,
+  "cf_cdm_part": 0.5796845199665235,
+  "cf_total": 1.2923543097576307,
+  "cf_vs_ar_pct": -10.428181069288234
+}
diff --git a/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/seeds_run/eval/11L_w03_s1337_eval.out b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/seeds_run/eval/11L_w03_s1337_eval.out
new file mode 100644
index 0000000000..43822fb966
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/seeds_run/eval/11L_w03_s1337_eval.out
@@ -0,0 +1,53 @@
+======================================================================
+CF ablation eval
+  ckpt: /workspace/ckpt/11L_w03_s1337/step_final.pt
+  model: L=11 d=512 vocab=4096
+  eval: N=500 seq=1024 stride=2 rounds=2
+======================================================================
+Loading model class from /tmp/train_cdm_patched_11L_w0.3_s1337.py
+  module constants verified: 11L d=512 vocab=4096
+  loading state dict from /workspace/ckpt/11L_w03_s1337/step_final.pt
+  loaded 116 keys, missing=0, unexpected=0
+  params: 28,402,777
+  val tokens: 45,122,403
+
+[1] Pure AR baseline (is_causal=True)
+    AR 100/500 | BPB:1.4518 | 2s
+    AR 200/500 | BPB:1.4527 | 4s
+    AR 300/500 | BPB:1.4480 | 6s
+    AR 400/500 | BPB:1.4432 | 7s
+    AR 500/500 | BPB:1.4428 | 9s
+  Pure AR BPB: 1.4428
+
+[2] CF eval (stride=2, rounds=2)
+    CF 100/500 | AR:0.7178 CDM:0.5830 Total:1.3008 | 11s
+    CF 200/500 | AR:0.7191 CDM:0.5861 Total:1.3052 | 21s
+    CF 300/500 | AR:0.7172 CDM:0.5832 Total:1.3004 | 31s
+    CF 400/500 | AR:0.7162 CDM:0.5827 Total:1.2989 | 42s
+    CF 500/500 | AR:0.7127 CDM:0.5797 Total:1.2924 | 52s
+
+======================================================================
+  Pure AR:       1.4428
+  CF AR part:    0.7127
+  CF CDM part:   0.5797
+  CF Total:      1.2924
+  CF vs Pure AR: -10.43%  (−0.1505 BPB)
+======================================================================
+
+JSON:
+{
+  "ckpt": "/workspace/ckpt/11L_w03_s1337/step_final.pt",
+  "num_layers": 11,
+  "model_dim": 512,
+  "vocab_size": 4096,
+  "n_params": 28402777,
+  "n_seqs": 500,
+  "seq_len": 1024,
+  "stride": 2,
+  "rounds": 2,
+  "pure_ar_bpb": 1.442813515663147,
+  "cf_ar_part": 0.7126697897911072,
+  "cf_cdm_part": 0.5796845199665235,
+  "cf_total": 1.2923543097576307,
+  "cf_vs_ar_pct": -10.428181069288234
+}
diff --git a/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/seeds_run/eval/11L_w03_s2024_cf.log b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/seeds_run/eval/11L_w03_s2024_cf.log
new file mode 100644
index 0000000000..6261b9fea6
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/seeds_run/eval/11L_w03_s2024_cf.log
@@ -0,0 +1,53 @@
+======================================================================
+CF ablation eval
+  ckpt: /workspace/ckpt/11L_w03_s2024/step_final.pt
+  model: L=11 d=512 vocab=4096
+  eval: N=500 seq=1024 stride=2 rounds=2
+======================================================================
+Loading model class from /tmp/train_cdm_patched_11L_w0.3_s2024.py
+  module constants verified: 11L d=512 vocab=4096
+  loading state dict from /workspace/ckpt/11L_w03_s2024/step_final.pt
+  loaded 116 keys, missing=0, unexpected=0
+  params: 28,402,777
+  val tokens: 45,122,403
+
+[1] Pure AR baseline (is_causal=True)
+    AR 100/500 | BPB:1.4550 | 2s
+    AR 200/500 | BPB:1.4559 | 3s
+    AR 300/500 | BPB:1.4510 | 5s
+    AR 400/500 | BPB:1.4460 | 7s
+    AR 500/500 | BPB:1.4459 | 8s
+  Pure AR BPB: 1.4459
+
+[2] CF eval (stride=2, rounds=2)
+    CF 100/500 | AR:0.7200 CDM:0.5945 Total:1.3145 | 11s
+    CF 200/500 | AR:0.7202 CDM:0.5975 Total:1.3177 | 21s
+    CF 300/500 | AR:0.7187 CDM:0.5952 Total:1.3139 | 32s
+    CF 400/500 | AR:0.7176 CDM:0.5945 Total:1.3122 | 43s
+    CF 500/500 | AR:0.7142 CDM:0.5918 Total:1.3060 | 55s
+
+======================================================================
+  Pure AR:       1.4459
+  CF AR part:    0.7142
+  CF CDM part:   0.5918
+  CF Total:      1.3060
+  CF vs Pure AR: -9.67%  (−0.1399 BPB)
+======================================================================
+
+JSON:
+{
+  "ckpt": "/workspace/ckpt/11L_w03_s2024/step_final.pt",
+  "num_layers": 11,
+  "model_dim": 512,
+  "vocab_size": 4096,
+  "n_params": 28402777,
+  "n_seqs": 500,
+  "seq_len": 1024,
+  "stride": 2,
+  "rounds": 2,
+  "pure_ar_bpb": 1.4458824396133423,
+  "cf_ar_part": 0.714214026927948,
+  "cf_cdm_part": 0.5917994689108285,
+  "cf_total": 1.3060134958387764,
+  "cf_vs_ar_pct": -9.67360415636348
+}
diff --git a/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/seeds_run/eval/11L_w03_s2024_eval.out b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/seeds_run/eval/11L_w03_s2024_eval.out
new file mode 100644
index 0000000000..6261b9fea6
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/seeds_run/eval/11L_w03_s2024_eval.out
@@ -0,0 +1,53 @@
+======================================================================
+CF ablation eval
+  ckpt: /workspace/ckpt/11L_w03_s2024/step_final.pt
+  model: L=11 d=512 vocab=4096
+  eval: N=500 seq=1024 stride=2 rounds=2
+======================================================================
+Loading model class from /tmp/train_cdm_patched_11L_w0.3_s2024.py
+  module constants verified: 11L d=512 vocab=4096
+  loading state dict from /workspace/ckpt/11L_w03_s2024/step_final.pt
+  loaded 116 keys, missing=0, unexpected=0
+  params: 28,402,777
+  val tokens: 45,122,403
+
+[1] Pure AR baseline (is_causal=True)
+    AR 100/500 | BPB:1.4550 | 2s
+    AR 200/500 | BPB:1.4559 | 3s
+    AR 300/500 | BPB:1.4510 | 5s
+    AR 400/500 | BPB:1.4460 | 7s
+    AR 500/500 | BPB:1.4459 | 8s
+  Pure AR BPB: 1.4459
+
+[2] CF eval (stride=2, rounds=2)
+    CF 100/500 | AR:0.7200 CDM:0.5945 Total:1.3145 | 11s
+    CF 200/500 | AR:0.7202 CDM:0.5975 Total:1.3177 | 21s
+    CF 300/500 | AR:0.7187 CDM:0.5952 Total:1.3139 | 32s
+    CF 400/500 | AR:0.7176 CDM:0.5945 Total:1.3122 | 43s
+    CF 500/500 | AR:0.7142 CDM:0.5918 Total:1.3060 | 55s
+
+======================================================================
+  Pure AR:       1.4459
+  CF AR part:    0.7142
+  CF CDM part:   0.5918
+  CF Total:      1.3060
+  CF vs Pure AR: -9.67%  (−0.1399 BPB)
+======================================================================
+
+JSON:
+{
+  "ckpt": "/workspace/ckpt/11L_w03_s2024/step_final.pt",
+  "num_layers": 11,
+  "model_dim": 512,
+  "vocab_size": 4096,
+  "n_params": 28402777,
+  "n_seqs": 500,
+  "seq_len": 1024,
+  "stride": 2,
+  "rounds": 2,
+  "pure_ar_bpb": 1.4458824396133423,
+  "cf_ar_part": 0.714214026927948,
+  "cf_cdm_part": 0.5917994689108285,
+  "cf_total": 1.3060134958387764,
+  "cf_vs_ar_pct": -9.67360415636348
+}
diff --git a/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/seeds_run/eval/11L_w03_s42_cf.log b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/seeds_run/eval/11L_w03_s42_cf.log
new file mode 100644
index 0000000000..5697f2cf56
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/seeds_run/eval/11L_w03_s42_cf.log
@@ -0,0 +1,53 @@
+======================================================================
+CF ablation eval
+  ckpt: /workspace/ckpt/11L_w03_s42/step_final.pt
+  model: L=11 d=512 vocab=4096
+  eval: N=500 seq=1024 stride=2 rounds=2
+======================================================================
+Loading model class from /tmp/train_cdm_patched_11L_w0.3_s42.py
+  module constants verified: 11L d=512 vocab=4096
+  loading state dict from /workspace/ckpt/11L_w03_s42/step_final.pt
+  loaded 116 keys, missing=0, unexpected=0
+  params: 28,402,777
+  val tokens: 45,122,403
+
+[1] Pure AR baseline (is_causal=True)
+    AR 100/500 | BPB:1.4536 | 2s
+    AR 200/500 | BPB:1.4531 | 3s
+    AR 300/500 | BPB:1.4484 | 5s
+    AR 400/500 | BPB:1.4430 | 7s
+    AR 500/500 | BPB:1.4425 | 8s
+  Pure AR BPB: 1.4425
+
+[2] CF eval (stride=2, rounds=2)
+    CF 100/500 | AR:0.7173 CDM:0.5933 Total:1.3106 | 10s
+    CF 200/500 | AR:0.7184 CDM:0.5956 Total:1.3140 | 21s
+    CF 300/500 | AR:0.7171 CDM:0.5935 Total:1.3106 | 32s
+    CF 400/500 | AR:0.7161 CDM:0.5928 Total:1.3089 | 43s
+    CF 500/500 | AR:0.7127 CDM:0.5900 Total:1.3027 | 53s
+
+======================================================================
+  Pure AR:       1.4425
+  CF AR part:    0.7127
+  CF CDM part:   0.5900
+  CF Total:      1.3027
+  CF vs Pure AR: -9.69%  (−0.1398 BPB)
+======================================================================
+
+JSON:
+{
+  "ckpt": "/workspace/ckpt/11L_w03_s42/step_final.pt",
+  "num_layers": 11,
+  "model_dim": 512,
+  "vocab_size": 4096,
+  "n_params": 28402777,
+  "n_seqs": 500,
+  "seq_len": 1024,
+  "stride": 2,
+  "rounds": 2,
+  "pure_ar_bpb": 1.4424875974655151,
+  "cf_ar_part": 0.712670624256134,
+  "cf_cdm_part": 0.590049410176459,
+  "cf_total": 1.302720034432593,
+  "cf_vs_ar_pct": -9.68934244415668
+}
diff --git a/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/seeds_run/eval/11L_w03_s42_eval.out b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/seeds_run/eval/11L_w03_s42_eval.out
new file mode 100644
index 0000000000..5697f2cf56
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/seeds_run/eval/11L_w03_s42_eval.out
@@ -0,0 +1,53 @@
+======================================================================
+CF ablation eval
+  ckpt: /workspace/ckpt/11L_w03_s42/step_final.pt
+  model: L=11 d=512 vocab=4096
+  eval: N=500 seq=1024 stride=2 rounds=2
+======================================================================
+Loading model class from /tmp/train_cdm_patched_11L_w0.3_s42.py
+  module constants verified: 11L d=512 vocab=4096
+  loading state dict from /workspace/ckpt/11L_w03_s42/step_final.pt
+  loaded 116 keys, missing=0, unexpected=0
+  params: 28,402,777
+  val tokens: 45,122,403
+
+[1] Pure AR baseline (is_causal=True)
+    AR 100/500 | BPB:1.4536 | 2s
+    AR 200/500 | BPB:1.4531 | 3s
+    AR 300/500 | BPB:1.4484 | 5s
+    AR 400/500 | BPB:1.4430 | 7s
+    AR 500/500 | BPB:1.4425 | 8s
+  Pure AR BPB: 1.4425
+
+[2] CF eval (stride=2, rounds=2)
+    CF 100/500 | AR:0.7173 CDM:0.5933 Total:1.3106 | 10s
+    CF 200/500 | AR:0.7184 CDM:0.5956 Total:1.3140 | 21s
+    CF 300/500 | AR:0.7171 CDM:0.5935 Total:1.3106 | 32s
+    CF 400/500 | AR:0.7161 CDM:0.5928 Total:1.3089 | 43s
+    CF 500/500 | AR:0.7127 CDM:0.5900 Total:1.3027 | 53s
+
+======================================================================
+  Pure AR:       1.4425
+  CF AR part:    0.7127
+  CF CDM part:   0.5900
+  CF Total:      1.3027
+  CF vs Pure AR: -9.69%  (−0.1398 BPB)
+======================================================================
+
+JSON:
+{
+  "ckpt": "/workspace/ckpt/11L_w03_s42/step_final.pt",
+  "num_layers": 11,
+  "model_dim": 512,
+  "vocab_size": 4096,
+  "n_params": 28402777,
+  "n_seqs": 500,
+  "seq_len": 1024,
+  "stride": 2,
+  "rounds": 2,
+  "pure_ar_bpb": 1.4424875974655151,
+  "cf_ar_part": 0.712670624256134,
+  "cf_cdm_part": 0.590049410176459,
+  "cf_total": 1.302720034432593,
+  "cf_vs_ar_pct": -9.68934244415668
+}
diff --git a/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/seeds_run/eval/11L_w03_s7_cf.log b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/seeds_run/eval/11L_w03_s7_cf.log
new file mode 100644
index 0000000000..98f3ce94db
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/seeds_run/eval/11L_w03_s7_cf.log
@@ -0,0 +1,53 @@
+======================================================================
+CF ablation eval
+  ckpt: /workspace/ckpt/11L_w03_s7/step_final.pt
+  model: L=11 d=512 vocab=4096
+  eval: N=500 seq=1024 stride=2 rounds=2
+======================================================================
+Loading model class from /tmp/train_cdm_patched_11L_w0.3_s7.py
+  module constants verified: 11L d=512 vocab=4096
+  loading state dict from /workspace/ckpt/11L_w03_s7/step_final.pt
+  loaded 116 keys, missing=0, unexpected=0
+  params: 28,402,777
+  val tokens: 45,122,403
+
+[1] Pure AR baseline (is_causal=True)
+    AR 100/500 | BPB:1.4544 | 2s
+    AR 200/500 | BPB:1.4551 | 4s
+    AR 300/500 | BPB:1.4496 | 5s
+    AR 400/500 | BPB:1.4447 | 7s
+    AR 500/500 | BPB:1.4446 | 9s
+  Pure AR BPB: 1.4446
+
+[2] CF eval (stride=2, rounds=2)
+    CF 100/500 | AR:0.7196 CDM:0.5920 Total:1.3116 | 11s
+    CF 200/500 | AR:0.7201 CDM:0.5946 Total:1.3146 | 22s
+    CF 300/500 | AR:0.7183 CDM:0.5928 Total:1.3111 | 34s
+    CF 400/500 | AR:0.7172 CDM:0.5917 Total:1.3089 | 45s
+    CF 500/500 | AR:0.7137 CDM:0.5888 Total:1.3025 | 56s
+
+======================================================================
+  Pure AR:       1.4446
+  CF AR part:    0.7137
+  CF CDM part:   0.5888
+  CF Total:      1.3025
+  CF vs Pure AR: -9.84%  (−0.1421 BPB)
+======================================================================
+
+JSON:
+{
+  "ckpt": "/workspace/ckpt/11L_w03_s7/step_final.pt",
+  "num_layers": 11,
+  "model_dim": 512,
+  "vocab_size": 4096,
+  "n_params": 28402777,
+  "n_seqs": 500,
+  "seq_len": 1024,
+  "stride": 2,
+  "rounds": 2,
+  "pure_ar_bpb": 1.4445689916610718,
+  "cf_ar_part": 0.7136739492416382,
+  "cf_cdm_part": 0.5888027892379928,
+  "cf_total": 1.302476738479631,
+  "cf_vs_ar_pct": -9.836307853877768
+}
diff --git a/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/seeds_run/eval/11L_w03_s7_eval.out b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/seeds_run/eval/11L_w03_s7_eval.out
new file mode 100644
index 0000000000..98f3ce94db
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/seeds_run/eval/11L_w03_s7_eval.out
@@ -0,0 +1,53 @@
+======================================================================
+CF ablation eval
+  ckpt: /workspace/ckpt/11L_w03_s7/step_final.pt
+  model: L=11 d=512 vocab=4096
+  eval: N=500 seq=1024 stride=2 rounds=2
+======================================================================
+Loading model class from /tmp/train_cdm_patched_11L_w0.3_s7.py
+  module constants verified: 11L d=512 vocab=4096
+  loading state dict from /workspace/ckpt/11L_w03_s7/step_final.pt
+  loaded 116 keys, missing=0, unexpected=0
+  params: 28,402,777
+  val tokens: 45,122,403
+
+[1] Pure AR baseline (is_causal=True)
+    AR 100/500 | BPB:1.4544 | 2s
+    AR 200/500 | BPB:1.4551 | 4s
+    AR 300/500 | BPB:1.4496 | 5s
+    AR 400/500 | BPB:1.4447 | 7s
+    AR 500/500 | BPB:1.4446 | 9s
+  Pure AR BPB: 1.4446
+
+[2] CF eval (stride=2, rounds=2)
+    CF 100/500 | AR:0.7196 CDM:0.5920 Total:1.3116 | 11s
+    CF 200/500 | AR:0.7201 CDM:0.5946 Total:1.3146 | 22s
+    CF 300/500 | AR:0.7183 CDM:0.5928 Total:1.3111 | 34s
+    CF 400/500 | AR:0.7172 CDM:0.5917 Total:1.3089 | 45s
+    CF 500/500 | AR:0.7137 CDM:0.5888 Total:1.3025 | 56s
+
+======================================================================
+  Pure AR:       1.4446
+  CF AR part:    0.7137
+  CF CDM part:   0.5888
+  CF Total:      1.3025
+  CF vs Pure AR: -9.84%  (−0.1421 BPB)
+======================================================================
+
+JSON:
+{
+  "ckpt": "/workspace/ckpt/11L_w03_s7/step_final.pt",
+  "num_layers": 11,
+  "model_dim": 512,
+  "vocab_size": 4096,
+  "n_params": 28402777,
+  "n_seqs": 500,
+  "seq_len": 1024,
+  "stride": 2,
+  "rounds": 2,
+  "pure_ar_bpb": 1.4445689916610718,
+  "cf_ar_part": 0.7136739492416382,
+  "cf_cdm_part": 0.5888027892379928,
+  "cf_total": 1.302476738479631,
+  "cf_vs_ar_pct": -9.836307853877768
+}
diff --git a/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/seeds_run/eval/11L_w0_s1337_cf.log b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/seeds_run/eval/11L_w0_s1337_cf.log
new file mode 100644
index 0000000000..4396414f05
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/seeds_run/eval/11L_w0_s1337_cf.log
@@ -0,0 +1,53 @@
+======================================================================
+CF ablation eval
+  ckpt: /workspace/ckpt/11L_w0_s1337/step_final.pt
+  model: L=11 d=512 vocab=4096
+  eval: N=500 seq=1024 stride=2 rounds=2
+======================================================================
+Loading model class from /tmp/train_cdm_patched_11L_w0.0_s1337.py
+  module constants verified: 11L d=512 vocab=4096
+  loading state dict from /workspace/ckpt/11L_w0_s1337/step_final.pt
+  loaded 116 keys, missing=0, unexpected=0
+  params: 28,402,777
+  val tokens: 45,122,403
+
+[1] Pure AR baseline (is_causal=True)
+    AR 100/500 | BPB:1.3305 | 2s
+    AR 200/500 | BPB:1.3323 | 3s
+    AR 300/500 | BPB:1.3276 | 5s
+    AR 400/500 | BPB:1.3225 | 6s
+    AR 500/500 | BPB:1.3214 | 8s
+  Pure AR BPB: 1.3214
+
+[2] CF eval (stride=2, rounds=2)
+    CF 100/500 | AR:0.6566 CDM:1.7995 Total:2.4561 | 11s
+    CF 200/500 | AR:0.6573 CDM:1.8115 Total:2.4687 | 21s
+    CF 300/500 | AR:0.6567 CDM:1.8049 Total:2.4616 | 32s
+    CF 400/500 | AR:0.6559 CDM:1.8053 Total:2.4612 | 43s
+    CF 500/500 | AR:0.6526 CDM:1.8012 Total:2.4538 | 54s
+
+======================================================================
+  Pure AR:       1.3214
+  CF AR part:    0.6526
+  CF CDM part:   1.8012
+  CF Total:      2.4538
+  CF vs Pure AR: +85.70%  (+1.1324 BPB)
+======================================================================
+
+JSON:
+{
+  "ckpt": "/workspace/ckpt/11L_w0_s1337/step_final.pt",
+  "num_layers": 11,
+  "model_dim": 512,
+  "vocab_size": 4096,
+  "n_params": 28402777,
+  "n_seqs": 500,
+  "seq_len": 1024,
+  "stride": 2,
+  "rounds": 2,
+  "pure_ar_bpb": 1.3213800191879272,
+  "cf_ar_part": 0.652561366558075,
+  "cf_cdm_part": 1.801246136474998,
+  "cf_total": 2.4538075030330733,
+  "cf_vs_ar_pct": 85.70036381669335
+}
diff --git a/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/seeds_run/eval/11L_w0_s1337_eval.out b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/seeds_run/eval/11L_w0_s1337_eval.out
new file mode 100644
index 0000000000..4396414f05
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/seeds_run/eval/11L_w0_s1337_eval.out
@@ -0,0 +1,53 @@
+======================================================================
+CF ablation eval
+  ckpt: /workspace/ckpt/11L_w0_s1337/step_final.pt
+  model: L=11 d=512 vocab=4096
+  eval: N=500 seq=1024 stride=2 rounds=2
+======================================================================
+Loading model class from /tmp/train_cdm_patched_11L_w0.0_s1337.py
+  module constants verified: 11L d=512 vocab=4096
+  loading state dict from /workspace/ckpt/11L_w0_s1337/step_final.pt
+  loaded 116 keys, missing=0, unexpected=0
+  params: 28,402,777
+  val tokens: 45,122,403
+
+[1] Pure AR baseline (is_causal=True)
+    AR 100/500 | BPB:1.3305 | 2s
+    AR 200/500 | BPB:1.3323 | 3s
+    AR 300/500 | BPB:1.3276 | 5s
+    AR 400/500 | BPB:1.3225 | 6s
+    AR 500/500 | BPB:1.3214 | 8s
+  Pure AR BPB: 1.3214
+
+[2] CF eval (stride=2, rounds=2)
+    CF 100/500 | AR:0.6566 CDM:1.7995 Total:2.4561 | 11s
+    CF 200/500 | AR:0.6573 CDM:1.8115 Total:2.4687 | 21s
+    CF 300/500 | AR:0.6567 CDM:1.8049 Total:2.4616 | 32s
+    CF 400/500 | AR:0.6559 CDM:1.8053 Total:2.4612 | 43s
+    CF 500/500 | AR:0.6526 CDM:1.8012 Total:2.4538 | 54s
+
+======================================================================
+  Pure AR:       1.3214
+  CF AR part:    0.6526
+  CF CDM part:   1.8012
+  CF Total:      2.4538
+  CF vs Pure AR: +85.70%  (+1.1324 BPB)
+======================================================================
+
+JSON:
+{
+  "ckpt": "/workspace/ckpt/11L_w0_s1337/step_final.pt",
+  "num_layers": 11,
+  "model_dim": 512,
+  "vocab_size": 4096,
+  "n_params": 28402777,
+  "n_seqs": 500,
+  "seq_len": 1024,
+  "stride": 2,
+  "rounds": 2,
+  "pure_ar_bpb": 1.3213800191879272,
+  "cf_ar_part": 0.652561366558075,
+  "cf_cdm_part": 1.801246136474998,
+  "cf_total": 2.4538075030330733,
+  "cf_vs_ar_pct": 85.70036381669335
+}
diff --git a/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/seeds_run/run_p5.out b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/seeds_run/run_p5.out
new file mode 100644
index 0000000000..510c0314d4
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/seeds_run/run_p5.out
@@ -0,0 +1,68 @@
+================================================================
+== TRAIN: 11L_w03_s1337  (L=11 d=512 w=0.3 seed=1337)
+================================================================
+  done -> FINAL val_bpb: 1.4387
+-rw-r--r-- 1 root root 109459269 Apr  9 10:41 /workspace/ckpt/11L_w03_s1337/step_final.pt
+-rw-r--r-- 1 root root 113643028 Apr  9 10:41 /workspace/out/11L_w03_s1337.npz
+================================================================
+== TRAIN: 11L_w03_s42  (L=11 d=512 w=0.3 seed=42)
+================================================================
+  done -> FINAL val_bpb: 1.4393
+-rw-r--r-- 1 root root 109459269 Apr  9 10:51 /workspace/ckpt/11L_w03_s42/step_final.pt
+-rw-r--r-- 1 root root 113643028 Apr  9 10:51 /workspace/out/11L_w03_s42.npz
+================================================================
+== TRAIN: 11L_w03_s2024  (L=11 d=512 w=0.3 seed=2024)
+================================================================
+  done -> FINAL val_bpb: 1.4430
+-rw-r--r-- 1 root root 109459269 Apr  9 11:00 /workspace/ckpt/11L_w03_s2024/step_final.pt
+-rw-r--r-- 1 root root 113643028 Apr  9 11:00 /workspace/out/11L_w03_s2024.npz
+================================================================
+== TRAIN: 11L_w03_s7  (L=11 d=512 w=0.3 seed=7)
+================================================================
+  done -> FINAL val_bpb: 1.4416
+-rw-r--r-- 1 root root 109459269 Apr  9 11:09 /workspace/ckpt/11L_w03_s7/step_final.pt
+-rw-r--r-- 1 root root 113643028 Apr  9 11:09 /workspace/out/11L_w03_s7.npz
+================================================================
+== TRAIN: 11L_w03_s100  (L=11 d=512 w=0.3 seed=100)
+================================================================
+  done -> FINAL val_bpb: 1.4422
+-rw-r--r-- 1 root root 109459269 Apr  9 11:19 /workspace/ckpt/11L_w03_s100/step_final.pt
+-rw-r--r-- 1 root root 113643028 Apr  9 11:19 /workspace/out/11L_w03_s100.npz
+
+================================================================
+== ALL 5 TRAININGS DONE — STARTING EVALS
+================================================================
+== EVAL: 11L_w03_s1337
+  cf ->   "cf_total": 1.2923543097576307,
+== EVAL: 11L_w03_s42
+  cf ->   "cf_total": 1.302720034432593,
+== EVAL: 11L_w03_s2024
+  cf ->   "cf_total": 1.3060134958387764,
+== EVAL: 11L_w03_s7
+  cf ->   "cf_total": 1.302476738479631,
+== EVAL: 11L_w03_s100
+  cf ->   "cf_total": 1.3007377814901728,
+
+================================================================
+== ALL DONE
+================================================================
+-rw-r--r-- 1 root root 109459269 Apr  9 11:19 /workspace/ckpt/11L_w03_s100/step_final.pt
+-rw-r--r-- 1 root root 109459269 Apr  9 10:41 /workspace/ckpt/11L_w03_s1337/step_final.pt
+-rw-r--r-- 1 root root 109459269 Apr  9 11:00 /workspace/ckpt/11L_w03_s2024/step_final.pt
+-rw-r--r-- 1 root root 109459269 Apr  9 10:51 /workspace/ckpt/11L_w03_s42/step_final.pt
+-rw-r--r-- 1 root root 109459269 Apr  9 11:09 /workspace/ckpt/11L_w03_s7/step_final.pt
+
+/workspace/out/:
+total 646292
+drwxr-xr-x 2 root root      4096 Apr  9 11:19 .
+drwxr-xr-x 1 root root       157 Apr  9 10:32 ..
+-rw-r--r-- 1 root root 113643028 Apr  9 11:19 11L_w03_s100.npz
+-rw-r--r-- 1 root root  18695388 Apr  9 11:19 11L_w03_s100_int6.lzma
+-rw-r--r-- 1 root root 113643028 Apr  9 10:41 11L_w03_s1337.npz
+-rw-r--r-- 1 root root  18755168 Apr  9 10:41 11L_w03_s1337_int6.lzma
+-rw-r--r-- 1 root root 113643028 Apr  9 11:00 11L_w03_s2024.npz
+-rw-r--r-- 1 root root  18679732 Apr  9 11:00 11L_w03_s2024_int6.lzma
+-rw-r--r-- 1 root root 113643028 Apr  9 10:51 11L_w03_s42.npz
+-rw-r--r-- 1 root root  18766988 Apr  9 10:51 11L_w03_s42_int6.lzma
+-rw-r--r-- 1 root root 113643028 Apr  9 11:09 11L_w03_s7.npz
+-rw-r--r-- 1 root root  18676280 Apr  9 11:10 11L_w03_s7_int6.lzma
diff --git a/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/seeds_run/run_p5.sh b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/seeds_run/run_p5.sh
new file mode 100755
index 0000000000..4504b1ae05
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/seeds_run/run_p5.sh
@@ -0,0 +1,63 @@
+#!/bin/bash
+set -e
+cd /workspace/meadow-golf/experiments/2026-04-09_matched_ablation
+mkdir -p /workspace/out /workspace/ckpt /workspace/logs /workspace/eval
+
+run_one () {
+  local tag=$1 weight=$2 seed=$3
+  echo "================================================================"
+  echo "== TRAIN: $tag  (L=11 d=512 w=$weight seed=$seed)"
+  echo "================================================================"
+  python3 train_ablation_runner.py \
+    --train_script ./train_cdm.py \
+    --num_layers 11 --model_dim 512 --vocab_size 4096 \
+    --bigram_dim 128 --xsa_last_n 4 \
+    --cdm_weight $weight --seed $seed \
+    -- \
+    --train_budget_secs 540 --steps 9999 \
+    --data_dir /workspace/gv4096/data \
+    --tokenizer_path /workspace/gv4096/bpe_v4096.model \
+    --save_path /workspace/out/${tag}.npz \
+    --save_int6_path /workspace/out/${tag}_int6.lzma \
+    --checkpoint_dir /workspace/ckpt/${tag} \
+    --val_every 500 --val_tokens 1000000 \
+    > /workspace/logs/${tag}_train.log 2>&1
+  echo "  done -> $(grep 'FINAL val_bpb' /workspace/logs/${tag}_train.log | tail -1)"
+  ls -la /workspace/ckpt/${tag}/step_final.pt /workspace/out/${tag}.npz 2>&1
+}
+
+eval_one () {
+  local tag=$1 weight=$2 seed=$3
+  echo "== EVAL: $tag"
+  python3 eval_cf_ablation.py \
+    --ckpt /workspace/ckpt/${tag}/step_final.pt \
+    --train_module_path /tmp/train_cdm_patched_11L_w${weight}_s${seed}.py \
+    --num_layers 11 --model_dim 512 --vocab_size 4096 \
+    --bigram_dim 128 --xsa_last_n 4 \
+    --n_seqs 500 --seq_len 1024 --stride 2 --rounds 2 --seed 42 \
+    --data_dir /workspace/gv4096/data \
+    --tokenizer_path /workspace/gv4096/bpe_v4096.model \
+    --log_path /workspace/eval/${tag}_cf.log \
+    > /workspace/eval/${tag}_eval.out 2>&1
+  echo "  cf -> $(grep 'cf_total' /workspace/eval/${tag}_cf.log | tail -1)"
+}
+
+# 5 fresh w03 seeds
+for seed in 1337 42 2024 7 100; do
+  run_one 11L_w03_s${seed} 0.3 $seed
+done
+
+echo
+echo '================================================================'
+echo '== ALL 5 TRAININGS DONE — STARTING EVALS'
+echo '================================================================'
+
+for seed in 1337 42 2024 7 100; do
+  eval_one 11L_w03_s${seed} 0.3 $seed
+done
+
+echo
+echo '================================================================'
+echo '== ALL DONE'
+echo '================================================================'
+ls -la /workspace/out/ /workspace/ckpt/*/step_final.pt 2>&1
diff --git a/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/seeds_run/run_phase_b.out b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/seeds_run/run_phase_b.out
new file mode 100644
index 0000000000..201b2833a3
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/seeds_run/run_phase_b.out
@@ -0,0 +1,11 @@
+================================================================
+== TRAIN: 11L_w0_s1337 (control, seed 1337, w=0.0)
+================================================================
+  done -> FINAL val_bpb: 1.3146
+-rw-r--r-- 1 root root 109459269 Apr  9 11:37 /workspace/ckpt/11L_w0_s1337/step_final.pt
+-rw-r--r-- 1 root root 113643028 Apr  9 11:37 /workspace/out/11L_w0_s1337.npz
+
+== EVAL: 11L_w0_s1337
+  cf ->   "cf_total": 2.4538075030330733,
+  pure_ar ->   "pure_ar_bpb": 1.3213800191879272,
+PHASE_B_DONE
diff --git a/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/seeds_run/run_phase_b.sh b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/seeds_run/run_phase_b.sh
new file mode 100755
index 0000000000..684e14b358
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/seeds_run/run_phase_b.sh
@@ -0,0 +1,41 @@
+#!/bin/bash
+set -e
+cd /workspace/meadow-golf/experiments/2026-04-09_matched_ablation
+mkdir -p /workspace/out /workspace/ckpt /workspace/logs /workspace/eval
+
+tag=11L_w0_s1337
+echo "================================================================"
+echo "== TRAIN: $tag (control, seed 1337, w=0.0)"
+echo "================================================================"
+python3 train_ablation_runner.py \
+  --train_script ./train_cdm.py \
+  --num_layers 11 --model_dim 512 --vocab_size 4096 \
+  --bigram_dim 128 --xsa_last_n 4 \
+  --cdm_weight 0.0 --seed 1337 \
+  -- \
+  --train_budget_secs 540 --steps 9999 \
+  --data_dir /workspace/gv4096/data \
+  --tokenizer_path /workspace/gv4096/bpe_v4096.model \
+  --save_path /workspace/out/${tag}.npz \
+  --save_int6_path /workspace/out/${tag}_int6.lzma \
+  --checkpoint_dir /workspace/ckpt/${tag} \
+  --val_every 500 --val_tokens 1000000 \
+  > /workspace/logs/${tag}_train.log 2>&1
+echo "  done -> $(grep 'FINAL val_bpb' /workspace/logs/${tag}_train.log | tail -1)"
+ls -la /workspace/ckpt/${tag}/step_final.pt /workspace/out/${tag}.npz
+
+echo
+echo "== EVAL: $tag"
+python3 eval_cf_ablation.py \
+  --ckpt /workspace/ckpt/${tag}/step_final.pt \
+  --train_module_path /tmp/train_cdm_patched_11L_w0.0_s1337.py \
+  --num_layers 11 --model_dim 512 --vocab_size 4096 \
+  --bigram_dim 128 --xsa_last_n 4 \
+  --n_seqs 500 --seq_len 1024 --stride 2 --rounds 2 --seed 42 \
+  --data_dir /workspace/gv4096/data \
+  --tokenizer_path /workspace/gv4096/bpe_v4096.model \
+  --log_path /workspace/eval/${tag}_cf.log \
+  > /workspace/eval/${tag}_eval.out 2>&1
+echo "  cf -> $(grep 'cf_total' /workspace/eval/${tag}_cf.log | tail -1)"
+echo "  pure_ar -> $(grep 'pure_ar_bpb' /workspace/eval/${tag}_cf.log | tail -1)"
+echo PHASE_B_DONE
diff --git a/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/submission.json b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/submission.json
new file mode 100644
index 0000000000..ad60d23764
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/submission.json
@@ -0,0 +1,70 @@
+{
+  "track": "non_record_16mb",
+  "date": "2026-04-09",
+  "name": "Shared AR + Masked Denoising — -0.0205 ± 0.005 BPB vs Matched Causal Baseline (11L, 1xH100)",
+  "author": "Sheng-Kai Huang",
+  "github_id": "akaiHuang",
+  "email": "akai@fawstudio.com",
+  "primary_claim": "A shared-weight joint causal-AR + uniform-noise D3PM denoising model, evaluated via a two-pass Coarse-to-Fine decoder, achieves lower BPB than a matched causal-only baseline at matched compute. Across 5 fresh shared training seeds and 1 fresh control seed at the true final checkpoint, the shared model's 11L CF BPB is 1.3009 ± 0.005 versus 1.3214 for the control (delta -0.0205).",
+  "headline_numbers": {
+    "11L_shared_w0.3_CF_BPB_5seed_mean": 1.3009,
+    "11L_shared_w0.3_CF_BPB_best_seed": 1.2924,
+    "11L_causal_only_control_Pure_AR_BPB_final_ckpt": 1.3214,
+    "11L_CF_advantage_over_control_5seed_mean": -0.0205,
+    "11L_CF_advantage_over_control_best_seed": -0.029,
+    "5L_shared_w1.0_CF_BPB": 1.3939,
+    "5L_causal_only_control_Pure_AR_BPB": 1.4479,
+    "5L_CF_advantage_over_control": -0.054
+  },
+  "ablation_runs": [
+    {"tag": "5L_w0", "layers": 5, "dim": 256, "vocab": 4096, "cdm_weight": 0.0, "pure_ar_bpb": 1.4479, "cf_total_bpb": 2.4371, "cf_valid": false},
+    {"tag": "5L_w03", "layers": 5, "dim": 256, "vocab": 4096, "cdm_weight": 0.3, "pure_ar_bpb": 1.5231, "cf_total_bpb": 1.4009, "cf_valid": true},
+    {"tag": "5L_w1", "layers": 5, "dim": 256, "vocab": 4096, "cdm_weight": 1.0, "pure_ar_bpb": 1.5841, "cf_total_bpb": 1.3939, "cf_valid": true},
+    {"tag": "11L_w0_final", "layers": 11, "dim": 512, "vocab": 4096, "cdm_weight": 0.0, "pure_ar_bpb": 1.3214, "cf_total_bpb": 2.4538, "cf_valid": false, "note": "final checkpoint, fresh control seed 1337"},
+    {"tag": "11L_w03_final_mean", "layers": 11, "dim": 512, "vocab": 4096, "cdm_weight": 0.3, "pure_ar_bpb": 1.4443, "cf_total_bpb": 1.3009, "cf_valid": true, "n_train_seeds": 5, "note": "5-seed mean at final checkpoint"},
+    {"tag": "11L_w1", "layers": 11, "dim": 512, "vocab": 4096, "cdm_weight": 1.0, "pure_ar_bpb": 1.5414, "cf_total_bpb": 1.3527, "cf_valid": true}
+  ],
+  "hardware": "2 x 1xH100 SXM pods, self-funded",
+  "total_compute_cost_usd": 7.43,
+  "training_budget_per_run_secs": 540,
+  "eval_protocol": "CF 2-pass decoder, N=500 sequences x seq_len=1024, stride=2, rounds=2, n_random=3, seed=42",
+  "tokenizer": "v4096 BPE (akaiii/meadow-golf-v4096/bpe_v4096.model)",
+  "val_data": "FineWeb v4096 retokenized validation shard (akaiii/meadow-golf-v4096/data/fineweb_val_000000.bin)",
+  "techniques": [
+    "Joint AR + D3PM-uniform masked denoising on shared weights (no separate heads)",
+    "Two-pass Coarse-to-Fine decoder: causal pass scores skeleton positions, bidirectional pass fills gaps",
+    "Matched causal-only control ablation (the first such control in the text-diffusion cluster)",
+    "Cross-scale verification (5L d=256 and 11L d=512) at matched compute",
+    "5-seed final-checkpoint verification at 11L with matched causal-only control",
+    "Causal-mask integrity verified by explicit leakage test (leakage_test.py)"
+  ],
+  "artifacts_hf": {
+    "checkpoints_repo": "akaiii/meadow-golf-checkpoints",
+    "data_repo": "akaiii/meadow-golf-v4096",
+    "checkpoint_files": [
+      "5L_w0.npz", "5L_w03.npz", "5L_w1.npz",
+      "11L_w0.npz", "11L_w03.npz", "11L_w1.npz"
+    ]
+  },
+  "included_scripts": [
+    "train_cdm.py",
+    "train_ablation_runner.py",
+    "eval_cf_dualbrain.py",
+    "eval_cf_dualbrain_cuda.py",
+    "eval_cf_ablation.py",
+    "leakage_test.py",
+    "run_6.sh",
+    "eval_6.sh",
+    "gen_test.py"
+  ],
+  "included_logs_dir": "ablation_logs/ + seeds_run/",
+  "wishlist_items_addressed": ["Text diffusion"],
+  "prior_art_cited": [
+    "UniLM (Dong et al. 2019)",
+    "GLM (Du et al. 2022)",
+    "FIM / Fill-in-the-Middle (Bavarian et al. 2022, OpenAI)",
+    "D3PM (Austin et al. 2021)",
+    "Mask-Predict (Ghazvininejad et al. 2019)",
+    "MDLM (Sahoo et al. 2024)"
+  ]
+}
diff --git a/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/train.log b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/train.log
new file mode 100644
index 0000000000..109cd31fe6
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/train.log
@@ -0,0 +1,77 @@
+WARNING: effective_batch=65,536 != official 524,288
+   grad_accum=2 x microbatch=32768 x 1 GPUs
+======================================================================
+11L Standard Retrodiction PyTorch | 5L d=256 MLP=3x | steps=99999
+NUM_HEADS=8 head_dim=32 NUM_KV_HEADS=4 BIGRAM_DIM=64
+Encoder=2 Decoder=3 (U-net skip connections)
+Retro alpha=0.3 | XSA last 2 layers | LeakyReLU^2 | BigramHash(2048) | EMA(0.997)
+Device: cuda | World size: 1 | Grad accum: 2
+Effective batch: 65,536 tok/step
+======================================================================
+Val tokens: 45,121,536 (eval on 999,424)
+Model params: 4,151,081 (4.2M)
+Estimated size BF16: 8.3MB | int6: 3.1MB
+  Muon params: 32 tensors
+  Scalar params: 23 tensors
+  Embed param: tok_emb.weight
+EMA starts step 79999
+Time budget: 540s (warmdown in last 80s)
+Starting training...
+step:0/99999 val_bpb:3.6114 *BEST* tokens:0M elapsed:1s
+  step:100 train_loss:9.6959 lr_mul:1.0000 tok/s:396347
+  step:200 train_loss:8.8438 lr_mul:1.0000 tok/s:410051
+  step:300 train_loss:8.4501 lr_mul:1.0000 tok/s:414813
+  step:400 train_loss:8.1144 lr_mul:1.0000 tok/s:417432
+step:500/99999 val_bpb:1.7278 *BEST* tokens:33M elapsed:78s
+  step:500 train_loss:7.9979 lr_mul:1.0000 tok/s:416857
+  Checkpoint saved: checkpoints_11L/step_500.pt
+  step:600 train_loss:7.9572 lr_mul:1.0000 tok/s:418068
+  step:700 train_loss:7.9324 lr_mul:1.0000 tok/s:418983
+  step:800 train_loss:7.5177 lr_mul:1.0000 tok/s:419733
+  step:900 train_loss:7.4747 lr_mul:1.0000 tok/s:420259
+step:1000/99999 val_bpb:1.6489 *BEST* tokens:66M elapsed:156s
+  step:1000 train_loss:7.5924 lr_mul:1.0000 tok/s:419695
+  Checkpoint saved: checkpoints_11L/step_1000.pt
+  step:1100 train_loss:8.0533 lr_mul:1.0000 tok/s:420150
+  step:1200 train_loss:7.4797 lr_mul:1.0000 tok/s:420113
+  step:1300 train_loss:7.6360 lr_mul:1.0000 tok/s:420462
+  step:1400 train_loss:7.2976 lr_mul:1.0000 tok/s:420757
+step:1500/99999 val_bpb:1.6149 *BEST* tokens:98M elapsed:234s
+  step:1500 train_loss:7.5031 lr_mul:1.0000 tok/s:420344
+  Checkpoint saved: checkpoints_11L/step_1500.pt
+  step:1600 train_loss:7.4292 lr_mul:1.0000 tok/s:420580
+  step:1700 train_loss:7.5803 lr_mul:1.0000 tok/s:420822
+  step:1800 train_loss:7.4522 lr_mul:1.0000 tok/s:421016
+  step:1900 train_loss:7.3479 lr_mul:1.0000 tok/s:421185
+step:2000/99999 val_bpb:1.5865 *BEST* tokens:131M elapsed:311s
+  step:2000 train_loss:7.2996 lr_mul:1.0000 tok/s:420845
+  Checkpoint saved: checkpoints_11L/step_2000.pt
+  step:2100 train_loss:7.3516 lr_mul:1.0000 tok/s:421004
+  step:2200 train_loss:7.2293 lr_mul:1.0000 tok/s:421083
+  step:2300 train_loss:7.4488 lr_mul:1.0000 tok/s:421051
+  step:2400 train_loss:7.3113 lr_mul:1.0000 tok/s:421193
+step:2500/99999 val_bpb:1.5790 *BEST* tokens:164M elapsed:389s
+  step:2500 train_loss:7.2561 lr_mul:1.0000 tok/s:420885
+  Checkpoint saved: checkpoints_11L/step_2500.pt
+  step:2600 train_loss:7.4061 lr_mul:1.0000 tok/s:421013
+  step:2700 train_loss:7.4659 lr_mul:1.0000 tok/s:421128
+  step:2800 train_loss:7.2594 lr_mul:1.0000 tok/s:421250
+  step:2900 train_loss:7.0914 lr_mul:1.0000 tok/s:421337
+step:3000/99999 val_bpb:1.5575 *BEST* tokens:197M elapsed:467s
+  step:3000 train_loss:7.2468 lr_mul:0.9154 tok/s:421076
+  Checkpoint saved: checkpoints_11L/step_3000.pt
+  step:3100 train_loss:7.1503 lr_mul:0.7223 tok/s:421177
+  step:3200 train_loss:7.1200 lr_mul:0.5295 tok/s:421287
+  step:3300 train_loss:6.8191 lr_mul:0.3362 tok/s:421363
+  step:3400 train_loss:6.9488 lr_mul:0.1411 tok/s:421320
+  Budget 540s reached at step 3473 (540s elapsed) — triggering final eval+save
+step:3473/99999 val_bpb:1.4967 *BEST* tokens:228M elapsed:540s
+
+Saved NPZ to /workspace/model_5L_v4096.npz
+Saved int6+lzma to /workspace/model_5L_v4096_int6.lzma (2.9MB)
+======================================================================
+FINAL val_bpb: 1.4967
+Total tokens: 6.554B in 542s
+Model: 5L d=256 MLP=3x | 4,151,081 params
+Throughput: 12100870 tok/s
+======================================================================
diff --git a/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/train_ablation_runner.py b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/train_ablation_runner.py
new file mode 100644
index 0000000000..4ba4f9024f
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/train_ablation_runner.py
@@ -0,0 +1,142 @@
+#!/usr/bin/env python3
+"""
+Wrapper that runs train_cdm.py with configurable num_layers / model_dim /
+vocab_size / cdm_weight by patching its module-level constants at source level
+before exec.
+
+Usage on pod:
+    python3 train_ablation_runner.py \
+        --train_script ./train_cdm.py \
+        --num_layers 5 --model_dim 256 --vocab_size 4096 \
+        --bigram_dim 128 --xsa_last_n 2 \
+        --cdm_weight 0.3 \
+        -- \
+        --steps 7000 --train_budget_secs 540 \
+        --data_dir /workspace/data_v4096 \
+        --tokenizer_path /workspace/bpe_v4096.model \
+        --save_path /workspace/out/5L_w0.3.npz \
+        --save_int6_path /workspace/out/5L_w0.3_int6.lzma \
+        --checkpoint_dir /workspace/ckpt/5L_w0.3
+
+The double-dash separates runner args from passthrough args for train_cdm.py.
+If --cdm_weight 0.0, the entire CDM block is wrapped in `if False:` so the
+forward/backward for the denoising pass is skipped.
+"""
+import argparse
+import os
+import re
+import sys
+import textwrap
+
+
+def main():
+    ap = argparse.ArgumentParser(description="Ablation runner for train_cdm.py")
+    ap.add_argument("--train_script", type=str, required=True,
+                    help="path to the base train_cdm.py")
+    ap.add_argument("--num_layers", type=int, required=True)
+    ap.add_argument("--model_dim", type=int, required=True)
+    ap.add_argument("--vocab_size", type=int, required=True)
+    ap.add_argument("--bigram_dim", type=int, required=True)
+    ap.add_argument("--xsa_last_n", type=int, required=True)
+    ap.add_argument("--cdm_weight", type=float, required=True,
+                    help="Weight on the denoising loss. 0 = disable CDM (causal-only control).")
+    ap.add_argument("--seed", type=int, default=None,
+                    help="Override module-level SEED constant in train_cdm.py "
+                         "(default: leave the script's baked-in seed, 1337, unchanged).")
+
+    # Split on -- to separate runner args from passthrough args
+    if "--" in sys.argv:
+        sep_idx = sys.argv.index("--")
+        runner_args = sys.argv[1:sep_idx]
+        passthrough = sys.argv[sep_idx + 1:]
+    else:
+        runner_args = sys.argv[1:]
+        passthrough = []
+
+    args = ap.parse_args(runner_args)
+
+    with open(args.train_script, "r") as f:
+        src = f.read()
+
+    # --- Patch module-level constants ---
+    # Use regex so we match "NUM_LAYERS = <number>" on a line by itself
+    def patch(src, name, new_val):
+        pat = re.compile(rf"^{re.escape(name)}\s*=\s*[^\n#]+", re.MULTILINE)
+        m = pat.search(src)
+        assert m is not None, f"could not find constant {name} in train script"
+        return pat.sub(f"{name} = {new_val}", src, count=1)
+
+    src = patch(src, "VOCAB_SIZE", args.vocab_size)
+    src = patch(src, "NUM_LAYERS", args.num_layers)
+    src = patch(src, "MODEL_DIM", args.model_dim)
+    src = patch(src, "BIGRAM_DIM", args.bigram_dim)
+    src = patch(src, "XSA_LAST_N", args.xsa_last_n)
+    if args.seed is not None:
+        src = patch(src, "SEED", args.seed)
+
+    # --- Patch the CDM loss weight inline ---
+    # Original line (around line 1023):
+    #     cdm_loss = (per_tok * mask.reshape(-1).float()).sum() / (mask.sum() + 1e-8) * 0.3 / args.grad_accum
+    new_cdm_line = (
+        f"cdm_loss = (per_tok * mask.reshape(-1).float()).sum() / "
+        f"(mask.sum() + 1e-8) * {args.cdm_weight} / args.grad_accum"
+    )
+    cdm_pat = re.compile(
+        r"cdm_loss\s*=\s*\(per_tok \* mask\.reshape\(-1\)\.float\(\)\)\.sum\(\) "
+        r"/ \(mask\.sum\(\) \+ 1e-8\) \* [0-9.]+ / args\.grad_accum"
+    )
+    n = len(cdm_pat.findall(src))
+    assert n == 1, f"expected 1 cdm_loss line, found {n}"
+    src = cdm_pat.sub(new_cdm_line, src, count=1)
+
+    # --- If weight == 0, wrap the entire CDM block in `if False:` so no FLOPs ---
+    if args.cdm_weight == 0.0:
+        marker_start = "import numpy as np_cdm"
+        marker_end = "cdm_loss.backward()"
+        # Find the START of the LINE containing the marker (include leading whitespace)
+        i_marker = src.find(marker_start)
+        assert i_marker != -1, "could not find CDM block start marker"
+        i0 = src.rfind("\n", 0, i_marker) + 1  # beginning of the line
+        i1_marker = src.find(marker_end, i_marker)
+        assert i1_marker != -1, "could not find CDM block end marker"
+        i1 = src.find("\n", i1_marker) + 1  # end of the cdm_loss.backward() line, inclusive
+
+        block = src[i0:i1]
+        # Now first line retains its original indent
+        first_line = block.split("\n", 1)[0]
+        orig_indent = len(first_line) - len(first_line.lstrip())
+        wrapper_indent = " " * orig_indent
+        extra = "    "
+
+        new_block_lines = [f"{wrapper_indent}if False:  # --cdm_weight=0 (causal-only control)"]
+        for line in block.splitlines():
+            if line.strip():
+                new_block_lines.append(extra + line)
+            else:
+                new_block_lines.append(line)
+        # Add a stub cdm_loss after the disabled block so references later don't crash
+        tail = src[i1:i1 + 400]
+        if "cdm_loss" in tail:
+            new_block_lines.append(f"{wrapper_indent}cdm_loss = torch.zeros((), device=x.device, dtype=torch.float32)")
+
+        new_block = "\n".join(new_block_lines) + "\n"
+        src = src[:i0] + new_block + src[i1:]
+
+    # --- Write patched source to a temp file for reproducibility ---
+    seed_suffix = f"_s{args.seed}" if args.seed is not None else ""
+    out_script = f"/tmp/train_cdm_patched_{args.num_layers}L_w{args.cdm_weight}{seed_suffix}.py"
+    with open(out_script, "w") as f:
+        f.write(src)
+    print(f"[ablation runner] patched script written to {out_script}")
+    print(f"[ablation runner] config: {args.num_layers}L d={args.model_dim} "
+          f"vocab={args.vocab_size} xsa={args.xsa_last_n} "
+          f"bigram={args.bigram_dim} cdm_weight={args.cdm_weight}")
+
+    # --- Exec the patched source with passthrough argv ---
+    sys.argv = [args.train_script] + passthrough
+    code = compile(src, out_script, "exec")
+    exec(code, {"__name__": "__main__", "__file__": out_script})
+
+
+if __name__ == "__main__":
+    main()
diff --git a/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/train_cdm.py b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/train_cdm.py
new file mode 100644
index 0000000000..90a38830fb
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/train_cdm.py
@@ -0,0 +1,1094 @@
+#!/usr/bin/env -S python3 -u
+"""
+5L d=256 Shared AR+CDM — PyTorch H100 version.
+
+Trains one model with both AR (causal) and CDM (bidirectional) losses each step.
+Same weights serve as left brain (AR) and right brain (CDM).
+
+Architecture:
+  - 5 layers: 2 encoder + 3 decoder with U-net skip connections
+  - dim=256, 8 heads, 4 KV heads, MLP 3x  (~3.4M params)
+  - SP1024 tokenizer
+  - BigramHash(2048, dim=64) + SmearGate
+  - LeakyReLU(0.5)^2, Logit softcap 30.0
+  - Muon optimizer + AdamW for embeddings/scalars
+
+Usage (1xH100):
+  python3 train_cdm.py --steps=9999 --train_budget_secs=540
+"""
+from __future__ import annotations
+
+import argparse
+import glob
+import lzma
+import math
+import os
+import pickle
+import sys
+import time
+from pathlib import Path
+from collections import defaultdict
+from contextlib import nullcontext
+
+import numpy as np
+import sentencepiece as spm
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import torch.distributed as dist
+from torch.nn.parallel import DistributedDataParallel as DDP
+
+# ==============================================================================
+# CONFIG
+# ==============================================================================
+COMPUTE_DTYPE = torch.bfloat16
+
+DATA_DIR = "/data/datasets/fineweb10B_sp1024"
+TOKENIZER_PATH = "/data/tokenizers/fineweb_1024_bpe.model"
+
+VOCAB_SIZE = 1024
+NUM_LAYERS = 5          # 2 encoder + 3 decoder
+MODEL_DIM = 256
+NUM_HEADS = 8            # head_dim = 256/8 = 32
+NUM_KV_HEADS = 4         # kv_dim = 4*32 = 128
+MLP_MULT = 3
+BIGRAM_DIM = 64
+
+ROPE_BASE = 10000.0
+QK_GAIN_INIT = 1.5
+TIED_EMBED_INIT_STD = 0.005
+LOGIT_SOFTCAP = 30.0
+SEQ_LEN = 1024
+
+XSA_LAST_N = 2           # XSA on last 2 layers
+BIGRAM_BUCKETS = 2048
+
+# Optimizer
+TIED_EMBED_LR = 0.035
+MATRIX_LR = 0.025
+SCALAR_LR = 0.025
+BETA1 = 0.9
+BETA2 = 0.95
+ADAM_EPS = 1e-8
+MUON_MOMENTUM = 0.99
+MUON_BACKEND_STEPS = 5
+MUON_MOMENTUM_WARMUP_START = 0.92
+MUON_MOMENTUM_WARMUP_STEPS = 1500
+WEIGHT_DECAY = 0.04
+GRAD_CLIP = 0.3
+
+# EMA
+EMA_DECAY = 0.997
+
+# Retrodiction
+RETRO_ALPHA = 0.3
+
+SEED = 1337
+
+# ==============================================================================
+# DISTRIBUTED HELPERS
+# ==============================================================================
+def is_distributed():
+    return dist.is_initialized()
+
+def get_rank():
+    return dist.get_rank() if is_distributed() else 0
+
+def get_world_size():
+    return dist.get_world_size() if is_distributed() else 1
+
+def is_main():
+    return get_rank() == 0
+
+def print_main(*args, **kwargs):
+    if is_main():
+        try:
+            print(*args, **kwargs, flush=True)
+        except (BrokenPipeError, OSError):
+            pass  # SSH pipe broken — don't crash rank 0
+
+# ==============================================================================
+# MATH HELPERS
+# ==============================================================================
+def rms_norm(x: torch.Tensor, eps: float = 1e-6) -> torch.Tensor:
+    return x * torch.rsqrt(x.pow(2).mean(dim=-1, keepdim=True) + eps)
+
+
+def zeropower_newtonschulz5(g: torch.Tensor, steps: int, eps: float = 1e-7) -> torch.Tensor:
+    """Newton-Schulz iteration to approximate the matrix sign function (Muon optimizer)."""
+    a, b, c = 3.4445, -4.7750, 2.0315
+    x = g.float()
+    x = x / (x.norm() + eps)
+    transposed = x.shape[0] > x.shape[1]
+    if transposed:
+        x = x.T
+    for _ in range(steps):
+        a_mat = x @ x.T
+        b_mat = b * a_mat + c * (a_mat @ a_mat)
+        x = a * x + b_mat @ x
+    if transposed:
+        x = x.T
+    return x.to(g.dtype)
+
+
+# ==============================================================================
+# ROTARY POSITION EMBEDDING
+# ==============================================================================
+class RotaryEmbedding(nn.Module):
+    """Standard rotary position embedding."""
+    def __init__(self, head_dim: int, base: float = 10000.0, max_seq_len: int = 2048):
+        super().__init__()
+        self.head_dim = head_dim
+        inv_freq = 1.0 / (base ** (torch.arange(0, head_dim, 2, dtype=torch.float32) / head_dim))
+        self.register_buffer("inv_freq", inv_freq, persistent=False)
+        # Precompute cos/sin cache
+        t = torch.arange(max_seq_len, dtype=torch.float32)
+        freqs = torch.outer(t, inv_freq)  # [T, head_dim//2]
+        self.register_buffer("cos_cached", freqs.cos(), persistent=False)
+        self.register_buffer("sin_cached", freqs.sin(), persistent=False)
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        """x: [B, H, T, D]"""
+        seq_len = x.shape[2]
+        cos = self.cos_cached[:seq_len].unsqueeze(0).unsqueeze(0)  # [1, 1, T, D//2]
+        sin = self.sin_cached[:seq_len].unsqueeze(0).unsqueeze(0)
+        cos = cos.to(x.dtype)
+        sin = sin.to(x.dtype)
+        # Split into even/odd
+        x1 = x[..., : self.head_dim // 2]
+        x2 = x[..., self.head_dim // 2 :]
+        return torch.cat([x1 * cos - x2 * sin, x2 * cos + x1 * sin], dim=-1)
+
+
+# ==============================================================================
+# DATA LOADING
+# ==============================================================================
+def load_data_shard(path):
+    header_bytes = 256 * np.dtype("<i4").itemsize
+    header = np.fromfile(path, dtype="<i4", count=256)
+    num_tokens = int(header[2])
+    tokens = np.fromfile(path, dtype="<u2", count=num_tokens, offset=header_bytes)
+    return tokens.astype(np.int32, copy=False)
+
+
+class TokenStream:
+    def __init__(self, pattern, rank=0, world_size=1):
+        self.files = [Path(p) for p in sorted(glob.glob(pattern))]
+        if not self.files:
+            raise FileNotFoundError(f"No files: {pattern}")
+        # For DDP: each rank skips to its portion of files
+        self.rank = rank
+        self.world_size = world_size
+        self.epoch = 1
+        self.file_idx = rank  # stagger file start per rank
+        self.tokens = load_data_shard(self.files[self.file_idx % len(self.files)])
+        self.pos = 0
+
+    def next_file(self):
+        self.file_idx += self.world_size  # each rank advances by world_size
+        if self.file_idx >= len(self.files):
+            self.file_idx = self.rank
+            self.epoch += 1
+            print_main(f"WARNING: starting epoch:{self.epoch}")
+        self.tokens = load_data_shard(self.files[self.file_idx % len(self.files)])
+        self.pos = 0
+
+    def take(self, n):
+        chunks = []
+        left = n
+        while left > 0:
+            if self.pos >= self.tokens.size:
+                self.next_file()
+            k = min(left, int(self.tokens.size - self.pos))
+            chunks.append(self.tokens[self.pos:self.pos + k])
+            self.pos += k
+            left -= k
+        return chunks[0] if len(chunks) == 1 else np.concatenate(chunks)
+
+
+class TokenLoader:
+    def __init__(self, pattern, device, rank=0, world_size=1):
+        self.stream = TokenStream(pattern, rank=rank, world_size=world_size)
+        self.device = device
+
+    def next_batch(self, batch_tokens, seq_len):
+        usable = (batch_tokens // seq_len) * seq_len
+        chunk = self.stream.take(usable + 1)
+        x = chunk[:-1].reshape(-1, seq_len)
+        y = chunk[1:].reshape(-1, seq_len)
+        return (torch.tensor(x, dtype=torch.long, device=self.device),
+                torch.tensor(y, dtype=torch.long, device=self.device))
+
+
+def load_validation_tokens(pattern, seq_len):
+    files = [Path(p) for p in sorted(glob.glob(pattern))]
+    tokens = np.concatenate([load_data_shard(f) for f in files])
+    usable = ((tokens.size - 1) // seq_len) * seq_len
+    return tokens[:usable + 1]
+
+
+# ==============================================================================
+# MODEL BLOCKS
+# ==============================================================================
+class CastedLinear(nn.Module):
+    """Linear with float32 weights, cast to input dtype at forward time."""
+    def __init__(self, in_dim: int, out_dim: int):
+        super().__init__()
+        # Initialize using default nn.Linear init, but store as float32 Parameter
+        temp = nn.Linear(in_dim, out_dim, bias=False)
+        self.weight = nn.Parameter(temp.weight.data.float())  # [out, in]
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        return F.linear(x, self.weight.to(x.dtype))
+
+
+class RMSNormNoWeight(nn.Module):
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        return rms_norm(x)
+
+
+class DualModeAttention(nn.Module):
+    def __init__(self, dim: int, num_heads: int, num_kv_heads: int,
+                 rope_base: float, qk_gain_init: float, use_xsa: bool = False):
+        super().__init__()
+        self.num_heads = num_heads
+        self.num_kv_heads = num_kv_heads
+        self.head_dim = dim // num_heads
+        self.num_groups = num_heads // num_kv_heads  # GQA group size
+        kv_dim = num_kv_heads * self.head_dim
+        self.c_q = CastedLinear(dim, dim)
+        self.c_k = CastedLinear(dim, kv_dim)
+        self.c_v = CastedLinear(dim, kv_dim)
+        self.proj = CastedLinear(dim, dim)
+        self.q_gain = nn.Parameter(torch.ones(num_heads, dtype=torch.float32) * qk_gain_init)
+        self.rope = RotaryEmbedding(self.head_dim, base=rope_base, max_seq_len=SEQ_LEN + 64)
+        self.scale = self.head_dim ** -0.5
+        self.use_xsa = use_xsa
+
+    def _xsa(self, y: torch.Tensor, v: torch.Tensor) -> torch.Tensor:
+        """Subtract self-value projection (XSA)."""
+        bsz, seqlen, dim = y.shape
+        hd = self.head_dim
+        nkv = self.num_kv_heads
+        group = self.num_groups
+        # y: [B, T, nh*hd] -> [B, T, nkv, group, hd]
+        y_g = y.reshape(bsz, seqlen, nkv, group, hd)
+        # v: [B, nkv, T, hd] -> [B, T, nkv, hd]
+        v_t = v.transpose(1, 2)
+        vn = v_t / (v_t.norm(dim=-1, keepdim=True) + 1e-8)
+        vn = vn.unsqueeze(3)  # [B, T, nkv, 1, hd]
+        # Project y onto v direction and subtract
+        proj = (y_g * vn).sum(dim=-1, keepdim=True) * vn
+        return (y_g - proj).reshape(bsz, seqlen, dim)
+
+    def forward(self, x: torch.Tensor, is_causal: bool = True) -> torch.Tensor:
+        bsz, seqlen, dim = x.shape
+        q = self.c_q(x).reshape(bsz, seqlen, self.num_heads, self.head_dim).transpose(1, 2)
+        k = self.c_k(x).reshape(bsz, seqlen, self.num_kv_heads, self.head_dim).transpose(1, 2)
+        v = self.c_v(x).reshape(bsz, seqlen, self.num_kv_heads, self.head_dim).transpose(1, 2)
+
+        # QK norm + RoPE
+        q = self.rope(rms_norm(q).to(COMPUTE_DTYPE))
+        k = self.rope(rms_norm(k).to(COMPUTE_DTYPE))
+        # Q gain
+        q = q * self.q_gain.to(q.dtype)[None, :, None, None]
+
+        # GQA: expand K/V to match Q heads
+        if self.num_kv_heads < self.num_heads:
+            k = k.repeat_interleave(self.num_groups, dim=1)
+            v_expanded = v.repeat_interleave(self.num_groups, dim=1)
+        else:
+            v_expanded = v
+
+        # FlashAttention via PyTorch SDPA
+        y = F.scaled_dot_product_attention(q, k, v_expanded, is_causal=is_causal,
+                                           scale=self.scale)
+        y = y.transpose(1, 2).reshape(bsz, seqlen, dim)
+
+        # XSA
+        if self.use_xsa:
+            y = self._xsa(y, v)
+        return self.proj(y)
+
+
+class MLP(nn.Module):
+    """LeakyReLU(0.5)^2 MLP."""
+    def __init__(self, dim: int, mlp_mult: int):
+        super().__init__()
+        hidden = dim * mlp_mult
+        self.fc = CastedLinear(dim, hidden)
+        self.proj = CastedLinear(hidden, dim)
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        h = self.fc(x)
+        h = torch.where(h >= 0, h, 0.5 * h)
+        return self.proj(h * h)
+
+
+class BigramHashEmbedding(nn.Module):
+    """Learned bigram hash embeddings."""
+    def __init__(self, buckets: int, bigram_dim: int, model_dim: int):
+        super().__init__()
+        self.buckets = buckets
+        self.embed = nn.Embedding(buckets, bigram_dim)
+        nn.init.zeros_(self.embed.weight)
+        self.proj = CastedLinear(bigram_dim, model_dim)
+        self.scale = nn.Parameter(torch.tensor(0.05, dtype=torch.float32))
+
+    def bigram_hash(self, tokens: torch.Tensor) -> torch.Tensor:
+        t = tokens.int()
+        mod = self.buckets - 1
+        shifted = torch.cat([torch.full((t.shape[0], 1), mod, dtype=torch.int32,
+                                        device=t.device),
+                             t[:, :-1]], dim=1)
+        hashed = (36313 * t + 27191 * shifted) % mod
+        return hashed.long()
+
+    def forward(self, token_ids: torch.Tensor) -> torch.Tensor:
+        h = self.embed(self.bigram_hash(token_ids))
+        h = self.proj(h)
+        return h * self.scale.to(h.dtype)
+
+
+class SmearGate(nn.Module):
+    """Learned blending with previous token."""
+    def __init__(self, dim: int):
+        super().__init__()
+        self.gate = nn.Parameter(torch.zeros(dim, dtype=torch.float32))
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        g = torch.sigmoid(self.gate.to(x.dtype)).unsqueeze(0).unsqueeze(0)
+        x_prev = torch.cat([torch.zeros_like(x[:, :1]), x[:, :-1]], dim=1)
+        return (1 - g) * x + g * x_prev
+
+
+class Block(nn.Module):
+    def __init__(self, dim: int, num_heads: int, num_kv_heads: int, mlp_mult: int,
+                 rope_base: float, qk_gain_init: float,
+                 layer_idx: int = 0, use_xsa: bool = False):
+        super().__init__()
+        self.attn_norm = RMSNormNoWeight()
+        self.mlp_norm = RMSNormNoWeight()
+        self.attn = DualModeAttention(dim, num_heads, num_kv_heads, rope_base, qk_gain_init,
+                                      use_xsa=use_xsa)
+        self.mlp = MLP(dim, mlp_mult)
+        # LN Scale: 1/sqrt(layer+1) — matches MLX version exactly
+        self.ln_scale = 1.0 / math.sqrt(layer_idx + 1)
+        self.attn_scale = nn.Parameter(torch.ones(dim, dtype=torch.float32))
+        self.mlp_scale = nn.Parameter(torch.ones(dim, dtype=torch.float32))
+        self.resid_mix = nn.Parameter(torch.stack([
+            torch.ones(dim, dtype=torch.float32),
+            torch.zeros(dim, dtype=torch.float32),
+        ]))
+
+    def forward(self, x: torch.Tensor, x0: torch.Tensor,
+                is_causal: bool = True) -> torch.Tensor:
+        mix = self.resid_mix.to(x.dtype)
+        x = mix[0].unsqueeze(0).unsqueeze(0) * x + mix[1].unsqueeze(0).unsqueeze(0) * x0
+        attn_out = self.attn(self.attn_norm(x) * self.ln_scale, is_causal=is_causal)
+        x = x + self.attn_scale.to(x.dtype).unsqueeze(0).unsqueeze(0) * attn_out
+        x = x + self.mlp_scale.to(x.dtype).unsqueeze(0).unsqueeze(0) * self.mlp(
+            self.mlp_norm(x) * self.ln_scale)
+        return x
+
+
+class GPTv2(nn.Module):
+    """5L d=256 GPT with U-net skip connections."""
+    def __init__(self):
+        super().__init__()
+        self.logit_softcap = LOGIT_SOFTCAP
+        self.tok_emb = nn.Embedding(VOCAB_SIZE, MODEL_DIM)
+        nn.init.normal_(self.tok_emb.weight, std=TIED_EMBED_INIT_STD)
+        self.tok_emb.weight.data = self.tok_emb.weight.data.to(COMPUTE_DTYPE)
+
+        self.bigram = BigramHashEmbedding(BIGRAM_BUCKETS, BIGRAM_DIM, MODEL_DIM)
+        self.smear = SmearGate(MODEL_DIM)
+
+        # U-net: 2 encoder layers + 3 decoder layers = 5 total
+        self.num_encoder_layers = NUM_LAYERS // 2   # 2
+        self.num_decoder_layers = NUM_LAYERS - self.num_encoder_layers  # 3
+        self.num_skip_weights = min(self.num_encoder_layers, self.num_decoder_layers)  # 2
+        self.skip_weights = nn.Parameter(torch.ones(self.num_skip_weights, MODEL_DIM,
+                                                    dtype=torch.float32))
+
+        # 5 unique layers (standard, no weight sharing)
+        self.blocks = nn.ModuleList()
+        for i in range(NUM_LAYERS):
+            use_xsa = i >= (NUM_LAYERS - XSA_LAST_N)  # last 2 layers (3, 4)
+            self.blocks.append(
+                Block(MODEL_DIM, NUM_HEADS, NUM_KV_HEADS, MLP_MULT, ROPE_BASE, QK_GAIN_INIT,
+                      layer_idx=i, use_xsa=use_xsa)
+            )
+        self.final_norm = RMSNormNoWeight()
+
+        # Zero out output projections (matches MLX init)
+        for b in self.blocks:
+            nn.init.zeros_(b.attn.proj.weight)
+            nn.init.zeros_(b.mlp.proj.weight)
+
+    def softcap(self, logits: torch.Tensor) -> torch.Tensor:
+        c = self.logit_softcap
+        return c * torch.tanh(logits / c)
+
+    def forward_hidden(self, input_ids: torch.Tensor,
+                       is_causal: bool = True) -> torch.Tensor:
+        x = self.tok_emb(input_ids).to(COMPUTE_DTYPE)
+        x = x + self.bigram(input_ids).to(COMPUTE_DTYPE)
+        x = rms_norm(x)
+        x = self.smear(x)
+        x0 = x
+
+        # Encoder: layers 0..4, save skips
+        skips = []
+        for i in range(self.num_encoder_layers):
+            x = self.blocks[i](x, x0, is_causal=is_causal)
+            skips.append(x)
+
+        # Decoder: layers 5..10, add U-net skip connections
+        for i in range(self.num_decoder_layers):
+            if skips:
+                skip_w = self.skip_weights[i]
+                x = x + skip_w.to(x.dtype).unsqueeze(0).unsqueeze(0) * skips.pop()
+            x = self.blocks[self.num_encoder_layers + i](x, x0, is_causal=is_causal)
+
+        return self.final_norm(x)
+
+    def forward(self, input_ids: torch.Tensor,
+                target_ids: torch.Tensor | None = None,
+                is_causal: bool = True,
+                retrodiction: bool = False) -> torch.Tensor:
+        """
+        If target_ids is None: return hidden states [B, T, D].
+        If target_ids is given: return scalar loss.
+        """
+        if target_ids is None:
+            return self.forward_hidden(input_ids, is_causal=is_causal)
+
+        # Standard AR loss: h[t] -> token[t+1]
+        h = self.forward_hidden(input_ids, is_causal=is_causal)
+        h_flat = h.reshape(-1, MODEL_DIM)
+        y = target_ids.reshape(-1)
+        logits = self.softcap(F.linear(h_flat, self.tok_emb.weight.to(h_flat.dtype)))
+        forward_loss = F.cross_entropy(logits.float(), y, reduction="mean")
+
+        if not retrodiction:
+            return forward_loss
+
+        # Retrodiction: backward AR loss
+        # .flip() in compiled graph causes NaN, so caller must pass pre-flipped tensors
+        # via x_rev, y_rev arguments (see training loop)
+        return forward_loss  # retrodiction handled in training loop, not here
+
+    def loss_fn(self, input_ids: torch.Tensor, target_ids: torch.Tensor,
+                is_causal: bool = True) -> torch.Tensor:
+        """Convenience method for eval (forward-only loss)."""
+        return self.forward(input_ids, target_ids, is_causal=is_causal, retrodiction=False)
+
+
+# ==============================================================================
+# OPTIMIZER (Muon + AdamW split)
+# ==============================================================================
+CONTROL_PATTERNS = ("attn_scale", "mlp_scale", "resid_mix", "q_gain", "skip_weight",
+                    "gate", "scale", "ln_scale")
+
+
+class Muon:
+    """Muon optimizer: Newton-Schulz orthogonalization for matrix parameters."""
+    def __init__(self, params_dict: dict[str, nn.Parameter]):
+        self.keys = list(params_dict.keys())
+        self.buffers = {k: torch.zeros_like(p.data) for k, p in params_dict.items()}
+
+    @torch.no_grad()
+    def step(self, params_dict: dict[str, nn.Parameter],
+             grads_dict: dict[str, torch.Tensor],
+             step: int, lr_mul: float):
+        t = min(step / max(MUON_MOMENTUM_WARMUP_STEPS, 1), 1.0)
+        momentum = (1.0 - t) * MUON_MOMENTUM_WARMUP_START + t * MUON_MOMENTUM
+        lr = MATRIX_LR * lr_mul
+        for k in self.keys:
+            if k not in grads_dict:
+                continue
+            p = params_dict[k]
+            g = grads_dict[k]
+            g_norm = g.norm()
+            if g_norm > GRAD_CLIP:
+                g = g * (GRAD_CLIP / (g_norm + 1e-8))
+            buf = momentum * self.buffers[k] + g
+            self.buffers[k] = buf
+            g_eff = g + momentum * buf
+            g_ortho = zeropower_newtonschulz5(g_eff, MUON_BACKEND_STEPS)
+            scale = math.sqrt(max(1.0, p.shape[0] / p.shape[1]))
+            p.data.mul_(1 - lr * WEIGHT_DECAY).add_(g_ortho.to(p.dtype), alpha=-lr * scale)
+
+
+class SplitOptimizers:
+    """Split parameters into Muon (matrix), AdamW (embedding), AdamW (scalar)."""
+    def __init__(self, model: nn.Module):
+        self.embed_key = "tok_emb.weight"
+        self.matrix_params = {}
+        self.scalar_params = {}
+        self.embed_param = None
+
+        for name, p in model.named_parameters():
+            if name == self.embed_key:
+                self.embed_param = p
+            elif p.ndim == 2 and not any(pat in name for pat in CONTROL_PATTERNS):
+                self.matrix_params[name] = p
+            else:
+                self.scalar_params[name] = p
+
+        self.muon = Muon(self.matrix_params)
+
+        # AdamW for embedding
+        self.adam_embed = torch.optim.AdamW(
+            [self.embed_param], lr=TIED_EMBED_LR,
+            betas=(BETA1, BETA2), eps=ADAM_EPS, weight_decay=0.0
+        )
+        # AdamW for scalar params
+        self.adam_scalar = torch.optim.AdamW(
+            list(self.scalar_params.values()), lr=SCALAR_LR,
+            betas=(BETA1, BETA2), eps=ADAM_EPS, weight_decay=0.0
+        )
+
+        print_main(f"  Muon params: {len(self.matrix_params)} tensors")
+        print_main(f"  Scalar params: {len(self.scalar_params)} tensors")
+        print_main(f"  Embed param: {self.embed_key}")
+
+    def step(self, model: nn.Module, step: int, lr_mul: float):
+        """Update all param groups. Gradients should already be on .grad."""
+        # Gather gradients
+        grads_dict = {}
+        for name, p in model.named_parameters():
+            if p.grad is not None:
+                grads_dict[name] = p.grad
+
+        # Muon step (manual)
+        muon_grads = {k: grads_dict[k] for k in self.matrix_params if k in grads_dict}
+        self.muon.step(self.matrix_params, muon_grads, step=step, lr_mul=lr_mul)
+
+        # AdamW for embedding
+        for pg in self.adam_embed.param_groups:
+            pg["lr"] = TIED_EMBED_LR * lr_mul
+        self.adam_embed.step()
+
+        # AdamW for scalars
+        for pg in self.adam_scalar.param_groups:
+            pg["lr"] = SCALAR_LR * lr_mul
+        self.adam_scalar.step()
+
+    def zero_grad(self):
+        self.adam_embed.zero_grad(set_to_none=True)
+        self.adam_scalar.zero_grad(set_to_none=True)
+        for p in self.matrix_params.values():
+            p.grad = None
+
+
+# ==============================================================================
+# SENTENCEPIECE BPB
+# ==============================================================================
+def build_sentencepiece_luts(sp, vocab_size):
+    sp_vocab_size = int(sp.vocab_size())
+    table_size = max(sp_vocab_size, vocab_size)
+    base_bytes_lut = np.zeros((table_size,), dtype=np.int16)
+    has_leading_space_lut = np.zeros((table_size,), dtype=np.bool_)
+    is_boundary_token_lut = np.ones((table_size,), dtype=np.bool_)
+    for token_id in range(sp_vocab_size):
+        if sp.is_control(token_id) or sp.is_unknown(token_id) or sp.is_unused(token_id):
+            continue
+        is_boundary_token_lut[token_id] = False
+        if sp.is_byte(token_id):
+            base_bytes_lut[token_id] = 1
+            continue
+        piece = sp.id_to_piece(token_id)
+        if piece.startswith("\u2581"):
+            has_leading_space_lut[token_id] = True
+            piece = piece[1:]
+        base_bytes_lut[token_id] = len(piece.encode("utf-8"))
+    return base_bytes_lut, has_leading_space_lut, is_boundary_token_lut
+
+
+def compute_bpb(total_nll, total_tokens, total_bytes):
+    avg_loss = total_nll / total_tokens
+    bpt = avg_loss / math.log(2.0)
+    return bpt * (total_tokens / total_bytes)
+
+
+# ==============================================================================
+# LR SCHEDULE
+# ==============================================================================
+def lr_schedule(step, total_steps, warmdown_iters):
+    warmdown_start = max(total_steps - warmdown_iters, 0)
+    if warmdown_start <= step < total_steps:
+        return max((total_steps - step) / max(warmdown_iters, 1), 0.0)
+    return 1.0
+
+
+# ==============================================================================
+# INT6 QUANTIZATION + LZMA SAVE
+# ==============================================================================
+def pack_int6(values: np.ndarray) -> np.ndarray:
+    """Pack 6-bit signed integers into bytes. 4 values -> 3 bytes."""
+    # values: int8 array in [-31, 31], shift to unsigned [0, 62]
+    unsigned = (values.astype(np.int16) + 31).astype(np.uint8)  # [0, 62], fits in 6 bits
+    n = len(unsigned)
+    # Pad to multiple of 4
+    pad = (4 - n % 4) % 4
+    if pad:
+        unsigned = np.concatenate([unsigned, np.zeros(pad, dtype=np.uint8)])
+    n_padded = len(unsigned)
+    # Pack 4 values (24 bits) into 3 bytes
+    a = unsigned[0::4]
+    b = unsigned[1::4]
+    c = unsigned[2::4]
+    d = unsigned[3::4]
+    byte0 = (a << 2) | (b >> 4)
+    byte1 = ((b & 0x0F) << 4) | (c >> 2)
+    byte2 = ((c & 0x03) << 6) | d
+    packed = np.stack([byte0, byte1, byte2], axis=1).flatten()
+    return packed, n  # return original length for unpadding
+
+
+def unpack_int6(packed: np.ndarray, n_values: int) -> np.ndarray:
+    """Unpack 6-bit values from bytes."""
+    byte0 = packed[0::3]
+    byte1 = packed[1::3]
+    byte2 = packed[2::3]
+    a = byte0 >> 2
+    b = ((byte0 & 0x03) << 4) | (byte1 >> 4)
+    c = ((byte1 & 0x0F) << 2) | (byte2 >> 6)
+    d = byte2 & 0x3F
+    unsigned = np.stack([a, b, c, d], axis=1).flatten()[:n_values]
+    return unsigned.astype(np.int16) - 31  # back to signed [-31, 31]
+
+
+def quantize_int6_perrow(tensor_f32: np.ndarray):
+    """
+    Per-row int6 quantization (range [-31, 31]) with independent scale per row.
+    For 1-D tensors, falls back to per-tensor.
+    Returns: (flat_quantized int8, scales float32 array, original shape)
+    """
+    arr = tensor_f32.astype(np.float32)
+    shape = arr.shape
+    if arr.ndim >= 2:
+        rows = arr.reshape(shape[0], -1)           # [R, C]
+        absmax = np.abs(rows).max(axis=1)          # [R]
+        absmax = np.where(absmax < 1e-12, 1.0, absmax)
+        scales = (absmax / 31.0).astype(np.float32)  # [R]
+        quantized = np.clip(
+            np.round(rows / scales[:, None]), -31, 31
+        ).astype(np.int8).flatten()
+    else:
+        absmax = float(np.abs(arr).max())
+        scale = np.float32(absmax / 31.0 if absmax > 1e-12 else 1.0)
+        scales = np.array([scale], dtype=np.float32)
+        quantized = np.atleast_1d(
+            np.clip(np.round(arr / scale), -31, 31).astype(np.int8))
+    return quantized, scales, shape
+
+
+def save_model_int6_lzma(state_dict: dict, path: str):
+    """Save model with per-row int6 quantization as int8 + lzma compression.
+    Uses int8 storage (not 6-bit packing) because lzma compresses int8 much better
+    than packed 6-bit (int8 values in [-31,31] have low entropy → high lzma ratio)."""
+    quantized_data = {}
+    for k, v in state_dict.items():
+        arr = v.cpu().float().numpy()
+        q, scales, shape = quantize_int6_perrow(arr)
+        # Store as int8 (NOT 6-bit packed) — lzma compresses this much better
+        quantized_data[k] = {"q": q, "s": scales, "sh": shape}
+    compressed = lzma.compress(pickle.dumps(quantized_data), preset=9)
+    with open(path, "wb") as f:
+        f.write(compressed)
+    return len(compressed)
+
+
+def load_model_int6_lzma(path: str):
+    """Load per-row int6 quantized + lzma compressed model."""
+    with open(path, "rb") as f:
+        data = pickle.loads(lzma.decompress(f.read()))
+    state_dict = {}
+    for k, v in data.items():
+        q = v["q"]  # int8 array
+        shape = v["sh"]
+        scales = v["s"]                              # float32 array
+        if len(scales) > 1:                          # per-row
+            rows = q.reshape(shape[0], -1).astype(np.float32)
+            rows *= scales[:, None]
+            arr = rows.reshape(shape)
+        else:                                        # per-tensor (1D)
+            arr = q.astype(np.float32) * float(scales[0])
+            arr = arr.reshape(shape)
+        state_dict[k] = torch.tensor(arr)
+    return state_dict
+
+
+# ==============================================================================
+# EMA
+# ==============================================================================
+class EMAState:
+    """Exponential Moving Average of model parameters."""
+    def __init__(self, model: nn.Module, decay: float = EMA_DECAY):
+        self.decay = decay
+        self.shadow = {k: v.clone().detach() for k, v in model.state_dict().items()}
+
+    @torch.no_grad()
+    def update(self, model: nn.Module):
+        d = self.decay
+        for k, v in model.state_dict().items():
+            if k in self.shadow:
+                self.shadow[k].mul_(d).add_(v, alpha=1 - d)
+
+    def apply(self, model: nn.Module):
+        """Temporarily load EMA weights into model. Returns original state."""
+        orig = {k: v.clone() for k, v in model.state_dict().items()}
+        model.load_state_dict(self.shadow, strict=False)
+        return orig
+
+    def restore(self, model: nn.Module, orig: dict):
+        """Restore original weights after EMA eval."""
+        model.load_state_dict(orig, strict=False)
+
+
+# ==============================================================================
+# MAIN
+# ==============================================================================
+def main():
+    parser = argparse.ArgumentParser(description="5L d=256 Shared AR+CDM — PyTorch H100")
+    parser.add_argument("--steps", type=int, default=7000)
+    parser.add_argument("--grad_accum", type=int, default=1,
+                        help="Gradient accumulation steps (1 for 8xH100 = 524K tokens/step)")
+    parser.add_argument("--microbatch_tokens", type=int, default=65536,
+                        help="Tokens per microbatch per GPU (65536 x 8 GPUs = 524K)")
+    parser.add_argument("--warmdown", type=int, default=1000)
+    parser.add_argument("--val_every", type=int, default=500)
+    parser.add_argument("--val_tokens", type=int, default=1_000_000)
+    parser.add_argument("--save_path", type=str, default="shared_ar_cdm.npz")
+    parser.add_argument("--save_int6_path", type=str, default="shared_ar_cdm_int6.lzma")
+    parser.add_argument("--checkpoint_dir", type=str, default="checkpoints_5L",
+                        help="Directory for periodic checkpoints")
+    parser.add_argument("--checkpoint_every", type=int, default=500,
+                        help="Save checkpoint every N steps")
+    parser.add_argument("--resume", type=str, default="",
+                        help="Path to checkpoint to resume from")
+    parser.add_argument("--data_dir", type=str, default=DATA_DIR)
+    parser.add_argument("--tokenizer_path", type=str, default=TOKENIZER_PATH)
+    parser.add_argument("--test_steps", type=int, default=0,
+                        help="If > 0, run N test steps on 1 GPU then exit")
+    parser.add_argument("--no_compile", action="store_true",
+                        help="Disable torch.compile (for debugging)")
+    parser.add_argument("--train_budget_secs", type=int, default=0,
+                        help="Wall-clock budget for training in seconds (0=disabled). "
+                             "Use 540 for competition (9 min train + 1 min save+eval).")
+    args = parser.parse_args()
+
+    # Override global paths
+    data_dir = args.data_dir
+    tokenizer_path = args.tokenizer_path
+
+    # ---- Distributed setup ----
+    local_rank = int(os.environ.get("LOCAL_RANK", 0))
+    world_size = int(os.environ.get("WORLD_SIZE", 1))
+
+    if world_size > 1:
+        dist.init_process_group(backend="nccl")
+        torch.cuda.set_device(local_rank)
+        device = torch.device(f"cuda:{local_rank}")
+    else:
+        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+
+    rank = get_rank()
+
+    # ---- Test mode ----
+    if args.test_steps > 0:
+        args.steps = args.test_steps
+        args.val_every = args.test_steps
+        args.grad_accum = 1
+        args.microbatch_tokens = 4096  # small for quick test
+        print_main(f"=== TEST MODE: {args.test_steps} steps on 1 GPU ===")
+
+    effective_batch = args.grad_accum * args.microbatch_tokens * world_size
+    OFFICIAL_BATCH = 524_288
+    if effective_batch != OFFICIAL_BATCH and args.test_steps == 0:
+        print_main(f"WARNING: effective_batch={effective_batch:,} != official {OFFICIAL_BATCH:,}")
+        print_main(f"   grad_accum={args.grad_accum} x microbatch={args.microbatch_tokens} x {world_size} GPUs")
+
+    print_main("=" * 70)
+    print_main(f"Shared AR+CDM PyTorch | {NUM_LAYERS}L "
+               f"d={MODEL_DIM} MLP={MLP_MULT}x | steps={args.steps}")
+    print_main(f"NUM_HEADS={NUM_HEADS} head_dim={MODEL_DIM//NUM_HEADS} "
+               f"NUM_KV_HEADS={NUM_KV_HEADS} BIGRAM_DIM={BIGRAM_DIM}")
+    print_main(f"Encoder={NUM_LAYERS//2} Decoder={NUM_LAYERS - NUM_LAYERS//2} "
+               f"(U-net skip connections)")
+    print_main(f"Retro alpha={RETRO_ALPHA} | XSA last {XSA_LAST_N} layers | "
+               f"LeakyReLU^2 | BigramHash({BIGRAM_BUCKETS}) | EMA({EMA_DECAY})")
+    print_main(f"Device: {device} | World size: {world_size} | "
+               f"Grad accum: {args.grad_accum}")
+    print_main(f"Effective batch: {effective_batch:,} tok/step")
+    print_main("=" * 70)
+
+    # ---- Tokenizer & BPB LUTs ----
+    sp = spm.SentencePieceProcessor(model_file=tokenizer_path)
+    base_bytes_lut, has_leading_space_lut, is_boundary_token_lut = (
+        build_sentencepiece_luts(sp, VOCAB_SIZE))
+
+    # ---- Validation data ----
+    val_tokens = load_validation_tokens(f"{data_dir}/fineweb_val_*.bin", SEQ_LEN)
+    if args.val_tokens > 0 and args.val_tokens < val_tokens.size:
+        usable = (args.val_tokens // SEQ_LEN) * SEQ_LEN
+        val_short = val_tokens[:usable + 1]
+    else:
+        val_short = val_tokens
+    print_main(f"Val tokens: {val_tokens.size - 1:,} (eval on {val_short.size - 1:,})")
+
+    # ---- Model ----
+    torch.manual_seed(SEED)
+    torch.cuda.manual_seed_all(SEED)
+    model = GPTv2().to(device)
+
+    n_params = sum(p.numel() for p in model.parameters())
+    print_main(f"Model params: {n_params:,} ({n_params/1e6:.1f}M)")
+    print_main(f"Estimated size BF16: {n_params*2/1e6:.1f}MB | int6: {n_params*0.75/1e6:.1f}MB")
+
+    # ---- torch.compile ----
+    if not args.no_compile and device.type == "cuda":
+        print_main("Compiling model with torch.compile()...")
+        model = torch.compile(model, mode="reduce-overhead")
+        print_main("  Compilation requested (will happen on first forward pass)")
+
+    # ---- DDP ----
+    if world_size > 1:
+        model = DDP(model, device_ids=[local_rank], output_device=local_rank,
+                    find_unused_parameters=False)
+    raw_model = model.module if isinstance(model, DDP) else model
+    # If compiled + DDP, unwrap further
+    if hasattr(raw_model, "_orig_mod"):
+        raw_model = raw_model._orig_mod
+
+    # ---- Optimizer ----
+    opt = SplitOptimizers(raw_model)
+
+    # ---- Data loader ----
+    train_loader = TokenLoader(f"{data_dir}/fineweb_train_*.bin",
+                               device=device, rank=rank, world_size=world_size)
+
+    # ---- Resume checkpoint ----
+    start_step = 0
+    ema_state = None
+    ema_start_step = int(args.steps * 0.8)
+
+    if args.resume and os.path.exists(args.resume):
+        print_main(f"Resuming from {args.resume}")
+        ckpt = torch.load(args.resume, map_location=device, weights_only=False)
+        raw_model.load_state_dict(ckpt["model"])
+        start_step = ckpt.get("step", 0)
+        if "ema" in ckpt and ckpt["ema"] is not None:
+            ema_state = EMAState(raw_model)
+            ema_state.shadow = ckpt["ema"]
+        print_main(f"  Resumed at step {start_step}")
+
+    print_main(f"EMA starts step {ema_start_step}")
+
+    # ---- Checkpoint directory ----
+    if is_main() and args.checkpoint_dir:
+        os.makedirs(args.checkpoint_dir, exist_ok=True)
+
+    # Context manager for DDP no_sync on non-final accumulation steps
+    def ddp_no_sync_ctx(is_last_micro):
+        if isinstance(model, DDP) and not is_last_micro:
+            return model.no_sync()
+        return nullcontext()
+
+    # ---- Validation ----
+    @torch.no_grad()
+    def eval_val(vtokens):
+        raw_model.eval()
+        batch_seqs = max(args.microbatch_tokens // SEQ_LEN, 1)
+        total_seqs = (vtokens.size - 1) // SEQ_LEN
+        total_nll = 0.0
+        total_tok = 0
+        total_bytes = 0.0
+        for s in range(0, total_seqs, batch_seqs):
+            e = min(s + batch_seqs, total_seqs)
+            chunk = vtokens[s * SEQ_LEN:(e * SEQ_LEN) + 1]
+            x_np = chunk[:-1].reshape(-1, SEQ_LEN)
+            y_np = chunk[1:].reshape(-1, SEQ_LEN)
+            x = torch.tensor(x_np, dtype=torch.long, device=device)
+            y = torch.tensor(y_np, dtype=torch.long, device=device)
+            ct = float(y.numel())
+            with torch.amp.autocast(device_type='cuda', dtype=COMPUTE_DTYPE):
+                bl = raw_model.loss_fn(x, y, is_causal=True).float()
+            total_nll += bl.item() * ct
+            prev_ids = x_np.reshape(-1)
+            tgt_ids = y_np.reshape(-1)
+            bytes_np = base_bytes_lut[tgt_ids].astype(np.float64)
+            bytes_np += (has_leading_space_lut[tgt_ids] &
+                         ~is_boundary_token_lut[prev_ids]).astype(np.float64)
+            total_tok += int(ct)
+            total_bytes += bytes_np.sum()
+        raw_model.train()
+        return compute_bpb(total_nll, total_tok, total_bytes)
+
+    # ---- Training loop ----
+    t0 = time.perf_counter()
+    t_budget_start = t0
+    WARMDOWN_BUDGET_SECS = 80
+    best_bpb = float("inf")
+
+    if args.train_budget_secs > 0:
+        print_main(f"Time budget: {args.train_budget_secs}s "
+                   f"(warmdown in last {WARMDOWN_BUDGET_SECS}s)")
+
+    print_main("Starting training...")
+
+    for step in range(start_step, args.steps + 1):
+        is_last = (step == args.steps)
+
+        # ---- Time budget check ----
+        if args.train_budget_secs > 0 and not is_last:
+            _elapsed_bgt = time.perf_counter() - t_budget_start
+            if _elapsed_bgt >= args.train_budget_secs:
+                is_last = True
+                print_main(f"  Budget {args.train_budget_secs}s reached at step {step} "
+                           f"({_elapsed_bgt:.0f}s elapsed) — triggering final eval+save")
+
+        # ---- Validation ----
+        if is_last or (args.val_every > 0 and step % args.val_every == 0):
+            use_ema = ema_state is not None
+            if use_ema:
+                orig = ema_state.apply(raw_model)
+
+            val_bpb = eval_val(val_short)
+            marker = " *BEST*" if val_bpb < best_bpb else ""
+            best_bpb = min(best_bpb, val_bpb)
+            elapsed = time.perf_counter() - t0
+            tokens_seen = step * effective_batch
+            ema_tag = " [EMA]" if use_ema else ""
+            print_main(f"step:{step}/{args.steps} val_bpb:{val_bpb:.4f}{marker}{ema_tag} "
+                       f"tokens:{tokens_seen / 1e6:.0f}M elapsed:{elapsed:.0f}s")
+
+            if use_ema:
+                ema_state.restore(raw_model, orig)
+
+            if is_last:
+                if ema_state is not None:
+                    ema_state.apply(raw_model)
+                break
+
+        # ---- LR schedule ----
+        lrm = lr_schedule(step, args.steps, args.warmdown)
+        # Time-based warmdown override
+        if args.train_budget_secs > 0:
+            _elapsed_bgt = time.perf_counter() - t_budget_start
+            _remaining_bgt = args.train_budget_secs - _elapsed_bgt
+            if 0 < _remaining_bgt < WARMDOWN_BUDGET_SECS:
+                lrm = min(lrm, _remaining_bgt / WARMDOWN_BUDGET_SECS)
+
+        # ---- Gradient accumulation ----
+        opt.zero_grad()
+        train_loss_accum = 0.0
+
+        for micro_step in range(args.grad_accum):
+            is_last_micro = (micro_step == args.grad_accum - 1)
+            x, y = train_loader.next_batch(args.microbatch_tokens, SEQ_LEN)
+
+            with ddp_no_sync_ctx(is_last_micro):
+                with torch.amp.autocast(device_type='cuda', dtype=COMPUTE_DTYPE):
+                    # AR loss (causal)
+                    ar_loss = model(x, target_ids=y, retrodiction=False) / args.grad_accum
+                ar_loss.backward()
+
+                import numpy as np_cdm
+                B, T = x.shape
+                mask_rate = np_cdm.random.uniform(0.15, 0.50)
+                mask = torch.rand(B, T, device=x.device) < mask_rate
+                x_masked = x.clone()
+                x_masked[mask] = torch.randint(0, VOCAB_SIZE, (mask.sum().item(),), device=x.device)
+                with torch.amp.autocast(device_type='cuda', dtype=COMPUTE_DTYPE):
+                    h = raw_model.forward_hidden(x_masked, is_causal=False)
+                    logits = raw_model.softcap(torch.nn.functional.linear(h, raw_model.tok_emb.weight.to(h.dtype)))
+                    per_tok = torch.nn.functional.cross_entropy(logits.reshape(-1, VOCAB_SIZE).float(), x.reshape(-1), reduction="none")
+                    cdm_loss = (per_tok * mask.reshape(-1).float()).sum() / (mask.sum() + 1e-8) * 0.3 / args.grad_accum
+                cdm_loss.backward()
+
+            train_loss_accum += (ar_loss.item() + cdm_loss.item()) * args.grad_accum
+
+        # ---- Optimizer step ----
+        opt.step(raw_model, step=step, lr_mul=lrm)
+
+        # ---- EMA ----
+        if step == ema_start_step:
+            ema_state = EMAState(raw_model)
+            print_main(f"  EMA started at step {step}")
+        elif ema_state is not None:
+            ema_state.update(raw_model)
+
+        # ---- Logging ----
+        if step % 100 == 0 and step > 0:
+            elapsed = time.perf_counter() - t0
+            tps = (step - start_step) * effective_batch / elapsed
+            print_main(f"  step:{step} train_loss:{train_loss_accum:.4f} "
+                       f"lr_mul:{lrm:.4f} tok/s:{tps:.0f}")
+
+        # ---- Checkpoint ----
+        if (is_main() and args.checkpoint_dir and args.checkpoint_every > 0
+                and step > 0 and step % args.checkpoint_every == 0):
+            ckpt_path = os.path.join(args.checkpoint_dir, f"step_{step}.pt")
+            ckpt = {
+                "step": step,
+                "model": raw_model.state_dict(),
+                "ema": ema_state.shadow if ema_state else None,
+            }
+            torch.save(ckpt, ckpt_path)
+            print_main(f"  Checkpoint saved: {ckpt_path}")
+
+    # ---- Save final model ----
+    if is_main():
+        # Always save a final-step .pt checkpoint so CF eval is not stuck on the
+        # last val_every-aligned save (which can be hundreds of steps before the
+        # actual end-of-training state). This addresses the intermediate-checkpoint
+        # bias caught in the v3.3 review of the 6-run ablation.
+        if args.checkpoint_dir:
+            os.makedirs(args.checkpoint_dir, exist_ok=True)
+            final_ckpt_path = os.path.join(args.checkpoint_dir, f"step_final.pt")
+            final_ckpt = {
+                "step": step,
+                "model": raw_model.state_dict(),
+                "ema": ema_state.shadow if ema_state else None,
+            }
+            torch.save(final_ckpt, final_ckpt_path)
+            print_main(f"  Final checkpoint saved: {final_ckpt_path} (step={step})")
+
+        # Save as npz
+        sd = raw_model.state_dict()
+        np_weights = {}
+        for k, v in sd.items():
+            t = v.cpu()
+            if t.dtype == torch.bfloat16:
+                np_weights[k] = t.float().numpy()
+            else:
+                np_weights[k] = t.numpy()
+        np.savez(args.save_path, **np_weights)
+        print_main(f"\nSaved NPZ to {args.save_path}")
+
+        # Save int6 + lzma
+        nbytes = save_model_int6_lzma(sd, args.save_int6_path)
+        print_main(f"Saved int6+lzma to {args.save_int6_path} ({nbytes/1e6:.1f}MB)")
+
+    # ---- Summary ----
+    elapsed = time.perf_counter() - t0
+    total_tokens = (args.steps - start_step) * effective_batch
+    print_main("=" * 70)
+    print_main(f"FINAL val_bpb: {best_bpb:.4f}")
+    print_main(f"Total tokens: {total_tokens / 1e9:.3f}B in {elapsed:.0f}s")
+    print_main(f"Model: {NUM_LAYERS}L d={MODEL_DIM} MLP={MLP_MULT}x | {n_params:,} params")
+    print_main(f"Throughput: {total_tokens / elapsed:.0f} tok/s")
+    print_main("=" * 70)
+
+    # ---- Cleanup ----
+    if is_distributed():
+        dist.destroy_process_group()
+
+
+if __name__ == "__main__":
+    main()
diff --git a/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/train_gpt.py b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/train_gpt.py
new file mode 100644
index 0000000000..b1f3bef4a2
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-02_Meadow_TextDiffusion_Retrodiction_TTT_DepthRecurrence/train_gpt.py
@@ -0,0 +1,1072 @@
+#!/usr/bin/env -S python3 -u
+"""
+5L d=256 AR + Retrodiction — PyTorch H100 version.
+
+Architecture:
+  - 5 layers: 2 encoder + 3 decoder with U-net skip connections
+  - dim=256, 8 heads, 4 KV heads, MLP 3x  (~4.2M params)
+  - XSA on last 2 layers
+  - Retrodiction alpha=0.3
+  - v4096 custom BPE tokenizer
+  - BigramHash(2048, dim=64) + SmearGate
+  - LeakyReLU(0.5)^2, Logit softcap 30.0
+  - Muon optimizer + AdamW for embeddings/scalars
+
+Usage (1xH100):
+  python3 train_gpt.py --steps=9999 --train_budget_secs=540
+"""
+from __future__ import annotations
+
+import argparse
+import glob
+import lzma
+import math
+import os
+import pickle
+import sys
+import time
+from pathlib import Path
+from collections import defaultdict
+from contextlib import nullcontext
+
+import numpy as np
+import sentencepiece as spm
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import torch.distributed as dist
+from torch.nn.parallel import DistributedDataParallel as DDP
+
+# ==============================================================================
+# CONFIG
+# ==============================================================================
+COMPUTE_DTYPE = torch.bfloat16
+
+DATA_DIR = "/workspace/data_v4096_full"
+TOKENIZER_PATH = "/workspace/bpe_v4096.model"
+
+VOCAB_SIZE = 4096
+NUM_LAYERS = 5          # 2 encoder + 3 decoder
+MODEL_DIM = 256
+NUM_HEADS = 8            # head_dim = 256/8 = 32
+NUM_KV_HEADS = 4         # kv_dim = 4*32 = 128
+MLP_MULT = 3
+BIGRAM_DIM = 64
+
+ROPE_BASE = 10000.0
+QK_GAIN_INIT = 1.5
+TIED_EMBED_INIT_STD = 0.005
+LOGIT_SOFTCAP = 30.0
+SEQ_LEN = 1024
+
+XSA_LAST_N = 2           # XSA on last 2 layers
+BIGRAM_BUCKETS = 2048
+
+# Optimizer
+TIED_EMBED_LR = 0.035
+MATRIX_LR = 0.025
+SCALAR_LR = 0.025
+BETA1 = 0.9
+BETA2 = 0.95
+ADAM_EPS = 1e-8
+MUON_MOMENTUM = 0.99
+MUON_BACKEND_STEPS = 5
+MUON_MOMENTUM_WARMUP_START = 0.92
+MUON_MOMENTUM_WARMUP_STEPS = 1500
+WEIGHT_DECAY = 0.04
+GRAD_CLIP = 0.3
+
+# EMA
+EMA_DECAY = 0.997
+
+# Retrodiction
+RETRO_ALPHA = 0.3
+
+SEED = 1337
+
+# ==============================================================================
+# DISTRIBUTED HELPERS
+# ==============================================================================
+def is_distributed():
+    return dist.is_initialized()
+
+def get_rank():
+    return dist.get_rank() if is_distributed() else 0
+
+def get_world_size():
+    return dist.get_world_size() if is_distributed() else 1
+
+def is_main():
+    return get_rank() == 0
+
+def print_main(*args, **kwargs):
+    if is_main():
+        try:
+            print(*args, **kwargs, flush=True)
+        except (BrokenPipeError, OSError):
+            pass  # SSH pipe broken — don't crash rank 0
+
+# ==============================================================================
+# MATH HELPERS
+# ==============================================================================
+def rms_norm(x: torch.Tensor, eps: float = 1e-6) -> torch.Tensor:
+    return x * torch.rsqrt(x.pow(2).mean(dim=-1, keepdim=True) + eps)
+
+
+def zeropower_newtonschulz5(g: torch.Tensor, steps: int, eps: float = 1e-7) -> torch.Tensor:
+    """Newton-Schulz iteration to approximate the matrix sign function (Muon optimizer)."""
+    a, b, c = 3.4445, -4.7750, 2.0315
+    x = g.float()
+    x = x / (x.norm() + eps)
+    transposed = x.shape[0] > x.shape[1]
+    if transposed:
+        x = x.T
+    for _ in range(steps):
+        a_mat = x @ x.T
+        b_mat = b * a_mat + c * (a_mat @ a_mat)
+        x = a * x + b_mat @ x
+    if transposed:
+        x = x.T
+    return x.to(g.dtype)
+
+
+# ==============================================================================
+# ROTARY POSITION EMBEDDING
+# ==============================================================================
+class RotaryEmbedding(nn.Module):
+    """Standard rotary position embedding."""
+    def __init__(self, head_dim: int, base: float = 10000.0, max_seq_len: int = 2048):
+        super().__init__()
+        self.head_dim = head_dim
+        inv_freq = 1.0 / (base ** (torch.arange(0, head_dim, 2, dtype=torch.float32) / head_dim))
+        self.register_buffer("inv_freq", inv_freq, persistent=False)
+        # Precompute cos/sin cache
+        t = torch.arange(max_seq_len, dtype=torch.float32)
+        freqs = torch.outer(t, inv_freq)  # [T, head_dim//2]
+        self.register_buffer("cos_cached", freqs.cos(), persistent=False)
+        self.register_buffer("sin_cached", freqs.sin(), persistent=False)
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        """x: [B, H, T, D]"""
+        seq_len = x.shape[2]
+        cos = self.cos_cached[:seq_len].unsqueeze(0).unsqueeze(0)  # [1, 1, T, D//2]
+        sin = self.sin_cached[:seq_len].unsqueeze(0).unsqueeze(0)
+        cos = cos.to(x.dtype)
+        sin = sin.to(x.dtype)
+        # Split into even/odd
+        x1 = x[..., : self.head_dim // 2]
+        x2 = x[..., self.head_dim // 2 :]
+        return torch.cat([x1 * cos - x2 * sin, x2 * cos + x1 * sin], dim=-1)
+
+
+# ==============================================================================
+# DATA LOADING
+# ==============================================================================
+def load_data_shard(path):
+    header_bytes = 256 * np.dtype("<i4").itemsize
+    header = np.fromfile(path, dtype="<i4", count=256)
+    num_tokens = int(header[2])
+    tokens = np.fromfile(path, dtype="<u2", count=num_tokens, offset=header_bytes)
+    return tokens.astype(np.int32, copy=False)
+
+
+class TokenStream:
+    def __init__(self, pattern, rank=0, world_size=1):
+        self.files = [Path(p) for p in sorted(glob.glob(pattern))]
+        if not self.files:
+            raise FileNotFoundError(f"No files: {pattern}")
+        # For DDP: each rank skips to its portion of files
+        self.rank = rank
+        self.world_size = world_size
+        self.epoch = 1
+        self.file_idx = rank  # stagger file start per rank
+        self.tokens = load_data_shard(self.files[self.file_idx % len(self.files)])
+        self.pos = 0
+
+    def next_file(self):
+        self.file_idx += self.world_size  # each rank advances by world_size
+        if self.file_idx >= len(self.files):
+            self.file_idx = self.rank
+            self.epoch += 1
+            print_main(f"WARNING: starting epoch:{self.epoch}")
+        self.tokens = load_data_shard(self.files[self.file_idx % len(self.files)])
+        self.pos = 0
+
+    def take(self, n):
+        chunks = []
+        left = n
+        while left > 0:
+            if self.pos >= self.tokens.size:
+                self.next_file()
+            k = min(left, int(self.tokens.size - self.pos))
+            chunks.append(self.tokens[self.pos:self.pos + k])
+            self.pos += k
+            left -= k
+        return chunks[0] if len(chunks) == 1 else np.concatenate(chunks)
+
+
+class TokenLoader:
+    def __init__(self, pattern, device, rank=0, world_size=1):
+        self.stream = TokenStream(pattern, rank=rank, world_size=world_size)
+        self.device = device
+
+    def next_batch(self, batch_tokens, seq_len):
+        usable = (batch_tokens // seq_len) * seq_len
+        chunk = self.stream.take(usable + 1)
+        x = chunk[:-1].reshape(-1, seq_len)
+        y = chunk[1:].reshape(-1, seq_len)
+        return (torch.tensor(x, dtype=torch.long, device=self.device),
+                torch.tensor(y, dtype=torch.long, device=self.device))
+
+
+def load_validation_tokens(pattern, seq_len):
+    files = [Path(p) for p in sorted(glob.glob(pattern))]
+    tokens = np.concatenate([load_data_shard(f) for f in files])
+    usable = ((tokens.size - 1) // seq_len) * seq_len
+    return tokens[:usable + 1]
+
+
+# ==============================================================================
+# MODEL BLOCKS
+# ==============================================================================
+class CastedLinear(nn.Module):
+    """Linear with float32 weights, cast to input dtype at forward time."""
+    def __init__(self, in_dim: int, out_dim: int):
+        super().__init__()
+        # Initialize using default nn.Linear init, but store as float32 Parameter
+        temp = nn.Linear(in_dim, out_dim, bias=False)
+        self.weight = nn.Parameter(temp.weight.data.float())  # [out, in]
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        return F.linear(x, self.weight.to(x.dtype))
+
+
+class RMSNormNoWeight(nn.Module):
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        return rms_norm(x)
+
+
+class DualModeAttention(nn.Module):
+    def __init__(self, dim: int, num_heads: int, num_kv_heads: int,
+                 rope_base: float, qk_gain_init: float, use_xsa: bool = False):
+        super().__init__()
+        self.num_heads = num_heads
+        self.num_kv_heads = num_kv_heads
+        self.head_dim = dim // num_heads
+        self.num_groups = num_heads // num_kv_heads  # GQA group size
+        kv_dim = num_kv_heads * self.head_dim
+        self.c_q = CastedLinear(dim, dim)
+        self.c_k = CastedLinear(dim, kv_dim)
+        self.c_v = CastedLinear(dim, kv_dim)
+        self.proj = CastedLinear(dim, dim)
+        self.q_gain = nn.Parameter(torch.ones(num_heads, dtype=torch.float32) * qk_gain_init)
+        self.rope = RotaryEmbedding(self.head_dim, base=rope_base, max_seq_len=SEQ_LEN + 64)
+        self.scale = self.head_dim ** -0.5
+        self.use_xsa = use_xsa
+
+    def _xsa(self, y: torch.Tensor, v: torch.Tensor) -> torch.Tensor:
+        """Subtract self-value projection (XSA)."""
+        bsz, seqlen, dim = y.shape
+        hd = self.head_dim
+        nkv = self.num_kv_heads
+        group = self.num_groups
+        # y: [B, T, nh*hd] -> [B, T, nkv, group, hd]
+        y_g = y.reshape(bsz, seqlen, nkv, group, hd)
+        # v: [B, nkv, T, hd] -> [B, T, nkv, hd]
+        v_t = v.transpose(1, 2)
+        vn = v_t / (v_t.norm(dim=-1, keepdim=True) + 1e-8)
+        vn = vn.unsqueeze(3)  # [B, T, nkv, 1, hd]
+        # Project y onto v direction and subtract
+        proj = (y_g * vn).sum(dim=-1, keepdim=True) * vn
+        return (y_g - proj).reshape(bsz, seqlen, dim)
+
+    def forward(self, x: torch.Tensor, is_causal: bool = True) -> torch.Tensor:
+        bsz, seqlen, dim = x.shape
+        q = self.c_q(x).reshape(bsz, seqlen, self.num_heads, self.head_dim).transpose(1, 2)
+        k = self.c_k(x).reshape(bsz, seqlen, self.num_kv_heads, self.head_dim).transpose(1, 2)
+        v = self.c_v(x).reshape(bsz, seqlen, self.num_kv_heads, self.head_dim).transpose(1, 2)
+
+        # QK norm + RoPE
+        q = self.rope(rms_norm(q).to(COMPUTE_DTYPE))
+        k = self.rope(rms_norm(k).to(COMPUTE_DTYPE))
+        # Q gain
+        q = q * self.q_gain.to(q.dtype)[None, :, None, None]
+
+        # GQA: expand K/V to match Q heads
+        if self.num_kv_heads < self.num_heads:
+            k = k.repeat_interleave(self.num_groups, dim=1)
+            v_expanded = v.repeat_interleave(self.num_groups, dim=1)
+        else:
+            v_expanded = v
+
+        # FlashAttention via PyTorch SDPA
+        y = F.scaled_dot_product_attention(q, k, v_expanded, is_causal=is_causal,
+                                           scale=self.scale)
+        y = y.transpose(1, 2).reshape(bsz, seqlen, dim)
+
+        # XSA
+        if self.use_xsa:
+            y = self._xsa(y, v)
+        return self.proj(y)
+
+
+class MLP(nn.Module):
+    """LeakyReLU(0.5)^2 MLP."""
+    def __init__(self, dim: int, mlp_mult: int):
+        super().__init__()
+        hidden = dim * mlp_mult
+        self.fc = CastedLinear(dim, hidden)
+        self.proj = CastedLinear(hidden, dim)
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        h = self.fc(x)
+        h = torch.where(h >= 0, h, 0.5 * h)
+        return self.proj(h * h)
+
+
+class BigramHashEmbedding(nn.Module):
+    """Learned bigram hash embeddings."""
+    def __init__(self, buckets: int, bigram_dim: int, model_dim: int):
+        super().__init__()
+        self.buckets = buckets
+        self.embed = nn.Embedding(buckets, bigram_dim)
+        nn.init.zeros_(self.embed.weight)
+        self.proj = CastedLinear(bigram_dim, model_dim)
+        self.scale = nn.Parameter(torch.tensor(0.05, dtype=torch.float32))
+
+    def bigram_hash(self, tokens: torch.Tensor) -> torch.Tensor:
+        t = tokens.int()
+        mod = self.buckets - 1
+        shifted = torch.cat([torch.full((t.shape[0], 1), mod, dtype=torch.int32,
+                                        device=t.device),
+                             t[:, :-1]], dim=1)
+        hashed = (36313 * t + 27191 * shifted) % mod
+        return hashed.long()
+
+    def forward(self, token_ids: torch.Tensor) -> torch.Tensor:
+        h = self.embed(self.bigram_hash(token_ids))
+        h = self.proj(h)
+        return h * self.scale.to(h.dtype)
+
+
+class SmearGate(nn.Module):
+    """Learned blending with previous token."""
+    def __init__(self, dim: int):
+        super().__init__()
+        self.gate = nn.Parameter(torch.zeros(dim, dtype=torch.float32))
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        g = torch.sigmoid(self.gate.to(x.dtype)).unsqueeze(0).unsqueeze(0)
+        x_prev = torch.cat([torch.zeros_like(x[:, :1]), x[:, :-1]], dim=1)
+        return (1 - g) * x + g * x_prev
+
+
+class Block(nn.Module):
+    def __init__(self, dim: int, num_heads: int, num_kv_heads: int, mlp_mult: int,
+                 rope_base: float, qk_gain_init: float,
+                 layer_idx: int = 0, use_xsa: bool = False):
+        super().__init__()
+        self.attn_norm = RMSNormNoWeight()
+        self.mlp_norm = RMSNormNoWeight()
+        self.attn = DualModeAttention(dim, num_heads, num_kv_heads, rope_base, qk_gain_init,
+                                      use_xsa=use_xsa)
+        self.mlp = MLP(dim, mlp_mult)
+        # LN Scale: 1/sqrt(layer+1) — matches MLX version exactly
+        self.ln_scale = 1.0 / math.sqrt(layer_idx + 1)
+        self.attn_scale = nn.Parameter(torch.ones(dim, dtype=torch.float32))
+        self.mlp_scale = nn.Parameter(torch.ones(dim, dtype=torch.float32))
+        self.resid_mix = nn.Parameter(torch.stack([
+            torch.ones(dim, dtype=torch.float32),
+            torch.zeros(dim, dtype=torch.float32),
+        ]))
+
+    def forward(self, x: torch.Tensor, x0: torch.Tensor,
+                is_causal: bool = True) -> torch.Tensor:
+        mix = self.resid_mix.to(x.dtype)
+        x = mix[0].unsqueeze(0).unsqueeze(0) * x + mix[1].unsqueeze(0).unsqueeze(0) * x0
+        attn_out = self.attn(self.attn_norm(x) * self.ln_scale, is_causal=is_causal)
+        x = x + self.attn_scale.to(x.dtype).unsqueeze(0).unsqueeze(0) * attn_out
+        x = x + self.mlp_scale.to(x.dtype).unsqueeze(0).unsqueeze(0) * self.mlp(
+            self.mlp_norm(x) * self.ln_scale)
+        return x
+
+
+class GPTv2(nn.Module):
+    """5L d=256 GPT with U-net skip connections."""
+    def __init__(self):
+        super().__init__()
+        self.logit_softcap = LOGIT_SOFTCAP
+        self.tok_emb = nn.Embedding(VOCAB_SIZE, MODEL_DIM)
+        nn.init.normal_(self.tok_emb.weight, std=TIED_EMBED_INIT_STD)
+        self.tok_emb.weight.data = self.tok_emb.weight.data.to(COMPUTE_DTYPE)
+
+        self.bigram = BigramHashEmbedding(BIGRAM_BUCKETS, BIGRAM_DIM, MODEL_DIM)
+        self.smear = SmearGate(MODEL_DIM)
+
+        # U-net: 2 encoder layers + 3 decoder layers = 5 total
+        self.num_encoder_layers = NUM_LAYERS // 2   # 2
+        self.num_decoder_layers = NUM_LAYERS - self.num_encoder_layers  # 3
+        self.num_skip_weights = min(self.num_encoder_layers, self.num_decoder_layers)  # 2
+        self.skip_weights = nn.Parameter(torch.ones(self.num_skip_weights, MODEL_DIM,
+                                                    dtype=torch.float32))
+
+        # 5 unique layers (standard, no weight sharing)
+        self.blocks = nn.ModuleList()
+        for i in range(NUM_LAYERS):
+            use_xsa = i >= (NUM_LAYERS - XSA_LAST_N)  # last 2 layers (3, 4)
+            self.blocks.append(
+                Block(MODEL_DIM, NUM_HEADS, NUM_KV_HEADS, MLP_MULT, ROPE_BASE, QK_GAIN_INIT,
+                      layer_idx=i, use_xsa=use_xsa)
+            )
+        self.final_norm = RMSNormNoWeight()
+
+        # Zero out output projections (matches MLX init)
+        for b in self.blocks:
+            nn.init.zeros_(b.attn.proj.weight)
+            nn.init.zeros_(b.mlp.proj.weight)
+
+    def softcap(self, logits: torch.Tensor) -> torch.Tensor:
+        c = self.logit_softcap
+        return c * torch.tanh(logits / c)
+
+    def forward_hidden(self, input_ids: torch.Tensor,
+                       is_causal: bool = True) -> torch.Tensor:
+        x = self.tok_emb(input_ids).to(COMPUTE_DTYPE)
+        x = x + self.bigram(input_ids).to(COMPUTE_DTYPE)
+        x = rms_norm(x)
+        x = self.smear(x)
+        x0 = x
+
+        # Encoder: layers 0..4, save skips
+        skips = []
+        for i in range(self.num_encoder_layers):
+            x = self.blocks[i](x, x0, is_causal=is_causal)
+            skips.append(x)
+
+        # Decoder: layers 5..10, add U-net skip connections
+        for i in range(self.num_decoder_layers):
+            if skips:
+                skip_w = self.skip_weights[i]
+                x = x + skip_w.to(x.dtype).unsqueeze(0).unsqueeze(0) * skips.pop()
+            x = self.blocks[self.num_encoder_layers + i](x, x0, is_causal=is_causal)
+
+        return self.final_norm(x)
+
+    def forward(self, input_ids: torch.Tensor,
+                target_ids: torch.Tensor | None = None,
+                is_causal: bool = True,
+                retrodiction: bool = False) -> torch.Tensor:
+        """
+        If target_ids is None: return hidden states [B, T, D].
+        If target_ids is given: return scalar loss.
+        """
+        if target_ids is None:
+            return self.forward_hidden(input_ids, is_causal=is_causal)
+
+        # Standard AR loss: h[t] -> token[t+1]
+        h = self.forward_hidden(input_ids, is_causal=is_causal)
+        h_flat = h.reshape(-1, MODEL_DIM)
+        y = target_ids.reshape(-1)
+        logits = self.softcap(F.linear(h_flat, self.tok_emb.weight.to(h_flat.dtype)))
+        forward_loss = F.cross_entropy(logits.float(), y, reduction="mean")
+
+        if not retrodiction:
+            return forward_loss
+
+        # Retrodiction: backward AR loss
+        # .flip() in compiled graph causes NaN, so caller must pass pre-flipped tensors
+        # via x_rev, y_rev arguments (see training loop)
+        return forward_loss  # retrodiction handled in training loop, not here
+
+    def loss_fn(self, input_ids: torch.Tensor, target_ids: torch.Tensor,
+                is_causal: bool = True) -> torch.Tensor:
+        """Convenience method for eval (forward-only loss)."""
+        return self.forward(input_ids, target_ids, is_causal=is_causal, retrodiction=False)
+
+
+# ==============================================================================
+# OPTIMIZER (Muon + AdamW split)
+# ==============================================================================
+CONTROL_PATTERNS = ("attn_scale", "mlp_scale", "resid_mix", "q_gain", "skip_weight",
+                    "gate", "scale", "ln_scale")
+
+
+class Muon:
+    """Muon optimizer: Newton-Schulz orthogonalization for matrix parameters."""
+    def __init__(self, params_dict: dict[str, nn.Parameter]):
+        self.keys = list(params_dict.keys())
+        self.buffers = {k: torch.zeros_like(p.data) for k, p in params_dict.items()}
+
+    @torch.no_grad()
+    def step(self, params_dict: dict[str, nn.Parameter],
+             grads_dict: dict[str, torch.Tensor],
+             step: int, lr_mul: float):
+        t = min(step / max(MUON_MOMENTUM_WARMUP_STEPS, 1), 1.0)
+        momentum = (1.0 - t) * MUON_MOMENTUM_WARMUP_START + t * MUON_MOMENTUM
+        lr = MATRIX_LR * lr_mul
+        for k in self.keys:
+            if k not in grads_dict:
+                continue
+            p = params_dict[k]
+            g = grads_dict[k]
+            g_norm = g.norm()
+            if g_norm > GRAD_CLIP:
+                g = g * (GRAD_CLIP / (g_norm + 1e-8))
+            buf = momentum * self.buffers[k] + g
+            self.buffers[k] = buf
+            g_eff = g + momentum * buf
+            g_ortho = zeropower_newtonschulz5(g_eff, MUON_BACKEND_STEPS)
+            scale = math.sqrt(max(1.0, p.shape[0] / p.shape[1]))
+            p.data.mul_(1 - lr * WEIGHT_DECAY).add_(g_ortho.to(p.dtype), alpha=-lr * scale)
+
+
+class SplitOptimizers:
+    """Split parameters into Muon (matrix), AdamW (embedding), AdamW (scalar)."""
+    def __init__(self, model: nn.Module):
+        self.embed_key = "tok_emb.weight"
+        self.matrix_params = {}
+        self.scalar_params = {}
+        self.embed_param = None
+
+        for name, p in model.named_parameters():
+            if name == self.embed_key:
+                self.embed_param = p
+            elif p.ndim == 2 and not any(pat in name for pat in CONTROL_PATTERNS):
+                self.matrix_params[name] = p
+            else:
+                self.scalar_params[name] = p
+
+        self.muon = Muon(self.matrix_params)
+
+        # AdamW for embedding
+        self.adam_embed = torch.optim.AdamW(
+            [self.embed_param], lr=TIED_EMBED_LR,
+            betas=(BETA1, BETA2), eps=ADAM_EPS, weight_decay=0.0
+        )
+        # AdamW for scalar params
+        self.adam_scalar = torch.optim.AdamW(
+            list(self.scalar_params.values()), lr=SCALAR_LR,
+            betas=(BETA1, BETA2), eps=ADAM_EPS, weight_decay=0.0
+        )
+
+        print_main(f"  Muon params: {len(self.matrix_params)} tensors")
+        print_main(f"  Scalar params: {len(self.scalar_params)} tensors")
+        print_main(f"  Embed param: {self.embed_key}")
+
+    def step(self, model: nn.Module, step: int, lr_mul: float):
+        """Update all param groups. Gradients should already be on .grad."""
+        # Gather gradients
+        grads_dict = {}
+        for name, p in model.named_parameters():
+            if p.grad is not None:
+                grads_dict[name] = p.grad
+
+        # Muon step (manual)
+        muon_grads = {k: grads_dict[k] for k in self.matrix_params if k in grads_dict}
+        self.muon.step(self.matrix_params, muon_grads, step=step, lr_mul=lr_mul)
+
+        # AdamW for embedding
+        for pg in self.adam_embed.param_groups:
+            pg["lr"] = TIED_EMBED_LR * lr_mul
+        self.adam_embed.step()
+
+        # AdamW for scalars
+        for pg in self.adam_scalar.param_groups:
+            pg["lr"] = SCALAR_LR * lr_mul
+        self.adam_scalar.step()
+
+    def zero_grad(self):
+        self.adam_embed.zero_grad(set_to_none=True)
+        self.adam_scalar.zero_grad(set_to_none=True)
+        for p in self.matrix_params.values():
+            p.grad = None
+
+
+# ==============================================================================
+# SENTENCEPIECE BPB
+# ==============================================================================
+def build_sentencepiece_luts(sp, vocab_size):
+    sp_vocab_size = int(sp.vocab_size())
+    table_size = max(sp_vocab_size, vocab_size)
+    base_bytes_lut = np.zeros((table_size,), dtype=np.int16)
+    has_leading_space_lut = np.zeros((table_size,), dtype=np.bool_)
+    is_boundary_token_lut = np.ones((table_size,), dtype=np.bool_)
+    for token_id in range(sp_vocab_size):
+        if sp.is_control(token_id) or sp.is_unknown(token_id) or sp.is_unused(token_id):
+            continue
+        is_boundary_token_lut[token_id] = False
+        if sp.is_byte(token_id):
+            base_bytes_lut[token_id] = 1
+            continue
+        piece = sp.id_to_piece(token_id)
+        if piece.startswith("\u2581"):
+            has_leading_space_lut[token_id] = True
+            piece = piece[1:]
+        base_bytes_lut[token_id] = len(piece.encode("utf-8"))
+    return base_bytes_lut, has_leading_space_lut, is_boundary_token_lut
+
+
+def compute_bpb(total_nll, total_tokens, total_bytes):
+    avg_loss = total_nll / total_tokens
+    bpt = avg_loss / math.log(2.0)
+    return bpt * (total_tokens / total_bytes)
+
+
+# ==============================================================================
+# LR SCHEDULE
+# ==============================================================================
+def lr_schedule(step, total_steps, warmdown_iters):
+    warmdown_start = max(total_steps - warmdown_iters, 0)
+    if warmdown_start <= step < total_steps:
+        return max((total_steps - step) / max(warmdown_iters, 1), 0.0)
+    return 1.0
+
+
+# ==============================================================================
+# INT6 QUANTIZATION + LZMA SAVE
+# ==============================================================================
+def pack_int6(values: np.ndarray) -> np.ndarray:
+    """Pack 6-bit signed integers into bytes. 4 values -> 3 bytes."""
+    # values: int8 array in [-31, 31], shift to unsigned [0, 62]
+    unsigned = (values.astype(np.int16) + 31).astype(np.uint8)  # [0, 62], fits in 6 bits
+    n = len(unsigned)
+    # Pad to multiple of 4
+    pad = (4 - n % 4) % 4
+    if pad:
+        unsigned = np.concatenate([unsigned, np.zeros(pad, dtype=np.uint8)])
+    n_padded = len(unsigned)
+    # Pack 4 values (24 bits) into 3 bytes
+    a = unsigned[0::4]
+    b = unsigned[1::4]
+    c = unsigned[2::4]
+    d = unsigned[3::4]
+    byte0 = (a << 2) | (b >> 4)
+    byte1 = ((b & 0x0F) << 4) | (c >> 2)
+    byte2 = ((c & 0x03) << 6) | d
+    packed = np.stack([byte0, byte1, byte2], axis=1).flatten()
+    return packed, n  # return original length for unpadding
+
+
+def unpack_int6(packed: np.ndarray, n_values: int) -> np.ndarray:
+    """Unpack 6-bit values from bytes."""
+    byte0 = packed[0::3]
+    byte1 = packed[1::3]
+    byte2 = packed[2::3]
+    a = byte0 >> 2
+    b = ((byte0 & 0x03) << 4) | (byte1 >> 4)
+    c = ((byte1 & 0x0F) << 2) | (byte2 >> 6)
+    d = byte2 & 0x3F
+    unsigned = np.stack([a, b, c, d], axis=1).flatten()[:n_values]
+    return unsigned.astype(np.int16) - 31  # back to signed [-31, 31]
+
+
+def quantize_int6_perrow(tensor_f32: np.ndarray):
+    """
+    Per-row int6 quantization (range [-31, 31]) with independent scale per row.
+    For 1-D tensors, falls back to per-tensor.
+    Returns: (flat_quantized int8, scales float32 array, original shape)
+    """
+    arr = tensor_f32.astype(np.float32)
+    shape = arr.shape
+    if arr.ndim >= 2:
+        rows = arr.reshape(shape[0], -1)           # [R, C]
+        absmax = np.abs(rows).max(axis=1)          # [R]
+        absmax = np.where(absmax < 1e-12, 1.0, absmax)
+        scales = (absmax / 31.0).astype(np.float32)  # [R]
+        quantized = np.clip(
+            np.round(rows / scales[:, None]), -31, 31
+        ).astype(np.int8).flatten()
+    else:
+        absmax = float(np.abs(arr).max())
+        scale = np.float32(absmax / 31.0 if absmax > 1e-12 else 1.0)
+        scales = np.array([scale], dtype=np.float32)
+        quantized = np.atleast_1d(
+            np.clip(np.round(arr / scale), -31, 31).astype(np.int8))
+    return quantized, scales, shape
+
+
+def save_model_int6_lzma(state_dict: dict, path: str):
+    """Save model with per-row int6 quantization as int8 + lzma compression.
+    Uses int8 storage (not 6-bit packing) because lzma compresses int8 much better
+    than packed 6-bit (int8 values in [-31,31] have low entropy → high lzma ratio)."""
+    quantized_data = {}
+    for k, v in state_dict.items():
+        arr = v.cpu().float().numpy()
+        q, scales, shape = quantize_int6_perrow(arr)
+        # Store as int8 (NOT 6-bit packed) — lzma compresses this much better
+        quantized_data[k] = {"q": q, "s": scales, "sh": shape}
+    compressed = lzma.compress(pickle.dumps(quantized_data), preset=9)
+    with open(path, "wb") as f:
+        f.write(compressed)
+    return len(compressed)
+
+
+def load_model_int6_lzma(path: str):
+    """Load per-row int6 quantized + lzma compressed model."""
+    with open(path, "rb") as f:
+        data = pickle.loads(lzma.decompress(f.read()))
+    state_dict = {}
+    for k, v in data.items():
+        q = v["q"]  # int8 array
+        shape = v["sh"]
+        scales = v["s"]                              # float32 array
+        if len(scales) > 1:                          # per-row
+            rows = q.reshape(shape[0], -1).astype(np.float32)
+            rows *= scales[:, None]
+            arr = rows.reshape(shape)
+        else:                                        # per-tensor (1D)
+            arr = q.astype(np.float32) * float(scales[0])
+            arr = arr.reshape(shape)
+        state_dict[k] = torch.tensor(arr)
+    return state_dict
+
+
+# ==============================================================================
+# EMA
+# ==============================================================================
+class EMAState:
+    """Exponential Moving Average of model parameters."""
+    def __init__(self, model: nn.Module, decay: float = EMA_DECAY):
+        self.decay = decay
+        self.shadow = {k: v.clone().detach() for k, v in model.state_dict().items()}
+
+    @torch.no_grad()
+    def update(self, model: nn.Module):
+        d = self.decay
+        for k, v in model.state_dict().items():
+            if k in self.shadow:
+                self.shadow[k].mul_(d).add_(v, alpha=1 - d)
+
+    def apply(self, model: nn.Module):
+        """Temporarily load EMA weights into model. Returns original state."""
+        orig = {k: v.clone() for k, v in model.state_dict().items()}
+        model.load_state_dict(self.shadow, strict=False)
+        return orig
+
+    def restore(self, model: nn.Module, orig: dict):
+        """Restore original weights after EMA eval."""
+        model.load_state_dict(orig, strict=False)
+
+
+# ==============================================================================
+# MAIN
+# ==============================================================================
+def main():
+    parser = argparse.ArgumentParser(description="5L d=256 AR + Retrodiction — PyTorch H100")
+    parser.add_argument("--steps", type=int, default=7000)
+    parser.add_argument("--grad_accum", type=int, default=1,
+                        help="Gradient accumulation steps (1 for 8xH100 = 524K tokens/step)")
+    parser.add_argument("--microbatch_tokens", type=int, default=65536,
+                        help="Tokens per microbatch per GPU (65536 x 8 GPUs = 524K)")
+    parser.add_argument("--warmdown", type=int, default=1000)
+    parser.add_argument("--val_every", type=int, default=500)
+    parser.add_argument("--val_tokens", type=int, default=1_000_000)
+    parser.add_argument("--save_path", type=str, default="model_5L_v4096.npz")
+    parser.add_argument("--save_int6_path", type=str, default="model_5L_v4096_int6.lzma")
+    parser.add_argument("--checkpoint_dir", type=str, default="checkpoints_5L",
+                        help="Directory for periodic checkpoints")
+    parser.add_argument("--checkpoint_every", type=int, default=500,
+                        help="Save checkpoint every N steps")
+    parser.add_argument("--resume", type=str, default="",
+                        help="Path to checkpoint to resume from")
+    parser.add_argument("--data_dir", type=str, default=DATA_DIR)
+    parser.add_argument("--tokenizer_path", type=str, default=TOKENIZER_PATH)
+    parser.add_argument("--test_steps", type=int, default=0,
+                        help="If > 0, run N test steps on 1 GPU then exit")
+    parser.add_argument("--no_compile", action="store_true",
+                        help="Disable torch.compile (for debugging)")
+    parser.add_argument("--train_budget_secs", type=int, default=0,
+                        help="Wall-clock budget for training in seconds (0=disabled). "
+                             "Use 540 for competition (9 min train + 1 min save+eval).")
+    args = parser.parse_args()
+
+    # Override global paths
+    data_dir = args.data_dir
+    tokenizer_path = args.tokenizer_path
+
+    # ---- Distributed setup ----
+    local_rank = int(os.environ.get("LOCAL_RANK", 0))
+    world_size = int(os.environ.get("WORLD_SIZE", 1))
+
+    if world_size > 1:
+        dist.init_process_group(backend="nccl")
+        torch.cuda.set_device(local_rank)
+        device = torch.device(f"cuda:{local_rank}")
+    else:
+        device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+
+    rank = get_rank()
+
+    # ---- Test mode ----
+    if args.test_steps > 0:
+        args.steps = args.test_steps
+        args.val_every = args.test_steps
+        args.grad_accum = 1
+        args.microbatch_tokens = 4096  # small for quick test
+        print_main(f"=== TEST MODE: {args.test_steps} steps on 1 GPU ===")
+
+    effective_batch = args.grad_accum * args.microbatch_tokens * world_size
+    OFFICIAL_BATCH = 524_288
+    if effective_batch != OFFICIAL_BATCH and args.test_steps == 0:
+        print_main(f"WARNING: effective_batch={effective_batch:,} != official {OFFICIAL_BATCH:,}")
+        print_main(f"   grad_accum={args.grad_accum} x microbatch={args.microbatch_tokens} x {world_size} GPUs")
+
+    print_main("=" * 70)
+    print_main(f"AR + Retrodiction PyTorch | {NUM_LAYERS}L "
+               f"d={MODEL_DIM} MLP={MLP_MULT}x | steps={args.steps}")
+    print_main(f"NUM_HEADS={NUM_HEADS} head_dim={MODEL_DIM//NUM_HEADS} "
+               f"NUM_KV_HEADS={NUM_KV_HEADS} BIGRAM_DIM={BIGRAM_DIM}")
+    print_main(f"Encoder={NUM_LAYERS//2} Decoder={NUM_LAYERS - NUM_LAYERS//2} "
+               f"(U-net skip connections)")
+    print_main(f"Retro alpha={RETRO_ALPHA} | XSA last {XSA_LAST_N} layers | "
+               f"LeakyReLU^2 | BigramHash({BIGRAM_BUCKETS}) | EMA({EMA_DECAY})")
+    print_main(f"Device: {device} | World size: {world_size} | "
+               f"Grad accum: {args.grad_accum}")
+    print_main(f"Effective batch: {effective_batch:,} tok/step")
+    print_main("=" * 70)
+
+    # ---- Tokenizer & BPB LUTs ----
+    sp = spm.SentencePieceProcessor(model_file=tokenizer_path)
+    base_bytes_lut, has_leading_space_lut, is_boundary_token_lut = (
+        build_sentencepiece_luts(sp, VOCAB_SIZE))
+
+    # ---- Validation data ----
+    val_tokens = load_validation_tokens(f"{data_dir}/fineweb_val_*.bin", SEQ_LEN)
+    if args.val_tokens > 0 and args.val_tokens < val_tokens.size:
+        usable = (args.val_tokens // SEQ_LEN) * SEQ_LEN
+        val_short = val_tokens[:usable + 1]
+    else:
+        val_short = val_tokens
+    print_main(f"Val tokens: {val_tokens.size - 1:,} (eval on {val_short.size - 1:,})")
+
+    # ---- Model ----
+    torch.manual_seed(SEED)
+    torch.cuda.manual_seed_all(SEED)
+    model = GPTv2().to(device)
+
+    n_params = sum(p.numel() for p in model.parameters())
+    print_main(f"Model params: {n_params:,} ({n_params/1e6:.1f}M)")
+    print_main(f"Estimated size BF16: {n_params*2/1e6:.1f}MB | int6: {n_params*0.75/1e6:.1f}MB")
+
+    # ---- torch.compile ----
+    if not args.no_compile and device.type == "cuda":
+        print_main("Compiling model with torch.compile()...")
+        model = torch.compile(model, mode="reduce-overhead")
+        print_main("  Compilation requested (will happen on first forward pass)")
+
+    # ---- DDP ----
+    if world_size > 1:
+        model = DDP(model, device_ids=[local_rank], output_device=local_rank,
+                    find_unused_parameters=False)
+    raw_model = model.module if isinstance(model, DDP) else model
+    # If compiled + DDP, unwrap further
+    if hasattr(raw_model, "_orig_mod"):
+        raw_model = raw_model._orig_mod
+
+    # ---- Optimizer ----
+    opt = SplitOptimizers(raw_model)
+
+    # ---- Data loader ----
+    train_loader = TokenLoader(f"{data_dir}/fineweb_train_*.bin",
+                               device=device, rank=rank, world_size=world_size)
+
+    # ---- Resume checkpoint ----
+    start_step = 0
+    ema_state = None
+    ema_start_step = int(args.steps * 0.8)
+
+    if args.resume and os.path.exists(args.resume):
+        print_main(f"Resuming from {args.resume}")
+        ckpt = torch.load(args.resume, map_location=device, weights_only=False)
+        raw_model.load_state_dict(ckpt["model"])
+        start_step = ckpt.get("step", 0)
+        if "ema" in ckpt and ckpt["ema"] is not None:
+            ema_state = EMAState(raw_model)
+            ema_state.shadow = ckpt["ema"]
+        print_main(f"  Resumed at step {start_step}")
+
+    print_main(f"EMA starts step {ema_start_step}")
+
+    # ---- Checkpoint directory ----
+    if is_main() and args.checkpoint_dir:
+        os.makedirs(args.checkpoint_dir, exist_ok=True)
+
+    # Context manager for DDP no_sync on non-final accumulation steps
+    def ddp_no_sync_ctx(is_last_micro):
+        if isinstance(model, DDP) and not is_last_micro:
+            return model.no_sync()
+        return nullcontext()
+
+    # ---- Validation ----
+    @torch.no_grad()
+    def eval_val(vtokens):
+        raw_model.eval()
+        batch_seqs = max(args.microbatch_tokens // SEQ_LEN, 1)
+        total_seqs = (vtokens.size - 1) // SEQ_LEN
+        total_nll = 0.0
+        total_tok = 0
+        total_bytes = 0.0
+        for s in range(0, total_seqs, batch_seqs):
+            e = min(s + batch_seqs, total_seqs)
+            chunk = vtokens[s * SEQ_LEN:(e * SEQ_LEN) + 1]
+            x_np = chunk[:-1].reshape(-1, SEQ_LEN)
+            y_np = chunk[1:].reshape(-1, SEQ_LEN)
+            x = torch.tensor(x_np, dtype=torch.long, device=device)
+            y = torch.tensor(y_np, dtype=torch.long, device=device)
+            ct = float(y.numel())
+            with torch.amp.autocast(device_type='cuda', dtype=COMPUTE_DTYPE):
+                bl = raw_model.loss_fn(x, y, is_causal=True).float()
+            total_nll += bl.item() * ct
+            prev_ids = x_np.reshape(-1)
+            tgt_ids = y_np.reshape(-1)
+            bytes_np = base_bytes_lut[tgt_ids].astype(np.float64)
+            bytes_np += (has_leading_space_lut[tgt_ids] &
+                         ~is_boundary_token_lut[prev_ids]).astype(np.float64)
+            total_tok += int(ct)
+            total_bytes += bytes_np.sum()
+        raw_model.train()
+        return compute_bpb(total_nll, total_tok, total_bytes)
+
+    # ---- Training loop ----
+    t0 = time.perf_counter()
+    t_budget_start = t0
+    WARMDOWN_BUDGET_SECS = 80
+    best_bpb = float("inf")
+
+    if args.train_budget_secs > 0:
+        print_main(f"Time budget: {args.train_budget_secs}s "
+                   f"(warmdown in last {WARMDOWN_BUDGET_SECS}s)")
+
+    print_main("Starting training...")
+
+    for step in range(start_step, args.steps + 1):
+        is_last = (step == args.steps)
+
+        # ---- Time budget check ----
+        if args.train_budget_secs > 0 and not is_last:
+            _elapsed_bgt = time.perf_counter() - t_budget_start
+            if _elapsed_bgt >= args.train_budget_secs:
+                is_last = True
+                print_main(f"  Budget {args.train_budget_secs}s reached at step {step} "
+                           f"({_elapsed_bgt:.0f}s elapsed) — triggering final eval+save")
+
+        # ---- Validation ----
+        if is_last or (args.val_every > 0 and step % args.val_every == 0):
+            use_ema = ema_state is not None
+            if use_ema:
+                orig = ema_state.apply(raw_model)
+
+            val_bpb = eval_val(val_short)
+            marker = " *BEST*" if val_bpb < best_bpb else ""
+            best_bpb = min(best_bpb, val_bpb)
+            elapsed = time.perf_counter() - t0
+            tokens_seen = step * effective_batch
+            ema_tag = " [EMA]" if use_ema else ""
+            print_main(f"step:{step}/{args.steps} val_bpb:{val_bpb:.4f}{marker}{ema_tag} "
+                       f"tokens:{tokens_seen / 1e6:.0f}M elapsed:{elapsed:.0f}s")
+
+            if use_ema:
+                ema_state.restore(raw_model, orig)
+
+            if is_last:
+                if ema_state is not None:
+                    ema_state.apply(raw_model)
+                break
+
+        # ---- LR schedule ----
+        lrm = lr_schedule(step, args.steps, args.warmdown)
+        # Time-based warmdown override
+        if args.train_budget_secs > 0:
+            _elapsed_bgt = time.perf_counter() - t_budget_start
+            _remaining_bgt = args.train_budget_secs - _elapsed_bgt
+            if 0 < _remaining_bgt < WARMDOWN_BUDGET_SECS:
+                lrm = min(lrm, _remaining_bgt / WARMDOWN_BUDGET_SECS)
+
+        # ---- Gradient accumulation ----
+        opt.zero_grad()
+        train_loss_accum = 0.0
+
+        for micro_step in range(args.grad_accum):
+            is_last_micro = (micro_step == args.grad_accum - 1)
+            x, y = train_loader.next_batch(args.microbatch_tokens, SEQ_LEN)
+
+            with ddp_no_sync_ctx(is_last_micro):
+                with torch.amp.autocast(device_type='cuda', dtype=COMPUTE_DTYPE):
+                    # Forward AR loss (compiled, fast)
+                    fwd_loss = model(x, target_ids=y, retrodiction=False) / args.grad_accum
+                fwd_loss.backward()
+
+                # Retrodiction: separate backward pass (NOT compiled, avoids .flip() NaN)
+                x_rev = x.flip(dims=[1])
+                y_rev = y.flip(dims=[1])
+                with torch.amp.autocast(device_type='cuda', dtype=COMPUTE_DTYPE):
+                    bwd_loss = raw_model.loss_fn(x_rev, y_rev) * RETRO_ALPHA / args.grad_accum
+                bwd_loss.backward()
+
+            train_loss_accum += (fwd_loss.item() + bwd_loss.item()) * args.grad_accum
+
+        # ---- Optimizer step ----
+        opt.step(raw_model, step=step, lr_mul=lrm)
+
+        # ---- EMA ----
+        if step == ema_start_step:
+            ema_state = EMAState(raw_model)
+            print_main(f"  EMA started at step {step}")
+        elif ema_state is not None:
+            ema_state.update(raw_model)
+
+        # ---- Logging ----
+        if step % 100 == 0 and step > 0:
+            elapsed = time.perf_counter() - t0
+            tps = (step - start_step) * effective_batch / elapsed
+            print_main(f"  step:{step} train_loss:{train_loss_accum:.4f} "
+                       f"lr_mul:{lrm:.4f} tok/s:{tps:.0f}")
+
+        # ---- Checkpoint ----
+        if (is_main() and args.checkpoint_dir and args.checkpoint_every > 0
+                and step > 0 and step % args.checkpoint_every == 0):
+            ckpt_path = os.path.join(args.checkpoint_dir, f"step_{step}.pt")
+            ckpt = {
+                "step": step,
+                "model": raw_model.state_dict(),
+                "ema": ema_state.shadow if ema_state else None,
+            }
+            torch.save(ckpt, ckpt_path)
+            print_main(f"  Checkpoint saved: {ckpt_path}")
+
+    # ---- Save final model ----
+    if is_main():
+        # Save as npz
+        sd = raw_model.state_dict()
+        np_weights = {}
+        for k, v in sd.items():
+            t = v.cpu()
+            if t.dtype == torch.bfloat16:
+                np_weights[k] = t.float().numpy()
+            else:
+                np_weights[k] = t.numpy()
+        np.savez(args.save_path, **np_weights)
+        print_main(f"\nSaved NPZ to {args.save_path}")
+
+        # Save int6 + lzma
+        nbytes = save_model_int6_lzma(sd, args.save_int6_path)
+        print_main(f"Saved int6+lzma to {args.save_int6_path} ({nbytes/1e6:.1f}MB)")
+
+    # ---- Summary ----
+    elapsed = time.perf_counter() - t0
+    total_tokens = (args.steps - start_step) * effective_batch
+    print_main("=" * 70)
+    print_main(f"FINAL val_bpb: {best_bpb:.4f}")
+    print_main(f"Total tokens: {total_tokens / 1e9:.3f}B in {elapsed:.0f}s")
+    print_main(f"Model: {NUM_LAYERS}L d={MODEL_DIM} MLP={MLP_MULT}x | {n_params:,} params")
+    print_main(f"Throughput: {total_tokens / elapsed:.0f} tok/s")
+    print_main("=" * 70)
+
+    # ---- Cleanup ----
+    if is_distributed():
+        dist.destroy_process_group()
+
+
+if __name__ == "__main__":
+    main()