diff --git a/records/track_non_record_16mb/2026-04-04_TTT_GPTQ_Incompatibility/README.md b/records/track_non_record_16mb/2026-04-04_TTT_GPTQ_Incompatibility/README.md
new file mode 100644
index 0000000000..f74eb6c7ba
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-04_TTT_GPTQ_Incompatibility/README.md
@@ -0,0 +1,103 @@
+## Summary
+
+**Test-time training (TTT) provides substantial BPB improvement on simple quantization but is fundamentally ineffective on GPTQ-quantized models.** This work aggregates evidence from 4 independent configurations across 3 research groups showing that GPTQ's compensatory weight structure is destroyed by gradient-based adaptation, making TTT and GPTQ mutually exclusive optimization strategies.
+
+This finding has immediate implications for the competition: teams using GPTQ (the dominant compression method) cannot benefit from TTT at eval time.
+
+---
+
+## Evidence
+
+| Configuration | TTT Method | Quantization | BPB Delta | Source |
+|--------------|-----------|-------------|-----------|--------|
+| PR #461 baseline | SGD, 3 epochs, momentum=0.9 | Simple int6 per-row | **-0.0165** | Christopher-Lee-McClendon |
+| PR #601 replication | SGD, full model | Full GPTQ int5 | **+0.030 (WORSE)** | Community finding |
+| This work | LoRA rank-8 on Q,V | Full GPTQ int6 | -0.0013 | My experiments (1×H100) |
+| PR #1326 | Score-first SGD | Full GPTQ int6 | -0.0001 | aryanbhosale |
+
+The pattern is stark: SGD TTT improves BPB by -0.0165 on simple int6 quantization (PR #461) but provides **zero benefit** on GPTQ-quantized weights. When applied aggressively to GPTQ models, TTT actively *degrades* performance by +0.030 BPB (PR #601).
+
+My LoRA TTT experiment used rank-8 adapters on Q and V projections of a GPTQ-quantized Clark-architecture model (11L, 512d, sp4096). Even this conservative approach — updating only ~2% of parameters — yielded negligible improvement (-0.0013 BPB).
+
+PR #1326 (aryanbhosale) independently confirmed this: applying score-first TTT to the strongest current architecture (depth recurrence + parallel residuals + GPTQ int6) produced -0.0001 BPB improvement — statistically indistinguishable from zero.
+
+---
+
+## Root Cause: GPTQ's Compensatory Weight Structure
+
+GPTQ (Frantar et al., 2023) solves a per-layer Hessian-weighted least-squares problem:
+
+```
+For each column j of weight matrix W:
+    Quantize w_j, compute error δ_j
+    Distribute δ_j to remaining columns: W[:,j+1:] -= δ_j * H_inv[j,j+1:] / H_inv[j,j]
+```
+
+Each quantized weight **compensates for errors in previously quantized weights**. The resulting weight matrix is not independently quantized — it's a globally optimized system where individual weights encode error-correction information for their neighbors.
+
+SGD updates individual weights based on local gradients, **ignoring the compensatory structure**. After even one SGD step:
+- Weight w_j is updated by -lr * ∂L/∂w_j
+- But w_j was carrying compensation for w_{j-1}'s quantization error
+- This compensation is now destroyed
+- The net effect: the SGD update that was supposed to reduce loss instead breaks error cancellation, often increasing loss
+
+This is why TTT on GPTQ is not merely unhelpful — it can be actively harmful (+0.030 BPB in PR #601).
+
+---
+
+## Implication: Compression vs Adaptation Tradeoff
+
+The competition has two parallel optimization strategies that **cannot be combined**:
+
+**Compression path (GPTQ):**
+- GPTQ enables fitting more parameters in 16MB
+- Every recent record submission uses GPTQ (PRs #1218, #1285, #1296, #1334)
+- Gain: ~0.02-0.05 BPB from fitting larger models
+
+**Adaptation path (TTT):**
+- Score-first TTT adapts the model to the evaluation distribution
+- Works well on simple quantization: -0.0165 BPB (PR #461)
+- But simple int6 produces artifacts too large for 16MB at competitive model sizes
+
+Teams must choose one. The current leaderboard shows GPTQ winning — but this may change if someone finds a way to bridge the gap.
+
+---
+
+## Proposed Fix Directions
+
+1. **Quantization-aware TTT:** Maintain full-precision master weights alongside GPTQ weights. Run TTT on masters, re-quantize per chunk. Preserves GPTQ structure while allowing adaptation. Cost: 2× memory + re-quantization overhead.
+
+2. **Structured TTT:** Constrain SGD updates to respect GPTQ block boundaries. Only update weights in ways that maintain the compensatory structure. Requires understanding GPTQ's column ordering.
+
+3. **Higher-rank LoRA:** My rank-8 LoRA gave -0.0013. Higher ranks (32, 64) may provide enough adaptation capacity without disturbing GPTQ weights. But higher rank = more parameters = potential artifact overhead.
+
+4. **Simple int6 + larger model:** Skip GPTQ entirely. Use simple int6 with a model small enough to fit 16MB. TTT then provides -0.0165 BPB. The question: does the GPTQ compression advantage (larger model) outweigh the TTT adaptation advantage (better eval)?
+
+None of these have been attempted in the competition.
+
+---
+
+## SGD TTT Implementation
+
+I implemented the full PR #461 TTT protocol: SGD with momentum=0.9, lr=0.002, cosine decay across 32K-token chunks, 3 epochs per chunk, freeze first 2 blocks, grad clip 1.0. Code: `sgd_ttt_eval.py`
+
+When applied to a GPTQ-quantized Clark 11L model (val_bpb ~1.10 pre-TTT), the result was -0.0013 BPB — consistent with PR #1326's finding of -0.0001 on a similar architecture.
+
+---
+
+## Reproduction
+
+```bash
+# Run SGD TTT on a GPTQ-quantized model:
+python3 sgd_ttt_eval.py \
+    --model-path final_model.int6.ptz \
+    --data-dir ./data/ \
+    --ttt-lr 0.002 --ttt-epochs 3 \
+    --ttt-chunk-size 32768 --ttt-freeze-blocks 2
+```
+
+---
+
+## Attribution
+
+Analysis aggregates findings from PR #461 (Christopher-Lee-McClendon), PR #601 (community), PR #1326 (aryanbhosale), and my own experiments. GPTQ analysis based on Frantar et al. (2023). All experiments self-funded.
diff --git a/records/track_non_record_16mb/2026-04-04_TTT_GPTQ_Incompatibility/clark_ttt_eval.py b/records/track_non_record_16mb/2026-04-04_TTT_GPTQ_Incompatibility/clark_ttt_eval.py
new file mode 100644
index 0000000000..6c6d494e39
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-04_TTT_GPTQ_Incompatibility/clark_ttt_eval.py
@@ -0,0 +1,278 @@
+#!/usr/bin/env python3
+"""
+Legal Score-First TTT Eval for Clark's Model
+==============================================
+Loads a trained Clark model, adds LoRA adapters to Q and V,
+runs strict score-first TTT, reports BPB.
+
+PROTOCOL (100% legal, same as PR #549 approved by valerio-oai):
+  For each chunk:
+    1. SCORE: forward pass, compute loss (eval mode, no grad)
+    2. Record loss for BPB calculation
+    3. TRAIN: gradient update on scored chunk (AFTER scoring)
+    4. NEXT: use updated model for next chunk
+
+USAGE on H100 (after Clark's train_gpt.py has trained a model):
+  python3 clark_ttt_eval.py
+
+Requires Clark's train_gpt.py in the same directory (as module).
+Loads model checkpoint from final_model.pt or trains briefly for testing.
+"""
+import sys; sys.stdout.reconfigure(line_buffering=True)
+sys.path.insert(0, '/workspace/repo')
+
+import os, time, math, copy
+os.chdir('/workspace/repo')
+
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import numpy as np
+from pathlib import Path
+
+DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
+
+print(f"Device: {DEVICE}")
+print(f"Legal Score-First TTT Eval — {time.strftime('%H:%M:%S')}")
+
+# ============================================================
+# Load Clark's code as module
+# ============================================================
+import train_gpt as tg
+
+# ============================================================
+# LoRA wrapper
+# ============================================================
+class LoRAWrapper(nn.Module):
+    """Wraps a CastedLinear/Linear with LoRA. Only A and B are trainable."""
+    def __init__(self, base_linear, rank=8):
+        super().__init__()
+        self.base = base_linear
+        in_f = base_linear.in_features
+        out_f = base_linear.out_features
+        self.scale = 1.0 / rank
+        device = next(base_linear.parameters()).device
+        self.lora_A = nn.Parameter(torch.randn(in_f, rank, device=device) * 0.01)
+        self.lora_B = nn.Parameter(torch.zeros(rank, out_f, device=device))
+        for p in self.base.parameters():
+            p.requires_grad = False
+
+    def forward(self, x):
+        return self.base(x) + (x @ self.lora_A @ self.lora_B) * self.scale
+
+    @property
+    def in_features(self):
+        return self.base.in_features
+
+    @property
+    def out_features(self):
+        return self.base.out_features
+
+    @property
+    def weight(self):
+        return self.base.weight
+
+
+def add_lora(model, rank=8):
+    """Add LoRA to Q and V projections in all attention blocks.
+    Freeze all base params. Returns list of LoRA parameters."""
+    for p in model.parameters():
+        p.requires_grad = False
+
+    lora_params = []
+    for block in model.blocks:
+        attn = block.attn
+        # Wrap c_q
+        lora_q = LoRAWrapper(attn.c_q, rank=rank)
+        attn.c_q = lora_q
+        lora_params.extend([lora_q.lora_A, lora_q.lora_B])
+        # Wrap c_v
+        lora_v = LoRAWrapper(attn.c_v, rank=rank)
+        attn.c_v = lora_v
+        lora_params.extend([lora_v.lora_A, lora_v.lora_B])
+
+    n_lora = sum(p.numel() for p in lora_params)
+    print(f"  LoRA: rank={rank}, {n_lora:,} params on Q,V in {len(model.blocks)} layers")
+    return lora_params
+
+
+# ============================================================
+# Score-First TTT
+# ============================================================
+def score_first_ttt(model, val_tokens, lora_params, h,
+                     chunk_size=2048, epochs=3, lr=0.001,
+                     byte_luts=None):
+    """Strict score-first TTT. Score chunk → record loss → train on it → next chunk."""
+    optimizer = torch.optim.AdamW(lora_params, lr=lr, betas=(0.9, 0.95))
+
+    n_tokens = val_tokens.numel()
+    n_chunks = (n_tokens - 1) // chunk_size
+    vocab_size = h.vocab_size
+
+    total_nll = 0.0
+    total_scored = 0
+    total_bytes = 0.0
+    t0 = time.time()
+
+    for c in range(n_chunks):
+        start = c * chunk_size
+        end = min(start + chunk_size + 1, n_tokens)
+        chunk = val_tokens[start:end].to(device=DEVICE, dtype=torch.long)
+        if len(chunk) < 2:
+            continue
+
+        x = chunk[:-1].unsqueeze(0)
+        y = chunk[1:].unsqueeze(0)
+        n_tok = y.numel()
+
+        # === STEP 1: SCORE (eval mode, no gradients) ===
+        model.eval()
+        with torch.no_grad():
+            with torch.autocast("cuda", torch.bfloat16):
+                logits = model.forward_logits(x)
+            loss = F.cross_entropy(logits.float().reshape(-1, vocab_size), y.reshape(-1))
+
+        total_nll += loss.item() * n_tok
+        total_scored += n_tok
+
+        # Byte counting for BPB
+        if byte_luts is not None:
+            base_lut, space_lut, boundary_lut = byte_luts
+            prev_ids = x.reshape(-1)
+            tgt_ids = y.reshape(-1)
+            tb = base_lut[tgt_ids].to(torch.int16)
+            tb += (space_lut[tgt_ids] & ~boundary_lut[prev_ids]).to(torch.int16)
+            total_bytes += tb.float().sum().item()
+
+        # === STEP 2: TRAIN on scored chunk (AFTER scoring) ===
+        if c < n_chunks - 1:
+            model.train()
+            for ep in range(epochs):
+                with torch.autocast("cuda", torch.bfloat16):
+                    logits = model.forward_logits(x)
+                train_loss = F.cross_entropy(logits.float().reshape(-1, vocab_size), y.reshape(-1))
+                optimizer.zero_grad()
+                train_loss.backward()
+                torch.nn.utils.clip_grad_norm_(lora_params, 1.0)
+                optimizer.step()
+
+        # Progress
+        if (c + 1) % 50 == 0 or c == n_chunks - 1:
+            avg_loss = total_nll / total_scored
+            if total_bytes > 0:
+                bpb = (avg_loss / math.log(2)) * (total_scored / total_bytes)
+                print(f"  TTT [{c+1}/{n_chunks}] loss={avg_loss:.4f} bpb={bpb:.4f} ({time.time()-t0:.0f}s)", flush=True)
+            else:
+                print(f"  TTT [{c+1}/{n_chunks}] loss={avg_loss:.4f} ({time.time()-t0:.0f}s)", flush=True)
+
+    avg_loss = total_nll / total_scored
+    if total_bytes > 0:
+        bpb = (avg_loss / math.log(2)) * (total_scored / total_bytes)
+    else:
+        bpb = avg_loss / math.log(2)
+
+    return avg_loss, bpb, time.time() - t0
+
+
+# ============================================================
+# Main
+# ============================================================
+if __name__ == "__main__":
+    print("\n=== Building model ===")
+    h = tg.Hyperparameters()
+
+    # Load tokenizer + byte LUTs
+    import sentencepiece as spm
+    sp = spm.SentencePieceProcessor(model_file=h.tokenizer_path)
+    byte_luts = tg.build_sentencepiece_luts(sp, h.vocab_size, torch.device(DEVICE))
+
+    # Load validation tokens — h.val_files is a glob pattern STRING
+    val_tokens = tg.load_validation_tokens(h.val_files, h.eval_seq_len)
+    print(f"Val tokens: {val_tokens.numel():,}")
+
+    # Build model
+    model = tg.GPT(h).to(DEVICE)
+    n_params = sum(p.numel() for p in model.parameters())
+    print(f"Model: {n_params:,} params")
+
+    # Load checkpoint if available, else quick train
+    ckpt_path = Path("final_model.pt")
+    if ckpt_path.exists():
+        print(f"Loading checkpoint from {ckpt_path}...")
+        state = torch.load(ckpt_path, map_location=DEVICE, weights_only=True)
+        model.load_state_dict(state, strict=False)
+        print("Checkpoint loaded")
+    else:
+        print("\n=== No checkpoint — quick training (200 steps) ===")
+        train_files = sorted(Path(h.datasets_dir).glob("fineweb_train_*.bin"))
+        if not train_files:
+            print("ERROR: No training data found")
+            sys.exit(1)
+        train_shard = tg.load_data_shard(train_files[0])
+        optimizer = torch.optim.AdamW(model.parameters(), lr=3e-4, weight_decay=h.muon_wd)
+        model.train()
+        for step in range(200):
+            start_idx = step * h.train_seq_len * 8
+            if start_idx + h.train_seq_len * 8 + 1 > train_shard.numel():
+                start_idx = 0
+            chunk = train_shard[start_idx:start_idx + h.train_seq_len * 8 + 1].to(DEVICE, torch.long)
+            x = chunk[:-1].reshape(-1, h.train_seq_len)[:8]
+            y = chunk[1:].reshape(-1, h.train_seq_len)[:8]
+            with torch.autocast("cuda", torch.bfloat16):
+                loss = model(x, y)
+            optimizer.zero_grad(); loss.backward()
+            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
+            optimizer.step()
+            if step % 100 == 0:
+                print(f"  Step {step}: loss={loss.item():.4f}")
+
+    # === Eval WITHOUT TTT ===
+    print("\n=== Eval WITHOUT TTT ===")
+    model.eval()
+    n_eval = min(500000, val_tokens.numel() - 1)
+    chunk_size = h.eval_seq_len
+    n_chunks = n_eval // chunk_size
+
+    base_lut, space_lut, boundary_lut = byte_luts
+    total_nll = 0.0; total_tok = 0; total_bytes = 0.0
+
+    with torch.no_grad():
+        for c in range(n_chunks):
+            s = c * chunk_size
+            chunk = val_tokens[s:s + chunk_size + 1].to(DEVICE, torch.long)
+            x = chunk[:-1].unsqueeze(0)
+            y = chunk[1:].unsqueeze(0)
+            with torch.autocast("cuda", torch.bfloat16):
+                logits = model.forward_logits(x)
+            loss = F.cross_entropy(logits.float().reshape(-1, h.vocab_size), y.reshape(-1))
+            total_nll += loss.item() * y.numel()
+            total_tok += y.numel()
+            tb = base_lut[y.reshape(-1)].to(torch.int16)
+            tb += (space_lut[y.reshape(-1)] & ~boundary_lut[x.reshape(-1)]).to(torch.int16)
+            total_bytes += tb.float().sum().item()
+
+    pre_loss = total_nll / total_tok
+    pre_bpb = (pre_loss / math.log(2)) * (total_tok / total_bytes)
+    print(f"Pre-TTT: loss={pre_loss:.4f} bpb={pre_bpb:.4f} ({total_tok:,} tokens)")
+
+    # === Add LoRA + Run TTT ===
+    print("\n=== Score-First TTT (LoRA rank=8) ===")
+    ttt_model = copy.deepcopy(model)
+    lora_params = add_lora(ttt_model, rank=8)
+
+    ttt_loss, ttt_bpb, ttt_time = score_first_ttt(
+        ttt_model, val_tokens[:n_eval + 1], lora_params, h,
+        chunk_size=chunk_size, epochs=3, lr=0.001,
+        byte_luts=byte_luts
+    )
+
+    # === Results ===
+    improvement = (ttt_bpb - pre_bpb) / pre_bpb * 100
+    print(f"\n{'='*60}")
+    print(f"RESULTS")
+    print(f"{'='*60}")
+    print(f"Pre-TTT:  loss={pre_loss:.4f} bpb={pre_bpb:.4f}")
+    print(f"Post-TTT: loss={ttt_loss:.4f} bpb={ttt_bpb:.4f}")
+    print(f"Change:   {improvement:+.2f}%")
+    print(f"TTT time: {ttt_time:.0f}s")
+    print(f"Tokens:   {total_tok:,}")
diff --git a/records/track_non_record_16mb/2026-04-04_TTT_GPTQ_Incompatibility/sgd_ttt_eval.py b/records/track_non_record_16mb/2026-04-04_TTT_GPTQ_Incompatibility/sgd_ttt_eval.py
new file mode 100644
index 0000000000..4b81d3cdaa
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-04_TTT_GPTQ_Incompatibility/sgd_ttt_eval.py
@@ -0,0 +1,705 @@
+#!/usr/bin/env python3
+"""
+Multi-Epoch SGD Test-Time Training (TTT) Evaluation
+====================================================
+Implements the PR #461 approach: SGD with momentum, 3 epochs per 32K chunk,
+cosine LR decay across chunks, freeze first 2 blocks.
+
+Expected improvement: -0.0165 BPB over baseline sliding window eval.
+
+PROTOCOL (legal score-first TTT):
+  For each 32K-token chunk:
+    1. SCORE: model.eval(), inference_mode, sliding window eval, record NLL
+    2. TRAIN: model.train(), SGD(lr=0.002, momentum=0.9), 3 epochs on chunk
+    3. Cosine LR decay: lr = base_lr * 0.5 * (1 + cos(pi * chunk_idx / (num_chunks - 1)))
+  Freeze first 2 blocks during TTT training.
+  Gradient clipping = 1.0.
+
+CRITICAL: SGD TTT does NOT work with full GPTQ (+0.03 BPB regression).
+It works with simple int6 per-row quantization (quantize -> dequantize -> float).
+
+USAGE:
+  # On H100 with trained model:
+  python3 sgd_ttt_eval.py --model-path final_model.int6.ptz
+
+  # With simple int6 (default, recommended for TTT):
+  python3 sgd_ttt_eval.py --model-path final_model.int6.ptz --quant simple_int6
+
+  # Skip TTT (baseline sliding window only):
+  python3 sgd_ttt_eval.py --model-path final_model.int6.ptz --no-ttt
+"""
+import sys
+sys.stdout.reconfigure(line_buffering=True)
+
+import argparse
+import io
+import math
+import os
+import time
+from pathlib import Path
+
+import numpy as np
+import sentencepiece as spm
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from torch import Tensor
+
+# ---------------------------------------------------------------------------
+# Import model architecture from Clark's train_gpt (record_train_gpt.py)
+# ---------------------------------------------------------------------------
+# We import lazily to allow the script to be placed anywhere.
+# Set REPO_DIR to the directory containing record_train_gpt.py if needed.
+_SCRIPT_DIR = Path(__file__).resolve().parent
+_REPO_DIR = os.environ.get("REPO_DIR", str(_SCRIPT_DIR))
+if _REPO_DIR not in sys.path:
+    sys.path.insert(0, _REPO_DIR)
+
+# Try importing from record_train_gpt first, fall back to train_gpt
+try:
+    import record_train_gpt as tg
+except ImportError:
+    import train_gpt as tg
+
+DEVICE = "cuda" if torch.cuda.is_available() else "cpu"
+
+
+# ---------------------------------------------------------------------------
+# Argument parsing
+# ---------------------------------------------------------------------------
+def parse_args() -> argparse.Namespace:
+    p = argparse.ArgumentParser(description="SGD TTT Eval (PR #461)")
+    p.add_argument("--model-path", type=str, default="final_model.int6.ptz",
+                   help="Path to quantized model (.ptz) or raw checkpoint (.pt)")
+    p.add_argument("--data-dir", type=str, default=os.environ.get("DATA_DIR", "./data/"),
+                   help="Data directory containing datasets/ and tokenizers/")
+    p.add_argument("--quant", type=str, default="simple_int6",
+                   choices=["simple_int6", "gptq", "none"],
+                   help="Quantization mode: simple_int6 (default, best for TTT), "
+                        "gptq (worse with TTT), none (raw .pt checkpoint)")
+
+    # TTT hyperparameters (PR #461 defaults)
+    p.add_argument("--no-ttt", action="store_true", help="Skip TTT, baseline eval only")
+    p.add_argument("--ttt-lr", type=float, default=0.002, help="Base TTT learning rate")
+    p.add_argument("--ttt-momentum", type=float, default=0.9, help="SGD momentum")
+    p.add_argument("--ttt-epochs", type=int, default=3, help="Epochs per chunk")
+    p.add_argument("--ttt-chunk-size", type=int, default=32768,
+                   help="Chunk size in tokens (32K)")
+    p.add_argument("--ttt-freeze-blocks", type=int, default=2,
+                   help="Number of initial blocks to freeze during TTT")
+    p.add_argument("--ttt-grad-clip", type=float, default=1.0, help="Gradient clip norm")
+    p.add_argument("--ttt-batch-seqs", type=int, default=32,
+                   help="Batch size (sequences) for TTT training")
+
+    # Eval parameters
+    p.add_argument("--eval-stride", type=int, default=64,
+                   help="Sliding window stride for scoring")
+    p.add_argument("--eval-seq-len", type=int, default=2048,
+                   help="Sequence length for evaluation")
+    p.add_argument("--eval-batch-seqs", type=int, default=32,
+                   help="Batch size (sequences) for sliding window scoring")
+
+    return p.parse_args()
+
+
+# ---------------------------------------------------------------------------
+# Model loading
+# ---------------------------------------------------------------------------
+def load_model(args: argparse.Namespace, h: tg.Hyperparameters) -> nn.Module:
+    """Load model from quantized or raw checkpoint, return in bfloat16."""
+    model_path = Path(args.model_path)
+    if not model_path.exists():
+        # Try in data dir
+        alt = Path(args.data_dir) / args.model_path
+        if alt.exists():
+            model_path = alt
+        else:
+            raise FileNotFoundError(
+                f"Model not found at {args.model_path} or {alt}")
+
+    model = tg.GPT(h).to(DEVICE).bfloat16()
+
+    # Regenerate frozen FC layers if selective freeze is enabled
+    if hasattr(tg, '_apply_selective_freeze'):
+        tg._apply_selective_freeze(model)
+    if hasattr(tg, 'restore_fp32_params'):
+        tg.restore_fp32_params(model)
+
+    if args.quant == "none" or model_path.suffix == ".pt":
+        # Raw checkpoint
+        print(f"Loading raw checkpoint from {model_path}")
+        state = torch.load(model_path, map_location=DEVICE, weights_only=True)
+        model.load_state_dict(state, strict=False)
+    else:
+        # Quantized checkpoint (.ptz) - decompress and dequantize
+        print(f"Loading quantized model from {model_path}")
+        with open(model_path, "rb") as f:
+            quant_blob = f.read()
+
+        # Decompress
+        compressor = getattr(h, 'compressor', 'brotli')
+        raw_bytes = tg._decompress(quant_blob, compressor)
+        quant_state = torch.load(io.BytesIO(raw_bytes), map_location="cpu")
+
+        # Dequantize
+        sd_cpu = {k: v.detach().cpu() for k, v in model.state_dict().items()}
+        deq_state = tg.dequantize_mixed_int6(
+            quant_state["w"], quant_state["m"], sd_cpu)
+
+        strict = os.environ.get("SELECTIVE_FREEZE", "0") not in ("1", "true")
+        model.load_state_dict(deq_state, strict=strict)
+        print(f"  Dequantized from int6 to bfloat16")
+
+    n_params = sum(p.numel() for p in model.parameters())
+    print(f"  Model parameters: {n_params:,}")
+    return model
+
+
+# ---------------------------------------------------------------------------
+# Byte counting LUTs (for BPB computation)
+# ---------------------------------------------------------------------------
+def load_byte_luts(h: tg.Hyperparameters, device: torch.device):
+    """Load SentencePiece tokenizer and build byte-counting lookup tables."""
+    sp = spm.SentencePieceProcessor(model_file=h.tokenizer_path)
+    base_lut, space_lut, boundary_lut = tg.build_sentencepiece_luts(
+        sp, h.vocab_size, device)
+    return base_lut, space_lut, boundary_lut
+
+
+def count_bytes(prev_ids: Tensor, tgt_ids: Tensor,
+                base_lut: Tensor, space_lut: Tensor,
+                boundary_lut: Tensor) -> float:
+    """Count UTF-8 bytes for BPB computation using SentencePiece LUTs."""
+    tb = base_lut[tgt_ids].to(torch.float64)
+    tb += (space_lut[tgt_ids] & ~boundary_lut[prev_ids]).to(torch.float64)
+    return tb.sum().item()
+
+
+# ---------------------------------------------------------------------------
+# Sliding window scoring (score phase of TTT)
+# ---------------------------------------------------------------------------
+def sliding_window_score(
+    model: nn.Module,
+    tokens: Tensor,
+    seq_len: int,
+    stride: int,
+    vocab_size: int,
+    batch_seqs: int,
+    base_lut: Tensor,
+    space_lut: Tensor,
+    boundary_lut: Tensor,
+    device: torch.device,
+) -> tuple[float, float, float]:
+    """
+    Score tokens using sliding window evaluation.
+    Each token is scored with maximum context.
+    Returns (total_nll, total_tokens_scored, total_bytes).
+    """
+    total_tokens = tokens.numel() - 1
+    context_size = seq_len - stride
+
+    # Generate window start positions
+    window_starts = [ws for ws in range(0, total_tokens, stride)
+                     if ws + context_size < total_tokens]
+
+    total_nll = 0.0
+    total_scored = 0
+    total_bytes = 0.0
+
+    with torch.inference_mode():
+        for bi in range(0, len(window_starts), batch_seqs):
+            batch_ws = window_starts[bi:bi + batch_seqs]
+            bsz = len(batch_ws)
+
+            x_batch = torch.zeros(bsz, seq_len, dtype=torch.int64, device=device)
+            y_batch = torch.zeros(bsz, seq_len, dtype=torch.int64, device=device)
+            wlens: list[int] = []
+
+            for i, ws in enumerate(batch_ws):
+                we = min(ws + seq_len, total_tokens)
+                wlen = we - ws
+                wlens.append(wlen)
+                chunk = tokens[ws:we + 1].to(dtype=torch.int64, device=device)
+                x_batch[i, :wlen] = chunk[:-1]
+                y_batch[i, :wlen] = chunk[1:]
+
+            with torch.autocast(device_type="cuda", dtype=torch.bfloat16):
+                logits = model.forward_logits(x_batch)
+
+            nll = F.cross_entropy(
+                logits.reshape(-1, logits.size(-1)).float(),
+                y_batch.reshape(-1),
+                reduction="none",
+            ).reshape(bsz, seq_len)
+
+            for i, ws in enumerate(batch_ws):
+                wlen = wlens[i]
+                # First window scores from position 0; others from context_size
+                s = 0 if ws == 0 else context_size
+                scored_nll = nll[i, s:wlen].to(torch.float64)
+                total_nll += scored_nll.sum().item()
+                total_scored += wlen - s
+
+                # Byte counting
+                tgt = y_batch[i, s:wlen]
+                prev = x_batch[i, s:wlen]
+                total_bytes += count_bytes(prev, tgt, base_lut, space_lut, boundary_lut)
+
+    return total_nll, total_scored, total_bytes
+
+
+# ---------------------------------------------------------------------------
+# Chunk-aware sliding window scoring
+# ---------------------------------------------------------------------------
+def assign_windows_to_chunks(
+    total_tokens: int,
+    seq_len: int,
+    stride: int,
+    chunk_size: int,
+) -> list[list[int]]:
+    """
+    Assign each sliding window to the chunk containing its scored region.
+    A window's scored region starts at max(ws, context_size) for ws > 0.
+    We assign based on where the scored tokens fall.
+    Returns a list of lists: chunk_windows[chunk_idx] = [window_start, ...].
+    """
+    context_size = seq_len - stride
+    n_chunks = math.ceil(total_tokens / chunk_size)
+
+    # Pre-compute chunk boundaries
+    chunk_windows: list[list[int]] = [[] for _ in range(n_chunks)]
+
+    window_starts = [ws for ws in range(0, total_tokens, stride)
+                     if ws + context_size < total_tokens]
+
+    for ws in window_starts:
+        # The scored region of this window
+        scored_start = ws if ws == 0 else ws + context_size
+        # Assign to the chunk containing scored_start
+        chunk_idx = min(scored_start // chunk_size, n_chunks - 1)
+        chunk_windows[chunk_idx].append(ws)
+
+    return chunk_windows
+
+
+def score_chunk_windows(
+    model: nn.Module,
+    tokens: Tensor,
+    window_starts: list[int],
+    seq_len: int,
+    stride: int,
+    vocab_size: int,
+    batch_seqs: int,
+    base_lut: Tensor,
+    space_lut: Tensor,
+    boundary_lut: Tensor,
+    device: torch.device,
+) -> tuple[float, int, float]:
+    """
+    Score a specific set of sliding windows. Returns (nll_sum, tokens_scored, bytes).
+    """
+    if not window_starts:
+        return 0.0, 0, 0.0
+
+    total_tokens = tokens.numel() - 1
+    context_size = seq_len - stride
+
+    nll_sum = 0.0
+    tok_scored = 0
+    byte_sum = 0.0
+
+    with torch.inference_mode():
+        for bi in range(0, len(window_starts), batch_seqs):
+            batch_ws = window_starts[bi:bi + batch_seqs]
+            bsz = len(batch_ws)
+
+            x_batch = torch.zeros(bsz, seq_len, dtype=torch.int64, device=device)
+            y_batch = torch.zeros(bsz, seq_len, dtype=torch.int64, device=device)
+            wlens: list[int] = []
+
+            for i, ws in enumerate(batch_ws):
+                we = min(ws + seq_len, total_tokens)
+                wlen = we - ws
+                wlens.append(wlen)
+                chunk = tokens[ws:we + 1].to(dtype=torch.int64, device=device)
+                x_batch[i, :wlen] = chunk[:-1]
+                y_batch[i, :wlen] = chunk[1:]
+
+            with torch.autocast(device_type="cuda", dtype=torch.bfloat16):
+                logits = model.forward_logits(x_batch)
+
+            nll = F.cross_entropy(
+                logits.reshape(-1, logits.size(-1)).float(),
+                y_batch.reshape(-1),
+                reduction="none",
+            ).reshape(bsz, seq_len)
+
+            for i, ws in enumerate(batch_ws):
+                wlen = wlens[i]
+                s = 0 if ws == 0 else context_size
+                scored_nll = nll[i, s:wlen].to(torch.float64)
+                nll_sum += scored_nll.sum().item()
+                tok_scored += wlen - s
+
+                tgt = y_batch[i, s:wlen]
+                prev = x_batch[i, s:wlen]
+                byte_sum += count_bytes(prev, tgt, base_lut, space_lut, boundary_lut)
+
+    return nll_sum, tok_scored, byte_sum
+
+
+# ---------------------------------------------------------------------------
+# TTT training on a chunk
+# ---------------------------------------------------------------------------
+def train_on_chunk(
+    model: nn.Module,
+    tokens: Tensor,
+    chunk_start: int,
+    chunk_end: int,
+    ttt_params: list[nn.Parameter],
+    optimizer: torch.optim.Optimizer,
+    epochs: int,
+    grad_clip: float,
+    batch_seqs: int,
+    seq_len: int,
+    vocab_size: int,
+    device: torch.device,
+) -> float:
+    """
+    Train on a chunk for multiple epochs using SGD.
+    Splits chunk into seq_len-sized sequences, batched.
+    Returns average training loss of final epoch.
+    """
+    model.train()
+
+    # Extract chunk tokens
+    chunk_tokens = tokens[chunk_start:min(chunk_end + 1, tokens.numel())].to(
+        dtype=torch.int64, device=device)
+    chunk_len = chunk_tokens.numel() - 1  # -1 for target offset
+
+    if chunk_len < 2:
+        return 0.0
+
+    # Split chunk into sequences of seq_len
+    n_seqs = chunk_len // seq_len
+    if n_seqs == 0:
+        # Chunk smaller than seq_len: use what we have
+        n_seqs = 1
+        actual_len = chunk_len
+    else:
+        actual_len = seq_len
+
+    # Build x, y tensors for the chunk
+    x_all = []
+    y_all = []
+    for si in range(n_seqs):
+        start = si * seq_len
+        end = start + actual_len
+        if end + 1 > chunk_tokens.numel():
+            break
+        x_all.append(chunk_tokens[start:end])
+        y_all.append(chunk_tokens[start + 1:end + 1])
+
+    if not x_all:
+        return 0.0
+
+    x_all = torch.stack(x_all)  # (n_seqs, seq_len)
+    y_all = torch.stack(y_all)  # (n_seqs, seq_len)
+    n_seqs = x_all.shape[0]
+
+    last_epoch_loss = 0.0
+    for epoch in range(epochs):
+        # Shuffle sequence order each epoch
+        perm = torch.randperm(n_seqs)
+        epoch_loss = 0.0
+        epoch_tokens = 0
+
+        for bi in range(0, n_seqs, batch_seqs):
+            batch_idx = perm[bi:bi + batch_seqs]
+            x_batch = x_all[batch_idx]
+            y_batch = y_all[batch_idx]
+
+            with torch.autocast(device_type="cuda", dtype=torch.bfloat16):
+                logits = model.forward_logits(x_batch)
+            loss = F.cross_entropy(
+                logits.reshape(-1, logits.size(-1)).float(),
+                y_batch.reshape(-1),
+                reduction="mean",
+            )
+
+            optimizer.zero_grad()
+            loss.backward()
+            torch.nn.utils.clip_grad_norm_(ttt_params, grad_clip)
+            optimizer.step()
+
+            batch_tok = y_batch.numel()
+            epoch_loss += loss.item() * batch_tok
+            epoch_tokens += batch_tok
+
+        if epoch_tokens > 0:
+            last_epoch_loss = epoch_loss / epoch_tokens
+
+    return last_epoch_loss
+
+
+# ---------------------------------------------------------------------------
+# Main SGD TTT loop
+# ---------------------------------------------------------------------------
+def sgd_ttt_eval(
+    model: nn.Module,
+    val_tokens: Tensor,
+    args: argparse.Namespace,
+    h: tg.Hyperparameters,
+    base_lut: Tensor,
+    space_lut: Tensor,
+    boundary_lut: Tensor,
+    device: torch.device,
+) -> tuple[float, float]:
+    """
+    Full SGD TTT evaluation following PR #461.
+
+    Algorithm:
+      1. Assign sliding windows to 32K-token chunks based on scored region
+      2. For each chunk:
+         a. SCORE: model.eval(), inference_mode, score the chunk's windows
+         b. TRAIN: model.train(), SGD update, 3 epochs
+         c. Apply cosine LR decay
+      3. Freeze first 2 blocks throughout
+
+    Returns (final_loss, final_bpb).
+    """
+    t0 = time.time()
+
+    # --- Setup: freeze first N blocks ---
+    freeze_n = args.ttt_freeze_blocks
+    frozen_params = set()
+    for i in range(min(freeze_n, len(model.blocks))):
+        for p in model.blocks[i].parameters():
+            p.requires_grad = False
+            frozen_params.add(id(p))
+    print(f"  Froze first {freeze_n} blocks")
+
+    # Collect trainable parameters
+    ttt_params = [p for p in model.parameters() if p.requires_grad]
+    n_trainable = sum(p.numel() for p in ttt_params)
+    n_total = sum(p.numel() for p in model.parameters())
+    print(f"  Trainable: {n_trainable:,} / {n_total:,} parameters")
+
+    # --- Setup: SGD optimizer ---
+    optimizer = torch.optim.SGD(
+        ttt_params,
+        lr=args.ttt_lr,
+        momentum=args.ttt_momentum,
+    )
+
+    # --- Assign windows to chunks ---
+    total_tokens = val_tokens.numel() - 1
+    chunk_size = args.ttt_chunk_size
+    n_chunks = math.ceil(total_tokens / chunk_size)
+
+    chunk_windows = assign_windows_to_chunks(
+        total_tokens, args.eval_seq_len, args.eval_stride, chunk_size)
+
+    # Sanity check: count total windows
+    total_windows = sum(len(cw) for cw in chunk_windows)
+    print(f"  Chunks: {n_chunks}, Total windows: {total_windows}, "
+          f"Chunk size: {chunk_size:,} tokens")
+
+    # --- Main loop ---
+    overall_nll = 0.0
+    overall_scored = 0
+    overall_bytes = 0.0
+    current_lr = args.ttt_lr
+
+    for ci in range(n_chunks):
+        chunk_start = ci * chunk_size
+        chunk_end = min((ci + 1) * chunk_size, total_tokens)
+        windows = chunk_windows[ci]
+
+        # === STEP 1: SCORE (eval mode, inference_mode) ===
+        model.eval()
+        chunk_nll, chunk_scored, chunk_bytes = score_chunk_windows(
+            model, val_tokens, windows,
+            args.eval_seq_len, args.eval_stride, h.vocab_size,
+            args.eval_batch_seqs,
+            base_lut, space_lut, boundary_lut, device,
+        )
+
+        overall_nll += chunk_nll
+        overall_scored += chunk_scored
+        overall_bytes += chunk_bytes
+
+        # === STEP 2: TRAIN on scored chunk (AFTER scoring) ===
+        # Skip training on the last chunk (no future tokens to benefit)
+        if ci < n_chunks - 1:
+            # Cosine LR decay across chunks
+            if n_chunks > 1:
+                cos_decay = 0.5 * (1.0 + math.cos(math.pi * ci / (n_chunks - 1)))
+            else:
+                cos_decay = 1.0
+            current_lr = args.ttt_lr * cos_decay
+
+            for pg in optimizer.param_groups:
+                pg["lr"] = current_lr
+
+            train_loss = train_on_chunk(
+                model, val_tokens,
+                chunk_start, chunk_end,
+                ttt_params, optimizer,
+                args.ttt_epochs, args.ttt_grad_clip,
+                args.ttt_batch_seqs, args.eval_seq_len,
+                h.vocab_size, device,
+            )
+
+        # --- Progress logging ---
+        if (ci + 1) % 5 == 0 or ci == 0 or ci == n_chunks - 1:
+            if overall_scored > 0 and overall_bytes > 0:
+                running_loss = overall_nll / overall_scored
+                running_bpb = (running_loss / math.log(2)) * (overall_scored / overall_bytes)
+                lr_str = f" lr={current_lr:.6f}" if ci < n_chunks - 1 else ""
+                print(f"  TTT [{ci+1}/{n_chunks}] loss={running_loss:.6f} "
+                      f"bpb={running_bpb:.6f} scored={overall_scored:,}{lr_str} "
+                      f"({time.time()-t0:.0f}s)", flush=True)
+
+    # --- Unfreeze (restore requires_grad) ---
+    for i in range(min(freeze_n, len(model.blocks))):
+        for p in model.blocks[i].parameters():
+            p.requires_grad = True
+
+    # --- Final BPB ---
+    if overall_scored == 0 or overall_bytes == 0:
+        raise RuntimeError("No tokens scored during TTT eval")
+
+    final_loss = overall_nll / overall_scored
+    final_bpb = (final_loss / math.log(2)) * (overall_scored / overall_bytes)
+    elapsed = time.time() - t0
+    print(f"\n  TTT complete: loss={final_loss:.6f} bpb={final_bpb:.6f} "
+          f"tokens={overall_scored:,} bytes={overall_bytes:.0f} time={elapsed:.0f}s")
+
+    return final_loss, final_bpb
+
+
+# ---------------------------------------------------------------------------
+# Baseline eval (no TTT, just sliding window)
+# ---------------------------------------------------------------------------
+def baseline_sliding_window_eval(
+    model: nn.Module,
+    val_tokens: Tensor,
+    args: argparse.Namespace,
+    h: tg.Hyperparameters,
+    base_lut: Tensor,
+    space_lut: Tensor,
+    boundary_lut: Tensor,
+    device: torch.device,
+) -> tuple[float, float]:
+    """Baseline sliding window eval without TTT."""
+    print("\n=== Baseline Sliding Window Eval (no TTT) ===")
+    t0 = time.time()
+
+    model.eval()
+    total_nll, total_scored, total_bytes = sliding_window_score(
+        model, val_tokens,
+        args.eval_seq_len, args.eval_stride, h.vocab_size,
+        args.eval_batch_seqs,
+        base_lut, space_lut, boundary_lut, device,
+    )
+
+    if total_scored == 0 or total_bytes == 0:
+        raise RuntimeError("No tokens scored in baseline eval")
+
+    loss = total_nll / total_scored
+    bpb = (loss / math.log(2)) * (total_scored / total_bytes)
+    elapsed = time.time() - t0
+    print(f"  Baseline: loss={loss:.6f} bpb={bpb:.6f} "
+          f"tokens={total_scored:,} bytes={total_bytes:.0f} time={elapsed:.0f}s")
+
+    return loss, bpb
+
+
+# ---------------------------------------------------------------------------
+# Main
+# ---------------------------------------------------------------------------
+def main():
+    args = parse_args()
+
+    print(f"SGD TTT Eval (PR #461 approach)")
+    print(f"  Device: {DEVICE}")
+    print(f"  Time: {time.strftime('%Y-%m-%d %H:%M:%S')}")
+    print(f"  Model: {args.model_path}")
+    print(f"  Quant: {args.quant}")
+
+    if not args.no_ttt:
+        print(f"  TTT: lr={args.ttt_lr} momentum={args.ttt_momentum} "
+              f"epochs={args.ttt_epochs} chunk={args.ttt_chunk_size}")
+        print(f"  TTT: freeze_blocks={args.ttt_freeze_blocks} "
+              f"grad_clip={args.ttt_grad_clip} batch_seqs={args.ttt_batch_seqs}")
+    print(f"  Eval: stride={args.eval_stride} seq_len={args.eval_seq_len} "
+          f"batch_seqs={args.eval_batch_seqs}")
+    print()
+
+    # --- Build hyperparameters ---
+    h = tg.Hyperparameters()
+    # Override data dir if specified
+    if args.data_dir:
+        h.data_dir = args.data_dir
+        h.datasets_dir = os.path.join(args.data_dir, 'datasets',
+                                       f'fineweb10B_sp{h.vocab_size}')
+        h.val_files = os.path.join(h.datasets_dir, 'fineweb_val_*.bin')
+        h.tokenizer_path = os.path.join(args.data_dir, 'tokenizers',
+                                         f'fineweb_{h.vocab_size}_bpe.model')
+    h.eval_stride = args.eval_stride
+    h.eval_seq_len = args.eval_seq_len
+
+    # --- Load tokenizer + byte LUTs ---
+    print("Loading tokenizer and byte LUTs...")
+    base_lut, space_lut, boundary_lut = load_byte_luts(h, torch.device(DEVICE))
+
+    # --- Load validation tokens ---
+    print("Loading validation tokens...")
+    val_tokens = tg.load_validation_tokens(h.val_files, h.eval_seq_len)
+    print(f"  Validation tokens: {val_tokens.numel():,}")
+
+    # --- Load model ---
+    print("Loading model...")
+    model = load_model(args, h)
+
+    # --- Baseline eval ---
+    baseline_loss, baseline_bpb = baseline_sliding_window_eval(
+        model, val_tokens, args, h,
+        base_lut, space_lut, boundary_lut, torch.device(DEVICE),
+    )
+
+    if args.no_ttt:
+        print(f"\n{'='*60}")
+        print(f"RESULTS (baseline only)")
+        print(f"{'='*60}")
+        print(f"  Baseline BPB: {baseline_bpb:.6f}")
+        return
+
+    # --- SGD TTT eval ---
+    print(f"\n=== SGD TTT Eval (PR #461) ===")
+    print(f"  SGD(lr={args.ttt_lr}, momentum={args.ttt_momentum})")
+    print(f"  {args.ttt_epochs} epochs/chunk, {args.ttt_chunk_size:,} tokens/chunk")
+    print(f"  Freeze first {args.ttt_freeze_blocks} blocks, grad_clip={args.ttt_grad_clip}")
+
+    ttt_loss, ttt_bpb = sgd_ttt_eval(
+        model, val_tokens, args, h,
+        base_lut, space_lut, boundary_lut, torch.device(DEVICE),
+    )
+
+    # --- Results ---
+    improvement_bpb = ttt_bpb - baseline_bpb
+    improvement_pct = improvement_bpb / baseline_bpb * 100
+
+    print(f"\n{'='*60}")
+    print(f"RESULTS")
+    print(f"{'='*60}")
+    print(f"  Baseline:  loss={baseline_loss:.6f}  bpb={baseline_bpb:.6f}")
+    print(f"  SGD TTT:   loss={ttt_loss:.6f}  bpb={ttt_bpb:.6f}")
+    print(f"  Delta BPB: {improvement_bpb:+.6f} ({improvement_pct:+.2f}%)")
+    print(f"  Expected:  -0.0165 BPB")
+    print(f"{'='*60}")
+
+
+if __name__ == "__main__":
+    main()
diff --git a/records/track_non_record_16mb/2026-04-04_TTT_GPTQ_Incompatibility/submission.json b/records/track_non_record_16mb/2026-04-04_TTT_GPTQ_Incompatibility/submission.json
new file mode 100644
index 0000000000..56a255465c
--- /dev/null
+++ b/records/track_non_record_16mb/2026-04-04_TTT_GPTQ_Incompatibility/submission.json
@@ -0,0 +1,16 @@
+{
+  "author": "himanshudongre",
+  "github_id": "himanshudongre",
+  "submission_type": "non-record",
+  "title": "TTT and GPTQ Are Fundamentally Incompatible",
+  "description": "Evidence from 4 independent configurations showing GPTQ's compensatory weight structure is destroyed by SGD-based test-time training, making TTT and GPTQ mutually exclusive strategies.",
+  "hardware": "1×H100 80GB (RunPod)",
+  "compute_cost_usd": 5.0,
+  "techniques": [
+    "SGD TTT (PR #461 protocol)",
+    "LoRA TTT rank-8",
+    "GPTQ int6 quantization"
+  ],
+  "key_finding": "SGD TTT gives -0.0165 BPB on simple int6 but -0.0001 to +0.030 on GPTQ",
+  "related_prs": [461, 601, 1326]
+}