openai · Ribin545 · Mar 31, 2026
diff --git a/..._non_record_16mb/2026-04-01_EliteUTv22p8_12StepRecurrence_Windows3090/README.md b/..._non_record_16mb/2026-04-01_EliteUTv22p8_12StepRecurrence_Windows3090/README.md
@@ -0,0 +1,69 @@
+# Parameter Golf — Windows Memory Index (Elite v22.8)
+
+Quick-reference index for all notes in this folder. Updated for the **Elite Universal Transformer v22.8**.
+
+---
+
+## Files
+
+| File | Contents |
+|---|---|
+| [01_bugs_and_fixes.md](notes/01_bugs_and_fixes.md) | 21+ bugs found + exact fixes for Windows/RTX 3090 |
+| [02_windows_setup.md](notes/02_windows_setup.md) | Step-by-step environment setup from scratch |
+| [03_training_guide.md](notes/03_training_guide.md) | Every training command + all env variables |
+| [04_wrapper_internals.md](notes/04_wrapper_internals.md) | How `train_gpt_windows.py` patches scripts at runtime |
+| [05_custom_kernel.md](notes/05_custom_kernel.md) | Triton MLP Fused Megakernel Technical Details |
+| [06_performance_tuning.md](notes/06_performance_tuning.md) | Performance tuning checklist + throughput notes |
+| [07_final_architecture.md](notes/07_final_architecture.md) | The **Elite Universal Transformer (12.2M Unique)** |
+| [08_muon_polar_express.md](notes/08_muon_polar_express.md) | Quintic Minimax (Degree-5) Stability Fix |
+| [09_elite_standard_v20_run.md](notes/09_elite_standard_v20_run.md) | v20 run notes + iteration history |
+| [10_vectorized_dataset_cleaning.md](notes/10_vectorized_dataset_cleaning.md) | Dataset cleaning pipeline notes |
+| [11_elite_standard_v21_high_heat_run.md](notes/11_elite_standard_v21_high_heat_run.md) | History of the High-Heat Stability Pivot |
+| [12_elite_transformer_implementation_summary.md](notes/12_elite_transformer_implementation_summary.md) | **Latest: v22.8 Implementation Summary** |
+
+---
+
+## TL;DR — What We Did (Elite v22.8 Pivot)
+
+1. **Architecture**: Finalized the **12-step Recursive Elite UT**. Reuses a 1024-dim block with **Coordinate (Step) Embeddings** and LoRA.
+2. **Normalization**: Moved to **Strict Pre-Normalization** (v22.8). All RMSNorms removed from the residual path to allow deep state accumulation.
+3. **Stabilization**: Applied a **Universal Gradient Averaging (1/12)** and a **20-step Maturity Ramp** to prevent recursive explosion.
+4. **Optimizers**: Switched to **Muon "Polar Express" (Degree-5)** at **0.009 LR** (Option B) for perfect monotonic convergence.
+5. **Kernels**: Integrated **Fused Triton MLP** for $X \times W + LeakyReLU^2$, significantly boosting Windows throughput.
+6. **Efficiency**: Saturating the RTX 3090 at **524,288 tokens/step** with **12.5GB VRAM** footprint.
+
+---
+
+## The 3-Line Cheat Sheet
+
+```powershell
+# 1. Setup the 'Elite' Environment (One-time)
+.\setup_elite_env.bat
+
+# 2. Launch the 10-Minute Stabilization Test (55 Steps)
+.\limits_test_10m.bat
+
+# 3. Scale for Final Championship Run
+.\final_run_10m.bat
+```
+
+---
+
+## Key Gotchas
+
+- **Never run `train_gpt.py` directly on Windows** → will crash with Flash SDP error. Always use `train_gpt_windows.py` (via .bat).
+- **Iteration Lock**: `train_gpt.py` is now environment-aware. Ensure `ITERATIONS` and `MAX_WALLCLOCK_SECONDS` are set in shell or .bat.
+- **Polar Express steps**: Always set `MUON_BACKEND_STEPS=5` for degree-5 minimax stability on Ampere (RTX 3090).
+- **VRAM Control**: Activation checkpointing is used across the recursive blocks to keep logic depth 12 within 12.5GB.
+
+---
+
+## Current Status v22.8
+
+- [x] **Elite Universal Transformer** (UT v22.8) architecture locked.
+- [x] **Strict Pre-Norm** (Residual-free) verified.
+- [x] **Polar Express** (Degree-5) stability verified at 0.009 LR.
+- [x] **Fused Triton MLP Kernel** verified.
+- [x] **Environment Awareness** (Iteration controls) verified.
+- [x] **3.29 BPB** at Step 60 achieved.
+- [ ] Sub-1.x BPB leaderboard submission...
diff --git a/.../track_non_record_16mb/2026-04-01_EliteUTv22p8_12StepRecurrence_Windows3090/data_utils.py b/.../track_non_record_16mb/2026-04-01_EliteUTv22p8_12StepRecurrence_Windows3090/data_utils.py
@@ -0,0 +1,82 @@
+import glob
+import os
+from pathlib import Path
+import numpy as np
+import torch
+from torch import Tensor
+
+def load_data_shard(file: Path) -> Tensor:
+    header_bytes = 256 * np.dtype("<i4").itemsize
+    token_bytes = np.dtype("<u2").itemsize
+    header = np.fromfile(file, dtype="<i4", count=256)
+    # SHARD HEADER INTS & SHARD_MAGIC
+    num_tokens = int(header[2])
+    expected_size = header_bytes + num_tokens * token_bytes
+    if file.stat().st_size != expected_size:
+        raise ValueError(f"Shard size mismatch for {file}: expected {expected_size} bytes")
+    tokens_np = np.fromfile(file, dtype="<u2", count=num_tokens, offset=header_bytes)
+    return torch.from_numpy(tokens_np.astype(np.uint16, copy=False)).pin_memory()
+
+
+class TokenStream:
+    def __init__(self, pattern: str, rank: int = 0, world_size: int = 1):
+        self.files = [Path(p) for p in sorted(glob.glob(pattern))]
+        if not self.files:
+            raise FileNotFoundError(f"No files found for pattern: {pattern}")
+
+        # ELITE FIX: Distributed Shard Partitioning (Zero Redundancy)
+        # Each rank processes a unique subset of the global shards.
+        self.files = self.files[rank::world_size]
+        if not self.files:
+            raise ValueError(f"Rank {rank}/{world_size} has no shards to process!")
+
+        # ELITE FIX: Global Shard Shuffling (Stable across runs with the same seed)
+        import random
+        rng = random.Random(int(os.environ.get("SEED", 42)))
+        rng.shuffle(self.files)
+
+        self.file_idx = 0
+        self.tokens = load_data_shard(self.files[0])
+
+        # ELITE FIX: Random Start Offset (To destroy boundary artifacts)
+        # Every rank starts at a different, random 'cut' within the first shard.
+        seq_len_hint = int(os.environ.get("TRAIN_SEQ_LEN", 256))
+        self.pos = rng.randint(0, seq_len_hint - 1)
+        print(f"[data] Rank {rank}/{world_size} initialized at file {self.files[0].name}, pos {self.pos}")
+
+    def _advance_file(self) -> None:
+        self.file_idx = (self.file_idx + 1) % len(self.files)
+        self.tokens = load_data_shard(self.files[self.file_idx])
+        self.pos = 0
+
+    def take(self, n: int) -> Tensor:
+        chunks: list[Tensor] = []
+        remaining = n
+        while remaining > 0:
+            avail = self.tokens.numel() - self.pos
+            if avail <= 0:
+                self._advance_file()
+                continue
+            k = min(remaining, avail)
+            chunks.append(self.tokens[self.pos : self.pos + k])
+            self.pos += k
+            remaining -= k
+        return chunks[0] if len(chunks) == 1 else torch.cat(chunks)
+
+
+class DistributedTokenLoader:
+    def __init__(self, pattern: str, rank: int, world_size: int, device: torch.device):
+        self.rank = rank
+        self.world_size = world_size
+        self.device = device
+        self.stream = TokenStream(pattern, rank, world_size)
+
+    def next_batch(self, micro_batch_tokens: int, seq_len: int) -> tuple[Tensor, Tensor]:
+        """Loads a single micro-batch of tokens for the current rank."""
+        per_rank_span = micro_batch_tokens + 1
+        # Each rank takes its own span sequentially from the stream
+        chunk = self.stream.take(per_rank_span)
+        local = chunk.to(dtype=torch.int64)
+        x = local[:-1].reshape(-1, seq_len)
+        y = local[1:].reshape(-1, seq_len)
+        return x.to(self.device, non_blocking=True), y.to(self.device, non_blocking=True)
diff --git a/.../track_non_record_16mb/2026-04-01_EliteUTv22p8_12StepRecurrence_Windows3090/eval_utils.py b/.../track_non_record_16mb/2026-04-01_EliteUTv22p8_12StepRecurrence_Windows3090/eval_utils.py
@@ -0,0 +1,181 @@
+import math
+import glob
+from pathlib import Path
+import numpy as np
+import torch
+from torch import Tensor, nn
+import torch.nn.functional as F
+import torch.distributed as dist
+import sentencepiece as spm
+from data_utils import load_data_shard
+
+def build_sentencepiece_luts(
+    sp: spm.SentencePieceProcessor, vocab_size: int, device: torch.device
+) -> tuple[Tensor, Tensor, Tensor]:
+    sp_vocab_size = int(sp.vocab_size())
+    table_size = max(sp_vocab_size, vocab_size)
+    base_bytes_np = np.zeros((table_size,), dtype=np.int16)
+    has_leading_space_np = np.zeros((table_size,), dtype=np.bool_)
+    is_boundary_token_np = np.ones((table_size,), dtype=np.bool_)
+    for token_id in range(sp_vocab_size):
+        if sp.is_control(token_id) or sp.is_unknown(token_id) or sp.is_unused(token_id):
+            continue
+        is_boundary_token_np[token_id] = False
+        if sp.is_byte(token_id):
+            base_bytes_np[token_id] = 1
+            continue
+        piece = sp.id_to_piece(token_id)
+        if piece.startswith("▁"):
+            has_leading_space_np[token_id] = True
+            piece = piece[1:]
+        base_bytes_np[token_id] = len(piece.encode("utf-8"))
+    return (
+        torch.tensor(base_bytes_np, dtype=torch.int16, device=device),
+        torch.tensor(has_leading_space_np, dtype=torch.bool, device=device),
+        torch.tensor(is_boundary_token_np, dtype=torch.bool, device=device),
+    )
+
+
+def load_validation_tokens(pattern: str, seq_len: int) -> Tensor:
+    files = [Path(p) for p in sorted(glob.glob(pattern))]
+    if not files:
+        raise FileNotFoundError(f"No files found for pattern: {pattern}")
+    tokens = torch.cat([load_data_shard(file) for file in files]).contiguous()
+    usable = ((tokens.numel() - 1) // seq_len) * seq_len
+    if usable <= 0:
+        raise ValueError(f"Validation split is too short for TRAIN_SEQ_LEN={seq_len}")
+    return tokens[: usable + 1]
+
+
+def eval_val(
+    args,
+    model: nn.Module,
+    rank: int,
+    world_size: int,
+    device: torch.device,
+    grad_accum_steps: int,
+    val_tokens: Tensor,
+    base_bytes_lut: Tensor,
+    has_leading_space_lut: Tensor,
+    is_boundary_token_lut: Tensor,
+    max_steps: int | None = None,
+    stride: int | None = None,
+    ttt_lr: float = 0.0,
+) -> tuple[float, float]:
+    """
+    Sliding Window Evaluation:
+    If stride is set, each window overlap by (seq_len - stride).
+    BPB is computed only on the non-overlapping 'tail' of each window to ensure warm context.
+    """
+    seq_len = args.train_seq_len
+    eff_stride = stride if stride is not None else seq_len
+
+    # Calculate how many possible windows we have
+    num_possible_windows = (val_tokens.numel() - seq_len - 1) // eff_stride + 1
+
+    # Partition windows among ranks
+    win_start = (num_possible_windows * rank) // world_size
+    win_end = (num_possible_windows * (rank + 1)) // world_size
+
+    val_loss_sum = torch.zeros((), device=device, dtype=torch.float64)
+    val_token_count = torch.zeros((), device=device, dtype=torch.float64)
+    val_byte_count = torch.zeros((), device=device, dtype=torch.float64)
+
+    # 0. SAVE GRAD STATE
+    orig_grad_state = {n: p.requires_grad for n, p in model.named_parameters()}
+
+    local_batch_tokens = args.val_batch_size // (world_size * grad_accum_steps)
+    local_batch_wins = max(1, local_batch_tokens // seq_len)
+
+    # 1. IDENTIFY LORA PARAMETERS FOR TTT
+    ttt_params = []
+    if ttt_lr > 0:
+        for name, p in model.named_parameters():
+            if "lora_" in name:
+                p.requires_grad = True
+                ttt_params.append(p)
+            else:
+                p.requires_grad = False
+
+        optimizer = torch.optim.SGD(ttt_params, lr=ttt_lr)
+    else:
+        # Standard Eval: No gradients needed anywhere
+        for p in model.parameters():
+            p.requires_grad = False
+
+    model.eval()
+    for batch_idx, b_win_start in enumerate(range(win_start, win_end, local_batch_wins)):
+        if max_steps is not None and batch_idx >= max_steps:
+            break
+
+        b_win_end = min(b_win_start + local_batch_wins, win_end)
+
+        # Pack windows into a batch
+        batch_x, batch_y, batch_prev = [], [], []
+        keep_masks = []
+
+        for win_idx in range(b_win_start, b_win_end):
+            raw_start = win_idx * eff_stride
+            raw_end = raw_start + seq_len + 1
+            local = val_tokens[raw_start:raw_end].to(device=device, dtype=torch.int64, non_blocking=True)
+
+            x = local[:-1]
+            y = local[1:]
+            prev_ids = local[:-1]
+
+            mask = torch.zeros((seq_len,), device=device, dtype=torch.bool)
+            if win_idx == 0:
+                mask[:] = True
+            else:
+                mask[seq_len - eff_stride:] = True
+
+            batch_x.append(x)
+            batch_y.append(y)
+            batch_prev.append(prev_ids)
+            keep_masks.append(mask)
+
+        if not batch_x: continue
+
+        x = torch.stack(batch_x)
+        y = torch.stack(batch_y)
+        prev_ids = torch.stack(batch_prev)
+        mask = torch.stack(keep_masks)
+
+        # 1. SCORE PHASE (Must be no_grad to use compiled inference kernels stably)
+        with torch.no_grad():
+            logits = model.forward_logits(x, use_compiled=True)
+            loss_all = F.cross_entropy(logits.permute(0, 2, 1).float(), y, reduction="none")
+
+            kept_loss = loss_all[mask]
+            val_loss_sum += kept_loss.sum().to(torch.float64)
+            val_token_count += mask.sum().to(torch.float64)
+
+            tgt_ids_kept = y[mask]
+            prev_ids_kept = prev_ids[mask]
+            token_bytes = base_bytes_lut[tgt_ids_kept].to(dtype=torch.int16)
+            token_bytes += (has_leading_space_lut[tgt_ids_kept] & ~is_boundary_token_lut[prev_ids_kept]).to(dtype=torch.int16)
+            val_byte_count += token_bytes.to(torch.float64).sum()
+
+        # 2. ADAPT PHASE (Legal TTT)
+        if ttt_lr > 0:
+            # Eager execution for adaptation to avoid Inductor re-compilation on Windows
+            logits_ttt = model.forward_logits(x, use_compiled=False)
+            loss_for_backprop = F.cross_entropy(logits_ttt.permute(0, 2, 1).float(), y, reduction="none").mean()
+            loss_for_backprop.backward()
+            optimizer.step()
+            optimizer.zero_grad()
+
+    if dist.is_available() and dist.is_initialized():
+        dist.all_reduce(val_loss_sum, op=dist.ReduceOp.SUM)
+        dist.all_reduce(val_token_count, op=dist.ReduceOp.SUM)
+        dist.all_reduce(val_byte_count, op=dist.ReduceOp.SUM)
+
+    # RESTORE GRAD STATE
+    for n, p in model.named_parameters():
+        p.requires_grad = orig_grad_state[n]
+
+    val_loss = val_loss_sum / val_token_count
+    bits_per_token = val_loss.item() / math.log(2.0)
+    tokens_per_byte = val_token_count.item() / val_byte_count.item()
+    model.train()
+    return float(val_loss.item()), float(bits_per_token * tokens_per_byte)
diff --git a/...ck_non_record_16mb/2026-04-01_EliteUTv22p8_12StepRecurrence_Windows3090/final_run_10m.bat b/...ck_non_record_16mb/2026-04-01_EliteUTv22p8_12StepRecurrence_Windows3090/final_run_10m.bat
@@ -0,0 +1,45 @@
+@echo off
+setlocal
+cd /d "%~dp0"
+:: Final 10-Minute Championship Run
+:: Incorporates Polar Express, 256-Seq Curriculum, and Warp Speed TTT
+
+:: --- Training Window (10 Minutes) ---
+set MAX_WALLCLOCK_SECONDS=600
+set ITERATIONS=55
+set WARMUP_STEPS=20
+set VAL_LOSS_EVERY=10
+
+
+:: --- Dataset & Throughput ---
+set DATA_PATH=..\..\..\data\datasets\fineweb10B_sp1024
+set TOKENIZER_PATH=..\..\..\data\tokenizers\fineweb_1024_bpe.model
+set TRAIN_BATCH_TOKENS=524288
+set GRAD_ACCUM_STEPS=16
+set TRAIN_SEQ_LEN=256
+set LORA_RANK=16
+
+:: --- Optimizer (Elite 22.8 "Safe-Stability") ---
+set MUON_BACKEND_STEPS=5
+set MATRIX_LR=0.009
+set GRAD_CLIP_NORM=1.0
+
+:: --- TTT (Test-Time Training) - TTT Cooling ---
+set TTT_ENABLED=1
+set TTT_LR=4e-4
+set EVAL_STRIDE=64
+
+echo =======================================================
+echo Launching Final 10-Minute Championship Run
+echo Mode: Polar Express + Warp Speed TTT
+echo Data: 256-1024 Seq Curriculum (32x Throughput)
+echo Limit: 600 Seconds (10 Minutes)
+echo =======================================================
+
+.\venv\Scripts\python train_gpt_windows.py
+
+echo.
+echo =======================================================
+echo Training Window Complete. Generated final_model.int8.ptz
+echo =======================================================
+pause