openai · maksblu · Apr 1, 2026 · Apr 2, 2026 · Apr 2, 2026
diff --git a/records/track_10min_16mb/2026-04-01_21L_PRP/README.md b/records/track_10min_16mb/2026-04-01_21L_PRP/README.md
@@ -0,0 +1,109 @@
+# 21L + PRP SwiGLU + Shared Attention + sp8192 + 2048 context length
+
+While this run demonstrates end-to-end results, the current int8+zlib artifact (17.7 MB) exceeds the 16 MB track limit. Furthermore, due to local hardware constraints (Nvidia RTX 3080), the model processed only 2.5–5% of the total target tokens required for a full convergence run (defined as a <10-minute training window on an 8xH100 cluster)
+
+This variant moves away from the default 1k-vocab, 1024-context, 9-layer starter and spends the budget on a deeper 8k-vocab model with parameter sharing, PRP-based MLPs, and a longer-horizon LR schedule.
+
+### Changes from Baseline
+
+**1. 8K tokenizer + longer context**
+
+The script switches from `fineweb10B_sp1024` to `fineweb10B_sp8192`, uses the 8k SentencePiece tokenizer, and doubles context length to `TRAIN_SEQ_LEN=2048`. The default model shape becomes 21 layers at width 512 with 8 query heads and 2 KV heads.
+
+**2. PRP SwiGLU MLP instead of the baseline relu^2 MLP**
+
+The dense MLP is replaced with a SwiGLU block built from 'ParametrizedRandomProjection' (https://arxiv.org/pdf/2512.13480). Each PRP layer keeps a fixed random projection buffer and only trains low-dimensional controls: input scaling, output scaling, output bias, and a rank-32 LoRA update. The MLP projects once to `2 * hidden`, then splits into gate and value halves. A higher MLP multiplier is used: 4 instead of 2.
+
+**3. Pairwise shared attention in the transformer body**
+
+Layer 1 and the final layer remain unique. Interior layers share attention modules in adjacent pairs: `(2,3)`, `(4,5)`, `(6,7)`, and so on. Each block still keeps its own norms, residual mixing, scales, and MLP, so depth increases without paying for 21 fully independent attention stacks.
+
+In the logged run, that produces 21 transformer blocks but only 12 unique attention modules.
+
+**4. Separate optimizer path for PRP vector controls**
+
+The script adds `unique_named_parameters` and splits PRP vector parameters (`.alpha`, `.weight`, `.bias`) into their own Adam group with `PRP_LR`. Matrix-shaped parameters still use Muon, while the remaining scalar and control tensors stay on Adam.
+
+**5. Three-phase cosine LR schedule**
+
+The default warmdown schedule is replaced with a longer-run schedule:
+- cosine warmup from `LR_INIT_SCALE` to 1.0
+- cosine flash drop from 1.0 to `LR_MAIN_SCALE`
+- powered cosine tail from `LR_MAIN_SCALE` to `LR_MIN_SCALE`, shaped by `LR_GAMMA`
+
+This will be adjusted for the full schedule run.
+
+**6. Shared-aware int8 export**
+
+The post-training export remains per-row int8 for 2D tensors and fp16/fp32 passthrough for small or control tensors, but it now deduplicates repeated storage before counting payload bytes. That matters once attention modules or embeddings are shared or tied.
+
+### Default Configuration in This Record
+
+```bash
+DATA_PATH=./data/datasets/fineweb10B_sp8192
+TOKENIZER_PATH=./data/tokenizers/fineweb_8192_bpe.model
+
+VOCAB_SIZE=8192
+NUM_LAYERS=21
+MODEL_DIM=512
+NUM_HEADS=8
+NUM_KV_HEADS=2
+MLP_MULT=4
+TIE_EMBEDDINGS=1
+
+TRAIN_SEQ_LEN=2048
+TRAIN_BATCH_TOKENS=65536
+VAL_BATCH_SIZE=65536
+VAL_FRAC=0.25
+
+ITERATIONS=4000
+WARMUP_STEPS=2
+LR_MAIN_SCALE=0.5
+LR_MIN_SCALE=0.05
+LR_INIT_SCALE=0.0
+LR_GAMMA=0.8
+
+TIED_EMBED_LR=0.03
+MATRIX_LR=0.03
+SCALAR_LR=0.03
+PRP_LR=0.06
+MUON_MOMENTUM=0.95
+```
+
+Notes:
+- `LR_WARMUP_STEPS` defaults to `int(0.01 * ITERATIONS)`.
+- `LR_DROP_STEPS` defaults to `int(0.2 * ITERATIONS)`.
+- The file lives under the 10-minute track, but the current defaults are closer to a longer research run than a finalized 10-minute submission.
+
+### Actual Results
+
+| Metric | Value |
+|--------|-------|
+| Final pre-quant val_bpb | 1.3527 |
+| Final pre-quant val_loss | 3.4657 |
+| int8+zlib roundtrip val_bpb | 1.3540 |
+| int8+zlib roundtrip val_loss | 3.4690 |
+| Quantization gap | +0.0013 BPB |
+| Total params | 17,171,040 |
+| Peak memory allocated | 7,329 MiB |
+| Serialized model | 50,783,797 bytes |
+| int8+zlib artifact | 17,603,010 bytes |
+| Total submission size (artifact + code) | 17,658,504 bytes |
+
+
+### Validation Trajectory
+
+| Step | val_loss | val_bpb |
+|------|----------|---------|
+| 0 | 9.0102 | 3.5169 |
+| 400 | 4.2170 | 1.6460 |
+| 800 | 3.9027 | 1.5233 |
+| 1200 | 3.7963 | 1.4818 |
+| 1600 | 3.7088 | 1.4476 |
+| 2000 | 3.6501 | 1.4247 |
+| 2400 | 3.6022 | 1.4060 |
+| 2800 | 3.5515 | 1.3862 |
+| 3200 | 3.5092 | 1.3697 |
+| 3600 | 3.4804 | 1.3585 |
+| 4000 | 3.4657 | 1.3527 |
+
diff --git a/records/track_10min_16mb/2026-04-01_21L_PRP/prp.py b/records/track_10min_16mb/2026-04-01_21L_PRP/prp.py
@@ -0,0 +1,97 @@
+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+import math
+
+
+class ParametrizedRandomProjection(nn.Module):
+    _layer_counter = 0
+    def __init__(
+        self,
+        in_features,
+        out_features,
+        projection_type="random",
+        lora_rank: int = 32,
+        lora_alpha: float = 1.0,
+    ):
+        super().__init__()
+        self.in_features = in_features
+        self.out_features = out_features
+        self.projection_type = projection_type
+        self._layer_id = ParametrizedRandomProjection._layer_counter
+        ParametrizedRandomProjection._layer_counter += 1
+        self.seed = hash((in_features, out_features, projection_type, self._layer_id)) % (2 ** 32)
+
+        self.alpha = nn.Parameter(torch.empty(in_features))
+        self.weight = nn.Parameter(torch.empty(out_features))
+        self.bias = nn.Parameter(torch.empty(out_features))
+
+        std_in = 1 / math.sqrt(in_features)
+        nn.init.normal_(self.alpha, mean=1.0, std=std_in)
+        nn.init.normal_(self.weight, mean=1.0, std=std_in)
+        nn.init.zeros_(self.bias)
+
+        # ========== Fixed random projection (non-learnable) ==========
+        g = torch.Generator(device="cpu")
+        g.manual_seed(self.seed)
+        std_dev = 1 / math.sqrt(self.in_features)
+        if self.projection_type == "random":
+            proj = torch.randn(self.in_features, self.out_features, generator=g) * std_dev
+        elif self.projection_type == "sparse":
+            val_scale = math.sqrt(3.0 / self.in_features)
+            u = torch.rand(self.in_features, self.out_features, generator=g)
+            proj = torch.zeros(self.in_features, self.out_features)
+            proj[u < 1 / 6] = -val_scale
+            proj[u >= 5 / 6] = val_scale
+        elif self.projection_type == "orthogonal":
+            proj = torch.randn(self.in_features, self.out_features, generator=g) * std_dev
+            nn.init.orthogonal_(proj)
+        elif self.projection_type == "uniform":
+            proj = (torch.rand(self.in_features, self.out_features, generator=g) * 2 - 1) * std_dev * math.sqrt(3)
+        else:
+            raise ValueError("Unsupported projection_type")
+
+        # keep both names for compatibility
+        self.register_buffer("proj", proj, persistent=False)
+        self.register_buffer("fixed_proj", proj, persistent=False)
+
+        # ========== LoRA low-rank adaptation ==========
+        self.lora_rank = lora_rank
+        self.lora_alpha = lora_alpha
+        self.lora_A = nn.Parameter(torch.empty(self.in_features, lora_rank))
+        self.lora_B = nn.Parameter(torch.empty(lora_rank, self.out_features))
+        nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
+        nn.init.zeros_(self.lora_B)
+        self.lora_scale = float(self.lora_alpha) / max(1, self.lora_rank)
+
+    def forward(self, x: torch.Tensor) -> torch.Tensor:
+        # Move proj to correct device/dtype
+        proj = self.proj.to(dtype=x.dtype, device=x.device)
+
+        # Element-wise scaling (no per-input bias)
+        x_scaled = x * self.alpha.unsqueeze(0)
+
+        # Fixed projection path
+        out_fixed = x_scaled @ proj
+        out_fixed = out_fixed * self.weight.unsqueeze(0) + self.bias.unsqueeze(0)
+
+        # LoRA adaptation as separate path
+        lora_out = (x @ self.lora_A) @ self.lora_B
+        lora_out = lora_out * self.lora_scale
+
+        return out_fixed + lora_out
+
+    def count_trainable_params(self) -> int:
+        return sum(p.numel() for p in self.parameters() if p.requires_grad)
+
+    def __repr__(self):
+        return (
+            f"{self.__class__.__name__}("
+            f"in_features={self.in_features}, "
+            f"out_features={self.out_features}, "
+            f"projection_type='{self.projection_type}', "
+            f"lora_rank={self.lora_rank}, "
+            f"lora_alpha={self.lora_alpha}, "
+            f"trainable_params={self.count_trainable_params():,})"
+        )
+
diff --git a/records/track_10min_16mb/2026-04-01_21L_PRP/submission.json b/records/track_10min_16mb/2026-04-01_21L_PRP/submission.json
@@ -0,0 +1,8 @@
+{
+  "author": "UniversalComputingResearch",
+  "name": "21L + PRP SwiGLU + Shared Attention + sp8192 + 2048 context length",
+  "github": "https://github.com/UniversalComputingResearch",
+  "val_bpb": 1.3540,
+  "description": "Custom architecture experiment for Parameter Golf.",
+  "status": "Awaiting compute credits"
+}
diff --git a/records/track_10min_16mb/2026-04-01_21L_PRP/tokenizer_getter.py b/records/track_10min_16mb/2026-04-01_21L_PRP/tokenizer_getter.py
@@ -0,0 +1,24 @@
+import os
+from huggingface_hub import hf_hub_download
+
+repo_id = "sproos/parameter-golf-tokenizers"
+
+# Define your custom paths (matching train_gpt.py expectations)
+tok_dir = "./data/tokenizers/"
+data_dir = "./data/datasets/fineweb10B_sp8192/"
+
+# 1. Download Tokenizer files (8k version)
+tokenizer_files = ["tokenizers/fineweb_8192_bpe.model", "tokenizers/fineweb_8192_bpe.vocab"]
+for file in tokenizer_files:
+    print(f"Downloading {file}...")
+    hf_hub_download(repo_id=repo_id, filename=file, local_dir=tok_dir)
+
+# 2. Download Dataset shards (Train 00-79 + Val) — 8192 vocab version
+dataset_subdir = "datasets/fineweb10B_sp8192"
+dataset_files = [f"{dataset_subdir}/fineweb_train_{i:06d}.bin" for i in range(80)]
+dataset_files.append(f"{dataset_subdir}/fineweb_val_000000.bin")
+
+for file in dataset_files:
+    print(f"Downloading {file}...")
+    hf_hub_download(repo_id=repo_id, filename=file, local_dir=data_dir)
+