Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
109 changes: 109 additions & 0 deletions records/track_10min_16mb/2026-04-01_21L_PRP/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,109 @@
# 21L + PRP SwiGLU + Shared Attention + sp8192 + 2048 context length

While this run demonstrates end-to-end results, the current int8+zlib artifact (17.7 MB) exceeds the 16 MB track limit. Furthermore, due to local hardware constraints (Nvidia RTX 3080), the model processed only 2.5–5% of the total target tokens required for a full convergence run (defined as a <10-minute training window on an 8xH100 cluster)

This variant moves away from the default 1k-vocab, 1024-context, 9-layer starter and spends the budget on a deeper 8k-vocab model with parameter sharing, PRP-based MLPs, and a longer-horizon LR schedule.

### Changes from Baseline

**1. 8K tokenizer + longer context**

The script switches from `fineweb10B_sp1024` to `fineweb10B_sp8192`, uses the 8k SentencePiece tokenizer, and doubles context length to `TRAIN_SEQ_LEN=2048`. The default model shape becomes 21 layers at width 512 with 8 query heads and 2 KV heads.

**2. PRP SwiGLU MLP instead of the baseline relu^2 MLP**

The dense MLP is replaced with a SwiGLU block built from 'ParametrizedRandomProjection' (https://arxiv.org/pdf/2512.13480). Each PRP layer keeps a fixed random projection buffer and only trains low-dimensional controls: input scaling, output scaling, output bias, and a rank-32 LoRA update. The MLP projects once to `2 * hidden`, then splits into gate and value halves. A higher MLP multiplier is used: 4 instead of 2.

**3. Pairwise shared attention in the transformer body**

Layer 1 and the final layer remain unique. Interior layers share attention modules in adjacent pairs: `(2,3)`, `(4,5)`, `(6,7)`, and so on. Each block still keeps its own norms, residual mixing, scales, and MLP, so depth increases without paying for 21 fully independent attention stacks.

In the logged run, that produces 21 transformer blocks but only 12 unique attention modules.

**4. Separate optimizer path for PRP vector controls**

The script adds `unique_named_parameters` and splits PRP vector parameters (`.alpha`, `.weight`, `.bias`) into their own Adam group with `PRP_LR`. Matrix-shaped parameters still use Muon, while the remaining scalar and control tensors stay on Adam.

**5. Three-phase cosine LR schedule**

The default warmdown schedule is replaced with a longer-run schedule:
- cosine warmup from `LR_INIT_SCALE` to 1.0
- cosine flash drop from 1.0 to `LR_MAIN_SCALE`
- powered cosine tail from `LR_MAIN_SCALE` to `LR_MIN_SCALE`, shaped by `LR_GAMMA`

This will be adjusted for the full schedule run.

**6. Shared-aware int8 export**

The post-training export remains per-row int8 for 2D tensors and fp16/fp32 passthrough for small or control tensors, but it now deduplicates repeated storage before counting payload bytes. That matters once attention modules or embeddings are shared or tied.

### Default Configuration in This Record

```bash
DATA_PATH=./data/datasets/fineweb10B_sp8192
TOKENIZER_PATH=./data/tokenizers/fineweb_8192_bpe.model

VOCAB_SIZE=8192
NUM_LAYERS=21
MODEL_DIM=512
NUM_HEADS=8
NUM_KV_HEADS=2
MLP_MULT=4
TIE_EMBEDDINGS=1

TRAIN_SEQ_LEN=2048
TRAIN_BATCH_TOKENS=65536
VAL_BATCH_SIZE=65536
VAL_FRAC=0.25

ITERATIONS=4000
WARMUP_STEPS=2
LR_MAIN_SCALE=0.5
LR_MIN_SCALE=0.05
LR_INIT_SCALE=0.0
LR_GAMMA=0.8

TIED_EMBED_LR=0.03
MATRIX_LR=0.03
SCALAR_LR=0.03
PRP_LR=0.06
MUON_MOMENTUM=0.95
```

Notes:
- `LR_WARMUP_STEPS` defaults to `int(0.01 * ITERATIONS)`.
- `LR_DROP_STEPS` defaults to `int(0.2 * ITERATIONS)`.
- The file lives under the 10-minute track, but the current defaults are closer to a longer research run than a finalized 10-minute submission.

### Actual Results

| Metric | Value |
|--------|-------|
| Final pre-quant val_bpb | 1.3527 |
| Final pre-quant val_loss | 3.4657 |
| int8+zlib roundtrip val_bpb | 1.3540 |
| int8+zlib roundtrip val_loss | 3.4690 |
| Quantization gap | +0.0013 BPB |
| Total params | 17,171,040 |
| Peak memory allocated | 7,329 MiB |
| Serialized model | 50,783,797 bytes |
| int8+zlib artifact | 17,603,010 bytes |
| Total submission size (artifact + code) | 17,658,504 bytes |


### Validation Trajectory

| Step | val_loss | val_bpb |
|------|----------|---------|
| 0 | 9.0102 | 3.5169 |
| 400 | 4.2170 | 1.6460 |
| 800 | 3.9027 | 1.5233 |
| 1200 | 3.7963 | 1.4818 |
| 1600 | 3.7088 | 1.4476 |
| 2000 | 3.6501 | 1.4247 |
| 2400 | 3.6022 | 1.4060 |
| 2800 | 3.5515 | 1.3862 |
| 3200 | 3.5092 | 1.3697 |
| 3600 | 3.4804 | 1.3585 |
| 4000 | 3.4657 | 1.3527 |

97 changes: 97 additions & 0 deletions records/track_10min_16mb/2026-04-01_21L_PRP/prp.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
import torch
import torch.nn as nn
import torch.nn.functional as F
import math


class ParametrizedRandomProjection(nn.Module):
_layer_counter = 0
def __init__(
self,
in_features,
out_features,
projection_type="random",
lora_rank: int = 32,
lora_alpha: float = 1.0,
):
super().__init__()
self.in_features = in_features
self.out_features = out_features
self.projection_type = projection_type
self._layer_id = ParametrizedRandomProjection._layer_counter
ParametrizedRandomProjection._layer_counter += 1
self.seed = hash((in_features, out_features, projection_type, self._layer_id)) % (2 ** 32)

self.alpha = nn.Parameter(torch.empty(in_features))
self.weight = nn.Parameter(torch.empty(out_features))
self.bias = nn.Parameter(torch.empty(out_features))

std_in = 1 / math.sqrt(in_features)
nn.init.normal_(self.alpha, mean=1.0, std=std_in)
nn.init.normal_(self.weight, mean=1.0, std=std_in)
nn.init.zeros_(self.bias)

# ========== Fixed random projection (non-learnable) ==========
g = torch.Generator(device="cpu")
g.manual_seed(self.seed)
std_dev = 1 / math.sqrt(self.in_features)
if self.projection_type == "random":
proj = torch.randn(self.in_features, self.out_features, generator=g) * std_dev
elif self.projection_type == "sparse":
val_scale = math.sqrt(3.0 / self.in_features)
u = torch.rand(self.in_features, self.out_features, generator=g)
proj = torch.zeros(self.in_features, self.out_features)
proj[u < 1 / 6] = -val_scale
proj[u >= 5 / 6] = val_scale
elif self.projection_type == "orthogonal":
proj = torch.randn(self.in_features, self.out_features, generator=g) * std_dev
nn.init.orthogonal_(proj)
elif self.projection_type == "uniform":
proj = (torch.rand(self.in_features, self.out_features, generator=g) * 2 - 1) * std_dev * math.sqrt(3)
else:
raise ValueError("Unsupported projection_type")

# keep both names for compatibility
self.register_buffer("proj", proj, persistent=False)
self.register_buffer("fixed_proj", proj, persistent=False)

# ========== LoRA low-rank adaptation ==========
self.lora_rank = lora_rank
self.lora_alpha = lora_alpha
self.lora_A = nn.Parameter(torch.empty(self.in_features, lora_rank))
self.lora_B = nn.Parameter(torch.empty(lora_rank, self.out_features))
nn.init.kaiming_uniform_(self.lora_A, a=math.sqrt(5))
nn.init.zeros_(self.lora_B)
self.lora_scale = float(self.lora_alpha) / max(1, self.lora_rank)

def forward(self, x: torch.Tensor) -> torch.Tensor:
# Move proj to correct device/dtype
proj = self.proj.to(dtype=x.dtype, device=x.device)

# Element-wise scaling (no per-input bias)
x_scaled = x * self.alpha.unsqueeze(0)

# Fixed projection path
out_fixed = x_scaled @ proj
out_fixed = out_fixed * self.weight.unsqueeze(0) + self.bias.unsqueeze(0)

# LoRA adaptation as separate path
lora_out = (x @ self.lora_A) @ self.lora_B
lora_out = lora_out * self.lora_scale

return out_fixed + lora_out

def count_trainable_params(self) -> int:
return sum(p.numel() for p in self.parameters() if p.requires_grad)

def __repr__(self):
return (
f"{self.__class__.__name__}("
f"in_features={self.in_features}, "
f"out_features={self.out_features}, "
f"projection_type='{self.projection_type}', "
f"lora_rank={self.lora_rank}, "
f"lora_alpha={self.lora_alpha}, "
f"trainable_params={self.count_trainable_params():,})"
)

8 changes: 8 additions & 0 deletions records/track_10min_16mb/2026-04-01_21L_PRP/submission.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
{
"author": "UniversalComputingResearch",
"name": "21L + PRP SwiGLU + Shared Attention + sp8192 + 2048 context length",
"github": "https://github.com/UniversalComputingResearch",
"val_bpb": 1.3540,
"description": "Custom architecture experiment for Parameter Golf.",
"status": "Awaiting compute credits"
}
24 changes: 24 additions & 0 deletions records/track_10min_16mb/2026-04-01_21L_PRP/tokenizer_getter.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
import os
from huggingface_hub import hf_hub_download

repo_id = "sproos/parameter-golf-tokenizers"

# Define your custom paths (matching train_gpt.py expectations)
tok_dir = "./data/tokenizers/"
data_dir = "./data/datasets/fineweb10B_sp8192/"

# 1. Download Tokenizer files (8k version)
tokenizer_files = ["tokenizers/fineweb_8192_bpe.model", "tokenizers/fineweb_8192_bpe.vocab"]
for file in tokenizer_files:
print(f"Downloading {file}...")
hf_hub_download(repo_id=repo_id, filename=file, local_dir=tok_dir)

# 2. Download Dataset shards (Train 00-79 + Val) — 8192 vocab version
dataset_subdir = "datasets/fineweb10B_sp8192"
dataset_files = [f"{dataset_subdir}/fineweb_train_{i:06d}.bin" for i in range(80)]
dataset_files.append(f"{dataset_subdir}/fineweb_val_000000.bin")

for file in dataset_files:
print(f"Downloading {file}...")
hf_hub_download(repo_id=repo_id, filename=file, local_dir=data_dir)

Loading