Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,93 @@
# MuonEq-R + Context-Only SLOT + XSA-all + QK-Gain 5.0

**val_bpb: TBD** (3-seed mean) | **~16.0 MB** | 8xH100 SXM

## Changes from PR #549 (1.1194 BPB)

| Change | Expected Impact | Source |
|--------|----------------|--------|
| **MuonEq-R** | -0.001 BPB | arXiv:2603.28254, PR #1260 |
| **Context-Only SLOT** | -0.006 BPB | PR #1217 |
| **XSA all 11 layers** | -0.001 BPB | PR #1019 |
| **QK_GAIN_INIT=5.0** | -0.001 BPB | PR #1217 sweep |
| **Total expected** | **-0.009 BPB** | **~1.110 BPB** |

## MuonEq-R

Row-normalizes gradient matrices before Newton-Schulz orthogonalization (arXiv:2603.28254). Equalizes row norms so the NS iteration operates on a better-conditioned matrix. Zero-byte cost, ~0.001 BPB improvement.

## Context-Only SLOT (Causal)

Per-batch additive delta vector (512 dims) optimized with AdamW during eval. For each sliding window (seq_len=2048, stride=64):

1. Hidden states computed under `torch.no_grad()` — model weights frozen
2. Delta optimized using cross-entropy on **context positions only** (0 to seq_len-stride). The 64 new tokens being scored are excluded from the loss.
3. Final logits computed with optimized delta. NLL recorded for the 64 new positions.

Delta is re-initialized to zeros for each window. Gradient flows only through the linear projection + softcap — not the transformer.

| Parameter | Value |
|-----------|-------|
| Delta shape | (1, 1, 512) |
| Optimizer | AdamW |
| Learning rate | 0.005 |
| Steps | 8 |

## Architecture

PR #549 stack with Parallel Muon:

| Component | Setting |
|-----------|---------|
| Layers | 11 (512d, 8H, 4KV) |
| MLP | 3x with LeakyReLU(0.5)^2 |
| BigramHash | 1536 |
| XSA | **All 11 layers** (was last 4) |
| RoPE | Partial (16/64 dims) |
| LN Scale | 1/sqrt(layer+1) |
| VE128 | Layers 9-10 |
| QK Gain | **5.0** (was 1.5) |
| Weight avg | EMA(0.997) + Tight SWA(every 50) |
| Quantization | GPTQ-lite int6 + lzma |
| Optimizer | **MuonEq-R** + Parallel Muon |

## Run Command

```bash
NUM_LAYERS=11 BIGRAM_VOCAB_SIZE=1536 XSA_LAST_N=11 \
QK_GAIN_INIT=5.0 \
EMA_ENABLED=1 EMA_DECAY=0.997 SWA_ENABLED=1 SWA_EVERY=50 \
ROPE_DIMS=16 LN_SCALE=1 LATE_QAT=1 LATE_QAT_THRESHOLD=0.15 \
VE_ENABLED=1 VE_DIM=128 VE_LAYERS=9,10 \
TTT_ENABLED=1 TTT_LR=0.002 TTT_EPOCHS=3 TTT_CHUNK_TOKENS=32768 \
TTT_FREEZE_BLOCKS=0 TTT_MOMENTUM=0.9 TTT_BATCH_SEQS=32 TTT_GRAD_CLIP=1.0 \
SLOT_ENABLED=1 SLOT_STEPS=8 SLOT_LR=0.005 \
MUON_WD=0.04 ADAM_WD=0.04 \
MATRIX_LR=0.025 SCALAR_LR=0.025 TIED_EMBED_LR=0.035 \
MUON_MOMENTUM=0.99 MUON_MOMENTUM_WARMUP_START=0.92 \
MUON_MOMENTUM_WARMUP_STEPS=1500 WARMDOWN_ITERS=3500 \
ITERATIONS=9000 MAX_WALLCLOCK_SECONDS=600 EVAL_STRIDE=64 \
SEED=1337 \
torchrun --standalone --nproc_per_node=8 train_gpt.py
```

## Results

TBD — pending 3-seed validation on 8xH100.

## Legality

- MuonEq-R: standard optimizer improvement, no rule restriction
- Context-Only SLOT: causal by construction — delta optimized on past tokens only, new tokens excluded from loss
- XSA-all: no new parameters, architectural choice
- QK_GAIN=5.0: hyperparameter choice
- No n-gram cache, no two-pass rescoring, no eval-time GPTQ
- Score-first TTT follows PR #461 legal protocol

## Credits

- **Base model + TTT**: PR #549 (@abaybektursun), PR #414 (@signalrush), PR #461 (@Christopher-Lee-McClendon)
- **MuonEq-R**: arXiv:2603.28254, validated in PR #1260
- **SLOT**: Hu et al. arXiv:2505.12392v2, Context-Only variant from PR #1217 (@dexhunter)
- **QK-Gain sweep**: PR #1217
- **XSA-all**: PR #1019
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
#!/bin/bash
set -euo pipefail

SEED=${SEED:-1337}
NPROC=${NPROC:-8}
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"

echo "=== MuonEq-R + SLOT + XSA-all + QK-Gain 5.0 ==="
echo "Seed: $SEED | GPUs: $NPROC"

export NUM_LAYERS=11
export BIGRAM_VOCAB_SIZE=1536
export XSA_LAST_N=11
export QK_GAIN_INIT=5.0
export EMA_ENABLED=1
export EMA_DECAY=0.997
export SWA_ENABLED=1
export SWA_EVERY=50
export ROPE_DIMS=16
export LN_SCALE=1
export LATE_QAT_THRESHOLD=0.15
export VE_ENABLED=1
export VE_DIM=128
export VE_LAYERS=9,10
export TTT_ENABLED=1
export TTT_LR=0.002
export TTT_EPOCHS=3
export TTT_CHUNK_TOKENS=32768
export TTT_FREEZE_BLOCKS=0
export TTT_MOMENTUM=0.9
export TTT_BATCH_SEQS=32
export TTT_GRAD_CLIP=1.0
export SLOT_ENABLED=1
export SLOT_STEPS=8
export SLOT_LR=0.005
export MUON_WD=0.04
export ADAM_WD=0.04
export MATRIX_LR=0.025
export SCALAR_LR=0.025
export TIED_EMBED_LR=0.035
export MUON_MOMENTUM=0.99
export MUON_MOMENTUM_WARMUP_START=0.92
export MUON_MOMENTUM_WARMUP_STEPS=1500
export WARMDOWN_ITERS=3500
export ITERATIONS=9000
export MAX_WALLCLOCK_SECONDS=600
export EVAL_STRIDE=64
export TRAIN_SEQ_LEN=2048
export TRAIN_BATCH_TOKENS=786432
export VAL_LOSS_EVERY=0
export TRAIN_LOG_EVERY=200
export SEED=$SEED

torchrun --standalone --nproc_per_node=$NPROC "$SCRIPT_DIR/train_gpt.py"
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
{
"track": "track_10min_16mb",
"val_bpb": null,
"author": {
"name": "Zorawar Sandhu",
"github": "BiggerDABOSS"
},
"hardware": "8xH100 SXM",
"training_time_seconds": 600,
"artifact_bytes": null,
"base_submission": "2026-03-23_LeakyReLU_LegalTTT_ParallelMuon",
"techniques": [
"MuonEq-R (row-normalize before NS)",
"Context-Only SLOT (causal delta optimization)",
"XSA all 11 layers",
"QK_GAIN_INIT=5.0",
"LeakyReLU(0.5)^2",
"Legal Score-First TTT",
"GPTQ-lite int6 + LZMA",
"EMA(0.997) + SWA(50)",
"Partial RoPE(16/64)",
"LN Scale",
"Value Embedding"
]
}
Loading