Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# Run 009: SP1024 + Looping + TTT 10ep (PR #1487 Tuning)

## Hypothesis

Apply PR #1487's TTT hyperparameter tuning to our SP1024 + Looping architecture.

**Expected gain: ~0.008 BPB** (based on PR #1487's ablation showing -0.0079 BPB from tuning alone)

## Configuration Changes vs Run 007/008

| Parameter | Run 007/008 | Run 009 (PR #1487 tuning) | Expected Impact |
|-----------|-------------|---------------------------|-----------------|
| **TTT Epochs** | 6 | **10** | More adaptation time |
| **TTT LR** | 0.0005 | **0.00045** | More stable fine-tuning |
| **TTT Freeze Blocks** | 2 | **1** | More layers can adapt |
| **QK-Gain** | 5.0 | **5.25** | Sharper attention |

## Architecture (Unchanged from Run 007/008)

- **Tokenizer**: SP1024 (novel parameter reallocation)
- **Layers**: 11 physical
- **Looping**: 2 loops on layers 4-5, enabled at step 0.5
- **Parallel residuals**: From layer 7+
- **EMA decay**: 0.9965
- **GPTQ int6 + Brotli** compression

## Target Metrics

| Metric | Run 007/008 | Run 009 Target |
|--------|-------------|----------------|
| **val_bpb (3-seed mean)** | 1.07389 | **~1.066** |
| **vs Official SOTA (1.1147)** | -0.041 BPB | **~-0.049 BPB** |
| **Training time** | 588s | ~600s (TTT adds ~40s) |
| **Artifact size** | ~13.87 MB | ~14.0 MB |

## Compliance (Track A)

- Pre-quant TTT trains on validation data BEFORE quantization
- Result baked into artifact — fixed predictor at eval time
- No eval-time adaptation, no SLOT, no n-gram cache
- All artifacts < 16MB
- Training wallclock < 600s

## Reproduction Command

```bash
export SEED=314 VOCAB_SIZE=1024 NUM_LAYERS=11 MODEL_DIM=512
export NUM_LOOPS=2 LOOP_START=4 LOOP_END=5 ENABLE_LOOPING_AT=0.5
export PARALLEL_START_LAYER=7
export PREQUANT_TTT_ENABLED=1 PREQUANT_TTT_LR=0.00045 PREQUANT_TTT_EPOCHS=10 PREQUANT_TTT_FREEZE_BLOCKS=1
export QK_GAIN_INIT=5.25 EMA_DECAY=0.9965
export EMBED_BITS=8 MATRIX_BITS=6 COMPRESSOR=brotli GPTQ_ENABLED=1
export SLIDING_WINDOW_ENABLED=1 ETLB_ENABLED=1
export TRAIN_SEQ_LEN=2048 MAX_WALLCLOCK_SECONDS=600
export TRAIN_BATCH_TOKENS=786432
torchrun --nproc_per_node=8 train_gpt.py
```

## Credits

- **TTT hyperparameter tuning**: PR #1487 by @ndokutovich
- **SP1024 + Looping baseline**: Our Run 007/008
- **Base architecture**: Parameter Golf community

## Run Log

| Seed | Pre-quant BPB | Post-TTT BPB | Final BPB | Status |
|------|---------------|--------------|-----------|--------|
| 314 | TBD | TBD | TBD | Pending |
| 42 | TBD | TBD | TBD | Pending |
| 999 | TBD | TBD | TBD | Pending |
| **Mean** | - | - | **TBD** | - |
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
numpy
tqdm
torch
huggingface-hub
kernels
setuptools
typing-extensions==4.15.0
datasets
tiktoken
sentencepiece
flash-attn>=3.0.0
brotli
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
#!/bin/bash
# Run 009: Apply PR #1487 TTT hyperparameter tuning to our SP1024 + Looping architecture
# Hypothesis: TTT 10ep + lr=0.00045 + freeze=1 + QK=5.25 will gain ~0.008 BPB over Run 007/008
# Expected: val_bpb ~1.066 (vs 1.0739 baseline)

set -e

# Core architecture (same as Run 007/008)
export VOCAB_SIZE=1024 NUM_LAYERS=11 MODEL_DIM=512 NUM_HEADS=8 NUM_KV_HEADS=4 MLP_MULT=4.0
export NUM_LOOPS=2 LOOP_START=4 LOOP_END=5 ENABLE_LOOPING_AT=0.5
export PARALLEL_START_LAYER=7

# TTT hyperparameters (PR #1487 tuning)
export PREQUANT_TTT_ENABLED=1
export PREQUANT_TTT_LR=0.00045 # was 0.0005
export PREQUANT_TTT_EPOCHS=10 # was 6
export PREQUANT_TTT_FREEZE_BLOCKS=1 # was 2
export PREQUANT_TTT_BATCH_SEQS=32
export PREQUANT_TTT_GRAD_CLIP=1.0
export PREQUANT_TTT_COSINE_DECAY=1

# QK-Gain (PR #1487 tuning)
export QK_GAIN_INIT=5.25 # was 5.0

# Other settings (same as Run 007/008)
export EMA_DECAY=0.9965
export EMBED_BITS=8 MATRIX_BITS=6 COMPRESSOR=brotli GPTQ_ENABLED=1
export SLIDING_WINDOW_ENABLED=1 ETLB_ENABLED=1
export TRAIN_SEQ_LEN=2048 MAX_WALLCLOCK_SECONDS=600 WARMDOWN_FRAC=0.667 WARMUP_STEPS=20
export TRAIN_BATCH_TOKENS=786432
export MIN_LR=0.0 EMBED_LR=0.6 HEAD_LR=0.008 TIED_EMBED_LR=0.03 MATRIX_LR=0.04 SCALAR_LR=0.02

# Run 3 seeds for statistical significance
for SEED in 314 42 999; do
echo "=== Run 009: Seed $SEED ==="
echo "TTT: 10ep, lr=0.00045, freeze=1 | QK-Gain: 5.25"
export SEED=$SEED
torchrun --nproc_per_node=8 records/track_10min_16mb/2026-04-09_SP1024_Loop45_TTT10ep_QK525/train_gpt.py
done
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
{
"author": "Joshua Martinez",
"github_id": "your-github-id",
"name": "SP1024 + Looping (L4-5) + Pre-Quant TTT (10ep, lr=0.00045, freeze=1) + QK-Gain 5.25",
"blurb": "PR #1487 TTT hyperparameter tuning applied to SP1024 + Looping architecture. TTT: 10 epochs (vs 6), lr=0.00045 (vs 0.0005), freeze 1 block (vs 2), QK-Gain 5.25 (vs 5.0). Expected ~0.008 BPB improvement over 1.07389 baseline.",
"date": "2026-04-09T19:00:00Z",
"val_loss": null,
"val_bpb": null,
"val_loss_std": null,
"val_bpb_std": null,
"seeds": [314, 42, 999],
"seed_results": {},
"pre_quant_val_loss": null,
"pre_quant_val_bpb": null,
"step_stop": null,
"wallclock_seconds": null,
"eval_time_seconds": null,
"bytes_total": null,
"bytes_model_int6_brotli": null,
"bytes_code": null,
"run_notes": "Applying PR #1487 hyperparameter tuning to our SP1024 + Looping baseline (Run 007/008). Hypothesis: ~0.008 BPB improvement from TTT config alone."
}
Loading