Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
89 changes: 89 additions & 0 deletions WINNING_RUNBOOK.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,89 @@
# Parameter Golf Winning Runbook ($25 Edition)

This runbook is optimized for low budget and high decision quality.
Goal: make one real SOTA attempt without wasting credits.

## 0) Non-negotiables

- Beat current SOTA with margin and significance, not one lucky seed.
- Keep artifact under `16,000,000` bytes (decimal MB).
- Never use validation or train data illegally during quantization/eval.
- Prefer cheap filtering first, then expensive confirmation.

## 1) Budget split

- Phase A (cheap filtering): `$8`
- Phase B (1xH100 confirmation): `$9`
- Phase C (final 8xH100 reproducibility): `$8` (or wait for grant)

If you get OpenAI credits, expand Phase C to 3-seed evidence.

## 2) Exact baseline to start from

Use this folder as your starting point:

- `records/track_10min_16mb/2026-03-25_ValCalib_GPTQ_XSA_BigramHash3072/`

Do not start from old baseline scripts.

## 3) Runpod setup (first pod)

1. Launch cheap single-GPU pod first (L40/4090/5090 class).
2. SSH in and run:
- `cd /workspace`
- `git clone https://github.com/openai/parameter-golf.git`
- `cd parameter-golf`
- `python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 1`
3. Copy your working `train_gpt.py` candidate into a new local work folder.

## 4) Experiment matrix (run in this order)

Use `commands/runpod_experiments.sh`.

Design principle:
- Change one high-impact axis at a time.
- Keep all other vars fixed.
- Promote only stable gains.

Priority axes:
- `GPTQ_CALIB_BATCHES`: 192/256/320
- `GPTQ_BLOCK_SIZE`: 128/256
- `BIGRAM_DIM`: 96/112/128 (with `BIGRAM_VOCAB_SIZE=3072`)
- `WARMDOWN_ITERS`: 3500/4000/4500
- `TARGET_MB`: 15.85/15.90

## 5) Stop/go rules (strict)

- If run regresses by `>= 0.0015 bpb` vs control: stop that branch.
- If run improves by `< 0.0007 bpb`: do not promote.
- Promote only if improved in 2 seeds (cheap pod is fine for this check).
- Spend H100 only on top 1-2 configs.

## 6) Promotion checklist before H100

- Script runs clean with no dependency errors.
- Final lines print `val_bpb` and compressed model size.
- Artifact clearly below 16MB target.
- No rule-violating data access in quantization/eval path.

## 7) Submission checklist

Create a new folder in `records/track_10min_16mb/<date>_<name>/` with:

- `README.md` (what changed, why, exact command)
- `submission.json`
- `train_gpt.py`
- `train.log` (or multiple logs for significance)
- `requirements.txt` only if non-default deps were needed

## 8) Your daily cadence (copy exactly)

1. 6 cheap ablations (Phase A).
2. Pick top 2 and re-run with new seeds.
3. Move best to 1xH100 (Phase B).
4. If still positive, run final reproducibility pass (Phase C or grant credits).
5. Submit PR the same day while evidence is fresh.

## 9) One hard rule

Do not chase tiny LR/WD decimal tweaks until your quantization + calibration stack is already clearly beating your current best.
53 changes: 53 additions & 0 deletions commands/run_remaining_after_a2_4090.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
#!/usr/bin/env bash
# Run experiments b1, c1, c2, d1, d2 after ctrl + a1 + a2 are done (1x RTX 4090, low VRAM).
# Usage: bash commands/run_remaining_after_a2_4090.sh /path/to/train_gpt.py
set -euo pipefail

TRAIN_SCRIPT="${1:-records/track_10min_16mb/2026-03-25_ValCalib_GPTQ_XSA_BigramHash3072/train_gpt.py}"

if [[ ! -f "${TRAIN_SCRIPT}" ]]; then
echo "ERROR: train script not found: ${TRAIN_SCRIPT}"
exit 1
fi

export TRAIN_BATCH_TOKENS="${TRAIN_BATCH_TOKENS:-196608}"
export PYTORCH_CUDA_ALLOC_CONF="${PYTORCH_CUDA_ALLOC_CONF:-expandable_segments:True}"
export DATA_PATH="${DATA_PATH:-./data/datasets/fineweb10B_sp1024/}"
export TOKENIZER_PATH="${TOKENIZER_PATH:-./data/tokenizers/fineweb_1024_bpe.model}"
export VOCAB_SIZE="${VOCAB_SIZE:-1024}"
export MAX_WALLCLOCK_SECONDS="${MAX_WALLCLOCK_SECONDS:-600}"
export TRAIN_SEQ_LEN="${TRAIN_SEQ_LEN:-2048}"
export EVAL_SEQ_LEN="${EVAL_SEQ_LEN:-2048}"
export VAL_LOSS_EVERY="${VAL_LOSS_EVERY:-0}"
export TARGET_MB="${TARGET_MB:-15.9}"
export BIGRAM_VOCAB_SIZE="${BIGRAM_VOCAB_SIZE:-3072}"
export XSA_LAST_N="${XSA_LAST_N:-11}"

mkdir -p logs

run_one() {
local run_id="$1"
shift
echo "==== START ${run_id} ===="
RUN_ID="${run_id}" "$@" torchrun --standalone --nproc_per_node=1 "${TRAIN_SCRIPT}" \
2>&1 | tee "logs/${run_id}.log"
echo "==== END ${run_id} ===="
}

run_one "b1_block256_seed314" env \
SEED=314 WARMDOWN_ITERS=4000 GPTQ_CALIB_BATCHES=256 GPTQ_BLOCK_SIZE=256 BIGRAM_DIM=112

run_one "c1_bigram96_seed314" env \
SEED=314 WARMDOWN_ITERS=4000 GPTQ_CALIB_BATCHES=256 GPTQ_BLOCK_SIZE=128 BIGRAM_DIM=96 TARGET_MB=15.85

run_one "c2_bigram128_seed314" env \
SEED=314 WARMDOWN_ITERS=4000 GPTQ_CALIB_BATCHES=256 GPTQ_BLOCK_SIZE=128 BIGRAM_DIM=128 TARGET_MB=15.90

run_one "d1_warm3500_seed314" env \
SEED=314 WARMDOWN_ITERS=3500 GPTQ_CALIB_BATCHES=256 GPTQ_BLOCK_SIZE=128 BIGRAM_DIM=112

run_one "d2_warm4500_seed314" env \
SEED=314 WARMDOWN_ITERS=4500 GPTQ_CALIB_BATCHES=256 GPTQ_BLOCK_SIZE=128 BIGRAM_DIM=112

echo "All five runs done."
python3 commands/summarize_logs.py logs
73 changes: 73 additions & 0 deletions commands/runpod_experiments.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
#!/usr/bin/env bash
set -euo pipefail

# Usage:
# bash commands/runpod_experiments.sh /path/to/train_gpt.py
#
# Example:
# bash commands/runpod_experiments.sh records/track_10min_16mb/2026-03-25_ValCalib_GPTQ_XSA_BigramHash3072/train_gpt.py
#
# This script runs a low-cost ablation matrix on 1 GPU.

TRAIN_SCRIPT="${1:-train_gpt.py}"

if [[ ! -f "${TRAIN_SCRIPT}" ]]; then
echo "ERROR: train script not found: ${TRAIN_SCRIPT}"
exit 1
fi

export DATA_PATH="${DATA_PATH:-./data/datasets/fineweb10B_sp1024/}"
export TOKENIZER_PATH="${TOKENIZER_PATH:-./data/tokenizers/fineweb_1024_bpe.model}"
export VOCAB_SIZE="${VOCAB_SIZE:-1024}"
export MAX_WALLCLOCK_SECONDS="${MAX_WALLCLOCK_SECONDS:-600}"
export TRAIN_BATCH_TOKENS="${TRAIN_BATCH_TOKENS:-786432}"
export TRAIN_SEQ_LEN="${TRAIN_SEQ_LEN:-2048}"
export EVAL_SEQ_LEN="${EVAL_SEQ_LEN:-2048}"
export VAL_LOSS_EVERY="${VAL_LOSS_EVERY:-0}"
export TARGET_MB="${TARGET_MB:-15.9}"
export BIGRAM_VOCAB_SIZE="${BIGRAM_VOCAB_SIZE:-3072}"
export XSA_LAST_N="${XSA_LAST_N:-11}"

mkdir -p logs

run_one() {
local run_id="$1"
shift
echo "==== START ${run_id} ===="
RUN_ID="${run_id}" "$@" torchrun --standalone --nproc_per_node=1 "${TRAIN_SCRIPT}" \
2>&1 | tee "logs/${run_id}.log"
echo "==== END ${run_id} ===="
}

# Control
run_one "ctrl_seed314" env \
SEED=314 WARMDOWN_ITERS=4000 GPTQ_CALIB_BATCHES=256 GPTQ_BLOCK_SIZE=128 BIGRAM_DIM=112

# A1/A2: calibration coverage
run_one "a1_calib192_seed314" env \
SEED=314 WARMDOWN_ITERS=4000 GPTQ_CALIB_BATCHES=192 GPTQ_BLOCK_SIZE=128 BIGRAM_DIM=112

run_one "a2_calib320_seed314" env \
SEED=314 WARMDOWN_ITERS=4000 GPTQ_CALIB_BATCHES=320 GPTQ_BLOCK_SIZE=128 BIGRAM_DIM=112

# B1: GPTQ block size
run_one "b1_block256_seed314" env \
SEED=314 WARMDOWN_ITERS=4000 GPTQ_CALIB_BATCHES=256 GPTQ_BLOCK_SIZE=256 BIGRAM_DIM=112

# C1/C2: bigram dim tradeoff
run_one "c1_bigram96_seed314" env \
SEED=314 WARMDOWN_ITERS=4000 GPTQ_CALIB_BATCHES=256 GPTQ_BLOCK_SIZE=128 BIGRAM_DIM=96 TARGET_MB=15.85

run_one "c2_bigram128_seed314" env \
SEED=314 WARMDOWN_ITERS=4000 GPTQ_CALIB_BATCHES=256 GPTQ_BLOCK_SIZE=128 BIGRAM_DIM=128 TARGET_MB=15.90

# D1/D2: warmdown schedule
run_one "d1_warm3500_seed314" env \
SEED=314 WARMDOWN_ITERS=3500 GPTQ_CALIB_BATCHES=256 GPTQ_BLOCK_SIZE=128 BIGRAM_DIM=112

run_one "d2_warm4500_seed314" env \
SEED=314 WARMDOWN_ITERS=4500 GPTQ_CALIB_BATCHES=256 GPTQ_BLOCK_SIZE=128 BIGRAM_DIM=112

echo "All runs done. Logs are in logs/."
echo "Next: parse results quickly with:"
echo " rg -n \"val_bpb|final_int8_zlib_roundtrip|artifact|compressed\" logs/*.log"
59 changes: 59 additions & 0 deletions commands/summarize_logs.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,59 @@
#!/usr/bin/env python3
"""
Summarize Parameter Golf train logs. Prefer leaderboard-style sliding BPB
(final_int6_sliding_window_exact) over plain val_bpb or roundtrip lines.
"""
import re
import sys
from pathlib import Path

# Leaderboard-style metric (see record README "Sliding BPB")
SLIDING_EXACT = re.compile(
r"final_int6_sliding_window_exact val_loss:[0-9.e+-]+ val_bpb:([0-9.]+)"
)
SLIDING_S64_EXACT = re.compile(
r"final_int6_sliding_window_s64_exact val_loss:[0-9.e+-]+ val_bpb:([0-9.]+)"
)
ROUNDTRIP_EXACT = re.compile(
r"final_int6_roundtrip_exact val_loss:[0-9.e+-]+ val_bpb:([0-9.]+)"
)
VAL_ANY = re.compile(r"val_bpb:([0-9.]+)")
ARTIFACT_RE = re.compile(
r"Total submission size int6\+lzma:\s*([0-9]+)\s*bytes?", re.IGNORECASE
)


def parse_file(path: Path):
text = path.read_text(encoding="utf-8", errors="ignore")

val = "NA"
for pattern in (SLIDING_EXACT, SLIDING_S64_EXACT, ROUNDTRIP_EXACT):
m = pattern.search(text)
if m:
val = m.group(1)
break
if val == "NA":
vals = VAL_ANY.findall(text)
val = vals[-1] if vals else "NA"

art_m = ARTIFACT_RE.search(text)
art = art_m.group(1) if art_m else "NA"

return val, art


def main():
log_dir = Path(sys.argv[1]) if len(sys.argv) > 1 else Path("logs")
files = sorted(log_dir.glob("*.log"))
if not files:
print(f"No log files found in {log_dir}")
return

print("run_id,sliding_val_bpb_or_best_available,artifact_bytes_int6_lzma")
for f in files:
val, art = parse_file(f)
print(f"{f.stem},{val},{art}")


if __name__ == "__main__":
main()
Original file line number Diff line number Diff line change
Expand Up @@ -5,17 +5,22 @@
## Run Command

```bash
# Setup (once)
# Setup (once): downloads tokenizer + training/val shards (default 80 train shards)
bash prepare.sh

# Train + evaluate (default seed=42)
bash eval/eval.sh

# With specific seed
SEED=42 bash eval/eval.sh

# Quick smoke test (few steps, small batch; set SWA_ENABLED=0) — not a leaderboard score
bash eval/smoke.sh
```

All parameters are set as defaults in `train_gpt.py`. No env vars needed.
All parameters are set as defaults in `train_gpt.py`. No env vars needed for a full run.

For a smaller download while iterating locally, run `TRAIN_SHARDS=1 bash prepare.sh` from this directory (or pass `--train-shards 1` to `data/cached_challenge_fineweb.py` from the repo root).

## 3-Seed Results

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
#!/usr/bin/env bash
# Full leaderboard training run (defaults match submission; ~10 min cap on 1xH100).
set -euo pipefail
RECORD_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
ROOT="$(cd "$RECORD_DIR/../../.." && pwd)"
cd "$ROOT"
export DATA_PATH="${DATA_PATH:-$ROOT/data/datasets/fineweb10B_sp1024}"
export TOKENIZER_PATH="${TOKENIZER_PATH:-$ROOT/data/tokenizers/fineweb_1024_bpe.model}"
export VOCAB_SIZE="${VOCAB_SIZE:-1024}"
export RUN_ID="${RUN_ID:-10L_Int5MLP_MuonWD04_SWA50}"
export SEED="${SEED:-42}"
NPROC="${NPROC:-1}"
exec torchrun --standalone --nproc_per_node="${NPROC}" "${RECORD_DIR}/train_gpt.py"
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
#!/usr/bin/env bash
# Tiny run to verify CUDA, data paths, and script wiring (not a score attempt).
set -euo pipefail
RECORD_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")/.." && pwd)"
ROOT="$(cd "$RECORD_DIR/../../.." && pwd)"
cd "$ROOT"
export DATA_PATH="${DATA_PATH:-$ROOT/data/datasets/fineweb10B_sp1024}"
export TOKENIZER_PATH="${TOKENIZER_PATH:-$ROOT/data/tokenizers/fineweb_1024_bpe.model}"
export VOCAB_SIZE="${VOCAB_SIZE:-1024}"
export RUN_ID="${RUN_ID:-smoke_10L_int5}"
export SEED="${SEED:-42}"
export ITERATIONS="${ITERATIONS:-8}"
export WARMUP_STEPS="${WARMUP_STEPS:-2}"
export WARMDOWN_ITERS="${WARMDOWN_ITERS:-4}"
export TRAIN_BATCH_TOKENS="${TRAIN_BATCH_TOKENS:-65536}"
export MAX_WALLCLOCK_SECONDS="${MAX_WALLCLOCK_SECONDS:-0}"
export VAL_LOSS_EVERY="${VAL_LOSS_EVERY:-0}"
export TRAIN_LOG_EVERY="${TRAIN_LOG_EVERY:-1}"
export SWA_ENABLED="${SWA_ENABLED:-0}"
NPROC="${NPROC:-1}"
exec torchrun --standalone --nproc_per_node="${NPROC}" "${RECORD_DIR}/train_gpt.py"
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
#!/usr/bin/env bash
# Download FineWeb shards + tokenizer into ./data/ (run from anywhere).
set -euo pipefail
ROOT="$(cd "$(dirname "${BASH_SOURCE[0]}")/../../.." && pwd)"
cd "$ROOT"
TRAIN_SHARDS="${TRAIN_SHARDS:-80}"
exec python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards "${TRAIN_SHARDS}"
Original file line number Diff line number Diff line change
Expand Up @@ -851,8 +851,10 @@ def main() -> None:
from torch.backends.cuda import enable_cudnn_sdp, enable_flash_sdp, enable_math_sdp, enable_mem_efficient_sdp
enable_cudnn_sdp(False)
enable_flash_sdp(True)
enable_mem_efficient_sdp(False)
enable_math_sdp(False)
# Flash is unavailable on many Windows CUDA builds; without math/mem SDPA errors ("No available kernel").
sdp_fallback = bool(int(os.environ.get("ENABLE_MATH_SDP", "0"))) or sys.platform == "win32"
enable_mem_efficient_sdp(sdp_fallback)
enable_math_sdp(sdp_fallback)

logfile = None
if master_process:
Expand Down
Loading