diff --git a/notes/pr2_update_summary.md b/notes/pr2_update_summary.md new file mode 100644 index 0000000000..4203893d1f --- /dev/null +++ b/notes/pr2_update_summary.md @@ -0,0 +1,21 @@ +# PR #2 Update Summary + +## Chosen environment path + +- `runpod_parameter_golf` + +## Why chosen + +- It is the most concrete runnable path already documented in the repo. +- It reduces ambiguity around dependencies, dataset placement, and GPU shape. + +## What remains before the evidence run + +- create or access the Runpod environment +- clone the repo and check out `exp/eval-first-003` +- download the published `sp1024` assets +- execute the fixed baseline and candidate commands + +## Review state + +- review comments: none observed during this turn diff --git a/notes/tpi_003_environment_decision.md b/notes/tpi_003_environment_decision.md new file mode 100644 index 0000000000..fca3f82cd2 --- /dev/null +++ b/notes/tpi_003_environment_decision.md @@ -0,0 +1,71 @@ +# TPI-003 Environment Decision + +## Chosen environment + +- `runpod_parameter_golf` + +## Why chosen + +- The repository README already defines this path concretely. +- It is the smallest runnable path with clear dependency assumptions. +- It is closer to challenge conditions than an unspecified remote machine. +- It keeps the same monkey-model eval-first branch and only changes environment readiness. + +## Rejected alternatives + +### `local_repair` + +- Rejected as primary path because three blockers stack at once: + - `torch` missing + - dataset/tokenizer assets missing + - GPU access blocked + +### `remote_gpu_small` + +- Rejected as primary path because it is less specific than the Runpod route. +- It risks wasting time on ad hoc package, path, and logging setup that the documented Runpod path already solves more cleanly. + +## Required assets + +- repo checkout at `exp/eval-first-003` +- Python environment with `torch`, `datasets`, `sentencepiece` +- `/workspace/parameter-golf/data/datasets/fineweb10B_sp1024/` +- `/workspace/parameter-golf/data/tokenizers/fineweb_1024_bpe.model` +- writable `logs/` and `runs/TPI-003/` + +## Baseline env vars + +- `DATA_PATH=/workspace/parameter-golf/data/datasets/fineweb10B_sp1024/` +- `TOKENIZER_PATH=/workspace/parameter-golf/data/tokenizers/fineweb_1024_bpe.model` +- `VOCAB_SIZE=1024` +- `TRAIN_SEQ_LEN=1024` +- `EVAL_STRIDE=1024` +- `MAX_WALLCLOCK_SECONDS=600` +- `TRAIN_LOG_EVERY=50` +- `VAL_LOSS_EVERY=200` + +## Candidate env vars + +- same as baseline except `EVAL_STRIDE=128` +- optional second candidate: `EVAL_STRIDE=64` + +## Command skeleton + +```bash +RUN_ID= \ +DATA_PATH=/workspace/parameter-golf/data/datasets/fineweb10B_sp1024/ \ +TOKENIZER_PATH=/workspace/parameter-golf/data/tokenizers/fineweb_1024_bpe.model \ +VOCAB_SIZE=1024 \ +TRAIN_SEQ_LEN=1024 \ +EVAL_STRIDE= \ +MAX_WALLCLOCK_SECONDS=600 \ +TRAIN_LOG_EVERY=50 \ +VAL_LOSS_EVERY=200 \ +torchrun --standalone --nproc_per_node=1 train_gpt.py +``` + +## First command to run next turn + +```bash +python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 1 +``` diff --git a/notes/tpi_003_environment_plan.md b/notes/tpi_003_environment_plan.md new file mode 100644 index 0000000000..ea6290d850 --- /dev/null +++ b/notes/tpi_003_environment_plan.md @@ -0,0 +1,44 @@ +# TPI-003 Environment Plan + +## Objective + +Select the smallest runnable environment path that can produce one real baseline/candidate evidence pair for the existing eval-first monkey-model policy. + +## Public-facing name + +`MonkeyModel_EvalFirst_MinRunnableEnv` + +## Candidate environment paths + +1. `local_repair` + - install missing dependencies locally + - acquire published dataset/tokenizer assets locally + - verify GPU/runtime availability locally + +2. `remote_gpu_small` + - use a minimal remote CUDA environment + - run one baseline/candidate pair with the same public-safe branch + +3. `runpod_parameter_golf` + - use the challenge-aligned Runpod path + - fetch the published dataset/tokenizer assets there + - execute one baseline/candidate pair under a more official environment shape + +## Selection criteria + +- shortest path to one real evidence pair +- command simplicity +- reproducibility +- fit with current monkey-model eval-first branch +- lowest setup overhead that still yields runtime + val_bpb + +## Current recommendation + +Prefer `runpod_parameter_golf` unless an already-usable remote CUDA environment exists. The local path is currently the weakest candidate because torch, assets, and GPU availability are all blocked at once. + +## Selection outcome for this turn + +- Chosen path: `runpod_parameter_golf` +- Reason: it is the most explicit public-safe path already documented in the repo, with the least ambiguity about dependency readiness and challenge-compatible execution shape. +- Deferred path: `remote_gpu_small` +- Rejected primary path: `local_repair` diff --git a/notes/tpi_003_execution_contract.md b/notes/tpi_003_execution_contract.md new file mode 100644 index 0000000000..81db30b9f3 --- /dev/null +++ b/notes/tpi_003_execution_contract.md @@ -0,0 +1,116 @@ +# TPI-003 Execution Contract + +## Objective + +Fix one executable baseline/candidate command contract for the existing eval-first monkey-model branch. + +## Baseline contract + +- branch: `exp/eval-first-003` +- mode: non-sliding validation behavior +- effective setting: `EVAL_STRIDE=TRAIN_SEQ_LEN` +- chosen environment: `runpod_parameter_golf` +- target host shape for first pass: `1xH100` Runpod pod +- tokenizer path: `/workspace/parameter-golf/data/tokenizers/fineweb_1024_bpe.model` +- dataset path: `/workspace/parameter-golf/data/datasets/fineweb10B_sp1024/` +- logs: + - script-native log: `logs/${RUN_ID}.txt` + - turn note archive: `runs/TPI-003/` +- commit SHA capture: + - `git rev-parse HEAD > runs/TPI-003/.commit.txt` + +## Candidate contract + +- branch: `exp/eval-first-003` +- mode: eval-first sliding validation +- primary candidate: `EVAL_STRIDE=128` +- optional secondary candidate: `EVAL_STRIDE=64` + +## Baseline command + +```bash +cd /workspace +git clone https://github.com/gb250e/parameter-golf.git +cd parameter-golf +git checkout exp/eval-first-003 +python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 1 +mkdir -p runs/TPI-003 +git rev-parse HEAD > runs/TPI-003/tpi003_baseline.commit.txt +RUN_ID=tpi003_baseline_stride1024 \ +DATA_PATH=/workspace/parameter-golf/data/datasets/fineweb10B_sp1024/ \ +TOKENIZER_PATH=/workspace/parameter-golf/data/tokenizers/fineweb_1024_bpe.model \ +VOCAB_SIZE=1024 \ +TRAIN_SEQ_LEN=1024 \ +EVAL_STRIDE=1024 \ +MAX_WALLCLOCK_SECONDS=600 \ +TRAIN_LOG_EVERY=50 \ +VAL_LOSS_EVERY=200 \ +torchrun --standalone --nproc_per_node=1 train_gpt.py | tee runs/TPI-003/tpi003_baseline.stdout.log +``` + +## Candidate command + +```bash +cd /workspace/parameter-golf +git checkout exp/eval-first-003 +mkdir -p runs/TPI-003 +git rev-parse HEAD > runs/TPI-003/tpi003_candidate_128.commit.txt +RUN_ID=tpi003_candidate_stride128 \ +DATA_PATH=/workspace/parameter-golf/data/datasets/fineweb10B_sp1024/ \ +TOKENIZER_PATH=/workspace/parameter-golf/data/tokenizers/fineweb_1024_bpe.model \ +VOCAB_SIZE=1024 \ +TRAIN_SEQ_LEN=1024 \ +EVAL_STRIDE=128 \ +MAX_WALLCLOCK_SECONDS=600 \ +TRAIN_LOG_EVERY=50 \ +VAL_LOSS_EVERY=200 \ +torchrun --standalone --nproc_per_node=1 train_gpt.py | tee runs/TPI-003/tpi003_candidate_128.stdout.log +``` + +## Required env vars for both runs + +- `DATA_PATH` +- `TOKENIZER_PATH` +- `VOCAB_SIZE=1024` +- `TRAIN_SEQ_LEN=1024` +- `EVAL_STRIDE` +- `MAX_WALLCLOCK_SECONDS=600` +- `TRAIN_LOG_EVERY=50` +- `VAL_LOSS_EVERY=200` + +## GPU assumption + +- first runnable path assumes `1xH100` +- this is for evidence collection, not final record-track timing + +## Log policy + +- keep terminal stdout in `runs/TPI-003/*.stdout.log` +- keep script-native logs in `logs/` +- summarize runtime and `val_bpb` back into notes after the run + +## Next-turn first command + +```bash +python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 1 +``` + +## Minimum required assets + +- Python environment with `torch`, `datasets`, and `sentencepiece` +- accessible CUDA runtime for `train_gpt.py` +- published FineWeb cached shards or equivalent challenge-provided dataset path +- published tokenizer model path + +## Minimum required capture + +- commit SHA +- command line +- runtime notes +- whether eval path was reached +- final runtime summary +- final val_bpb summary + +## Environment decision rule + +Choose the environment path that can produce one real baseline/candidate pair with the least additional setup while remaining reproducible and public-safe.