openai · LucasErcolano · Mar 31, 2026
diff --git a/...k_non_record_16mb/2026-03-30_Distributed_H100x8_PolarSTE_QJL_Baseline/README.md b/...k_non_record_16mb/2026-03-30_Distributed_H100x8_PolarSTE_QJL_Baseline/README.md
@@ -0,0 +1,78 @@
+# Non-record Submission: Distributed 8xH100 Polar STE + QJL KV-cache Baseline
+
+This folder captures the first end-to-end distributed Hopper baseline for the Polar STE + QJL KV-cache stack. The goal here is not a leaderboard claim. The goal is to prove that the architecture compiles, trains, exports, reloads, and runs final autoregressive KV evaluation on `8xH100 80GB HBM3` under DDP without deadlocks.
+
+The run used:
+
+- `WORLD_SIZE=8` via `torchrun --standalone --nproc_per_node=8`
+- `QAT_SCHEME=polar`
+- `WEIGHT_QUANT_SCHEME=polar`
+- native `KV_QUANT_BACKEND=qjl`
+- `ENABLE_TORCH_COMPILE=0`
+- a hard `600s` wallclock with an internal `15s` finalization reserve
+
+## Result
+
+Single-seed Hopper run (`SEED=314`):
+
+| Metric | Value |
+|--------|------:|
+| Steps completed | `3382` |
+| Teacher-forced final `val_bpb` | `1.4594` |
+| Final autoregressive `qjl` `val_bpb` | `2.12830032` |
+| Final autoregressive throughput | `93.51 tok/s` |
+| Artifact bytes (`polar+zlib`) | `14,751,006` |
+| Total submission bytes | `14,875,186` |
+| Peak VRAM allocated | `1933 MiB` |
+| Peak VRAM reserved | `2080 MiB` |
+| Total wallclock | `592.209s` |
+
+The large gap between teacher-forced evaluation (`1.4594`) and final autoregressive KV evaluation (`2.1283`) is the main reason this is submitted as a non-record baseline rather than a record attempt. Training remains stable, but the quantized KV path still injects too much error during free-running decode.
+
+## Engineering Notes
+
+This run found and fixed a real infrastructure bug before submission:
+
+- The first 8xH100 attempt exceeded the wallclock at `601.863s`.
+- Root cause: the internal training budget reserved time for export/final-eval, but did not subtract pre-training setup overhead.
+- The fix now measures `pre_training_overhead` before entering the main loop and reduces the usable training budget accordingly.
+- The successful run logged `pre_training_overhead:6610ms`, `train_budget_after_setup:578390ms`, and finished at `total_wallclock:592209ms`.
+
+This folder therefore serves as both:
+
+- a distributed Hopper baseline for Polar STE + QJL
+- a regression test for the DDP-safe final KV evaluation path and wallclock budgeting logic
+
+## Files
+
+- `train_gpt.py`: self-contained training + export + rank-0 KV evaluation script
+- `triton_kv_ops.py`: Triton kernels kept alongside the script, even though the Hopper-winning eval backend here is native `qjl`
+- `run_h100x8.sh`: exact launcher used for the successful run
+- `train_seed314_budgetfix.log`: raw run log from the successful 8xH100 execution
+
+## Run Command
+
+From this folder on the official RunPod Parameter Golf image:
+
+```bash
+bash run_h100x8.sh
+```
+
+The launcher bakes in the validated Hopper settings:
+
+- `KV_QUANT_BACKEND=qjl`
+- `ENABLE_TORCH_COMPILE=0`
+- `WARMUP_STEPS=0`
+- `LR_WARMUP_STEPS=128`
+- `LR_WARMUP_INIT_SCALE=0.1`
+- `MAX_WALLCLOCK_SECONDS=600`
+- `FINALIZE_BUDGET_SECONDS=15`
+
+## Why This Matters
+
+Even though the quality is not leaderboard-competitive, this submission proves that:
+
+- Polar STE weight training survives the transition from local `1x3090` experimentation to `8xH100` DDP.
+- The final autoregressive KV evaluator can run rank-0-only under distributed training without deadlock.
+- Hopper prefers native `qjl` over the current Triton decode path on this `batch=1` autoregressive workload.
+- The export path stays under the `16MB` artifact limit on real `8xH100` runs.
diff --git a/...rack_non_record_16mb/2026-03-30_Distributed_H100x8_PolarSTE_QJL_Baseline/requirements.txt b/...rack_non_record_16mb/2026-03-30_Distributed_H100x8_PolarSTE_QJL_Baseline/requirements.txt
@@ -0,0 +1 @@
+sentencepiece
diff --git a/...s/track_non_record_16mb/2026-03-30_Distributed_H100x8_PolarSTE_QJL_Baseline/run_h100x8.sh b/...s/track_non_record_16mb/2026-03-30_Distributed_H100x8_PolarSTE_QJL_Baseline/run_h100x8.sh
@@ -0,0 +1,44 @@
+#!/usr/bin/env bash
+set -euo pipefail
+
+SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd)"
+REPO_ROOT="$(cd -- "${SCRIPT_DIR}/../../.." && pwd)"
+
+cd "${SCRIPT_DIR}"
+
+export RUN_ID="${RUN_ID:-record_polar_qjl_h100x8}"
+export DATA_PATH="${DATA_PATH:-${REPO_ROOT}/data/datasets/fineweb10B_sp1024}"
+export TOKENIZER_PATH="${TOKENIZER_PATH:-${REPO_ROOT}/data/tokenizers/fineweb_1024_bpe.model}"
+export VOCAB_SIZE="${VOCAB_SIZE:-1024}"
+
+export QAT="${QAT:-1}"
+export QAT_SCHEME="${QAT_SCHEME:-polar}"
+export WEIGHT_QUANT_SCHEME="${WEIGHT_QUANT_SCHEME:-polar}"
+export POLAR_QAT_BITS_MODE="${POLAR_QAT_BITS_MODE:-quality}"
+export POLAR_WEIGHT_BITS_MODE="${POLAR_WEIGHT_BITS_MODE:-quality}"
+export POLAR_WEIGHT_ROTATE="${POLAR_WEIGHT_ROTATE:-0}"
+
+export TRAIN_SEQ_LEN="${TRAIN_SEQ_LEN:-256}"
+export TRAIN_BATCH_TOKENS="${TRAIN_BATCH_TOKENS:-65536}"
+export ITERATIONS="${ITERATIONS:-100000}"
+export WARMUP_STEPS="${WARMUP_STEPS:-0}"
+export LR_WARMUP_STEPS="${LR_WARMUP_STEPS:-128}"
+export LR_WARMUP_INIT_SCALE="${LR_WARMUP_INIT_SCALE:-0.1}"
+export TRAIN_LOG_EVERY="${TRAIN_LOG_EVERY:-20}"
+
+export VAL_LOSS_EVERY="${VAL_LOSS_EVERY:-0}"
+export VAL_BATCH_SIZE="${VAL_BATCH_SIZE:-131072}"
+export VAL_MAX_TOKENS="${VAL_MAX_TOKENS:-4096}"
+
+export EVAL_AUTOREGRESSIVE_KV="${EVAL_AUTOREGRESSIVE_KV:-1}"
+export KV_QUANT_BACKEND="${KV_QUANT_BACKEND:-qjl}"
+export KV_EVAL_CONTEXT_LEN="${KV_EVAL_CONTEXT_LEN:-256}"
+export KV_EVAL_MAX_TOKENS="${KV_EVAL_MAX_TOKENS:-512}"
+export KV_BACKEND_SELFTEST="${KV_BACKEND_SELFTEST:-0}"
+
+export ENABLE_TORCH_COMPILE="${ENABLE_TORCH_COMPILE:-0}"
+export MAX_WALLCLOCK_SECONDS="${MAX_WALLCLOCK_SECONDS:-600}"
+export FINALIZE_BUDGET_SECONDS="${FINALIZE_BUDGET_SECONDS:-15}"
+export LOG_SYNC_TO_DISK="${LOG_SYNC_TO_DISK:-1}"
+
+torchrun --standalone --nproc_per_node=8 train_gpt.py
diff --git a/...track_non_record_16mb/2026-03-30_Distributed_H100x8_PolarSTE_QJL_Baseline/submission.json b/...track_non_record_16mb/2026-03-30_Distributed_H100x8_PolarSTE_QJL_Baseline/submission.json
@@ -0,0 +1,28 @@
+{
+  "author": "Lucas Ercolano",
+  "github_id": "LucasErcolano",
+  "name": "Distributed 8xH100 Polar STE + QJL KV-cache baseline",
+  "blurb": "Non-record Hopper baseline validating Polar STE structural-weight QAT plus native QJL KV-cache eval under 8xH100 DDP. The run completed in 592.209s without deadlocks, stayed under 16MB with a 14,751,006-byte polar+zlib artifact, and exposed a large teacher-forced vs autoregressive KV gap (1.4594 vs 2.1283 BPB), so it is submitted as infrastructure validation rather than a leaderboard claim.",
+  "date": "2026-03-30T00:00:00Z",
+  "track": "non-record-16mb",
+  "val_loss": 3.78602759,
+  "val_bpb": 2.12830032,
+  "teacher_forced_val_loss": 2.5242,
+  "teacher_forced_val_bpb": 1.4594,
+  "step_stop": 3382,
+  "wallclock_seconds": 592.209,
+  "training_time_seconds": 578.545,
+  "pre_training_overhead_seconds": 6.610,
+  "bytes_total": 14875186,
+  "bytes_model_polar_zlib": 14751006,
+  "bytes_code": 124180,
+  "peak_memory_allocated_mib": 1933,
+  "peak_memory_reserved_mib": 2080,
+  "peak_cache_bytes": 387072,
+  "peak_cache_tensor_bytes": 1271808,
+  "tokens_per_second": 93.51,
+  "kv_eval_backend": "qjl",
+  "gpu": "8xH100 80GB HBM3",
+  "world_size": 8,
+  "seed": 314
+}