Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# Non-record Submission: Distributed 8xH100 Polar STE + QJL KV-cache Baseline

This folder captures the first end-to-end distributed Hopper baseline for the Polar STE + QJL KV-cache stack. The goal here is not a leaderboard claim. The goal is to prove that the architecture compiles, trains, exports, reloads, and runs final autoregressive KV evaluation on `8xH100 80GB HBM3` under DDP without deadlocks.

The run used:

- `WORLD_SIZE=8` via `torchrun --standalone --nproc_per_node=8`
- `QAT_SCHEME=polar`
- `WEIGHT_QUANT_SCHEME=polar`
- native `KV_QUANT_BACKEND=qjl`
- `ENABLE_TORCH_COMPILE=0`
- a hard `600s` wallclock with an internal `15s` finalization reserve

## Result

Single-seed Hopper run (`SEED=314`):

| Metric | Value |
|--------|------:|
| Steps completed | `3382` |
| Teacher-forced final `val_bpb` | `1.4594` |
| Final autoregressive `qjl` `val_bpb` | `2.12830032` |
| Final autoregressive throughput | `93.51 tok/s` |
| Artifact bytes (`polar+zlib`) | `14,751,006` |
| Total submission bytes | `14,875,186` |
| Peak VRAM allocated | `1933 MiB` |
| Peak VRAM reserved | `2080 MiB` |
| Total wallclock | `592.209s` |

The large gap between teacher-forced evaluation (`1.4594`) and final autoregressive KV evaluation (`2.1283`) is the main reason this is submitted as a non-record baseline rather than a record attempt. Training remains stable, but the quantized KV path still injects too much error during free-running decode.

## Engineering Notes

This run found and fixed a real infrastructure bug before submission:

- The first 8xH100 attempt exceeded the wallclock at `601.863s`.
- Root cause: the internal training budget reserved time for export/final-eval, but did not subtract pre-training setup overhead.
- The fix now measures `pre_training_overhead` before entering the main loop and reduces the usable training budget accordingly.
- The successful run logged `pre_training_overhead:6610ms`, `train_budget_after_setup:578390ms`, and finished at `total_wallclock:592209ms`.

This folder therefore serves as both:

- a distributed Hopper baseline for Polar STE + QJL
- a regression test for the DDP-safe final KV evaluation path and wallclock budgeting logic

## Files

- `train_gpt.py`: self-contained training + export + rank-0 KV evaluation script
- `triton_kv_ops.py`: Triton kernels kept alongside the script, even though the Hopper-winning eval backend here is native `qjl`
- `run_h100x8.sh`: exact launcher used for the successful run
- `train_seed314_budgetfix.log`: raw run log from the successful 8xH100 execution

## Run Command

From this folder on the official RunPod Parameter Golf image:

```bash
bash run_h100x8.sh
```

The launcher bakes in the validated Hopper settings:

- `KV_QUANT_BACKEND=qjl`
- `ENABLE_TORCH_COMPILE=0`
- `WARMUP_STEPS=0`
- `LR_WARMUP_STEPS=128`
- `LR_WARMUP_INIT_SCALE=0.1`
- `MAX_WALLCLOCK_SECONDS=600`
- `FINALIZE_BUDGET_SECONDS=15`

## Why This Matters

Even though the quality is not leaderboard-competitive, this submission proves that:

- Polar STE weight training survives the transition from local `1x3090` experimentation to `8xH100` DDP.
- The final autoregressive KV evaluator can run rank-0-only under distributed training without deadlock.
- Hopper prefers native `qjl` over the current Triton decode path on this `batch=1` autoregressive workload.
- The export path stays under the `16MB` artifact limit on real `8xH100` runs.
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
sentencepiece
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
#!/usr/bin/env bash
set -euo pipefail

SCRIPT_DIR="$(cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd)"
REPO_ROOT="$(cd -- "${SCRIPT_DIR}/../../.." && pwd)"

cd "${SCRIPT_DIR}"

export RUN_ID="${RUN_ID:-record_polar_qjl_h100x8}"
export DATA_PATH="${DATA_PATH:-${REPO_ROOT}/data/datasets/fineweb10B_sp1024}"
export TOKENIZER_PATH="${TOKENIZER_PATH:-${REPO_ROOT}/data/tokenizers/fineweb_1024_bpe.model}"
export VOCAB_SIZE="${VOCAB_SIZE:-1024}"

export QAT="${QAT:-1}"
export QAT_SCHEME="${QAT_SCHEME:-polar}"
export WEIGHT_QUANT_SCHEME="${WEIGHT_QUANT_SCHEME:-polar}"
export POLAR_QAT_BITS_MODE="${POLAR_QAT_BITS_MODE:-quality}"
export POLAR_WEIGHT_BITS_MODE="${POLAR_WEIGHT_BITS_MODE:-quality}"
export POLAR_WEIGHT_ROTATE="${POLAR_WEIGHT_ROTATE:-0}"

export TRAIN_SEQ_LEN="${TRAIN_SEQ_LEN:-256}"
export TRAIN_BATCH_TOKENS="${TRAIN_BATCH_TOKENS:-65536}"
export ITERATIONS="${ITERATIONS:-100000}"
export WARMUP_STEPS="${WARMUP_STEPS:-0}"
export LR_WARMUP_STEPS="${LR_WARMUP_STEPS:-128}"
export LR_WARMUP_INIT_SCALE="${LR_WARMUP_INIT_SCALE:-0.1}"
export TRAIN_LOG_EVERY="${TRAIN_LOG_EVERY:-20}"

export VAL_LOSS_EVERY="${VAL_LOSS_EVERY:-0}"
export VAL_BATCH_SIZE="${VAL_BATCH_SIZE:-131072}"
export VAL_MAX_TOKENS="${VAL_MAX_TOKENS:-4096}"

export EVAL_AUTOREGRESSIVE_KV="${EVAL_AUTOREGRESSIVE_KV:-1}"
export KV_QUANT_BACKEND="${KV_QUANT_BACKEND:-qjl}"
export KV_EVAL_CONTEXT_LEN="${KV_EVAL_CONTEXT_LEN:-256}"
export KV_EVAL_MAX_TOKENS="${KV_EVAL_MAX_TOKENS:-512}"
export KV_BACKEND_SELFTEST="${KV_BACKEND_SELFTEST:-0}"

export ENABLE_TORCH_COMPILE="${ENABLE_TORCH_COMPILE:-0}"
export MAX_WALLCLOCK_SECONDS="${MAX_WALLCLOCK_SECONDS:-600}"
export FINALIZE_BUDGET_SECONDS="${FINALIZE_BUDGET_SECONDS:-15}"
export LOG_SYNC_TO_DISK="${LOG_SYNC_TO_DISK:-1}"

torchrun --standalone --nproc_per_node=8 train_gpt.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
{
"author": "Lucas Ercolano",
"github_id": "LucasErcolano",
"name": "Distributed 8xH100 Polar STE + QJL KV-cache baseline",
"blurb": "Non-record Hopper baseline validating Polar STE structural-weight QAT plus native QJL KV-cache eval under 8xH100 DDP. The run completed in 592.209s without deadlocks, stayed under 16MB with a 14,751,006-byte polar+zlib artifact, and exposed a large teacher-forced vs autoregressive KV gap (1.4594 vs 2.1283 BPB), so it is submitted as infrastructure validation rather than a leaderboard claim.",
"date": "2026-03-30T00:00:00Z",
"track": "non-record-16mb",
"val_loss": 3.78602759,
"val_bpb": 2.12830032,
"teacher_forced_val_loss": 2.5242,
"teacher_forced_val_bpb": 1.4594,
"step_stop": 3382,
"wallclock_seconds": 592.209,
"training_time_seconds": 578.545,
"pre_training_overhead_seconds": 6.610,
"bytes_total": 14875186,
"bytes_model_polar_zlib": 14751006,
"bytes_code": 124180,
"peak_memory_allocated_mib": 1933,
"peak_memory_reserved_mib": 2080,
"peak_cache_bytes": 387072,
"peak_cache_tensor_bytes": 1271808,
"tokens_per_second": 93.51,
"kv_eval_backend": "qjl",
"gpu": "8xH100 80GB HBM3",
"world_size": 8,
"seed": 314
}
Loading