Non-record: Distributed 8xH100 Polar STE + QJL KV-cache baseline by LucasErcolano · Pull Request #1160 · openai/parameter-golf

LucasErcolano · 2026-03-31T02:00:28Z

Summary

This PR adds a self-contained non-record submission folder under records/track_non_record_16mb/ for the first successful distributed Hopper baseline of the Polar STE + QJL KV-cache stack.

The point of this submission is infrastructure validation, not a leaderboard claim. It proves that the stack:

trains under 8xH100 80GB HBM3 with WORLD_SIZE=8
exports a polar+zlib artifact under 16MB
reloads that artifact successfully
runs final autoregressive KV evaluation on rank 0 without DDP deadlocks
respects the official 600s wallclock after a budgeting bug was fixed

Hopper Result

Single-seed run (SEED=314) on the official RunPod Parameter Golf image:

3382 train steps
teacher-forced final val_bpb=1.4594
final autoregressive qjl eval val_bpb=2.12830032
final autoregressive throughput 93.51 tok/s
artifact size 14,751,006 bytes
peak VRAM 1933 MiB allocated / 2080 MiB reserved
final wallclock 592.209s

The run log is included directly in the folder as train_seed314_budgetfix.log.

Why Non-record

The teacher-forced vs autoregressive gap is still too large for a serious leaderboard attempt:

teacher-forced val_bpb=1.4594
autoregressive KV val_bpb=2.1283

That gap strongly suggests the quantized KV path is still injecting too much decode-time error, even though optimization itself is stable.

Engineering Note: Wallclock Bug Found and Fixed

The first 8xH100 attempt exposed a real bug in the internal wallclock guard:

the initial run finished at 601.863s
root cause: the script reserved finalization time, but did not subtract pre-training setup overhead before entering the training loop
the fix now measures pre_training_overhead and reduces the usable training budget accordingly
the successful run logged pre_training_overhead:6610ms and train_budget_after_setup:578390ms

Files Added

train_gpt.py
triton_kv_ops.py
run_h100x8.sh
train_seed314_budgetfix.log
README.md
submission.json
requirements.txt

Validation

Real 8xH100 distributed run completed successfully on RunPod
py -3.11 -m py_compile records/track_non_record_16mb/2026-03-30_Distributed_H100x8_PolarSTE_QJL_Baseline/train_gpt.py

LucasErcolano · 2026-03-31T02:04:02Z

Superseded by the Hopper validation update on #1154. Closing this draft to keep the discussion consolidated in the Polar STE PR, which now includes the wallclock-budget fix and the full 8xH100 validation results.

Add non-record 8xH100 Polar QJL baseline

4c95362

LucasErcolano closed this Mar 31, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: Distributed 8xH100 Polar STE + QJL KV-cache baseline#1160

Non-record: Distributed 8xH100 Polar STE + QJL KV-cache baseline#1160
LucasErcolano wants to merge 1 commit intoopenai:mainfrom
LucasErcolano:codex/non-record-distributed-h100x8-polar-qjl

LucasErcolano commented Mar 31, 2026

Uh oh!

LucasErcolano commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

LucasErcolano commented Mar 31, 2026

Summary

Hopper Result

Why Non-record

Engineering Note: Wallclock Bug Found and Fixed

Files Added

Validation

Uh oh!

LucasErcolano commented Mar 31, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant