Skip to content

Non-record: Distributed 8xH100 Polar STE + QJL KV-cache baseline#1160

Closed
LucasErcolano wants to merge 1 commit intoopenai:mainfrom
LucasErcolano:codex/non-record-distributed-h100x8-polar-qjl
Closed

Non-record: Distributed 8xH100 Polar STE + QJL KV-cache baseline#1160
LucasErcolano wants to merge 1 commit intoopenai:mainfrom
LucasErcolano:codex/non-record-distributed-h100x8-polar-qjl

Conversation

@LucasErcolano
Copy link
Copy Markdown

Summary

This PR adds a self-contained non-record submission folder under records/track_non_record_16mb/ for the first successful distributed Hopper baseline of the Polar STE + QJL KV-cache stack.

The point of this submission is infrastructure validation, not a leaderboard claim. It proves that the stack:

  • trains under 8xH100 80GB HBM3 with WORLD_SIZE=8
  • exports a polar+zlib artifact under 16MB
  • reloads that artifact successfully
  • runs final autoregressive KV evaluation on rank 0 without DDP deadlocks
  • respects the official 600s wallclock after a budgeting bug was fixed

Hopper Result

Single-seed run (SEED=314) on the official RunPod Parameter Golf image:

  • 3382 train steps
  • teacher-forced final val_bpb=1.4594
  • final autoregressive qjl eval val_bpb=2.12830032
  • final autoregressive throughput 93.51 tok/s
  • artifact size 14,751,006 bytes
  • peak VRAM 1933 MiB allocated / 2080 MiB reserved
  • final wallclock 592.209s

The run log is included directly in the folder as train_seed314_budgetfix.log.

Why Non-record

The teacher-forced vs autoregressive gap is still too large for a serious leaderboard attempt:

  • teacher-forced val_bpb=1.4594
  • autoregressive KV val_bpb=2.1283

That gap strongly suggests the quantized KV path is still injecting too much decode-time error, even though optimization itself is stable.

Engineering Note: Wallclock Bug Found and Fixed

The first 8xH100 attempt exposed a real bug in the internal wallclock guard:

  • the initial run finished at 601.863s
  • root cause: the script reserved finalization time, but did not subtract pre-training setup overhead before entering the training loop
  • the fix now measures pre_training_overhead and reduces the usable training budget accordingly
  • the successful run logged pre_training_overhead:6610ms and train_budget_after_setup:578390ms

Files Added

  • train_gpt.py
  • triton_kv_ops.py
  • run_h100x8.sh
  • train_seed314_budgetfix.log
  • README.md
  • submission.json
  • requirements.txt

Validation

  • Real 8xH100 distributed run completed successfully on RunPod
  • py -3.11 -m py_compile records/track_non_record_16mb/2026-03-30_Distributed_H100x8_PolarSTE_QJL_Baseline/train_gpt.py

@LucasErcolano
Copy link
Copy Markdown
Author

Superseded by the Hopper validation update on #1154. Closing this draft to keep the discussion consolidated in the Polar STE PR, which now includes the wallclock-budget fix and the full 8xH100 validation results.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant