Non-record: Distributed 8xH100 Polar STE + QJL KV-cache baseline#1160
Closed
LucasErcolano wants to merge 1 commit intoopenai:mainfrom
Closed
Non-record: Distributed 8xH100 Polar STE + QJL KV-cache baseline#1160LucasErcolano wants to merge 1 commit intoopenai:mainfrom
LucasErcolano wants to merge 1 commit intoopenai:mainfrom
Conversation
Author
|
Superseded by the Hopper validation update on #1154. Closing this draft to keep the discussion consolidated in the Polar STE PR, which now includes the wallclock-budget fix and the full 8xH100 validation results. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This PR adds a self-contained non-record submission folder under
records/track_non_record_16mb/for the first successful distributed Hopper baseline of the Polar STE + QJL KV-cache stack.The point of this submission is infrastructure validation, not a leaderboard claim. It proves that the stack:
8xH100 80GB HBM3withWORLD_SIZE=8polar+zlibartifact under16MB600swallclock after a budgeting bug was fixedHopper Result
Single-seed run (
SEED=314) on the official RunPod Parameter Golf image:3382train stepsval_bpb=1.4594qjlevalval_bpb=2.1283003293.51 tok/s14,751,006bytes1933 MiBallocated /2080 MiBreserved592.209sThe run log is included directly in the folder as
train_seed314_budgetfix.log.Why Non-record
The teacher-forced vs autoregressive gap is still too large for a serious leaderboard attempt:
val_bpb=1.4594val_bpb=2.1283That gap strongly suggests the quantized KV path is still injecting too much decode-time error, even though optimization itself is stable.
Engineering Note: Wallclock Bug Found and Fixed
The first 8xH100 attempt exposed a real bug in the internal wallclock guard:
601.863spre_training_overheadand reduces the usable training budget accordinglypre_training_overhead:6610msandtrain_budget_after_setup:578390msFiles Added
train_gpt.pytriton_kv_ops.pyrun_h100x8.shtrain_seed314_budgetfix.logREADME.mdsubmission.jsonrequirements.txtValidation
8xH100distributed run completed successfully on RunPodpy -3.11 -m py_compile records/track_non_record_16mb/2026-03-30_Distributed_H100x8_PolarSTE_QJL_Baseline/train_gpt.py