[Record] SP8192 + SDClip + 3-Layer Depth Recurrence + EMA 0.9965 — val_bpb 1.0866#1471
[Record] SP8192 + SDClip + 3-Layer Depth Recurrence + EMA 0.9965 — val_bpb 1.0866#1471X-Abhishek-X wants to merge 1 commit intoopenai:mainfrom
Conversation
…l_bpb 1.0866 3-seed mean: 1.0866 BPB (sliding window stride=64) Beats merged SOTA (1.1147) by 0.0281 BPB. SP8192 tokenizer with SDClip quantization (c=k*std), 3-layer recurrence (3,4,5), EMA 0.9965, WD=0.095, early recurrence (step 2000), extended warmdown (72%). Zero selective pruning across all seeds. Seeds: 42 (1.0873), 1337 (1.0866), 2024 (1.0859) All artifacts under 16MB. 8xH100 SXM, 590s training.
There was a problem hiding this comment.
Pull request overview
Adds a new Track A (10min / 16MB) record snapshot for the “SP8192 + SDClip + 3-layer depth recurrence + EMA 0.9965” run, including the exact training script, seed logs, and leaderboard metadata for reproducibility.
Changes:
- Add a standalone
train_gpt.pyimplementing SP8192 + SDClip quantization + 3-layer depth recurrence + EMA=0.9965 configuration. - Add training logs for the reported seeds (42 / 1337 / 2024) plus a canonical
train.log. - Add record metadata (
submission.json) and a human-readable writeup (README.md).
Reviewed changes
Copilot reviewed 3 out of 7 changed files in this pull request and generated 8 comments.
Show a summary per file
| File | Description |
|---|---|
| records/track_10min_16mb/2026-04-08_SP8192_SDClip_3LayerRecur_EMA0.9965_1.0866/train_gpt.py | Standalone training/eval/quantization script for the record configuration |
| records/track_10min_16mb/2026-04-08_SP8192_SDClip_3LayerRecur_EMA0.9965_1.0866/train.log | Canonical run log (seed 42) capturing hyperparams, training, GPTQ, and final metrics |
| records/track_10min_16mb/2026-04-08_SP8192_SDClip_3LayerRecur_EMA0.9965_1.0866/train_seed42.log | Full seed-42 run log for reproducibility |
| records/track_10min_16mb/2026-04-08_SP8192_SDClip_3LayerRecur_EMA0.9965_1.0866/train_seed1337.log | Full seed-1337 run log for reproducibility |
| records/track_10min_16mb/2026-04-08_SP8192_SDClip_3LayerRecur_EMA0.9965_1.0866/train_seed2024.log | Full seed-2024 run log for reproducibility |
| records/track_10min_16mb/2026-04-08_SP8192_SDClip_3LayerRecur_EMA0.9965_1.0866/submission.json | Leaderboard metadata (mean metrics + artifact size) |
| records/track_10min_16mb/2026-04-08_SP8192_SDClip_3LayerRecur_EMA0.9965_1.0866/README.md | Record summary, reported results table, and reproduction command |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| def log(msg, console: bool = True) -> None: | ||
| if _logger_hparams is None: | ||
| print(msg) | ||
| if _logger_hparams.is_main_process: | ||
| if console: | ||
| print(msg) | ||
| if _logger_hparams.logfile is not None: | ||
| with open(_logger_hparams.logfile, "a", encoding="utf-8") as f: | ||
| print(msg, file=f) |
There was a problem hiding this comment.
log() prints when _logger_hparams is None but then immediately dereferences _logger_hparams.is_main_process, which would raise AttributeError if log() is ever called before set_logging_hparams() (or if set_logging_hparams() fails). Add an early return after the fallback print, or guard the second if with an else.
| def serialize(h: Hyperparameters, base_model: torch.nn.Module, code: str) -> int: | ||
| model_bytes = None | ||
| code_bytes = len(code.encode("utf-8")) | ||
| if h.is_main_process: | ||
| torch.save(base_model.state_dict(), h.model_path) | ||
| model_bytes = os.path.getsize(h.model_path) | ||
| log(f"Serialized model: {model_bytes} bytes") | ||
| log(f"Code size: {code_bytes} bytes") | ||
|
|
There was a problem hiding this comment.
serialize() is annotated as returning int but never returns a value. Either return something meaningful (e.g., bytes_total) or change the annotation to -> None to match behavior.
| def train_model(h: Hyperparameters, device: torch.device, val_data: ValidationData) -> None: | ||
| # Set up model | ||
| base_model = GPT(h).to(device).bfloat16() | ||
| restore_fp32_params(base_model) | ||
| compiled_model = torch.compile(base_model, dynamic=False, fullgraph=True) | ||
| if h.distributed: | ||
| model = DDP(compiled_model, device_ids=[h.local_rank], broadcast_buffers=False) | ||
| else: | ||
| model = compiled_model |
There was a problem hiding this comment.
train_model() is annotated as returning None, but it returns (base_model, compiled_model). Update the return annotation (and ideally the docstring/comment) to reflect the actual return type to avoid confusion and make static tooling happier.
| ) -> tuple[dict[str, Tensor], dict[str, object]]: | ||
| """Mixed quantization using full GPTQ for layers with Hessians, fallback to clip-search.""" | ||
| result: dict[str, Tensor] = {} | ||
| meta: dict[str, object] = {} | ||
| gptq_count = 0 | ||
| fallback_count = 0 | ||
|
|
||
| for name, tensor in state_dict.items(): | ||
| t = tensor.detach().cpu().contiguous() | ||
| cat = classify_param(name) | ||
|
|
||
| if not t.is_floating_point() or t.numel() <= 65536: | ||
| result[name] = t.to(torch.float16) if t.is_floating_point() else t | ||
| meta[name] = "passthrough" | ||
| continue | ||
|
|
||
| if any(p in name for p in CONTROL_TENSOR_NAME_PATTERNS): | ||
| result[name] = t.float() | ||
| meta[name] = "passthrough_ctrl" | ||
| continue | ||
|
|
||
| if cat in int6_cats and t.ndim == 2: | ||
| if name in hessians: | ||
| q, s = gptq_quantize_weight(t, hessians[name], sdclip_k=sdclip_k) | ||
| gptq_count += 1 | ||
| meta[name] = {"type": "int6", "method": "gptq_sdclip"} | ||
| else: | ||
| q, s = quantize_int6_per_row(t, sdclip_k=sdclip_k) | ||
| fallback_count += 1 | ||
| meta[name] = {"type": "int6", "method": "sdclip"} | ||
| result[name + ".q"] = q | ||
| result[name + ".scale"] = s | ||
| elif cat in int6_cats and t.ndim >= 1: | ||
| q, s = quantize_int6_per_row(t, sdclip_k=sdclip_k) | ||
| result[name + ".q"] = q | ||
| result[name + ".scale"] = s | ||
| meta[name] = {"type": "int6"} | ||
| else: | ||
| q, s = quantize_float_tensor(t, sdclip_k_embed=sdclip_k_embed) | ||
| result[name + ".q"] = q | ||
| result[name + ".scale"] = s | ||
| meta[name] = {"type": "int8_sdclip"} | ||
|
|
||
| log(f"GPTQ quantization: {gptq_count} layers with full GPTQ, {fallback_count} fallback to clip-search") | ||
| return result, meta |
There was a problem hiding this comment.
The docstring and log message say the non-GPTQ path is a “fallback to clip-search”, but the code actually falls back to SDClip (quantize_int6_per_row(..., sdclip_k=...)). Please update the wording so logs/metadata accurately reflect what happened during quantization.
| _HAS_BROTLI = True | ||
| except ImportError: | ||
| _HAS_BROTLI = False | ||
|
|
There was a problem hiding this comment.
_HAS_BROTLI is set in the import guard but never used (compression always does an inline import). Either remove _HAS_BROTLI or use it to raise a clearer error when COMPRESSOR=brotli but the module isn’t installed.
| if os.environ.get("COMPRESSOR", "").strip().lower() == "brotli" and not _HAS_BROTLI: | |
| raise ImportError( | |
| "COMPRESSOR is set to 'brotli' but the optional 'brotli' module is not installed. " | |
| "Install 'brotli' or choose a different compressor." | |
| ) |
| |------|---------------|-------------------|---------|----------| | ||
| | 42 | 1.0874 | **1.0873** | None | 15,981,300 B | | ||
| | 1337 | 1.0865 | **1.0866** | None | 15,978,870 B | | ||
| | 2024 | — | **1.0859** | None | — | |
There was a problem hiding this comment.
The seed 2024 row is missing the pre-quant BPB and artifact size, but train_seed2024.log includes both (pre-quantization post-ema val_bpb: 1.08623375 and Total submission size ...: 15,975,819 bytes). Please fill these in (or remove the column) so the README matches the provided logs.
| | 2024 | — | **1.0859** | None | — | | |
| | 2024 | 1.0862 | **1.0859** | None | 15,975,819 B | |
| DATA_PATH=./data/datasets/fineweb10B_sp8192/ \ | ||
| TOKENIZER_PATH=./data/tokenizers/fineweb_8192_bpe.model \ |
There was a problem hiding this comment.
The run command uses DATA_PATH and TOKENIZER_PATH, but this train_gpt.py reads DATA_DIR and constructs datasets_dir/tokenizer_path internally (it does not read DATA_PATH/TOKENIZER_PATH). Update the command to use the env vars the script actually consumes so reproduction works when data isn’t under ./data/.
| DATA_PATH=./data/datasets/fineweb10B_sp8192/ \ | |
| TOKENIZER_PATH=./data/tokenizers/fineweb_8192_bpe.model \ | |
| DATA_DIR=./data/ \ |
| "date": "2026-04-08T00:00:00Z", | ||
| "val_loss": 2.80668370, | ||
| "val_bpb": 1.08655472, | ||
| "bytes_total": 15978870 |
There was a problem hiding this comment.
This track’s submission.json files consistently include bytes_code alongside bytes_total (e.g., records/track_10min_16mb/2026-03-17_NaiveBaseline/submission.json:10). Consider adding bytes_code here as well to match the established schema and make size breakdowns comparable.
| "bytes_total": 15978870 | |
| "bytes_total": 15978870, | |
| "bytes_code": 15978870 |
…IP), L05_NORM_PCT still world-novel, flagged L11_DYN_LYAPUNOV vs PR openai#1471 SDClip Re-audit findings: - OPT_RIEMANNIAN_GRAM_QKV → comp-novel (Tilde Gram-Space Manifold Muon + arXiv:2603.09697 Mousse, just shipped 14 min before demotion) - L06_ASYMMETRIC_SKIP_INIT → comp-novel (Nick Ryan May 2024 blog explicitly tests 0.5 half-init schedule, 2-yr prior art) - L05_NORM_PCT_DROPOUT → STILL world-novel (0 hits on norm-percentile feature dropout) Comp PR audit (last 3h, openai#1467-openai#1473): PR openai#1471 introduces SDClip — flagged for L11 DYN_LYAPUNOV adjacency review next C180. Verified world-novels now 4 (down from claimed 8): L05 NORM_PCT_DROPOUT (validated), L09 NGR_LOG_FREQ_INV (shipped), L09 CTX_PARTITIONED_TAB (shipped), L10 CMP_QUANT_VALUE_DEDUP (shipped). Spend: ~$13.60 / $25 soft cap NORMAL. Win likelihood: 25% (down from 30%). LESSON: 3rd time this session conflated 'novel sublayer slice' with 'world-novel'. New rule: pre-ship audit must demand 0 hits on UNDERLYING technique, not just the slice.
Refresh PR cache, reclassify, publish frontier verdicts on data-touching vs data-free compression (PR openai#672/openai#1482/openai#1477/openai#1471). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Two of the three comp-frontier wins are env-var bumps with no code change: - LOOP_START 4 → 3 (with NUM_LOOPS=2 and LOOP_END=5 this gives 3-layer recurrence on layers 3/4/5 instead of 2-layer on 4/5). PR openai#1485 / openai#1471 / openai#1437 use this. Expected -0.005 to -0.01 BPB. - QK_GAIN_INIT 4 → 5. PRs openai#1413, openai#1423, openai#1485, openai#1437, openai#1351, openai#1408 are at 5; openai#1482 is at 5.25. PR openai#1477's default 4 is below the leaderboard curve. Expected -0.001 BPB. C1 (Pre-Quant AdamW TTT) is the bigger win (-0.014 BPB) but requires real code — agent is researching PR openai#1485 / openai#1416 / openai#1306 implementations in background. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Record: SP8192 + SDClip + 3-Layer Depth Recurrence + EMA 0.9965 — val_bpb 1.0866
val_bpb: 1.0866 (3-seed mean, std 0.0007) | ~15.98 MB | 8×H100 SXM, 590s
3-Seed Results (8×H100 80GB SXM)
Current merged SOTA: 1.1147 (PR #1019). Delta: −0.0281 BPB.
Key Changes (over PR #1445, this author)
SDClip Quantization
Replaces multi-percentile clip search with
clip = k · std(row)(PR #1394):Full Stack
Architecture
Training
Quantization
Run Command
Credits