Skip to content

[Record] SP8192 + SDClip + 3-Layer Depth Recurrence + EMA 0.9965 — val_bpb 1.0866#1471

Open
X-Abhishek-X wants to merge 1 commit intoopenai:mainfrom
X-Abhishek-X:record/v6-sp8192-sdclip-1.0866
Open

[Record] SP8192 + SDClip + 3-Layer Depth Recurrence + EMA 0.9965 — val_bpb 1.0866#1471
X-Abhishek-X wants to merge 1 commit intoopenai:mainfrom
X-Abhishek-X:record/v6-sp8192-sdclip-1.0866

Conversation

@X-Abhishek-X
Copy link
Copy Markdown

Record: SP8192 + SDClip + 3-Layer Depth Recurrence + EMA 0.9965 — val_bpb 1.0866

val_bpb: 1.0866 (3-seed mean, std 0.0007) | ~15.98 MB | 8×H100 SXM, 590s

3-Seed Results (8×H100 80GB SXM)

Seed Pre-quant BPB Sliding BPB (s64) Pruning Artifact
42 1.0874 1.0873 None 15,981,300 B
1337 1.0865 1.0866 None 15,978,870 B
2024 1.0859 None
Mean 1.0866 (std 0.0007) Zero

Current merged SOTA: 1.1147 (PR #1019). Delta: −0.0281 BPB.

Key Changes (over PR #1445, this author)

Change PR #1445 This Impact
Tokenizer SP4096 SP8192 Larger vocab, better context
Quantization Percentile search SDClip (c = k·std) Zero pruning, better rate-distortion

SDClip Quantization

Replaces multi-percentile clip search with clip = k · std(row) (PR #1394):

  • k=12.85 for int6 matrices, k=20.0 for int8 embeddings
  • Directly accounts for compressed size, not just reconstruction error
  • One GPTQ pass per matrix instead of 5
  • Result: zero selective pruning — model fits natively under 16MB

Full Stack

Parameter Value Source
Tokenizer SP8192 This work
SDClip k (matrices/embed) 12.85 / 20.0 PR #1394, this work
Recurrence layers 3,4,5 (14 virtual) PR #1331
Weight decay 0.095 PR #1331
Matrix LR 0.022 PR #1331
EMA decay 0.9965 PR #1421 (this author)
Recurrence start step 2000 PR #1445 (this author)
Warmdown fraction 0.72 PR #1445 (this author)

Architecture

  • 11L, 512-dim, 8 heads (4 KV), depth recurrence (3,4,5), 14 virtual layers
  • Skip gates, parallel residuals from layer 7, QK-Gain 5.0
  • XSA all 11 layers, LeakyReLU(0.5)², VE128 (layers 9,10)
  • Tied embeddings, logit softcap=30.0

Training

  • FlashAttention 3, Muon (lr=0.022, WD=0.095), Adam/AdamW (fused=True)
  • Warmdown: 72%, EMA=0.9965, Wallclock: 590s

Quantization

  • Full Hessian GPTQ + Cholesky + actorder
  • SDClip (c = k·std) — int6 matrices, int8 embeddings
  • Brotli compression, zero selective pruning

Run Command

SEED=42 VOCAB_SIZE=8192 \
DATA_PATH=./data/datasets/fineweb10B_sp8192/ \
TOKENIZER_PATH=./data/tokenizers/fineweb_8192_bpe.model \
RECUR_START_STEP=2000 WARMDOWN_FRAC=0.72 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Credits

…l_bpb 1.0866

3-seed mean: 1.0866 BPB (sliding window stride=64)
Beats merged SOTA (1.1147) by 0.0281 BPB.

SP8192 tokenizer with SDClip quantization (c=k*std),
3-layer recurrence (3,4,5), EMA 0.9965, WD=0.095,
early recurrence (step 2000), extended warmdown (72%).
Zero selective pruning across all seeds.

Seeds: 42 (1.0873), 1337 (1.0866), 2024 (1.0859)
All artifacts under 16MB. 8xH100 SXM, 590s training.
Copilot AI review requested due to automatic review settings April 8, 2026 10:20
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new Track A (10min / 16MB) record snapshot for the “SP8192 + SDClip + 3-layer depth recurrence + EMA 0.9965” run, including the exact training script, seed logs, and leaderboard metadata for reproducibility.

Changes:

  • Add a standalone train_gpt.py implementing SP8192 + SDClip quantization + 3-layer depth recurrence + EMA=0.9965 configuration.
  • Add training logs for the reported seeds (42 / 1337 / 2024) plus a canonical train.log.
  • Add record metadata (submission.json) and a human-readable writeup (README.md).

Reviewed changes

Copilot reviewed 3 out of 7 changed files in this pull request and generated 8 comments.

Show a summary per file
File Description
records/track_10min_16mb/2026-04-08_SP8192_SDClip_3LayerRecur_EMA0.9965_1.0866/train_gpt.py Standalone training/eval/quantization script for the record configuration
records/track_10min_16mb/2026-04-08_SP8192_SDClip_3LayerRecur_EMA0.9965_1.0866/train.log Canonical run log (seed 42) capturing hyperparams, training, GPTQ, and final metrics
records/track_10min_16mb/2026-04-08_SP8192_SDClip_3LayerRecur_EMA0.9965_1.0866/train_seed42.log Full seed-42 run log for reproducibility
records/track_10min_16mb/2026-04-08_SP8192_SDClip_3LayerRecur_EMA0.9965_1.0866/train_seed1337.log Full seed-1337 run log for reproducibility
records/track_10min_16mb/2026-04-08_SP8192_SDClip_3LayerRecur_EMA0.9965_1.0866/train_seed2024.log Full seed-2024 run log for reproducibility
records/track_10min_16mb/2026-04-08_SP8192_SDClip_3LayerRecur_EMA0.9965_1.0866/submission.json Leaderboard metadata (mean metrics + artifact size)
records/track_10min_16mb/2026-04-08_SP8192_SDClip_3LayerRecur_EMA0.9965_1.0866/README.md Record summary, reported results table, and reproduction command

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +155 to +163
def log(msg, console: bool = True) -> None:
if _logger_hparams is None:
print(msg)
if _logger_hparams.is_main_process:
if console:
print(msg)
if _logger_hparams.logfile is not None:
with open(_logger_hparams.logfile, "a", encoding="utf-8") as f:
print(msg, file=f)
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

log() prints when _logger_hparams is None but then immediately dereferences _logger_hparams.is_main_process, which would raise AttributeError if log() is ever called before set_logging_hparams() (or if set_logging_hparams() fails). Add an early return after the fallback print, or guard the second if with an else.

Copilot uses AI. Check for mistakes.
Comment on lines +1280 to +1288
def serialize(h: Hyperparameters, base_model: torch.nn.Module, code: str) -> int:
model_bytes = None
code_bytes = len(code.encode("utf-8"))
if h.is_main_process:
torch.save(base_model.state_dict(), h.model_path)
model_bytes = os.path.getsize(h.model_path)
log(f"Serialized model: {model_bytes} bytes")
log(f"Code size: {code_bytes} bytes")

Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

serialize() is annotated as returning int but never returns a value. Either return something meaningful (e.g., bytes_total) or change the annotation to -> None to match behavior.

Copilot uses AI. Check for mistakes.
Comment on lines +1711 to +1719
def train_model(h: Hyperparameters, device: torch.device, val_data: ValidationData) -> None:
# Set up model
base_model = GPT(h).to(device).bfloat16()
restore_fp32_params(base_model)
compiled_model = torch.compile(base_model, dynamic=False, fullgraph=True)
if h.distributed:
model = DDP(compiled_model, device_ids=[h.local_rank], broadcast_buffers=False)
else:
model = compiled_model
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

train_model() is annotated as returning None, but it returns (base_model, compiled_model). Update the return annotation (and ideally the docstring/comment) to reflect the actual return type to avoid confusion and make static tooling happier.

Copilot uses AI. Check for mistakes.
Comment on lines +1124 to +1168
) -> tuple[dict[str, Tensor], dict[str, object]]:
"""Mixed quantization using full GPTQ for layers with Hessians, fallback to clip-search."""
result: dict[str, Tensor] = {}
meta: dict[str, object] = {}
gptq_count = 0
fallback_count = 0

for name, tensor in state_dict.items():
t = tensor.detach().cpu().contiguous()
cat = classify_param(name)

if not t.is_floating_point() or t.numel() <= 65536:
result[name] = t.to(torch.float16) if t.is_floating_point() else t
meta[name] = "passthrough"
continue

if any(p in name for p in CONTROL_TENSOR_NAME_PATTERNS):
result[name] = t.float()
meta[name] = "passthrough_ctrl"
continue

if cat in int6_cats and t.ndim == 2:
if name in hessians:
q, s = gptq_quantize_weight(t, hessians[name], sdclip_k=sdclip_k)
gptq_count += 1
meta[name] = {"type": "int6", "method": "gptq_sdclip"}
else:
q, s = quantize_int6_per_row(t, sdclip_k=sdclip_k)
fallback_count += 1
meta[name] = {"type": "int6", "method": "sdclip"}
result[name + ".q"] = q
result[name + ".scale"] = s
elif cat in int6_cats and t.ndim >= 1:
q, s = quantize_int6_per_row(t, sdclip_k=sdclip_k)
result[name + ".q"] = q
result[name + ".scale"] = s
meta[name] = {"type": "int6"}
else:
q, s = quantize_float_tensor(t, sdclip_k_embed=sdclip_k_embed)
result[name + ".q"] = q
result[name + ".scale"] = s
meta[name] = {"type": "int8_sdclip"}

log(f"GPTQ quantization: {gptq_count} layers with full GPTQ, {fallback_count} fallback to clip-search")
return result, meta
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docstring and log message say the non-GPTQ path is a “fallback to clip-search”, but the code actually falls back to SDClip (quantize_int6_per_row(..., sdclip_k=...)). Please update the wording so logs/metadata accurately reflect what happened during quantization.

Copilot uses AI. Check for mistakes.
_HAS_BROTLI = True
except ImportError:
_HAS_BROTLI = False

Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_HAS_BROTLI is set in the import guard but never used (compression always does an inline import). Either remove _HAS_BROTLI or use it to raise a clearer error when COMPRESSOR=brotli but the module isn’t installed.

Suggested change
if os.environ.get("COMPRESSOR", "").strip().lower() == "brotli" and not _HAS_BROTLI:
raise ImportError(
"COMPRESSOR is set to 'brotli' but the optional 'brotli' module is not installed. "
"Install 'brotli' or choose a different compressor."
)

Copilot uses AI. Check for mistakes.
|------|---------------|-------------------|---------|----------|
| 42 | 1.0874 | **1.0873** | None | 15,981,300 B |
| 1337 | 1.0865 | **1.0866** | None | 15,978,870 B |
| 2024 | — | **1.0859** | None | — |
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The seed 2024 row is missing the pre-quant BPB and artifact size, but train_seed2024.log includes both (pre-quantization post-ema val_bpb: 1.08623375 and Total submission size ...: 15,975,819 bytes). Please fill these in (or remove the column) so the README matches the provided logs.

Suggested change
| 2024 | | **1.0859** | None | |
| 2024 | 1.0862 | **1.0859** | None | 15,975,819 B |

Copilot uses AI. Check for mistakes.
Comment on lines +89 to +90
DATA_PATH=./data/datasets/fineweb10B_sp8192/ \
TOKENIZER_PATH=./data/tokenizers/fineweb_8192_bpe.model \
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The run command uses DATA_PATH and TOKENIZER_PATH, but this train_gpt.py reads DATA_DIR and constructs datasets_dir/tokenizer_path internally (it does not read DATA_PATH/TOKENIZER_PATH). Update the command to use the env vars the script actually consumes so reproduction works when data isn’t under ./data/.

Suggested change
DATA_PATH=./data/datasets/fineweb10B_sp8192/ \
TOKENIZER_PATH=./data/tokenizers/fineweb_8192_bpe.model \
DATA_DIR=./data/ \

Copilot uses AI. Check for mistakes.
"date": "2026-04-08T00:00:00Z",
"val_loss": 2.80668370,
"val_bpb": 1.08655472,
"bytes_total": 15978870
Copy link

Copilot AI Apr 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This track’s submission.json files consistently include bytes_code alongside bytes_total (e.g., records/track_10min_16mb/2026-03-17_NaiveBaseline/submission.json:10). Consider adding bytes_code here as well to match the established schema and make size breakdowns comparable.

Suggested change
"bytes_total": 15978870
"bytes_total": 15978870,
"bytes_code": 15978870

Copilot uses AI. Check for mistakes.
taka6745 pushed a commit to taka6745/parameter-golf that referenced this pull request Apr 8, 2026
…IP), L05_NORM_PCT still world-novel, flagged L11_DYN_LYAPUNOV vs PR openai#1471 SDClip

Re-audit findings:
- OPT_RIEMANNIAN_GRAM_QKV → comp-novel (Tilde Gram-Space Manifold Muon + arXiv:2603.09697 Mousse, just shipped 14 min before demotion)
- L06_ASYMMETRIC_SKIP_INIT → comp-novel (Nick Ryan May 2024 blog explicitly tests 0.5 half-init schedule, 2-yr prior art)
- L05_NORM_PCT_DROPOUT → STILL world-novel (0 hits on norm-percentile feature dropout)

Comp PR audit (last 3h, openai#1467-openai#1473): PR openai#1471 introduces SDClip — flagged for L11 DYN_LYAPUNOV adjacency review next C180.

Verified world-novels now 4 (down from claimed 8): L05 NORM_PCT_DROPOUT (validated), L09 NGR_LOG_FREQ_INV (shipped), L09 CTX_PARTITIONED_TAB (shipped), L10 CMP_QUANT_VALUE_DEDUP (shipped).

Spend: ~$13.60 / $25 soft cap NORMAL.
Win likelihood: 25% (down from 30%).

LESSON: 3rd time this session conflated 'novel sublayer slice' with 'world-novel'. New rule: pre-ship audit must demand 0 hits on UNDERLYING technique, not just the slice.
MatoTeziTanka pushed a commit to MatoTeziTanka/parameter-golf that referenced this pull request Apr 8, 2026
Refresh PR cache, reclassify, publish frontier verdicts on
data-touching vs data-free compression (PR openai#672/openai#1482/openai#1477/openai#1471).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
taka6745 pushed a commit to taka6745/parameter-golf that referenced this pull request Apr 9, 2026
Two of the three comp-frontier wins are env-var bumps with no code change:
- LOOP_START 4 → 3 (with NUM_LOOPS=2 and LOOP_END=5 this gives 3-layer
  recurrence on layers 3/4/5 instead of 2-layer on 4/5). PR openai#1485 / openai#1471 /
  openai#1437 use this. Expected -0.005 to -0.01 BPB.
- QK_GAIN_INIT 4 → 5. PRs openai#1413, openai#1423, openai#1485, openai#1437, openai#1351, openai#1408 are at 5;
  openai#1482 is at 5.25. PR openai#1477's default 4 is below the leaderboard curve.
  Expected -0.001 BPB.

C1 (Pre-Quant AdamW TTT) is the bigger win (-0.014 BPB) but requires real
code — agent is researching PR openai#1485 / openai#1416 / openai#1306 implementations in
background.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants