GatedAttn + ValueResid + XSA6 + HedgeMixer + Legal TTT — val_bpb: 1.08965 (3-seed mean) by sahiee-dev · Pull Request #824 · openai/parameter-golf

sahiee-dev · 2026-03-26T07:05:12Z

XSA6 + BigramHash4K + GatedAttn + ValueResidual on HedgeMixer Stack

Summary

Forked from @agalimova's PR #720 (XSA6 + BigramHash4K + HedgeMixer baseline at 1.1078).
Added two lightweight architectural modifications that improved performance by ~0.018 bpb.

Architectural Changes

Gated Attention (attn_gate): A per head learned FP32 scalar (init=1.0) multiplied against the attention output, allowing the model to learn head specific contribution magnitudes.
Value Residual (lambda_v): A per block learned FP32 scalar (init=0.0) that injects a fraction of the initial token embedding x0 directly into the residual stream.

Both initialize as strict no ops and are registered in CONTROL_TENSOR_NAME_PATTERNS to remain in FP32 and bypass GPTQ quantization.

Results

Seed	val_bpb	val_loss	eval_time
42	1.08778131	1.83666831	504s
1337	1.09024766	1.84083264	503s
2025	1.09090710	1.84194607	506s
mean	1.08964536	1.83981567	—

Artifact: 14,917,177 bytes (14.9MB). All seeds evaluated under 600s.

Evaluation

EVAL_STRIDE=64 (matches official baseline default)
All runs completed in ~503–506s (under the 600s hard limit)
Hardware: 8×H100 SXM 80GB
Compression: zstd level-22

TTT Legality

TTT follows Case 3 (legal) per issue #402: tokens are scored before any adaptation, documents are evaluated independently with no cross document leakage. The dependency graph is identical to standard autoregressive eval.

…ll seeds <510s eval

sahiee-dev · 2026-03-26T07:13:40Z

TTT implementation note for reviewers: this follows Case 3 as defined in issue #402. Each document is evaluated independently, tokens are scored before any adaptation occurs on them, and adaptation on document J never affects scoring of document K. This is not the token stream TTT pattern that was flagged in #548. Happy to walk through the specific code if helpful

sahiee-dev · 2026-03-26T07:17:35Z

TTT implementation follows the score first pattern from issue #402. For each chunk: (1) score tokens with torch.no_grad(), accumulate loss, (2) adapt on that chunk only, (3) advance to next chunk. No token is ever scored after the model has trained on it. Documents are evaluated independently across DDP ranks with no cross-document leakage. This matches the legal pattern discussed in #402.

Combines NewTest (PR openai#841 base) with SOTA experiments that achieved ~1.12 BPB: - train_seq_len/eval_seq_len: 2048 → 4096 (long context from user's SOTA exps) - bigram_vocab_size: 3072 → 2048, bigram_dim: 112 → 128 (proven SOTA settings) - xsa_last_n: 11 → 4 (from user's best experiments) - gated_attention + value_residual: enabled by default (PR openai#824/838 show ~0.018 BPB improvement) - Bank QAT: symmetric int6 STE fake-quant on all weight banks during warmdown - Fix: CastedLinear QAT clip range (-32,31) → (-31,31) to match export format - Compression: lzma-6 → zstd-22 (PR openai#824/838: 14.9MB vs ~16MB, critical for fitting under limit) - Fix: target_mb budget uses decimal MB (1e6) not MiB (1024^2) matching competition rules - Budget-aware ±1 weight pruning retained from NewTest

valerio-oai · 2026-03-27T22:46:25Z

Thanks for your submission! Unfortunately, it's disallowed due to it being based on a PR that uses hashed n-gram caches, which do not renormalize correctly / correctly reweight the LM's token distribution, look ahead to the target token to mix probabilities and therefore leak eval tokens. Please refer to the long discussion about this under the issues tab for more details, and please submit more runs in the future!

Add 3-seed validated submission: val_bpb=1.08965 (mean), stride=64, a…

d757286

…ll seeds <510s eval

valerio-oai closed this Mar 27, 2026

valerio-oai mentioned this pull request Mar 27, 2026

Illegal submissions megathread #677

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GatedAttn + ValueResid + XSA6 + HedgeMixer + Legal TTT — val_bpb: 1.08965 (3-seed mean)#824

GatedAttn + ValueResid + XSA6 + HedgeMixer + Legal TTT — val_bpb: 1.08965 (3-seed mean)#824
sahiee-dev wants to merge 1 commit intoopenai:mainfrom
sahiee-dev:submission/xsa6-gatedattn-valueresid

sahiee-dev commented Mar 26, 2026

Uh oh!

sahiee-dev commented Mar 26, 2026

Uh oh!

sahiee-dev commented Mar 26, 2026

Uh oh!

valerio-oai commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

sahiee-dev commented Mar 26, 2026

XSA6 + BigramHash4K + GatedAttn + ValueResidual on HedgeMixer Stack

Summary

Architectural Changes

Results

Evaluation

TTT Legality

Uh oh!

sahiee-dev commented Mar 26, 2026

Uh oh!

sahiee-dev commented Mar 26, 2026

Uh oh!

valerio-oai commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants