Skip to content

GatedAttn + ValueResid + XSA6 + HedgeMixer + Legal TTT — val_bpb: 1.08965 (3-seed mean)#824

Closed
sahiee-dev wants to merge 1 commit intoopenai:mainfrom
sahiee-dev:submission/xsa6-gatedattn-valueresid
Closed

GatedAttn + ValueResid + XSA6 + HedgeMixer + Legal TTT — val_bpb: 1.08965 (3-seed mean)#824
sahiee-dev wants to merge 1 commit intoopenai:mainfrom
sahiee-dev:submission/xsa6-gatedattn-valueresid

Conversation

@sahiee-dev
Copy link
Copy Markdown

XSA6 + BigramHash4K + GatedAttn + ValueResidual on HedgeMixer Stack

Summary

Forked from @agalimova's PR #720 (XSA6 + BigramHash4K + HedgeMixer baseline at 1.1078).
Added two lightweight architectural modifications that improved performance by ~0.018 bpb.

Architectural Changes

  1. Gated Attention (attn_gate): A per head learned FP32 scalar (init=1.0) multiplied against the attention output, allowing the model to learn head specific contribution magnitudes.
  2. Value Residual (lambda_v): A per block learned FP32 scalar (init=0.0) that injects a fraction of the initial token embedding x0 directly into the residual stream.

Both initialize as strict no ops and are registered in CONTROL_TENSOR_NAME_PATTERNS to remain in FP32 and bypass GPTQ quantization.

Results

Seed val_bpb val_loss eval_time
42 1.08778131 1.83666831 504s
1337 1.09024766 1.84083264 503s
2025 1.09090710 1.84194607 506s
mean 1.08964536 1.83981567

Artifact: 14,917,177 bytes (14.9MB). All seeds evaluated under 600s.

Evaluation

  • EVAL_STRIDE=64 (matches official baseline default)
  • All runs completed in ~503–506s (under the 600s hard limit)
  • Hardware: 8×H100 SXM 80GB
  • Compression: zstd level-22

TTT Legality

TTT follows Case 3 (legal) per issue #402: tokens are scored before any adaptation, documents are evaluated independently with no cross document leakage. The dependency graph is identical to standard autoregressive eval.

@sahiee-dev
Copy link
Copy Markdown
Author

TTT implementation note for reviewers: this follows Case 3 as defined in issue #402. Each document is evaluated independently, tokens are scored before any adaptation occurs on them, and adaptation on document J never affects scoring of document K. This is not the token stream TTT pattern that was flagged in #548. Happy to walk through the specific code if helpful

@sahiee-dev
Copy link
Copy Markdown
Author

TTT implementation follows the score first pattern from issue #402. For each chunk: (1) score tokens with torch.no_grad(), accumulate loss, (2) adapt on that chunk only, (3) advance to next chunk. No token is ever scored after the model has trained on it. Documents are evaluated independently across DDP ranks with no cross-document leakage. This matches the legal pattern discussed in #402.

FlashyFlash3011 added a commit to FlashyFlash3011/parameter-golf that referenced this pull request Mar 26, 2026
Combines NewTest (PR openai#841 base) with SOTA experiments that achieved ~1.12 BPB:
- train_seq_len/eval_seq_len: 2048 → 4096 (long context from user's SOTA exps)
- bigram_vocab_size: 3072 → 2048, bigram_dim: 112 → 128 (proven SOTA settings)
- xsa_last_n: 11 → 4 (from user's best experiments)
- gated_attention + value_residual: enabled by default (PR openai#824/838 show ~0.018 BPB improvement)
- Bank QAT: symmetric int6 STE fake-quant on all weight banks during warmdown
- Fix: CastedLinear QAT clip range (-32,31) → (-31,31) to match export format
- Compression: lzma-6 → zstd-22 (PR openai#824/838: 14.9MB vs ~16MB, critical for fitting under limit)
- Fix: target_mb budget uses decimal MB (1e6) not MiB (1024^2) matching competition rules
- Budget-aware ±1 weight pruning retained from NewTest
@valerio-oai
Copy link
Copy Markdown
Contributor

Thanks for your submission! Unfortunately, it's disallowed due to it being based on a PR that uses hashed n-gram caches, which do not renormalize correctly / correctly reweight the LM's token distribution, look ahead to the target token to mix probabilities and therefore leak eval tokens. Please refer to the long discussion about this under the issues tab for more details, and please submit more runs in the future!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants