GatedAttn + ValueResid + XSA6 + HedgeMixer + Legal TTT — val_bpb: 1.08965 (3-seed mean)#824
Conversation
…ll seeds <510s eval
|
TTT implementation note for reviewers: this follows Case 3 as defined in issue #402. Each document is evaluated independently, tokens are scored before any adaptation occurs on them, and adaptation on document J never affects scoring of document K. This is not the token stream TTT pattern that was flagged in #548. Happy to walk through the specific code if helpful |
|
TTT implementation follows the score first pattern from issue #402. For each chunk: (1) score tokens with torch.no_grad(), accumulate loss, (2) adapt on that chunk only, (3) advance to next chunk. No token is ever scored after the model has trained on it. Documents are evaluated independently across DDP ranks with no cross-document leakage. This matches the legal pattern discussed in #402. |
Combines NewTest (PR openai#841 base) with SOTA experiments that achieved ~1.12 BPB: - train_seq_len/eval_seq_len: 2048 → 4096 (long context from user's SOTA exps) - bigram_vocab_size: 3072 → 2048, bigram_dim: 112 → 128 (proven SOTA settings) - xsa_last_n: 11 → 4 (from user's best experiments) - gated_attention + value_residual: enabled by default (PR openai#824/838 show ~0.018 BPB improvement) - Bank QAT: symmetric int6 STE fake-quant on all weight banks during warmdown - Fix: CastedLinear QAT clip range (-32,31) → (-31,31) to match export format - Compression: lzma-6 → zstd-22 (PR openai#824/838: 14.9MB vs ~16MB, critical for fitting under limit) - Fix: target_mb budget uses decimal MB (1e6) not MiB (1024^2) matching competition rules - Budget-aware ±1 weight pruning retained from NewTest
|
Thanks for your submission! Unfortunately, it's disallowed due to it being based on a PR that uses hashed n-gram caches, which do not renormalize correctly / correctly reweight the LM's token distribution, look ahead to the target token to mix probabilities and therefore leak eval tokens. Please refer to the long discussion about this under the issues tab for more details, and please submit more runs in the future! |
XSA6 + BigramHash4K + GatedAttn + ValueResidual on HedgeMixer Stack
Summary
Forked from @agalimova's PR #720 (XSA6 + BigramHash4K + HedgeMixer baseline at 1.1078).
Added two lightweight architectural modifications that improved performance by ~0.018 bpb.
Architectural Changes
attn_gate): A per head learned FP32 scalar (init=1.0) multiplied against the attention output, allowing the model to learn head specific contribution magnitudes.lambda_v): A per block learned FP32 scalar (init=0.0) that injects a fraction of the initial token embeddingx0directly into the residual stream.Both initialize as strict no ops and are registered in
CONTROL_TENSOR_NAME_PATTERNSto remain in FP32 and bypass GPTQ quantization.Results
Artifact: 14,917,177 bytes (14.9MB). All seeds evaluated under 600s.
Evaluation
EVAL_STRIDE=64(matches official baseline default)TTT Legality
TTT follows Case 3 (legal) per issue #402: tokens are scored before any adaptation, documents are evaluated independently with no cross document leakage. The dependency graph is identical to standard autoregressive eval.