Non-record: LeakyReLU(0.9)² slope study — 1.1001 BPB (SLOT), pending credits for competitive run#1062
Non-record: LeakyReLU(0.9)² slope study — 1.1001 BPB (SLOT), pending credits for competitive run#1062yaowubarbara wants to merge 2 commits intoopenai:mainfrom
Conversation
Non-Record: Activation Slope Study, LeakyReLU(0.9)^2 vs Community Default (0.5)^2 on Full SOTA StackUpdate (2026-04-03): 8xH100 Results with FA3 + TTT + SLOTResearch QuestionDoes the community-default LeakyReLU slope of 0.5 remain optimal on the full competitive stack (FA3 + Legal TTT + SLOT + INT6 GPTQ)? Early community sweep data (on simpler architectures) showed a monotonic improvement from slope 0.5 to 0.9. We test whether this trend holds when the full SOTA training + eval pipeline is applied. Experimental DesignSingle-variable controlled experiment. We take the PR #1176 codebase and change only the activation slope (0.5 to 0.9) and QK-Gain (4.0 to 5.0), keeping all other hyperparameters identical.
Results
*PR #1176 reported score includes FA3 + TTT + SLOT. Our sliding window score (1.1158) is the most comparable pre-TTT/SLOT metric. Key Findings1. LeakyReLU(0.9)^2 underperforms 0.5^2 on the full SOTA stack by 0.0087 BPB (SLOT final). This contradicts early community sweep data which showed 0.9 outperforming 0.5 by 0.013 BPB on simpler architectures. The reversal suggests that optimal activation slope is stack-dependent: the interaction between activation function, GPTQ quantization, TTT, and SLOT changes the optimal slope. 2. The ~55 fewer training steps (6745 vs ~6800) account for part but not all of the gap. With FUSED_MLP=0 (required for slope != 0.5), step_avg was 87.49ms vs the fused kernel's ~85ms. This cost ~55 steps. At the observed learning rate, 55 steps account for roughly 0.001 to 0.002 BPB, leaving ~0.006 BPB attributable to the activation slope change itself. 3. Quantization delta is informative. Pre-quant EMA (1.1356) to post-quant roundtrip (1.1393) gives a quant_delta of +0.0037. This is relatively small, suggesting LeakyReLU(0.9)^2 does not catastrophically interact with INT6 GPTQ. The performance gap is primarily a training-time effect, not a quantization effect. Implications
Reproduction# On 8x H100 SXM (RunPod Secure Cloud, runpod/parameter-golf:latest):
pip install --break-system-packages flash_attn_3 \
--find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291
git clone https://github.com/openai/parameter-golf.git
cd parameter-golf
python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 80
FUSED_MLP=0 LEAKY_SLOPE=0.9 QK_GAIN_INIT=5.0 \
TTT_ENABLED=1 SLOT_ENABLED=1 USE_GPTQ=1 \
SWA_ENABLED=1 SWA_EVERY=50 \
XSA_LAST_N=11 BIGRAM_VOCAB_SIZE=2816 BIGRAM_DIM=112 \
VE_ENABLED=1 VE_DIM=128 VE_LAYERS=9,10 \
ROPE_DIMS=16 LN_SCALE=1 \
SEED=1337 VOCAB_SIZE=1024 \
torchrun --standalone --nproc_per_node=8 train_gpt_pr1176.pyCode Change3 lines modified in
AcknowledgmentsThis submission builds on PR #1176 by @bigbag1983. Compute provided by OpenAI Parameter Golf RunPod credits ($25 starter grant). Thanks to the Parameter Golf community for the open research environment. |
Non-record: Activation Slope Sweep Under Extreme Quantization
Investigating how LeakyReLU negative slope interacts with INT6 GPTQ quantization in sub-16MB language models.
Preliminary Result (no FA3)
Next Steps
Research Question
Does activation slope affect quantization degradation? The community default of 0.5 has been adopted without formal ablation. A slope closer to 1.0 preserves more gradient flow but produces denser weight distributions; a lower slope increases sparsity which may help INT6 compression. This PR aims to find the empirical answer.
Acknowledgments
Built on PR #1176 by @bigbag1983. LeakyReLU² introduced by @parinzee (PR #493).