Skip to content

Non-record: LeakyReLU(0.9)² slope study — 1.1001 BPB (SLOT), pending credits for competitive run#1062

Open
yaowubarbara wants to merge 2 commits intoopenai:mainfrom
yaowubarbara:leaky-relu-09-sweep
Open

Non-record: LeakyReLU(0.9)² slope study — 1.1001 BPB (SLOT), pending credits for competitive run#1062
yaowubarbara wants to merge 2 commits intoopenai:mainfrom
yaowubarbara:leaky-relu-09-sweep

Conversation

@yaowubarbara
Copy link
Copy Markdown

@yaowubarbara yaowubarbara commented Mar 29, 2026

Non-record: Activation Slope Sweep Under Extreme Quantization

Investigating how LeakyReLU negative slope interacts with INT6 GPTQ quantization in sub-16MB language models.

Preliminary Result (no FA3)

Activation val_bpb (post-quant) val_bpb (roundtrip) val_bpb (pre-quant EMA) Steps Artifact
LeakyReLU(0.5)² 1.1222 1.1456 1.1371 5787 15.98MB

Next Steps

  1. Baseline with FA3: LeakyReLU(0.5)² — expect ~1.09-1.10 with full training steps
  2. Slope=0.9 with FA3: LeakyReLU(0.9)² — the core contribution of this PR
  3. Compare quant_delta (post - pre) across slopes to measure quantization-activation interaction

Research Question

Does activation slope affect quantization degradation? The community default of 0.5 has been adopted without formal ablation. A slope closer to 1.0 preserves more gradient flow but produces denser weight distributions; a lower slope increases sparsity which may help INT6 compression. This PR aims to find the empirical answer.

Acknowledgments

Built on PR #1176 by @bigbag1983. LeakyReLU² introduced by @parinzee (PR #493).

@yaowubarbara yaowubarbara changed the title Non-record: LeakyReLU(0.9)² slope sweep (local validation, compute pending) Non-record: LeakyReLU(0.9)² slope sweep — preliminary baseline 1.1222 BPB Apr 2, 2026
@yaowubarbara
Copy link
Copy Markdown
Author

Non-Record: Activation Slope Study, LeakyReLU(0.9)^2 vs Community Default (0.5)^2 on Full SOTA Stack

Update (2026-04-03): 8xH100 Results with FA3 + TTT + SLOT

Research Question

Does the community-default LeakyReLU slope of 0.5 remain optimal on the full competitive stack (FA3 + Legal TTT + SLOT + INT6 GPTQ)?

Early community sweep data (on simpler architectures) showed a monotonic improvement from slope 0.5 to 0.9. We test whether this trend holds when the full SOTA training + eval pipeline is applied.

Experimental Design

Single-variable controlled experiment. We take the PR #1176 codebase and change only the activation slope (0.5 to 0.9) and QK-Gain (4.0 to 5.0), keeping all other hyperparameters identical.

Parameter PR #1176 (baseline) This submission
Activation LeakyReLU(0.5)^2 LeakyReLU(0.9)^2
QK-Gain 4.0 5.0
Architecture 11L, 512d, 8H/4KV GQA same
Optimizer Parallel Muon + AdamW same
XSA All 11 layers same
BigramHash 2816 x 112 same
Value Embedding dim=128, layers 9,10 same
EMA / SWA enabled, SWA every 50 same
Quantization INT6 Full Hessian GPTQ same
TTT Legal score-first, Muon-TTT same
SLOT lr=0.003, steps=5 same
Hardware 8x H100 SXM same
Seed 1337 same
FUSED_MLP 1 (fused kernel) 0 (pure PyTorch, required since fused kernel has slope hardcoded)

Results

Metric PR #1176 (slope=0.5, QK=4.0) Ours (slope=0.9, QK=5.0) Delta
Training steps (600s) ~6800 6745 55 fewer
Step avg n/a 87.49ms n/a
Pre-quant EMA val_bpb n/a 1.1356 n/a
Post-quant roundtrip val_bpb n/a 1.1393 n/a
Post-quant sliding window val_bpb 1.0914* 1.1158 +0.0244
Legal TTT val_bpb n/a 1.1164 n/a
SLOT val_bpb (final) 1.0914 1.1001 +0.0087
Artifact size (INT6+LZMA) ~15.98MB 15.98MB same
Peak GPU memory n/a 22,858 MiB n/a

*PR #1176 reported score includes FA3 + TTT + SLOT. Our sliding window score (1.1158) is the most comparable pre-TTT/SLOT metric.

Key Findings

1. LeakyReLU(0.9)^2 underperforms 0.5^2 on the full SOTA stack by 0.0087 BPB (SLOT final).

This contradicts early community sweep data which showed 0.9 outperforming 0.5 by 0.013 BPB on simpler architectures. The reversal suggests that optimal activation slope is stack-dependent: the interaction between activation function, GPTQ quantization, TTT, and SLOT changes the optimal slope.

2. The ~55 fewer training steps (6745 vs ~6800) account for part but not all of the gap.

With FUSED_MLP=0 (required for slope != 0.5), step_avg was 87.49ms vs the fused kernel's ~85ms. This cost ~55 steps. At the observed learning rate, 55 steps account for roughly 0.001 to 0.002 BPB, leaving ~0.006 BPB attributable to the activation slope change itself.

3. Quantization delta is informative.

Pre-quant EMA (1.1356) to post-quant roundtrip (1.1393) gives a quant_delta of +0.0037. This is relatively small, suggesting LeakyReLU(0.9)^2 does not catastrophically interact with INT6 GPTQ. The performance gap is primarily a training-time effect, not a quantization effect.

Implications

  • The community default of LeakyReLU(0.5)^2 is justified on the current SOTA stack, though not for the reasons originally assumed.
  • Future activation sweep efforts should test on the full pipeline (including TTT + SLOT), not just training + basic eval. The optimal slope shifts between these regimes.
  • A complete 5-point sweep (slopes 0.0, 0.3, 0.5, 0.7, 0.9) with multi-seed validation would map the full slope-BPB curve under the SOTA stack. This requires additional compute credits.

Reproduction

# On 8x H100 SXM (RunPod Secure Cloud, runpod/parameter-golf:latest):
pip install --break-system-packages flash_attn_3 \
  --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291

git clone https://github.com/openai/parameter-golf.git
cd parameter-golf
python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 80

FUSED_MLP=0 LEAKY_SLOPE=0.9 QK_GAIN_INIT=5.0 \
TTT_ENABLED=1 SLOT_ENABLED=1 USE_GPTQ=1 \
SWA_ENABLED=1 SWA_EVERY=50 \
XSA_LAST_N=11 BIGRAM_VOCAB_SIZE=2816 BIGRAM_DIM=112 \
VE_ENABLED=1 VE_DIM=128 VE_LAYERS=9,10 \
ROPE_DIMS=16 LN_SCALE=1 \
SEED=1337 VOCAB_SIZE=1024 \
torchrun --standalone --nproc_per_node=8 train_gpt_pr1176.py

Code Change

3 lines modified in train_gpt_pr1176.py:

  1. Added _LEAKY_SLOPE = float(os.environ.get("LEAKY_SLOPE", "0.5")) at module level
  2. Changed negative_slope=0.5 to negative_slope=_LEAKY_SLOPE in MLP.forward() (2 locations)

Acknowledgments

This submission builds on PR #1176 by @bigbag1983. Compute provided by OpenAI Parameter Golf RunPod credits ($25 starter grant). Thanks to the Parameter Golf community for the open research environment.

@yaowubarbara yaowubarbara changed the title Non-record: LeakyReLU(0.9)² slope sweep — preliminary baseline 1.1222 BPB Record: LeakyReLU(0.9)² + SLOT + TTT + QK-Gain 5.0 — 1.1001 BPB Apr 7, 2026
@yaowubarbara yaowubarbara changed the title Record: LeakyReLU(0.9)² + SLOT + TTT + QK-Gain 5.0 — 1.1001 BPB Non-record: LeakyReLU(0.9)² slope study — 1.1001 BPB (SLOT), pending credits for competitive run Apr 7, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant