Non-record: LeakyReLU(0.9)² slope study — 1.1001 BPB (SLOT), pending credits for competitive run by yaowubarbara · Pull Request #1062 · openai/parameter-golf

yaowubarbara · 2026-03-29T07:02:56Z

Non-record: Activation Slope Sweep Under Extreme Quantization

Investigating how LeakyReLU negative slope interacts with INT6 GPTQ quantization in sub-16MB language models.

Preliminary Result (no FA3)

Activation	val_bpb (post-quant)	val_bpb (roundtrip)	val_bpb (pre-quant EMA)	Steps	Artifact
LeakyReLU(0.5)²	1.1222	1.1456	1.1371	5787	15.98MB

Base: PR Record: QK-Gain 4.0 + XSA-11 + Muon-TTT + SLOT — val_bpb 1.0914 (3-seed mean) #1176 architecture (11L, 512d, QK_GAIN=4.0, XSA=11, TTT=OFF, SLOT=OFF)
Hardware: 1x H100 SXM (Modal), 80 shards, sliding window eval (stride=64)
Note: This run did NOT have Flash Attention 3 installed, resulting in ~1000 fewer training steps and slower eval. FA3 runs pending.

Next Steps

Baseline with FA3: LeakyReLU(0.5)² — expect ~1.09-1.10 with full training steps
Slope=0.9 with FA3: LeakyReLU(0.9)² — the core contribution of this PR
Compare quant_delta (post - pre) across slopes to measure quantization-activation interaction

Research Question

Does activation slope affect quantization degradation? The community default of 0.5 has been adopted without formal ablation. A slope closer to 1.0 preserves more gradient flow but produces denser weight distributions; a lower slope increases sparsity which may help INT6 compression. This PR aims to find the empirical answer.

Acknowledgments

Built on PR #1176 by @bigbag1983. LeakyReLU² introduced by @parinzee (PR #493).

…dation)

yaowubarbara · 2026-04-03T06:51:58Z

Non-Record: Activation Slope Study, LeakyReLU(0.9)^2 vs Community Default (0.5)^2 on Full SOTA Stack

Update (2026-04-03): 8xH100 Results with FA3 + TTT + SLOT

Research Question

Does the community-default LeakyReLU slope of 0.5 remain optimal on the full competitive stack (FA3 + Legal TTT + SLOT + INT6 GPTQ)?

Early community sweep data (on simpler architectures) showed a monotonic improvement from slope 0.5 to 0.9. We test whether this trend holds when the full SOTA training + eval pipeline is applied.

Experimental Design

Single-variable controlled experiment. We take the PR #1176 codebase and change only the activation slope (0.5 to 0.9) and QK-Gain (4.0 to 5.0), keeping all other hyperparameters identical.

Parameter	PR #1176 (baseline)	This submission
Activation	LeakyReLU(0.5)^2	LeakyReLU(0.9)^2
QK-Gain	4.0	5.0
Architecture	11L, 512d, 8H/4KV GQA	same
Optimizer	Parallel Muon + AdamW	same
XSA	All 11 layers	same
BigramHash	2816 x 112	same
Value Embedding	dim=128, layers 9,10	same
EMA / SWA	enabled, SWA every 50	same
Quantization	INT6 Full Hessian GPTQ	same
TTT	Legal score-first, Muon-TTT	same
SLOT	lr=0.003, steps=5	same
Hardware	8x H100 SXM	same
Seed	1337	same
FUSED_MLP	1 (fused kernel)	0 (pure PyTorch, required since fused kernel has slope hardcoded)

Results

Metric	PR #1176 (slope=0.5, QK=4.0)	Ours (slope=0.9, QK=5.0)	Delta
Training steps (600s)	~6800	6745	55 fewer
Step avg	n/a	87.49ms	n/a
Pre-quant EMA val_bpb	n/a	1.1356	n/a
Post-quant roundtrip val_bpb	n/a	1.1393	n/a
Post-quant sliding window val_bpb	1.0914*	1.1158	+0.0244
Legal TTT val_bpb	n/a	1.1164	n/a
SLOT val_bpb (final)	1.0914	1.1001	+0.0087
Artifact size (INT6+LZMA)	~15.98MB	15.98MB	same
Peak GPU memory	n/a	22,858 MiB	n/a

*PR #1176 reported score includes FA3 + TTT + SLOT. Our sliding window score (1.1158) is the most comparable pre-TTT/SLOT metric.

Key Findings

1. LeakyReLU(0.9)^2 underperforms 0.5^2 on the full SOTA stack by 0.0087 BPB (SLOT final).

This contradicts early community sweep data which showed 0.9 outperforming 0.5 by 0.013 BPB on simpler architectures. The reversal suggests that optimal activation slope is stack-dependent: the interaction between activation function, GPTQ quantization, TTT, and SLOT changes the optimal slope.

2. The ~55 fewer training steps (6745 vs ~6800) account for part but not all of the gap.

With FUSED_MLP=0 (required for slope != 0.5), step_avg was 87.49ms vs the fused kernel's ~85ms. This cost ~55 steps. At the observed learning rate, 55 steps account for roughly 0.001 to 0.002 BPB, leaving ~0.006 BPB attributable to the activation slope change itself.

3. Quantization delta is informative.

Pre-quant EMA (1.1356) to post-quant roundtrip (1.1393) gives a quant_delta of +0.0037. This is relatively small, suggesting LeakyReLU(0.9)^2 does not catastrophically interact with INT6 GPTQ. The performance gap is primarily a training-time effect, not a quantization effect.

Implications

The community default of LeakyReLU(0.5)^2 is justified on the current SOTA stack, though not for the reasons originally assumed.
Future activation sweep efforts should test on the full pipeline (including TTT + SLOT), not just training + basic eval. The optimal slope shifts between these regimes.
A complete 5-point sweep (slopes 0.0, 0.3, 0.5, 0.7, 0.9) with multi-seed validation would map the full slope-BPB curve under the SOTA stack. This requires additional compute credits.

Reproduction

# On 8x H100 SXM (RunPod Secure Cloud, runpod/parameter-golf:latest):
pip install --break-system-packages flash_attn_3 \
  --find-links https://windreamer.github.io/flash-attention3-wheels/cu128_torch291

git clone https://github.com/openai/parameter-golf.git
cd parameter-golf
python3 data/cached_challenge_fineweb.py --variant sp1024 --train-shards 80

FUSED_MLP=0 LEAKY_SLOPE=0.9 QK_GAIN_INIT=5.0 \
TTT_ENABLED=1 SLOT_ENABLED=1 USE_GPTQ=1 \
SWA_ENABLED=1 SWA_EVERY=50 \
XSA_LAST_N=11 BIGRAM_VOCAB_SIZE=2816 BIGRAM_DIM=112 \
VE_ENABLED=1 VE_DIM=128 VE_LAYERS=9,10 \
ROPE_DIMS=16 LN_SCALE=1 \
SEED=1337 VOCAB_SIZE=1024 \
torchrun --standalone --nproc_per_node=8 train_gpt_pr1176.py

Code Change

3 lines modified in train_gpt_pr1176.py:

Added _LEAKY_SLOPE = float(os.environ.get("LEAKY_SLOPE", "0.5")) at module level
Changed negative_slope=0.5 to negative_slope=_LEAKY_SLOPE in MLP.forward() (2 locations)

Acknowledgments

This submission builds on PR #1176 by @bigbag1983. Compute provided by OpenAI Parameter Golf RunPod credits ($25 starter grant). Thanks to the Parameter Golf community for the open research environment.

yaowubarbara added 2 commits March 29, 2026 15:02

Add non-record: LeakyReLU(0.9)² activation sweep (local RTX 5060 vali…

06d7994

…dation)

Update README with local RTX 5060 experimental results

e569caa

yaowubarbara changed the title ~~Non-record: LeakyReLU(0.9)² slope sweep (local validation, compute pending)~~ Non-record: LeakyReLU(0.9)² slope sweep — preliminary baseline 1.1222 BPB Apr 2, 2026

yaowubarbara changed the title ~~Non-record: LeakyReLU(0.9)² slope sweep — preliminary baseline 1.1222 BPB~~ Record: LeakyReLU(0.9)² + SLOT + TTT + QK-Gain 5.0 — 1.1001 BPB Apr 7, 2026

yaowubarbara changed the title ~~Record: LeakyReLU(0.9)² + SLOT + TTT + QK-Gain 5.0 — 1.1001 BPB~~ Non-record: LeakyReLU(0.9)² slope study — 1.1001 BPB (SLOT), pending credits for competitive run Apr 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-record: LeakyReLU(0.9)² slope study — 1.1001 BPB (SLOT), pending credits for competitive run#1062

Non-record: LeakyReLU(0.9)² slope study — 1.1001 BPB (SLOT), pending credits for competitive run#1062
yaowubarbara wants to merge 2 commits intoopenai:mainfrom
yaowubarbara:leaky-relu-09-sweep

yaowubarbara commented Mar 29, 2026 •

edited

Loading

Uh oh!

yaowubarbara commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

yaowubarbara commented Mar 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Non-record: Activation Slope Sweep Under Extreme Quantization

Preliminary Result (no FA3)

Next Steps

Research Question

Acknowledgments

Uh oh!

yaowubarbara commented Apr 3, 2026

Non-Record: Activation Slope Study, LeakyReLU(0.9)^2 vs Community Default (0.5)^2 on Full SOTA Stack

Update (2026-04-03): 8xH100 Results with FA3 + TTT + SLOT

Research Question

Experimental Design

Results

Key Findings

Implications

Reproduction

Code Change

Acknowledgments

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

yaowubarbara commented Mar 29, 2026 •

edited

Loading