LeakyReLU + XSA + PartialRoPE + FA3 submission — val_bpb 1.1991 by kjahan · Pull Request #1427 · openai/parameter-golf

kjahan · 2026-04-06T22:42:43Z

LeakyReLU² + Partial RoPE + XSA + FA3

val_bpb: 1.2092 (3-seed mean, std 0.0019, int6+lZMA roundtrip) | ~14.39 MB | 8×H100 SXM

Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)

The relevant numbers are the final validation BPB at the wallclock stop and the exact post-export final_int6_lzma_roundtrip BPB.

Seed	step_avg	steps	Final train-time bpb	Final int6+lZMA bpb	Export delta	Artifact
42	50.44ms	11,895	1.1987	1.2066	+0.0079	14,387,504
1337	50.35ms	11,915	1.1998	1.2102	+0.0104	14,392,816
2024	50.78ms	11,802	1.1989	1.2109	+0.0120	14,388,932
Mean	50.52ms	11,871	1.1991 (std 0.0005)	1.2092 (std 0.0019)	+0.0101	14,389,751

Core Changes vs `train_gpt.py`

This run differs from the repo baseline train_gpt.py in five material ways:

The MLP activation is changed from relu(x)^2 to leaky_relu(x, 0.75)^2.
Attention uses partial RoPE with ROPE_DIMS=16, so only 16 of 64 head dimensions receive rotary embedding.
XSA is enabled only on the deepest 4 layers via XSA_LAST_N=4.
Standard SDPA is replaced with FlashAttention-3 (flash_attn_3_func).
Export uses GPTQ-style Hessian-aware mixed int6 quantization with lZMA compression and selective pruning, instead of the baseline int8+zlib path.

Innovations

# Baseline
x = torch.relu(self.fc(x)).square()

# This experiment
x = F.leaky_relu(self.fc(x), 0.75).square()

Key contributions in this run:

LeakyReLU(0.75)² replaces relu² in the MLP while keeping the same simple squared-activation structure.
Partial RoPE (16/64) reduces rotary work per head and helps training throughput relative to full-head rotary application.
FlashAttention-3 improves attention kernel efficiency on Hopper, which helps both training speed and final BPB.
Deep-layer XSA keeps the XSA intervention focused on the last 4 layers while preserving the training speed of the rest of the stack.
GPTQ-style mixed int6 + lZMA export keeps the artifact comfortably under the size limit after roundtrip validation.

Architecture and export settings used in the logs:

Component	Setting
Layers	11 (512d, 8H, 4KV)
MLP	3× with LeakyReLU(0.75)²
Attention kernel	FlashAttention-3
XSA	Last 4 layers (`[7, 8, 9, 10]`)
RoPE	Partial (`16/64` dims per head)
Quantization	GPTQ-style mixed int6 + lZMA
Calibration	64 autoregressive sequences, block size 128

Future work:

Tune matrix_lr for the Muon optimizer.
Reduce XSA_LAST_N and measure the BPB / throughput tradeoff.
Try SiLU as the MLP activation.
Try LeakyReLU again with negative slope 0.5.
Speed up Newton-Schulz with Gram Newton-Schulz from gram-newton-schulz.
Try BigramHashEmbedding.

Run Command

RUN_ID=train_seed_1337 \
DATA_PATH=./data/datasets/fineweb10B_sp1024/ \
TOKENIZER_PATH=./data/tokenizers/fineweb_1024_bpe.model \
VOCAB_SIZE=1024 XSA_LAST_N=4 NUM_LAYERS=11 ROPE_DIMS=16 SEED=1337 \
torchrun --standalone --nproc_per_node=8 train_gpt.py

Credits

XSA on the last 4 layers: PR #549
GPTQ + int6 quantization stack: PR #1019
Base model: Naive Baseline by @0hq

Add: LeakyReLU + XSA + PartialRoPE + FA3 submission

592cd63

kjahan changed the title ~~Add: LeakyReLU + XSA + PartialRoPE + FA3 submission~~ Record: LeakyReLU + XSA + PartialRoPE + FA3 submission Apr 6, 2026

kjahan changed the title ~~Record: LeakyReLU + XSA + PartialRoPE + FA3 submission~~ Add: LeakyReLU + XSA + PartialRoPE + FA3 submission Apr 6, 2026

update README

c4bc8bb

kjahan changed the title ~~Add: LeakyReLU + XSA + PartialRoPE + FA3 submission~~ LeakyReLU + XSA + PartialRoPE + FA3 submission Apr 6, 2026

kjahan changed the title ~~LeakyReLU + XSA + PartialRoPE + FA3 submission~~ LeakyReLU + XSA + PartialRoPE + FA3 submission — val_bpb 1.1991 Apr 6, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LeakyReLU + XSA + PartialRoPE + FA3 submission — val_bpb 1.1991#1427

LeakyReLU + XSA + PartialRoPE + FA3 submission — val_bpb 1.1991#1427
kjahan wants to merge 2 commits intoopenai:mainfrom
kjahan:leakyrelu-xsa-partialrope-fa3

kjahan commented Apr 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

kjahan commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

LeakyReLU² + Partial RoPE + XSA + FA3

Results (8×H100 80GB SXM, PyTorch 2.9.1+cu128)

Core Changes vs train_gpt.py

Innovations

Future work:

Run Command

Credits

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

kjahan commented Apr 6, 2026 •

edited

Loading

Core Changes vs `train_gpt.py`