Skip to content

Commit 7ea2371

Browse files
anthony-maioclaude
andcommitted
Swap ReLU² → LeakyReLU(0.5)² in MLP activation
Multiple top PRs (openai#535, openai#549, openai#569) demonstrate -0.0015 to -0.003 bpb from this change. LeakyReLU preserves gradient flow through negative pre-activations while maintaining the sparsity/gating benefits of squaring. At 22M params, dead neurons from hard ReLU are expensive. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
1 parent 01724f3 commit 7ea2371

File tree

1 file changed

+1
-1
lines changed
  • records/track_10min_16mb/2026-03-23_Reproduce414_LegalTTT

1 file changed

+1
-1
lines changed

records/track_10min_16mb/2026-03-23_Reproduce414_LegalTTT/train_gpt.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -614,7 +614,7 @@ def __init__(self, dim: int, mlp_mult: int):
614614
self.proj = CastedLinear(hidden, dim, bias=False)
615615
self.proj._zero_init = True
616616
def forward(self, x: Tensor) -> Tensor:
617-
x = torch.relu(self.fc(x))
617+
x = F.leaky_relu(self.fc(x), negative_slope=0.5)
618618
return self.proj(x.square())
619619
class Block(nn.Module):
620620
def __init__(

0 commit comments

Comments
 (0)