Swap ReLU² → LeakyReLU(0.5)² in MLP activation

anthony-maio · claude · anthony-maio · commit 7ea2371c7b7b · 2026-03-23T20:02:57.000-04:00
Multiple top PRs (openai#535, openai#549, openai#569) demonstrate -0.0015 to -0.003 bpb from this change. LeakyReLU preserves gradient flow through negative pre-activations while maintaining the sparsity/gating benefits of squaring. At 22M params, dead neurons from hard ReLU are expensive. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
diff --git a/records/track_10min_16mb/2026-03-23_Reproduce414_LegalTTT/train_gpt.py b/records/track_10min_16mb/2026-03-23_Reproduce414_LegalTTT/train_gpt.py
@@ -614,7 +614,7 @@ def __init__(self, dim: int, mlp_mult: int):
         self.proj = CastedLinear(hidden, dim, bias=False)
         self.proj._zero_init = True
     def forward(self, x: Tensor) -> Tensor:
-        x = torch.relu(self.fc(x))
+        x = F.leaky_relu(self.fc(x), negative_slope=0.5)
         return self.proj(x.square())
 class Block(nn.Module):
     def __init__(