Skip to content

Non-record: Fused Triton relu^2 kernel — negative result (val_bpb=1.1198)#1237

Open
ibarrajo wants to merge 1 commit intoopenai:mainfrom
ibarrajo:approach-f
Open

Non-record: Fused Triton relu^2 kernel — negative result (val_bpb=1.1198)#1237
ibarrajo wants to merge 1 commit intoopenai:mainfrom
ibarrajo:approach-f

Conversation

@ibarrajo
Copy link
Copy Markdown

@ibarrajo ibarrajo commented Apr 1, 2026

Summary

  • Fused Triton kernel for relu^2.square() activation — hand-written Triton kernel with torch.compile fallback
  • Negative result: Triton kernel provides no speed improvement when torch.compile is active (it already fuses the same ops)
  • QK-Gain 4.0 included
  • TTT s_0 score: 1.1198

Results

Metric Value
val_bpb (TTT s_0) 1.1198
val_bpb (base) 1.1273
Artifact size 15.1 MB (930 KB headroom)
Current SOTA 1.1147

Key Findings

  • Fused Triton kernel does NOT help: torch.compile already fuses relu^2 into an efficient kernel. Hand-written Triton provides zero speedup
  • Lesson: Before writing custom Triton kernels, benchmark against torch.compile — it handles elementwise fusion well
  • Non-record: 1.1198 does not beat SOTA of 1.1147

Rule Compliance

  • Training time < 600s
  • Eval time < 600s
  • Artifact < 16MB (15.1MB)
  • No val tokens in artifact
  • Score-first TTT only (s_0 reported)
  • Single-pass evaluation

🤖 Generated with Claude Code

…198)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant