feat: Add @auto_pipeline Decorator for Advanced Multi-Level Pipelining #327
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR introduces the @auto_pipeline decorator that enables automatic multi-level pipelining optimization for Triton kernels, achieving up to 2.18x speedup on GEMM operations compared to non-pipelined kernels.
Performance Results (2048x2048x2048 GEMM on Nvidia)
AutoPipeline vs Default Pipeline: 1.53x faster
Features
Usage
Files Changed
How It Works
- Circular buffer allocation for multi-stage pipelining
- Double-buffering for S2R optimization
- Async copy instruction insertion
- Synchronization barrier placement
Test Plan
Breaking Changes
None. This is a purely additive feature that doesn't modify existing APIs.
Dependencies