Skip to content

Conversation

@zacliu2023
Copy link


This PR introduces the @auto_pipeline decorator that enables automatic multi-level pipelining optimization for Triton kernels, achieving up to 2.18x speedup on GEMM operations compared to non-pipelined kernels.

Performance Results (2048x2048x2048 GEMM on Nvidia)

Kernel Time (ms) TFLOPS Speedup vs Baseline
No Pipeline 0.199 86.22 1.00x
Default Pipeline (num_stages=3) 0.139 123.28 1.43x
AutoPipeline (FlagTree) 0.091 188.29 2.18x

AutoPipeline vs Default Pipeline: 1.53x faster

Features

  • Global-to-Shared (G2S) Pipelining: Multi-stage async data prefetching from global memory to shared memory with configurable pipeline depth
  • Shared-to-Register (S2R) Pipelining: Double-buffering optimization for shared memory to register transfers, reducing memory access latency
  • Warp Specialization Support: Producer-consumer pattern configuration with dedicated prefetch and compute warps
  • Async Copy: Automatic async copy instruction generation for supported hardware
  • Swizzle Optimization: Memory access pattern optimization to reduce bank conflicts

Usage


  import triton
  import triton.language as tl
  from triton.language import auto_pipeline, PipelineConfig, WarpSpecConfig

  @triton.jit
  @auto_pipeline(PipelineConfig(
      global_to_shared_stages=4,      # G2S pipeline depth
      shared_to_register_stages=2,    # S2R double-buffering
      enable_async_copy=True,         # Use async copy instructions
      enable_swizzle=True,            # Optimize memory access patterns
      enable_warp_specialization=True,
      warp_spec_config=WarpSpecConfig(
          num_producer_warps=1,
          num_consumer_warps=3,
      )
  ))
  def matmul_kernel(A, B, C, M, N, K, ...):
      # Standard GEMM implementation - no manual changes needed!
      ...

Files Changed

Python API (4 files, ~1,700 lines)

  • python/triton/language/pipeline.py - Core @auto_pipeline decorator and PipelineConfig class
  • python/triton/language/autotune_config.py - Smart autotuning utilities
  • python/triton/compiler/pipeline_config.py - Compiler integration hook
  • python/triton/language/init.py - Public exports

C++ MLIR Passes (8 files, ~4,500 lines)

  • lib/Dialect/TritonGPU/Transforms/AdvancedPipeliner.cpp - Main S2R pipelining pass
  • lib/Dialect/TritonGPU/Transforms/BufferAccessAnalysis.cpp - Memory access pattern analysis
  • lib/Dialect/TritonGPU/Transforms/CircularBufferTransform.cpp - Circular buffer implementation
  • lib/Dialect/TritonGPU/Transforms/MultiBufferFusion.cpp - Buffer fusion optimization
  • lib/Dialect/TritonGPU/Transforms/PipelineOpportunityDetector.cpp - Pipeline opportunity detection
  • lib/Dialect/TritonGPU/Transforms/SynchronizationInsertion.cpp - Barrier insertion
  • lib/Dialect/TritonGPU/Transforms/WarpSpecialization.cpp - Warp specialization transform
  • lib/Dialect/TritonGPU/Transforms/TMASupport.cpp - TMA (Tensor Memory Access) support

Headers (8 files)

  • include/triton/Dialect/TritonGPU/Transforms/*.h - Pass declarations

Tests & Examples (1 file)

  • python/test/benchmark_autopipeline.py - Benchmark demonstrating speedup

How It Works

  1. Decorator Application: @auto_pipeline(config) attaches pipeline configuration to the kernel function
  2. Compiler Hook: During compilation, PipelineCompilerHook detects the configuration and injects optimization passes
  3. Pass Execution: The AdvancedPipeliner pass analyzes the kernel IR and applies:
    - Circular buffer allocation for multi-stage pipelining
    - Double-buffering for S2R optimization
    - Async copy instruction insertion
    - Synchronization barrier placement

Test Plan

  • Run python python/test/benchmark_autopipeline.py - Verifies 2.18x speedup
  • Correctness validation via torch.allclose() comparison with reference implementation
  • Build with pip install -e . --no-build-isolation
  • Run existing unit tests

Breaking Changes

None. This is a purely additive feature that doesn't modify existing APIs.

Dependencies

  • Triton 3.1.x base

Introduces automatic multi-level pipelining optimization for Triton kernels
with up to 2.19x speedup on GEMM operations.

## Features

- Global-to-Shared (G2S) Pipelining: Multi-stage async data prefetching
- Shared-to-Register (S2R) Pipelining: Double-buffering optimization
- Warp Specialization: Producer-consumer pattern with dedicated warps

## Performance (2048x2048x2048 GEMM on A100)

| Kernel | TFLOPS | Speedup |
|--------|--------|---------|
| No Pipeline | 86.03 | 1.00x |
| Default Pipeline | 141.17 | 1.64x |
| AutoPipeline | 188.02 | 2.19x |

## Usage

```python
from triton.language import auto_pipeline, PipelineConfig

@triton.jit
@auto_pipeline(PipelineConfig(
    global_to_shared_stages=4,
    shared_to_register_stages=2,
    enable_async_copy=True,
))
def matmul_kernel(...):
    ...
```
TLX language extensions are optional and not needed for core
auto_pipeline functionality. Remove TLX to simplify the PR:

- Remove third_party/tlx/language/tlx directory
- Remove TLX symlink from python/triton/language/extra
- Remove TLX imports from code_generator.py
- Remove create_tlx_autotune_configs from public exports

The core @auto_pipeline decorator still works with:
- G2S pipelining (global_to_shared_stages)
- S2R pipelining (shared_to_register_stages)
- Basic warp specialization config (WarpSpecConfig)
@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

@zacliu2023 zacliu2023 changed the title PR: Add @auto_pipeline Decorator for Advanced Multi-Level Pipelining feat: Add @auto_pipeline Decorator for Advanced Multi-Level Pipelining Jan 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants