Skip to content

Conversation

@kimm240
Copy link

@kimm240 kimm240 commented Nov 4, 2025

Currently it is not possible to fuse an epilogue operation (e.g., bias addition) into a reduction block's initialization statement. This limitation prevents leveraging hardware-specific instructions that support bias accumulation in vector ISAs, such as MACC (multiply-accumulate with bias) instructions.

This commit implements a new schedule primitive 'fuse_reduction_epilogue' that addresses the problem described in:
https://discuss.tvm.apache.org/t/tir-problem-inlining-addition-into-matmul-block/18066

The primitive transforms the following pattern:

Before:
for i, j, k in T.grid(M, N, K):
with T.block("matmul"):
with T.init():
temp[vi, vj] = 0
temp[vi, vj] = temp[vi, vj] + A[vi, vk] * B[vj, vk]

for i, j in T.grid(M, N):
    with T.block("bias_add"):
        D[vi, vj] = temp[vi, vj] + C[vi, vj]

After:
for i, j, k in T.grid(M, N, K):
with T.block("matmul"):
T.reads(C[vi, vj], A[vi, vk], B[vj, vk])
T.writes(D[vi, vj])
with T.init():
D[vi, vj] = C[vi, vj] # Fused epilogue into init
D[vi, vj] = D[vi, vj] + A[vi, vk] * B[vj, vk]

The transformation removes the intermediate temp buffer and the separate epilogue block, enabling better tensorization opportunities for hardware with bias accumulation support.

Implementation:

  • ReductionEpilogueFuser class for pattern validation and IR transformation
    • BodyPatternAllowFusion: Validates epilogue can be fused
    • AnalyzeEpiloguePattern: Detects addition pattern (D = temp + C)
    • ExtractEpilogueInfo: Extracts buffer and region information
    • CreateFusedReductionBlock: Creates single block with modified T.init()
  • SingleBlockFusionReplacer: Replaces blocks and removes temp buffer
  • Variable mapping between epilogue and reduction block iter vars
  • Proper buffer and region updates with correct read/write ordering
  • FFI bindings and Python API following TVM conventions

Changes:

  • src/tir/schedule/primitive/compute_inline.cc: Core implementation (~430 lines)
  • src/tir/schedule/primitive.h: Function declaration
  • include/tvm/tir/schedule/schedule.h: Virtual method in ScheduleNode
  • src/tir/schedule/concrete_schedule.{h,cc}: ConcreteScheduleNode implementation
  • src/tir/schedule/traced_schedule.{h,cc}: TracedScheduleNode implementation
  • src/tir/schedule/schedule.cc: FFI binding registration
  • python/tvm/tir/schedule/schedule.py: Python API with documentation
  • tests/python/tir-schedule/test_tir_schedule_fuse_reduction_epilogue.py: Comprehensive tests including basic fusion, float32 variant, numerical correctness verification, and trace roundtrip validation

Run tests with:
pytest tests/python/tir-schedule/test_tir_schedule_fuse_reduction_epilogue.py -v

And, Could you please also take a look at #18240? Thx :)

@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @kimm240, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a significant optimization to the TVM Tensor IR (TIR) schedule by adding a new primitive, FuseReductionEpilogue. This primitive addresses the current limitation of fusing epilogue operations, such as bias addition, directly into reduction blocks. By transforming the IR to integrate the epilogue into the reduction block's initialization, it eliminates intermediate buffer usage and unlocks opportunities for more efficient hardware utilization, particularly for architectures supporting bias accumulation instructions.

Highlights

  • New Primitive: Introduced FuseReductionEpilogue to the TVM TIR schedule, allowing the fusion of epilogue operations (like bias addition) directly into reduction blocks.
  • Optimization Enablement: This fusion enables leveraging hardware-specific instructions (e.g., MACC) for bias accumulation, leading to improved performance and more efficient hardware utilization.
  • IR Transformation: The primitive transforms a two-block pattern (reduction followed by an epilogue) into a single, optimized reduction block by modifying its initialization statement and removing intermediate buffers.
  • Comprehensive Implementation: Includes C++ backend logic (ReductionEpilogueFuser, SingleBlockFusionReplacer), FFI bindings, a Python API with clear documentation, and extensive test cases to ensure correctness and functionality.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new and useful schedule primitive, fuse_reduction_epilogue, to fuse an epilogue operation into a reduction block's initialization. The implementation is well-structured, with a clear separation of concerns for analysis and transformation. The changes are consistently applied across the scheduling infrastructure, and the new functionality is well-tested, including checks for numerical correctness and trace round-tripping. I've found one high-severity correctness issue in the pattern matching logic and a couple of medium-severity opportunities to improve code clarity and robustness. Overall, this is a great addition to TVM's scheduling capabilities.

@kimm240 kimm240 force-pushed the feature/fuse-reduction-epilogue-clean branch from 59f14e6 to a1c9681 Compare November 5, 2025 02:40
…into reduction init

Currently it is not possible to fuse an epilogue operation (e.g., bias addition)
into a reduction block's initialization statement. This limitation prevents
leveraging hardware-specific instructions that support bias accumulation in
vector ISAs, such as MACC (multiply-accumulate with bias) instructions.

This commit implements a new schedule primitive 'fuse_reduction_epilogue' that
addresses the problem described in:
https://discuss.tvm.apache.org/t/tir-problem-inlining-addition-into-matmul-block/18066

The primitive transforms the following pattern:

  Before:
    for i, j, k in T.grid(M, N, K):
        with T.block("matmul"):
            with T.init():
                temp[vi, vj] = 0
            temp[vi, vj] = temp[vi, vj] + A[vi, vk] * B[vj, vk]

    for i, j in T.grid(M, N):
        with T.block("bias_add"):
            D[vi, vj] = temp[vi, vj] + C[vi, vj]

  After:
    for i, j, k in T.grid(M, N, K):
        with T.block("matmul"):
            T.reads(C[vi, vj], A[vi, vk], B[vj, vk])
            T.writes(D[vi, vj])
            with T.init():
                D[vi, vj] = C[vi, vj]  # Fused epilogue into init
            D[vi, vj] = D[vi, vj] + A[vi, vk] * B[vj, vk]

The transformation removes the intermediate temp buffer and the separate
epilogue block, enabling better tensorization opportunities for hardware
with bias accumulation support.

Implementation:
- ReductionEpilogueFuser class for pattern validation and IR transformation
  - BodyPatternAllowFusion: Validates epilogue can be fused
  - AnalyzeEpiloguePattern: Detects addition pattern (D = temp + C)
  - ExtractEpilogueInfo: Extracts buffer and region information
  - CreateFusedReductionBlock: Creates single block with modified T.init()
- SingleBlockFusionReplacer: Replaces blocks and removes temp buffer
- Variable mapping between epilogue and reduction block iter vars
- Proper buffer and region updates with correct read/write ordering
- FFI bindings and Python API following TVM conventions

Changes:
- src/tir/schedule/primitive/compute_inline.cc: Core implementation (~430 lines)
- src/tir/schedule/primitive.h: Function declaration
- include/tvm/tir/schedule/schedule.h: Virtual method in ScheduleNode
- src/tir/schedule/concrete_schedule.{h,cc}: ConcreteScheduleNode implementation
- src/tir/schedule/traced_schedule.{h,cc}: TracedScheduleNode implementation
- src/tir/schedule/schedule.cc: FFI binding registration
- python/tvm/tir/schedule/schedule.py: Python API with documentation
- tests/python/tir-schedule/test_tir_schedule_fuse_reduction_epilogue.py:
  Comprehensive tests including basic fusion, float32 variant, numerical
  correctness verification, and trace roundtrip validation

Run tests with:
  pytest tests/python/tir-schedule/test_tir_schedule_fuse_reduction_epilogue.py -v
@kimm240 kimm240 force-pushed the feature/fuse-reduction-epilogue-clean branch from a1c9681 to 0fc40e7 Compare November 5, 2025 03:31
@tlopex
Copy link
Member

tlopex commented Nov 7, 2025

cc @tqchen @Hzfengsy

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants