Skip to content

Introducing task level computation abstraction#235

Merged
ShangkunLi merged 11 commits intocoredac:mainfrom
ShangkunLi:taskflow-to-hyperblock
Jan 14, 2026
Merged

Introducing task level computation abstraction#235
ShangkunLi merged 11 commits intocoredac:mainfrom
ShangkunLi:taskflow-to-hyperblock

Conversation

@ShangkunLi
Copy link
Copy Markdown
Collaborator

In this pr,

  1. re-define the computation of taskflow.task: we ensure there are only data dependencies (e.g., producer-consumer, RAW, WAR, WAW) between different tasks
  2. introduce taskflow.counter op: convert all the loop control into counter tree/graph, each counter represents a loop control
  3. introduce taskflow.hyperblock: wrap the rest code into hyperblock

This design enables us to perform three-level optimizations:

  1. Pre-task optimization: before transforming to taskflow, we can perform high-level optimizations to explore the parallelism
  2. Resource-agnostic optimization: perform task fission/fusion to minimize the data dependencies between different tasks
  3. Resource-aware optimization: task legalization & resource binding (e.g., one task on multiple CGRAs, fuse multiple task to maximize resource utilization)

TODO:
Hyperblock fusion to generate the DFG/integrate to the neura.kernel op.

@tancheng
Copy link
Copy Markdown
Contributor

what is the diff between task and hyperblock?

@ShangkunLi
Copy link
Copy Markdown
Collaborator Author

what is the diff between task and hyperblock?

task corresponds to the workload accelerated on one (logic-)CGRA. hyperblock is used to build a temporary representation before transforming to pure counter + neura.kernel. Each hyperblock records a code block between/within loops and its triggered loop index.

For example, a task can be:

module {
  func.func @_Z21pureNestedLoopExamplePA8_A6_iPA8_A5_iS4_PA7_iPA9_iPiS9_S9_S9_S9_(%arg0: memref<?x8x6xi32>, %arg1: memref<?x8x5xi32>, %arg2: memref<?x8x5xi32>, %arg3: memref<?x7xi32>, %arg4: memref<?x9xi32>, %arg5: memref<?xi32>, %arg6: memref<?xi32>, %arg7: memref<?xi32>, %arg8: memref<?xi32>, %arg9: memref<?xi32>) -> i32 attributes {llvm.linkage = #llvm.linkage<external>} {
    %memory_outputs:5 = "taskflow.task"(%arg0, %arg1, %arg2, %arg5, %arg6, %arg9, %arg3, %arg4, %arg7, %arg8) <{operandSegmentSizes = array<i32: 10, 0>, resultSegmentSizes = array<i32: 5, 0>, task_name = "Task_0"}> ({
    ^bb0(%arg10: memref<?x8x6xi32>, %arg11: memref<?x8x5xi32>, %arg12: memref<?x8x5xi32>, %arg13: memref<?xi32>, %arg14: memref<?xi32>, %arg15: memref<?xi32>, %arg16: memref<?x7xi32>, %arg17: memref<?x9xi32>, %arg18: memref<?xi32>, %arg19: memref<?xi32>):
      affine.for %arg20 = 0 to 4 {
        affine.for %arg21 = 0 to 8 {
          affine.for %arg22 = 0 to 6 {
            %1 = affine.load %arg10[%arg20, %arg21, %arg22] : memref<?x8x6xi32>
            affine.store %1, %arg13[%arg22] : memref<?xi32>
          }
          affine.for %arg22 = 0 to 5 {
            %1 = affine.load %arg11[%arg20, %arg21, %arg22] : memref<?x8x5xi32>
            %2 = affine.load %arg12[%arg20, %arg21, %arg22] : memref<?x8x5xi32>
            %3 = arith.addi %1, %2 : i32
            affine.store %3, %arg14[%arg22] : memref<?xi32>
          }
          affine.for %arg22 = 0 to 6 {
            %1 = affine.load %arg13[%arg22] : memref<?xi32>
            %2 = affine.load %arg14[%arg22] : memref<?xi32>
            %3 = arith.addi %1, %2 : i32
            %4 = affine.load %arg15[0] : memref<?xi32>
            %5 = arith.addi %4, %3 : i32
            affine.store %5, %arg15[0] : memref<?xi32>
          }
        }
        affine.for %arg21 = 0 to 7 {
          %1 = affine.load %arg16[%arg20, %arg21] : memref<?x7xi32>
          affine.store %1, %arg18[%arg21] : memref<?xi32>
        }
        affine.for %arg21 = 0 to 9 {
          %1 = affine.load %arg17[%arg20, %arg21] : memref<?x9xi32>
          %2 = affine.load %arg18[%arg21] : memref<?xi32>
          %3 = arith.addi %1, %2 : i32
          affine.store %3, %arg19[%arg21] : memref<?xi32>
        }
      }
      "taskflow.yield"(%arg13, %arg14, %arg15, %arg18, %arg19) <{operandSegmentSizes = array<i32: 5, 0>}> : (memref<?xi32>, memref<?xi32>, memref<?xi32>, memref<?xi32>, memref<?xi32>) -> ()
    }) : (memref<?x8x6xi32>, memref<?x8x5xi32>, memref<?x8x5xi32>, memref<?xi32>, memref<?xi32>, memref<?xi32>, memref<?x7xi32>, memref<?x9xi32>, memref<?xi32>, memref<?xi32>) -> (memref<?xi32>, memref<?xi32>, memref<?xi32>, memref<?xi32>, memref<?xi32>)
    %0 = affine.load %arg9[0] : memref<?xi32>
    return %0 : i32
  }
}

We can extract each code block to build this temporary representation:

module {
  func.func @_Z21pureNestedLoopExamplePA8_A6_iPA8_A5_iS4_PA7_iPA9_iPiS9_S9_S9_S9_(%arg0: memref<?x8x6xi32>, %arg1: memref<?x8x5xi32>, %arg2: memref<?x8x5xi32>, %arg3: memref<?x7xi32>, %arg4: memref<?x9xi32>, %arg5: memref<?xi32>, %arg6: memref<?xi32>, %arg7: memref<?xi32>, %arg8: memref<?xi32>, %arg9: memref<?xi32>) -> i32 attributes {llvm.linkage = #llvm.linkage<external>} {
    %memory_outputs:5 = "taskflow.task"(%arg0, %arg1, %arg2, %arg5, %arg6, %arg9, %arg3, %arg4, %arg7, %arg8) <{operandSegmentSizes = array<i32: 10, 0>, resultSegmentSizes = array<i32: 5, 0>, task_name = "Task_0"}> ({
    ^bb0(%arg10: memref<?x8x6xi32>, %arg11: memref<?x8x5xi32>, %arg12: memref<?x8x5xi32>, %arg13: memref<?xi32>, %arg14: memref<?xi32>, %arg15: memref<?xi32>, %arg16: memref<?x7xi32>, %arg17: memref<?x9xi32>, %arg18: memref<?xi32>, %arg19: memref<?xi32>):
      %1 = taskflow.counter attributes {lower_bound = 0 : index, step = 1 : index, upper_bound = 4 : index} : index
      %2 = taskflow.counter parent(%1 : index) attributes {lower_bound = 0 : index, step = 1 : index, upper_bound = 8 : index} : index
      %3 = taskflow.counter parent(%2 : index) attributes {lower_bound = 0 : index, step = 1 : index, upper_bound = 6 : index} : index
      %4 = taskflow.counter parent(%2 : index) attributes {lower_bound = 0 : index, step = 1 : index, upper_bound = 5 : index} : index
      %5 = taskflow.counter parent(%2 : index) attributes {lower_bound = 0 : index, step = 1 : index, upper_bound = 6 : index} : index
      %6 = taskflow.counter parent(%1 : index) attributes {lower_bound = 0 : index, step = 1 : index, upper_bound = 7 : index} : index
      %7 = taskflow.counter parent(%1 : index) attributes {lower_bound = 0 : index, step = 1 : index, upper_bound = 9 : index} : index
      taskflow.hyperblock indices(%1, %2, %3 : index, index, index) {
      ^bb0(%arg20: index, %arg21: index, %arg22: index):
        %8 = memref.load %arg10[%arg20, %arg21, %arg22] : memref<?x8x6xi32>
        memref.store %8, %arg13[%arg22] : memref<?xi32>
      } -> ()
      taskflow.hyperblock indices(%1, %2, %4 : index, index, index) {
      ^bb0(%arg20: index, %arg21: index, %arg22: index):
        %8 = memref.load %arg11[%arg20, %arg21, %arg22] : memref<?x8x5xi32>
        %9 = memref.load %arg12[%arg20, %arg21, %arg22] : memref<?x8x5xi32>
        %10 = arith.addi %8, %9 : i32
        memref.store %10, %arg14[%arg22] : memref<?xi32>
      } -> ()
      taskflow.hyperblock indices(%5 : index) {
      ^bb0(%arg20: index):
        %8 = memref.load %arg13[%arg20] : memref<?xi32>
        %9 = memref.load %arg14[%arg20] : memref<?xi32>
        %10 = arith.addi %8, %9 : i32
        %c0 = arith.constant 0 : index
        %11 = memref.load %arg15[%c0] : memref<?xi32>
        %12 = arith.addi %11, %10 : i32
        %c0_0 = arith.constant 0 : index
        memref.store %12, %arg15[%c0_0] : memref<?xi32>
      } -> ()
      taskflow.hyperblock indices(%1, %6 : index, index) {
      ^bb0(%arg20: index, %arg21: index):
        %8 = memref.load %arg16[%arg20, %arg21] : memref<?x7xi32>
        memref.store %8, %arg18[%arg21] : memref<?xi32>
      } -> ()
      taskflow.hyperblock indices(%1, %7 : index, index) {
      ^bb0(%arg20: index, %arg21: index):
        %8 = memref.load %arg17[%arg20, %arg21] : memref<?x9xi32>
        %9 = memref.load %arg18[%arg21] : memref<?xi32>
        %10 = arith.addi %8, %9 : i32
        memref.store %10, %arg19[%arg21] : memref<?xi32>
      } -> ()
      "taskflow.yield"(%arg13, %arg14, %arg15, %arg18, %arg19) <{operandSegmentSizes = array<i32: 5, 0>}> : (memref<?xi32>, memref<?xi32>, memref<?xi32>, memref<?xi32>, memref<?xi32>) -> ()
    }) : (memref<?x8x6xi32>, memref<?x8x5xi32>, memref<?x8x5xi32>, memref<?xi32>, memref<?xi32>, memref<?xi32>, memref<?x7xi32>, memref<?x9xi32>, memref<?xi32>, memref<?xi32>) -> (memref<?xi32>, memref<?xi32>, memref<?xi32>, memref<?xi32>, memref<?xi32>)
    %0 = affine.load %arg9[0] : memref<?xi32>
    return %0 : i32
  }
}

Based on this temporary representation, we can perform resource binding (affine controller chaining, multi-CGRA binding, etc.). And then we can fuse these hyperblocks to get the kernel. The key point of hyperblock fusion is to maintain the memory access order, which I plan to insert a token type between each memory access operation.

@tancheng
Copy link
Copy Markdown
Contributor

task

  • Bufferization is performed between task and hyperblock IR? I saw affine becomes memref.
  • hyperblock is region but it doesn't have yield?
  • Why is it called hyperblock? It sounds like just block.

@ShangkunLi
Copy link
Copy Markdown
Collaborator Author

  • Bufferization is performed between task and hyperblock IR? I saw affine becomes memref.

When we extract affine.load/store/if operation into hyperblock, we cannot preserve the map relationtiops in the hyperblock region. So I use

// Creates a taskflow.hyperblock operation from HyperblockInfo.
static TaskflowHyperblockOp createHyperblock(
    OpBuilder &builder, Location loc, const HyperblockInfo &info,
    Block *task_body,
    const DenseMap<affine::AffineForOp, LoopInfo *> &loop_info_map) {
  ...
  ...
  MLIRContext *context = hyperblock_op.getContext();
  RewritePatternSet patterns(context);

  populateAffineToStdConversionPatterns(patterns);
  ConversionTarget target(*context);
  target.addLegalDialect<arith::ArithDialect, memref::MemRefDialect,
                         func::FuncDialect, taskflow::TaskflowDialect>();
  target.addIllegalOp<affine::AffineLoadOp, affine::AffineStoreOp,
                      affine::AffineIfOp>();
  if (failed(applyPartialConversion(hyperblock_op, target, std::move(patterns)))) {
    assert(false && "Affine to Standard conversion failed.");
  }

  return hyperblock_op;
}

to convert affine.load/store/if operations into standard operations (e.g., scf.if, memref.load/store)

  • hyperblock is region but it doesn't have yield?

I use the trait SingleBlockImplicitTerminator<"TaskflowHyperblockYieldOp"> in taskflow.yield. This means if the hyperblock has an output, the taskflow.hyperblock.yield exists explicitly, otherwise implicitly.

  • Why is it called hyperblock? It sounds like just block.

Because actually, we can also wrap the innermost loop (including the loop control) into the hyperblock, and we can still handle loop control w/o counters. Furthermore, if a loop body contains if-else conditions, we need to lower them to CDFG-based ir to build the DFG.

@ShangkunLi ShangkunLi merged commit ff5a1e8 into coredac:main Jan 14, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants