Introducing task level computation abstraction by ShangkunLi · Pull Request #235 · coredac/neura

ShangkunLi · 2026-01-13T13:35:54Z

In this pr,

re-define the computation of taskflow.task: we ensure there are only data dependencies (e.g., producer-consumer, RAW, WAR, WAW) between different tasks
introduce taskflow.counter op: convert all the loop control into counter tree/graph, each counter represents a loop control
introduce taskflow.hyperblock: wrap the rest code into hyperblock

This design enables us to perform three-level optimizations:

Pre-task optimization: before transforming to taskflow, we can perform high-level optimizations to explore the parallelism
Resource-agnostic optimization: perform task fission/fusion to minimize the data dependencies between different tasks
Resource-aware optimization: task legalization & resource binding (e.g., one task on multiple CGRAs, fuse multiple task to maximize resource utilization)

TODO:
Hyperblock fusion to generate the DFG/integrate to the neura.kernel op.

tancheng · 2026-01-13T22:59:09Z

what is the diff between task and hyperblock?

ShangkunLi · 2026-01-14T04:17:25Z

what is the diff between task and hyperblock?

task corresponds to the workload accelerated on one (logic-)CGRA. hyperblock is used to build a temporary representation before transforming to pure counter + neura.kernel. Each hyperblock records a code block between/within loops and its triggered loop index.

For example, a task can be:

module {
  func.func @_Z21pureNestedLoopExamplePA8_A6_iPA8_A5_iS4_PA7_iPA9_iPiS9_S9_S9_S9_(%arg0: memref<?x8x6xi32>, %arg1: memref<?x8x5xi32>, %arg2: memref<?x8x5xi32>, %arg3: memref<?x7xi32>, %arg4: memref<?x9xi32>, %arg5: memref<?xi32>, %arg6: memref<?xi32>, %arg7: memref<?xi32>, %arg8: memref<?xi32>, %arg9: memref<?xi32>) -> i32 attributes {llvm.linkage = #llvm.linkage<external>} {
    %memory_outputs:5 = "taskflow.task"(%arg0, %arg1, %arg2, %arg5, %arg6, %arg9, %arg3, %arg4, %arg7, %arg8) <{operandSegmentSizes = array<i32: 10, 0>, resultSegmentSizes = array<i32: 5, 0>, task_name = "Task_0"}> ({
    ^bb0(%arg10: memref<?x8x6xi32>, %arg11: memref<?x8x5xi32>, %arg12: memref<?x8x5xi32>, %arg13: memref<?xi32>, %arg14: memref<?xi32>, %arg15: memref<?xi32>, %arg16: memref<?x7xi32>, %arg17: memref<?x9xi32>, %arg18: memref<?xi32>, %arg19: memref<?xi32>):
      affine.for %arg20 = 0 to 4 {
        affine.for %arg21 = 0 to 8 {
          affine.for %arg22 = 0 to 6 {
            %1 = affine.load %arg10[%arg20, %arg21, %arg22] : memref<?x8x6xi32>
            affine.store %1, %arg13[%arg22] : memref<?xi32>
          }
          affine.for %arg22 = 0 to 5 {
            %1 = affine.load %arg11[%arg20, %arg21, %arg22] : memref<?x8x5xi32>
            %2 = affine.load %arg12[%arg20, %arg21, %arg22] : memref<?x8x5xi32>
            %3 = arith.addi %1, %2 : i32
            affine.store %3, %arg14[%arg22] : memref<?xi32>
          }
          affine.for %arg22 = 0 to 6 {
            %1 = affine.load %arg13[%arg22] : memref<?xi32>
            %2 = affine.load %arg14[%arg22] : memref<?xi32>
            %3 = arith.addi %1, %2 : i32
            %4 = affine.load %arg15[0] : memref<?xi32>
            %5 = arith.addi %4, %3 : i32
            affine.store %5, %arg15[0] : memref<?xi32>
          }
        }
        affine.for %arg21 = 0 to 7 {
          %1 = affine.load %arg16[%arg20, %arg21] : memref<?x7xi32>
          affine.store %1, %arg18[%arg21] : memref<?xi32>
        }
        affine.for %arg21 = 0 to 9 {
          %1 = affine.load %arg17[%arg20, %arg21] : memref<?x9xi32>
          %2 = affine.load %arg18[%arg21] : memref<?xi32>
          %3 = arith.addi %1, %2 : i32
          affine.store %3, %arg19[%arg21] : memref<?xi32>
        }
      }
      "taskflow.yield"(%arg13, %arg14, %arg15, %arg18, %arg19) <{operandSegmentSizes = array<i32: 5, 0>}> : (memref<?xi32>, memref<?xi32>, memref<?xi32>, memref<?xi32>, memref<?xi32>) -> ()
    }) : (memref<?x8x6xi32>, memref<?x8x5xi32>, memref<?x8x5xi32>, memref<?xi32>, memref<?xi32>, memref<?xi32>, memref<?x7xi32>, memref<?x9xi32>, memref<?xi32>, memref<?xi32>) -> (memref<?xi32>, memref<?xi32>, memref<?xi32>, memref<?xi32>, memref<?xi32>)
    %0 = affine.load %arg9[0] : memref<?xi32>
    return %0 : i32
  }
}

We can extract each code block to build this temporary representation:

module {
  func.func @_Z21pureNestedLoopExamplePA8_A6_iPA8_A5_iS4_PA7_iPA9_iPiS9_S9_S9_S9_(%arg0: memref<?x8x6xi32>, %arg1: memref<?x8x5xi32>, %arg2: memref<?x8x5xi32>, %arg3: memref<?x7xi32>, %arg4: memref<?x9xi32>, %arg5: memref<?xi32>, %arg6: memref<?xi32>, %arg7: memref<?xi32>, %arg8: memref<?xi32>, %arg9: memref<?xi32>) -> i32 attributes {llvm.linkage = #llvm.linkage<external>} {
    %memory_outputs:5 = "taskflow.task"(%arg0, %arg1, %arg2, %arg5, %arg6, %arg9, %arg3, %arg4, %arg7, %arg8) <{operandSegmentSizes = array<i32: 10, 0>, resultSegmentSizes = array<i32: 5, 0>, task_name = "Task_0"}> ({
    ^bb0(%arg10: memref<?x8x6xi32>, %arg11: memref<?x8x5xi32>, %arg12: memref<?x8x5xi32>, %arg13: memref<?xi32>, %arg14: memref<?xi32>, %arg15: memref<?xi32>, %arg16: memref<?x7xi32>, %arg17: memref<?x9xi32>, %arg18: memref<?xi32>, %arg19: memref<?xi32>):
      %1 = taskflow.counter attributes {lower_bound = 0 : index, step = 1 : index, upper_bound = 4 : index} : index
      %2 = taskflow.counter parent(%1 : index) attributes {lower_bound = 0 : index, step = 1 : index, upper_bound = 8 : index} : index
      %3 = taskflow.counter parent(%2 : index) attributes {lower_bound = 0 : index, step = 1 : index, upper_bound = 6 : index} : index
      %4 = taskflow.counter parent(%2 : index) attributes {lower_bound = 0 : index, step = 1 : index, upper_bound = 5 : index} : index
      %5 = taskflow.counter parent(%2 : index) attributes {lower_bound = 0 : index, step = 1 : index, upper_bound = 6 : index} : index
      %6 = taskflow.counter parent(%1 : index) attributes {lower_bound = 0 : index, step = 1 : index, upper_bound = 7 : index} : index
      %7 = taskflow.counter parent(%1 : index) attributes {lower_bound = 0 : index, step = 1 : index, upper_bound = 9 : index} : index
      taskflow.hyperblock indices(%1, %2, %3 : index, index, index) {
      ^bb0(%arg20: index, %arg21: index, %arg22: index):
        %8 = memref.load %arg10[%arg20, %arg21, %arg22] : memref<?x8x6xi32>
        memref.store %8, %arg13[%arg22] : memref<?xi32>
      } -> ()
      taskflow.hyperblock indices(%1, %2, %4 : index, index, index) {
      ^bb0(%arg20: index, %arg21: index, %arg22: index):
        %8 = memref.load %arg11[%arg20, %arg21, %arg22] : memref<?x8x5xi32>
        %9 = memref.load %arg12[%arg20, %arg21, %arg22] : memref<?x8x5xi32>
        %10 = arith.addi %8, %9 : i32
        memref.store %10, %arg14[%arg22] : memref<?xi32>
      } -> ()
      taskflow.hyperblock indices(%5 : index) {
      ^bb0(%arg20: index):
        %8 = memref.load %arg13[%arg20] : memref<?xi32>
        %9 = memref.load %arg14[%arg20] : memref<?xi32>
        %10 = arith.addi %8, %9 : i32
        %c0 = arith.constant 0 : index
        %11 = memref.load %arg15[%c0] : memref<?xi32>
        %12 = arith.addi %11, %10 : i32
        %c0_0 = arith.constant 0 : index
        memref.store %12, %arg15[%c0_0] : memref<?xi32>
      } -> ()
      taskflow.hyperblock indices(%1, %6 : index, index) {
      ^bb0(%arg20: index, %arg21: index):
        %8 = memref.load %arg16[%arg20, %arg21] : memref<?x7xi32>
        memref.store %8, %arg18[%arg21] : memref<?xi32>
      } -> ()
      taskflow.hyperblock indices(%1, %7 : index, index) {
      ^bb0(%arg20: index, %arg21: index):
        %8 = memref.load %arg17[%arg20, %arg21] : memref<?x9xi32>
        %9 = memref.load %arg18[%arg21] : memref<?xi32>
        %10 = arith.addi %8, %9 : i32
        memref.store %10, %arg19[%arg21] : memref<?xi32>
      } -> ()
      "taskflow.yield"(%arg13, %arg14, %arg15, %arg18, %arg19) <{operandSegmentSizes = array<i32: 5, 0>}> : (memref<?xi32>, memref<?xi32>, memref<?xi32>, memref<?xi32>, memref<?xi32>) -> ()
    }) : (memref<?x8x6xi32>, memref<?x8x5xi32>, memref<?x8x5xi32>, memref<?xi32>, memref<?xi32>, memref<?xi32>, memref<?x7xi32>, memref<?x9xi32>, memref<?xi32>, memref<?xi32>) -> (memref<?xi32>, memref<?xi32>, memref<?xi32>, memref<?xi32>, memref<?xi32>)
    %0 = affine.load %arg9[0] : memref<?xi32>
    return %0 : i32
  }
}

Based on this temporary representation, we can perform resource binding (affine controller chaining, multi-CGRA binding, etc.). And then we can fuse these hyperblocks to get the kernel. The key point of hyperblock fusion is to maintain the memory access order, which I plan to insert a token type between each memory access operation.

tancheng · 2026-01-14T04:40:27Z

task

Bufferization is performed between task and hyperblock IR? I saw affine becomes memref.
hyperblock is region but it doesn't have yield?
Why is it called hyperblock? It sounds like just block.

ShangkunLi · 2026-01-14T04:52:10Z

Bufferization is performed between task and hyperblock IR? I saw affine becomes memref.

When we extract affine.load/store/if operation into hyperblock, we cannot preserve the map relationtiops in the hyperblock region. So I use

// Creates a taskflow.hyperblock operation from HyperblockInfo.
static TaskflowHyperblockOp createHyperblock(
    OpBuilder &builder, Location loc, const HyperblockInfo &info,
    Block *task_body,
    const DenseMap<affine::AffineForOp, LoopInfo *> &loop_info_map) {
  ...
  ...
  MLIRContext *context = hyperblock_op.getContext();
  RewritePatternSet patterns(context);

  populateAffineToStdConversionPatterns(patterns);
  ConversionTarget target(*context);
  target.addLegalDialect<arith::ArithDialect, memref::MemRefDialect,
                         func::FuncDialect, taskflow::TaskflowDialect>();
  target.addIllegalOp<affine::AffineLoadOp, affine::AffineStoreOp,
                      affine::AffineIfOp>();
  if (failed(applyPartialConversion(hyperblock_op, target, std::move(patterns)))) {
    assert(false && "Affine to Standard conversion failed.");
  }

  return hyperblock_op;
}

to convert affine.load/store/if operations into standard operations (e.g., scf.if, memref.load/store)

hyperblock is region but it doesn't have yield?

I use the trait SingleBlockImplicitTerminator<"TaskflowHyperblockYieldOp"> in taskflow.yield. This means if the hyperblock has an output, the taskflow.hyperblock.yield exists explicitly, otherwise implicitly.

Why is it called hyperblock? It sounds like just block.

Because actually, we can also wrap the innermost loop (including the loop control) into the hyperblock, and we can still handle loop control w/o counters. Furthermore, if a loop body contains if-else conditions, we need to lower them to CDFG-based ir to build the DFG.

ShangkunLi added 11 commits January 9, 2026 20:07

add taskflow.emit op

5e3d07c

add more general affine tests

fb82c0e

remove debug output

3924f2e

modify the op definition

0300cc6

enable affine-to-taskflow conversion

a14a8b8

introduce intra-task operation definitions

d858e3f

change the format of task & yield op

03fa147

enable counter chain building

406e549

enable hyperblock construction

5bd9732

remove debug output

e45a4a2

add tests

31b5331

ShangkunLi requested review from guosran and tancheng January 13, 2026 13:40

tancheng approved these changes Jan 14, 2026

View reviewed changes

ShangkunLi merged commit ff5a1e8 into coredac:main Jan 14, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introducing task level computation abstraction#235

Introducing task level computation abstraction#235
ShangkunLi merged 11 commits intocoredac:mainfrom
ShangkunLi:taskflow-to-hyperblock

ShangkunLi commented Jan 13, 2026

Uh oh!

tancheng commented Jan 13, 2026

Uh oh!

ShangkunLi commented Jan 14, 2026

Uh oh!

tancheng commented Jan 14, 2026

Uh oh!

ShangkunLi commented Jan 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ShangkunLi commented Jan 13, 2026

Uh oh!

tancheng commented Jan 13, 2026

Uh oh!

ShangkunLi commented Jan 14, 2026

Uh oh!

tancheng commented Jan 14, 2026

Uh oh!

ShangkunLi commented Jan 14, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants