Add TaskDivisibilityAnalysisPass to Taskflow dialect by ShangkunLi · Pull Request #308 · coredac/neura

ShangkunLi · 2026-03-31T13:45:25Z

Introduces a new MLIR pass --task-divisibility-analysis that statically classifies each taskflow.task as either divisible (has exploitable parallel loops for data-level parallelism) or atomic (no parallel loops, must run as a single unit).

The pass walks the affine loop nest of each task, using affine::isLoopParallel() (excluding reduction-parallel loops) and affine::getConstantTripCount() to identify parallel dimensions and their trip counts. Results are attached to each task op as a divisibility_info DictionaryAttr containing three fields: divisibility (string), parallel_dims (i32 array of loop depth indices), and parallel_space (i32 array of corresponding trip counts).

Also adds TaskflowAttributes.h with constexpr StringLiteral constants for all attribute keys and values, following the same pattern as NeuraAttributes.h.

tancheng · 2026-03-31T18:28:04Z

include/TaskflowDialect/TaskflowAttributes.h

+namespace attr {
+// Attribute keys on taskflow.task operations produced by the
+// TaskDivisibilityAnalysisPass.
+constexpr llvm::StringLiteral kDivisibilityInfo = "divisibility_info";


Call this task_info?

tancheng · 2026-03-31T18:29:06Z

include/TaskflowDialect/TaskflowAttributes.h

+// Attribute keys on taskflow.task operations produced by the
+// TaskDivisibilityAnalysisPass.
+constexpr llvm::StringLiteral kDivisibilityInfo = "divisibility_info";
+constexpr llvm::StringLiteral kDivisibility = "divisibility";


Can these be named as parallelism = parallel/atomic?

I think the parallelism is a superset of divisibility. For example, a loop can be tiled (data-level parallelism) or unrolled (instruction-level parallelism) for higher parallelism. But we only consider whether a loop can be tiled for higher (spatial) data-level parallelism here.

So for this pass, it would be more precise to keep divisibility. WDYT?

tancheng · 2026-03-31T18:29:44Z

Introduces a new MLIR pass --task-divisibility-analysis that statically classifies each taskflow.task as either divisible (has exploitable parallel loops for data-level parallelism) or atomic (no parallel loops, must run as a single unit).

The pass walks the affine loop nest of each task, using affine::isLoopParallel() (excluding reduction-parallel loops) and affine::getConstantTripCount() to identify parallel dimensions and their trip counts. Results are attached to each task op as a divisibility_info DictionaryAttr containing three fields: divisibility (string), parallel_dims (i32 array of loop depth indices), and parallel_space (i32 array of corresponding trip counts).

Also adds TaskflowAttributes.h with constexpr StringLiteral constants for all attribute keys and values, following the same pattern as NeuraAttributes.h.

How would parallel tasks be mapped/lowered? i.e., how would we leverage this characteristic?

ShangkunLi · 2026-04-01T06:05:41Z

Introduces a new MLIR pass --task-divisibility-analysis that statically classifies each taskflow.task as either divisible (has exploitable parallel loops for data-level parallelism) or atomic (no parallel loops, must run as a single unit).
The pass walks the affine loop nest of each task, using affine::isLoopParallel() (excluding reduction-parallel loops) and affine::getConstantTripCount() to identify parallel dimensions and their trip counts. Results are attached to each task op as a divisibility_info DictionaryAttr containing three fields: divisibility (string), parallel_dims (i32 array of loop depth indices), and parallel_space (i32 array of corresponding trip counts).
Also adds TaskflowAttributes.h with constexpr StringLiteral constants for all attribute keys and values, following the same pattern as NeuraAttributes.h.

How would parallel tasks be mapped/lowered? i.e., how would we leverage this characteristic?

This parallel information is used for runtime configuration duplication, which is used for data-level parallelism. For example,

%dependency_read_out, %dependency_write_out = taskflow.task @Task_0 dependency_read_in(%arg0 : memref<1x64x8x8xf32>) dependency_write_in(%alloc : memref<1x8x8x64xf32>) [original_read_memrefs(%arg0 : memref<1x64x8x8xf32>), original_write_memrefs(%alloc : memref<1x8x8x64xf32>)] {divisibility_info = {divisibility = "divisible", parallel_dims = array<i32: 1, 2, 3>, parallel_space = array<i32: 8, 8, 64>}} : (memref<1x64x8x8xf32>, memref<1x8x8x64xf32>) -> (memref<1x64x8x8xf32>, memref<1x8x8x64xf32>) {
    ^bb0(%arg1: memref<1x64x8x8xf32>, %arg2: memref<1x8x8x64xf32>):
      affine.for %arg3 = 0 to 1 {
        affine.for %arg4 = 0 to 8 {
          affine.for %arg5 = 0 to 8 {
            affine.for %arg6 = 0 to 64 {
              %0 = affine.load %arg1[%arg3, %arg6, %arg4, %arg5] : memref<1x64x8x8xf32>
              affine.store %0, %arg2[%arg3, %arg4, %arg5, %arg6] : memref<1x8x8x64xf32>
            }
          }
        }
      }
      taskflow.yield reads(%arg1 : memref<1x64x8x8xf32>) writes(%arg2 : memref<1x8x8x64xf32>)
    }

This task may be assigned to one CGRA at first, but it may be the bottleneck when the input tensor size changes. And we can duplicate its configuration to other CGRAs for parallelism. The parallel_dims and the parallel_space defines how we could configure the loop controller for fast runtime configuration duplication.

tancheng · 2026-04-01T07:06:46Z

Be aware that our RTL can also support vectorized operation. So I am wondering how the vectorization pass can be applied (tensor -> vector dialect) in our flow.

tancheng · 2026-04-01T07:07:27Z

Introduces a new MLIR pass --task-divisibility-analysis that statically classifies each taskflow.task as either divisible (has exploitable parallel loops for data-level parallelism) or atomic (no parallel loops, must run as a single unit).
The pass walks the affine loop nest of each task, using affine::isLoopParallel() (excluding reduction-parallel loops) and affine::getConstantTripCount() to identify parallel dimensions and their trip counts. Results are attached to each task op as a divisibility_info DictionaryAttr containing three fields: divisibility (string), parallel_dims (i32 array of loop depth indices), and parallel_space (i32 array of corresponding trip counts).
Also adds TaskflowAttributes.h with constexpr StringLiteral constants for all attribute keys and values, following the same pattern as NeuraAttributes.h.

How would parallel tasks be mapped/lowered? i.e., how would we leverage this characteristic?

This parallel information is used for runtime configuration duplication, which is used for data-level parallelism. For example,
%dependency_read_out, %dependency_write_out = taskflow.task @Task_0 dependency_read_in(%arg0 : memref<1x64x8x8xf32>) dependency_write_in(%alloc : memref<1x8x8x64xf32>) [original_read_memrefs(%arg0 : memref<1x64x8x8xf32>), original_write_memrefs(%alloc : memref<1x8x8x64xf32>)] {divisibility_info = {divisibility = "divisible", parallel_dims = array<i32: 1, 2, 3>, parallel_space = array<i32: 8, 8, 64>}} : (memref<1x64x8x8xf32>, memref<1x8x8x64xf32>) -> (memref<1x64x8x8xf32>, memref<1x8x8x64xf32>) {
    ^bb0(%arg1: memref<1x64x8x8xf32>, %arg2: memref<1x8x8x64xf32>):
      affine.for %arg3 = 0 to 1 {
        affine.for %arg4 = 0 to 8 {
          affine.for %arg5 = 0 to 8 {
            affine.for %arg6 = 0 to 64 {
              %0 = affine.load %arg1[%arg3, %arg6, %arg4, %arg5] : memref<1x64x8x8xf32>
              affine.store %0, %arg2[%arg3, %arg4, %arg5, %arg6] : memref<1x8x8x64xf32>
            }
          }
        }
      }
      taskflow.yield reads(%arg1 : memref<1x64x8x8xf32>) writes(%arg2 : memref<1x8x8x64xf32>)
    }
This task may be assigned to one CGRA at first, but it may be the bottleneck when the input tensor size changes. And we can duplicate its configuration to other CGRAs for parallelism. The parallel_dims and the parallel_space defines how we could configure the loop controller for fast runtime configuration duplication.

The 8, 8, 64, which one will be duplicated? Should be from outer to inner?

ShangkunLi · 2026-04-06T05:14:25Z

Be aware that our RTL can also support vectorized operation. So I am wondering how the vectorization pass can be applied (tensor -> vector dialect) in our flow.

I think these two methods are orthogonal for data-level parallelism. For this pass, we are trying to explore DLP in task granularity (the memory access pattern can be non-contiguous). For the vectorized method, we are trying to explore DLP in SIMD instruction granularity (the memory access pattern must be contiguous).

ShangkunLi · 2026-04-06T05:16:48Z

Introduces a new MLIR pass --task-divisibility-analysis that statically classifies each taskflow.task as either divisible (has exploitable parallel loops for data-level parallelism) or atomic (no parallel loops, must run as a single unit).
The pass walks the affine loop nest of each task, using affine::isLoopParallel() (excluding reduction-parallel loops) and affine::getConstantTripCount() to identify parallel dimensions and their trip counts. Results are attached to each task op as a divisibility_info DictionaryAttr containing three fields: divisibility (string), parallel_dims (i32 array of loop depth indices), and parallel_space (i32 array of corresponding trip counts).
Also adds TaskflowAttributes.h with constexpr StringLiteral constants for all attribute keys and values, following the same pattern as NeuraAttributes.h.

How would parallel tasks be mapped/lowered? i.e., how would we leverage this characteristic?

This parallel information is used for runtime configuration duplication, which is used for data-level parallelism. For example,
%dependency_read_out, %dependency_write_out = taskflow.task @Task_0 dependency_read_in(%arg0 : memref<1x64x8x8xf32>) dependency_write_in(%alloc : memref<1x8x8x64xf32>) [original_read_memrefs(%arg0 : memref<1x64x8x8xf32>), original_write_memrefs(%alloc : memref<1x8x8x64xf32>)] {divisibility_info = {divisibility = "divisible", parallel_dims = array<i32: 1, 2, 3>, parallel_space = array<i32: 8, 8, 64>}} : (memref<1x64x8x8xf32>, memref<1x8x8x64xf32>) -> (memref<1x64x8x8xf32>, memref<1x8x8x64xf32>) {
    ^bb0(%arg1: memref<1x64x8x8xf32>, %arg2: memref<1x8x8x64xf32>):
      affine.for %arg3 = 0 to 1 {
        affine.for %arg4 = 0 to 8 {
          affine.for %arg5 = 0 to 8 {
            affine.for %arg6 = 0 to 64 {
              %0 = affine.load %arg1[%arg3, %arg6, %arg4, %arg5] : memref<1x64x8x8xf32>
              affine.store %0, %arg2[%arg3, %arg4, %arg5, %arg6] : memref<1x8x8x64xf32>
            }
          }
        }
      }
      taskflow.yield reads(%arg1 : memref<1x64x8x8xf32>) writes(%arg2 : memref<1x8x8x64xf32>)
    }
This task may be assigned to one CGRA at first, but it may be the bottleneck when the input tensor size changes. And we can duplicate its configuration to other CGRAs for parallelism. The parallel_dims and the parallel_space defines how we could configure the loop controller for fast runtime configuration duplication.
The 8, 8, 64, which one will be duplicated? Should be from outer to inner?

Yes, we will duplicate from outer to inner.

ShangkunLi added 7 commits March 31, 2026 16:59

add task cate pass decl

1be310d

prototype task cate pass

ecf6968

rename task-categorization -> task-divisibility-analysis

d95d1e9

update the output format

5e60627

introduce taskflow attributes

15c2567

modify comments

f68af64

update test

b91deb8

ShangkunLi requested review from guosran and tancheng March 31, 2026 13:45

ShangkunLi self-assigned this Mar 31, 2026

ShangkunLi added the new feature New feature or request label Mar 31, 2026

tancheng reviewed Mar 31, 2026

View reviewed changes

rename divisibility_into -> task_info

ea2d384

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add TaskDivisibilityAnalysisPass to Taskflow dialect#308

Add TaskDivisibilityAnalysisPass to Taskflow dialect#308
ShangkunLi wants to merge 8 commits intocoredac:mainfrom
ShangkunLi:task-cato

ShangkunLi commented Mar 31, 2026

Uh oh!

tancheng Mar 31, 2026

Uh oh!

tancheng Mar 31, 2026

Uh oh!

ShangkunLi Apr 6, 2026

Uh oh!

tancheng commented Mar 31, 2026

Uh oh!

ShangkunLi commented Apr 1, 2026

Uh oh!

tancheng commented Apr 1, 2026

Uh oh!

tancheng commented Apr 1, 2026

Uh oh!

ShangkunLi commented Apr 6, 2026

Uh oh!

ShangkunLi commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ShangkunLi commented Mar 31, 2026

Uh oh!

tancheng Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

tancheng Mar 31, 2026

Choose a reason for hiding this comment

Uh oh!

ShangkunLi Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

tancheng commented Mar 31, 2026

Uh oh!

ShangkunLi commented Apr 1, 2026

Uh oh!

tancheng commented Apr 1, 2026

Uh oh!

tancheng commented Apr 1, 2026

Uh oh!

ShangkunLi commented Apr 6, 2026

Uh oh!

ShangkunLi commented Apr 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants