Skip to content

Add TaskDivisibilityAnalysisPass to Taskflow dialect#308

Open
ShangkunLi wants to merge 7 commits intocoredac:mainfrom
ShangkunLi:task-cato
Open

Add TaskDivisibilityAnalysisPass to Taskflow dialect#308
ShangkunLi wants to merge 7 commits intocoredac:mainfrom
ShangkunLi:task-cato

Conversation

@ShangkunLi
Copy link
Copy Markdown
Collaborator

Introduces a new MLIR pass --task-divisibility-analysis that statically classifies each taskflow.task as either divisible (has exploitable parallel loops for data-level parallelism) or atomic (no parallel loops, must run as a single unit).

The pass walks the affine loop nest of each task, using affine::isLoopParallel() (excluding reduction-parallel loops) and affine::getConstantTripCount() to identify parallel dimensions and their trip counts. Results are attached to each task op as a divisibility_info DictionaryAttr containing three fields: divisibility (string), parallel_dims (i32 array of loop depth indices), and parallel_space (i32 array of corresponding trip counts).

Also adds TaskflowAttributes.h with constexpr StringLiteral constants for all attribute keys and values, following the same pattern as NeuraAttributes.h.

@ShangkunLi ShangkunLi requested review from guosran and tancheng March 31, 2026 13:45
@ShangkunLi ShangkunLi self-assigned this Mar 31, 2026
@ShangkunLi ShangkunLi added the new feature New feature or request label Mar 31, 2026
namespace attr {
// Attribute keys on taskflow.task operations produced by the
// TaskDivisibilityAnalysisPass.
constexpr llvm::StringLiteral kDivisibilityInfo = "divisibility_info";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Call this task_info?

// Attribute keys on taskflow.task operations produced by the
// TaskDivisibilityAnalysisPass.
constexpr llvm::StringLiteral kDivisibilityInfo = "divisibility_info";
constexpr llvm::StringLiteral kDivisibility = "divisibility";
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can these be named as parallelism = parallel/atomic?

@tancheng
Copy link
Copy Markdown
Contributor

Introduces a new MLIR pass --task-divisibility-analysis that statically classifies each taskflow.task as either divisible (has exploitable parallel loops for data-level parallelism) or atomic (no parallel loops, must run as a single unit).

The pass walks the affine loop nest of each task, using affine::isLoopParallel() (excluding reduction-parallel loops) and affine::getConstantTripCount() to identify parallel dimensions and their trip counts. Results are attached to each task op as a divisibility_info DictionaryAttr containing three fields: divisibility (string), parallel_dims (i32 array of loop depth indices), and parallel_space (i32 array of corresponding trip counts).

Also adds TaskflowAttributes.h with constexpr StringLiteral constants for all attribute keys and values, following the same pattern as NeuraAttributes.h.

How would parallel tasks be mapped/lowered? i.e., how would we leverage this characteristic?

@ShangkunLi
Copy link
Copy Markdown
Collaborator Author

Introduces a new MLIR pass --task-divisibility-analysis that statically classifies each taskflow.task as either divisible (has exploitable parallel loops for data-level parallelism) or atomic (no parallel loops, must run as a single unit).
The pass walks the affine loop nest of each task, using affine::isLoopParallel() (excluding reduction-parallel loops) and affine::getConstantTripCount() to identify parallel dimensions and their trip counts. Results are attached to each task op as a divisibility_info DictionaryAttr containing three fields: divisibility (string), parallel_dims (i32 array of loop depth indices), and parallel_space (i32 array of corresponding trip counts).
Also adds TaskflowAttributes.h with constexpr StringLiteral constants for all attribute keys and values, following the same pattern as NeuraAttributes.h.

How would parallel tasks be mapped/lowered? i.e., how would we leverage this characteristic?

This parallel information is used for runtime configuration duplication, which is used for data-level parallelism. For example,

%dependency_read_out, %dependency_write_out = taskflow.task @Task_0 dependency_read_in(%arg0 : memref<1x64x8x8xf32>) dependency_write_in(%alloc : memref<1x8x8x64xf32>) [original_read_memrefs(%arg0 : memref<1x64x8x8xf32>), original_write_memrefs(%alloc : memref<1x8x8x64xf32>)] {divisibility_info = {divisibility = "divisible", parallel_dims = array<i32: 1, 2, 3>, parallel_space = array<i32: 8, 8, 64>}} : (memref<1x64x8x8xf32>, memref<1x8x8x64xf32>) -> (memref<1x64x8x8xf32>, memref<1x8x8x64xf32>) {
    ^bb0(%arg1: memref<1x64x8x8xf32>, %arg2: memref<1x8x8x64xf32>):
      affine.for %arg3 = 0 to 1 {
        affine.for %arg4 = 0 to 8 {
          affine.for %arg5 = 0 to 8 {
            affine.for %arg6 = 0 to 64 {
              %0 = affine.load %arg1[%arg3, %arg6, %arg4, %arg5] : memref<1x64x8x8xf32>
              affine.store %0, %arg2[%arg3, %arg4, %arg5, %arg6] : memref<1x8x8x64xf32>
            }
          }
        }
      }
      taskflow.yield reads(%arg1 : memref<1x64x8x8xf32>) writes(%arg2 : memref<1x8x8x64xf32>)
    }

This task may be assigned to one CGRA at first, but it may be the bottleneck when the input tensor size changes. And we can duplicate its configuration to other CGRAs for parallelism. The parallel_dims and the parallel_space defines how we could configure the loop controller for fast runtime configuration duplication.

@tancheng
Copy link
Copy Markdown
Contributor

tancheng commented Apr 1, 2026

Be aware that our RTL can also support vectorized operation. So I am wondering how the vectorization pass can be applied (tensor -> vector dialect) in our flow.

@tancheng
Copy link
Copy Markdown
Contributor

tancheng commented Apr 1, 2026

Introduces a new MLIR pass --task-divisibility-analysis that statically classifies each taskflow.task as either divisible (has exploitable parallel loops for data-level parallelism) or atomic (no parallel loops, must run as a single unit).
The pass walks the affine loop nest of each task, using affine::isLoopParallel() (excluding reduction-parallel loops) and affine::getConstantTripCount() to identify parallel dimensions and their trip counts. Results are attached to each task op as a divisibility_info DictionaryAttr containing three fields: divisibility (string), parallel_dims (i32 array of loop depth indices), and parallel_space (i32 array of corresponding trip counts).
Also adds TaskflowAttributes.h with constexpr StringLiteral constants for all attribute keys and values, following the same pattern as NeuraAttributes.h.

How would parallel tasks be mapped/lowered? i.e., how would we leverage this characteristic?

This parallel information is used for runtime configuration duplication, which is used for data-level parallelism. For example,

%dependency_read_out, %dependency_write_out = taskflow.task @Task_0 dependency_read_in(%arg0 : memref<1x64x8x8xf32>) dependency_write_in(%alloc : memref<1x8x8x64xf32>) [original_read_memrefs(%arg0 : memref<1x64x8x8xf32>), original_write_memrefs(%alloc : memref<1x8x8x64xf32>)] {divisibility_info = {divisibility = "divisible", parallel_dims = array<i32: 1, 2, 3>, parallel_space = array<i32: 8, 8, 64>}} : (memref<1x64x8x8xf32>, memref<1x8x8x64xf32>) -> (memref<1x64x8x8xf32>, memref<1x8x8x64xf32>) {
    ^bb0(%arg1: memref<1x64x8x8xf32>, %arg2: memref<1x8x8x64xf32>):
      affine.for %arg3 = 0 to 1 {
        affine.for %arg4 = 0 to 8 {
          affine.for %arg5 = 0 to 8 {
            affine.for %arg6 = 0 to 64 {
              %0 = affine.load %arg1[%arg3, %arg6, %arg4, %arg5] : memref<1x64x8x8xf32>
              affine.store %0, %arg2[%arg3, %arg4, %arg5, %arg6] : memref<1x8x8x64xf32>
            }
          }
        }
      }
      taskflow.yield reads(%arg1 : memref<1x64x8x8xf32>) writes(%arg2 : memref<1x8x8x64xf32>)
    }

This task may be assigned to one CGRA at first, but it may be the bottleneck when the input tensor size changes. And we can duplicate its configuration to other CGRAs for parallelism. The parallel_dims and the parallel_space defines how we could configure the loop controller for fast runtime configuration duplication.

The 8, 8, 64, which one will be duplicated? Should be from outer to inner?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new feature New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants