Add TaskDivisibilityAnalysisPass to Taskflow dialect#308
Add TaskDivisibilityAnalysisPass to Taskflow dialect#308ShangkunLi wants to merge 7 commits intocoredac:mainfrom
Conversation
| namespace attr { | ||
| // Attribute keys on taskflow.task operations produced by the | ||
| // TaskDivisibilityAnalysisPass. | ||
| constexpr llvm::StringLiteral kDivisibilityInfo = "divisibility_info"; |
| // Attribute keys on taskflow.task operations produced by the | ||
| // TaskDivisibilityAnalysisPass. | ||
| constexpr llvm::StringLiteral kDivisibilityInfo = "divisibility_info"; | ||
| constexpr llvm::StringLiteral kDivisibility = "divisibility"; |
There was a problem hiding this comment.
Can these be named as parallelism = parallel/atomic?
How would parallel tasks be mapped/lowered? i.e., how would we leverage this characteristic? |
This parallel information is used for runtime configuration duplication, which is used for data-level parallelism. For example, %dependency_read_out, %dependency_write_out = taskflow.task @Task_0 dependency_read_in(%arg0 : memref<1x64x8x8xf32>) dependency_write_in(%alloc : memref<1x8x8x64xf32>) [original_read_memrefs(%arg0 : memref<1x64x8x8xf32>), original_write_memrefs(%alloc : memref<1x8x8x64xf32>)] {divisibility_info = {divisibility = "divisible", parallel_dims = array<i32: 1, 2, 3>, parallel_space = array<i32: 8, 8, 64>}} : (memref<1x64x8x8xf32>, memref<1x8x8x64xf32>) -> (memref<1x64x8x8xf32>, memref<1x8x8x64xf32>) {
^bb0(%arg1: memref<1x64x8x8xf32>, %arg2: memref<1x8x8x64xf32>):
affine.for %arg3 = 0 to 1 {
affine.for %arg4 = 0 to 8 {
affine.for %arg5 = 0 to 8 {
affine.for %arg6 = 0 to 64 {
%0 = affine.load %arg1[%arg3, %arg6, %arg4, %arg5] : memref<1x64x8x8xf32>
affine.store %0, %arg2[%arg3, %arg4, %arg5, %arg6] : memref<1x8x8x64xf32>
}
}
}
}
taskflow.yield reads(%arg1 : memref<1x64x8x8xf32>) writes(%arg2 : memref<1x8x8x64xf32>)
}This task may be assigned to one CGRA at first, but it may be the bottleneck when the input tensor size changes. And we can duplicate its configuration to other CGRAs for parallelism. The |
|
Be aware that our RTL can also support vectorized operation. So I am wondering how the vectorization pass can be applied (tensor -> vector dialect) in our flow. |
The |
Introduces a new MLIR pass
--task-divisibility-analysisthat statically classifies eachtaskflow.taskas eitherdivisible(has exploitable parallel loops for data-level parallelism) oratomic(no parallel loops, must run as a single unit).The pass walks the affine loop nest of each task, using
affine::isLoopParallel()(excluding reduction-parallel loops) andaffine::getConstantTripCount()to identify parallel dimensions and their trip counts. Results are attached to each task op as adivisibility_infoDictionaryAttr containing three fields:divisibility(string),parallel_dims(i32 array of loop depth indices), andparallel_space(i32 array of corresponding trip counts).Also adds
TaskflowAttributes.hwith constexprStringLiteralconstants for all attribute keys and values, following the same pattern asNeuraAttributes.h.