Add TaskDivisibilityAnalysisPass to Taskflow dialect#308
Add TaskDivisibilityAnalysisPass to Taskflow dialect#308ShangkunLi wants to merge 8 commits intocoredac:mainfrom
Conversation
| namespace attr { | ||
| // Attribute keys on taskflow.task operations produced by the | ||
| // TaskDivisibilityAnalysisPass. | ||
| constexpr llvm::StringLiteral kDivisibilityInfo = "divisibility_info"; |
| // Attribute keys on taskflow.task operations produced by the | ||
| // TaskDivisibilityAnalysisPass. | ||
| constexpr llvm::StringLiteral kDivisibilityInfo = "divisibility_info"; | ||
| constexpr llvm::StringLiteral kDivisibility = "divisibility"; |
There was a problem hiding this comment.
Can these be named as parallelism = parallel/atomic?
There was a problem hiding this comment.
I think the parallelism is a superset of divisibility. For example, a loop can be tiled (data-level parallelism) or unrolled (instruction-level parallelism) for higher parallelism. But we only consider whether a loop can be tiled for higher (spatial) data-level parallelism here.
So for this pass, it would be more precise to keep divisibility. WDYT?
How would parallel tasks be mapped/lowered? i.e., how would we leverage this characteristic? |
This parallel information is used for runtime configuration duplication, which is used for data-level parallelism. For example, %dependency_read_out, %dependency_write_out = taskflow.task @Task_0 dependency_read_in(%arg0 : memref<1x64x8x8xf32>) dependency_write_in(%alloc : memref<1x8x8x64xf32>) [original_read_memrefs(%arg0 : memref<1x64x8x8xf32>), original_write_memrefs(%alloc : memref<1x8x8x64xf32>)] {divisibility_info = {divisibility = "divisible", parallel_dims = array<i32: 1, 2, 3>, parallel_space = array<i32: 8, 8, 64>}} : (memref<1x64x8x8xf32>, memref<1x8x8x64xf32>) -> (memref<1x64x8x8xf32>, memref<1x8x8x64xf32>) {
^bb0(%arg1: memref<1x64x8x8xf32>, %arg2: memref<1x8x8x64xf32>):
affine.for %arg3 = 0 to 1 {
affine.for %arg4 = 0 to 8 {
affine.for %arg5 = 0 to 8 {
affine.for %arg6 = 0 to 64 {
%0 = affine.load %arg1[%arg3, %arg6, %arg4, %arg5] : memref<1x64x8x8xf32>
affine.store %0, %arg2[%arg3, %arg4, %arg5, %arg6] : memref<1x8x8x64xf32>
}
}
}
}
taskflow.yield reads(%arg1 : memref<1x64x8x8xf32>) writes(%arg2 : memref<1x8x8x64xf32>)
}This task may be assigned to one CGRA at first, but it may be the bottleneck when the input tensor size changes. And we can duplicate its configuration to other CGRAs for parallelism. The |
|
Be aware that our RTL can also support vectorized operation. So I am wondering how the vectorization pass can be applied (tensor -> vector dialect) in our flow. |
The |
I think these two methods are orthogonal for data-level parallelism. For this pass, we are trying to explore DLP in task granularity (the memory access pattern can be non-contiguous). For the vectorized method, we are trying to explore DLP in SIMD instruction granularity (the memory access pattern must be contiguous). |
Yes, we will duplicate from outer to inner. |
Introduces a new MLIR pass
--task-divisibility-analysisthat statically classifies eachtaskflow.taskas eitherdivisible(has exploitable parallel loops for data-level parallelism) oratomic(no parallel loops, must run as a single unit).The pass walks the affine loop nest of each task, using
affine::isLoopParallel()(excluding reduction-parallel loops) andaffine::getConstantTripCount()to identify parallel dimensions and their trip counts. Results are attached to each task op as adivisibility_infoDictionaryAttr containing three fields:divisibility(string),parallel_dims(i32 array of loop depth indices), andparallel_space(i32 array of corresponding trip counts).Also adds
TaskflowAttributes.hwith constexprStringLiteralconstants for all attribute keys and values, following the same pattern asNeuraAttributes.h.