-
Notifications
You must be signed in to change notification settings - Fork 30
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[vectorization] Apply tiling only if element types vectorizable #476
[vectorization] Apply tiling only if element types vectorizable #476
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Any idea what is failing?
I've been trying to understand the failure for 30 mins, no success, very mysterious to me |
If this is supported in llvm-aie (peano) Xilinx/llvm-aie#102 then I think this path will work. |
I think we should still look into this from our side on why we have <2 x s32> when we were expecting to make scalar code, do we know where that happens? If not @newling could you please produce the ir dump with the standard |
I think it's because linalg-to-loops only scalarizes the reduction dimensions of the packed linalg.generic. So the insert-loops-for-vectorization, before this PR, was scalarizing (aka tiling to size 1) the outer m- and n- dimensions. But with this PR, that does not happen. See below (left is before this PR, right is after this PR). |
I take that back, it is being completely scalarized (before and after). Just that before there is an intermediate memref.subview. |
Just to summarize my view of the situation:
|
82facf6
to
0fda88f
Compare
@@ -654,13 +669,6 @@ run_matmul_test \ | |||
--acc_type "f32" \ | |||
--m "128" --n "128" --k "2304" \ | |||
|
|||
run_matmul_test \ | |||
--name_prefix "packPeel_t_i32" \ | |||
--pipeline "pack-peel" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've removed this test, as there is a test with the same shapes for bf16.
build_tools/ci/run_matmul_test.sh
Outdated
# Note I'm not using the --expect_compile_failure flag here, | ||
# as that would require all developers to use the same verion | ||
# of peano, which we currently don't enforce. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't quite understand this statement.
Nit typo: verion -> version
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we use the --expect_compile_failure=1 trick here, then after peano has fixed the issue we'll need to do
- bump peano past the fix on CI, and for all developers
- change to --expect_compile_failure=0
at the same time. But we don't control the version of peano we use from with iree-amd-aie (like we do iree, and third-party repos), so this is basically impossible.
(....I think I'll just remove the comment!)
run_matmul_test \ | ||
--name_prefix "transpose_int32" \ | ||
--lhs_rhs_type "i32" \ | ||
--name_prefix "transpose_i8_i32" \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test is successful? Only the input and output types are i32 failed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup. i8 -> i32 is vectorized.
Vectorization in iree-amd-aie consists of 2 passes:
tile linalg.generic ops in all leading dimensions, so that batched matmuls, and matmuls which have been packed into high-dimesions with multiple reduction dimensions, get replaced by unbatched 'atomic' matmuls with scf.for loops
lower the 'atomic' matmuls to the vector dialect (vector.contract and other vector dialect casts/copies).
Before this PR, pass 1 applied to all linalg.generics, irrespective of their element type. However, pass 2 only applies to linalg.generics for which the operand element types have hardware support on AIE for vectorization. All linalg.generics with other types, like matmuls with i32 operands and result, are not converted to the vector dialect, they are later unrolled in a pass, linalg-to-loop (or something).
So basically pass 1, before this PR, might transform linalg.generics in preparation for vectorization, even though in pass 2 they are not vectorized. This is not a problem, but unnecessarily changes IR. More importantly, there has been a request to make pass 1 more conservative and leave unvectorizable ops alone, so make a later object-fifo related pass easier (@yzhang93 @Abhishek-Varma)
So that's what this PR does: makes the loop tiling only apply when the element types are vectorizable for AIE.