Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[vectorization] Apply tiling only if element types vectorizable #476

Merged
merged 4 commits into from
Jun 28, 2024

Conversation

newling
Copy link
Contributor

@newling newling commented Jun 27, 2024

Vectorization in iree-amd-aie consists of 2 passes:

  1. tile linalg.generic ops in all leading dimensions, so that batched matmuls, and matmuls which have been packed into high-dimesions with multiple reduction dimensions, get replaced by unbatched 'atomic' matmuls with scf.for loops

  2. lower the 'atomic' matmuls to the vector dialect (vector.contract and other vector dialect casts/copies).

Before this PR, pass 1 applied to all linalg.generics, irrespective of their element type. However, pass 2 only applies to linalg.generics for which the operand element types have hardware support on AIE for vectorization. All linalg.generics with other types, like matmuls with i32 operands and result, are not converted to the vector dialect, they are later unrolled in a pass, linalg-to-loop (or something).

So basically pass 1, before this PR, might transform linalg.generics in preparation for vectorization, even though in pass 2 they are not vectorized. This is not a problem, but unnecessarily changes IR. More importantly, there has been a request to make pass 1 more conservative and leave unvectorizable ops alone, so make a later object-fifo related pass easier (@yzhang93 @Abhishek-Varma)

So that's what this PR does: makes the loop tiling only apply when the element types are vectorizable for AIE.

@newling newling changed the title Apply tiling only if vectorization likely [vectorization] Apply tiling only if element types vectorizable Jun 27, 2024
Copy link
Contributor

@yzhang93 yzhang93 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Any idea what is failing?

@newling
Copy link
Contributor Author

newling commented Jun 27, 2024

LGTM. Any idea what is failing?

I've been trying to understand the failure for 30 mins, no success, very mysterious to me

@newling
Copy link
Contributor Author

newling commented Jun 28, 2024

LGTM. Any idea what is failing?

I've been trying to understand the failure for 30 mins, no success, very mysterious to me

If this is supported in llvm-aie (peano) Xilinx/llvm-aie#102 then I think this path will work.

@nirvedhmeshram
Copy link
Contributor

nirvedhmeshram commented Jun 28, 2024

LGTM. Any idea what is failing?

I've been trying to understand the failure for 30 mins, no success, very mysterious to me

If this is supported in llvm-aie (peano) Xilinx/llvm-aie#102 then I think this path will work.

I think we should still look into this from our side on why we have <2 x s32> when we were expecting to make scalar code, do we know where that happens? If not @newling could you please produce the ir dump with the standard *-dump-ir-after-all flags and in addition we may also need to pass --print-after-all flag to opt over here if its not our mlir lowering doing it.

@newling
Copy link
Contributor Author

newling commented Jun 28, 2024

LGTM. Any idea what is failing?

I've been trying to understand the failure for 30 mins, no success, very mysterious to me

If this is supported in llvm-aie (peano) Xilinx/llvm-aie#102 then I think this path will work.

I think we should still look into this from our side on why we have <2 x s32> when we were expecting to make scalar code, do we know where that happens? If not @newling could you please produce the ir dump with the standard *-dump-ir-after-all flags and in addition we may also need to pass --print-after-all flag to opt over here if its not our mlir lowering doing it.

I think it's because linalg-to-loops only scalarizes the reduction dimensions of the packed linalg.generic.

So the insert-loops-for-vectorization, before this PR, was scalarizing (aka tiling to size 1) the outer m- and n- dimensions. But with this PR, that does not happen. See below (left is before this PR, right is after this PR).

image

@newling
Copy link
Contributor Author

newling commented Jun 28, 2024

LGTM. Any idea what is failing?

I've been trying to understand the failure for 30 mins, no success, very mysterious to me

If this is supported in llvm-aie (peano) Xilinx/llvm-aie#102 then I think this path will work.

I think we should still look into this from our side on why we have <2 x s32> when we were expecting to make scalar code, do we know where that happens? If not @newling could you please produce the ir dump with the standard *-dump-ir-after-all flags and in addition we may also need to pass --print-after-all flag to opt over here if its not our mlir lowering doing it.

I think it's because linalg-to-loops only scalarizes the reduction dimensions of the packed linalg.generic.

So the insert-loops-for-vectorization, before this PR, was scalarizing (aka tiling to size 1) the outer m- and n- dimensions. But with this PR, that does not happen. See below (left is before this PR, right is after this PR).

image

I take that back, it is being completely scalarized (before and after). Just that before there is an intermediate memref.subview.

@newling
Copy link
Contributor Author

newling commented Jun 28, 2024

One more data point: this issue only arises with matmul transpose. Transpose vs non-transpose below:

image

(The peano error arises in the matmul-transpose case, not the matmul case).

@newling
Copy link
Contributor Author

newling commented Jun 28, 2024

Just to summarize my view of the situation:

  • This is all for matmul with i32, which is not our priority (and shouldn't be a priority for peano either)
  • matmul with i32 should not require any vectorization-specific passes to compile end-to-end
  • Disabling --insert-loops-for-vectorization for i32 makes peano compilation fail
  • So by pure luck, the pass --insert-loops-for-vectorization puts the IR (matmul transpose) in a state which peano handles
  • With or without --insert-loops-for-vectorization, the IR is correctly scalarized by --linalg-to-loops

@newling newling force-pushed the only_tile_if_eltype_vectorizable branch from 82facf6 to 0fda88f Compare June 28, 2024 19:33
@@ -654,13 +669,6 @@ run_matmul_test \
--acc_type "f32" \
--m "128" --n "128" --k "2304" \

run_matmul_test \
--name_prefix "packPeel_t_i32" \
--pipeline "pack-peel" \
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've removed this test, as there is a test with the same shapes for bf16.

Comment on lines 510 to 512
# Note I'm not using the --expect_compile_failure flag here,
# as that would require all developers to use the same verion
# of peano, which we currently don't enforce.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't quite understand this statement.
Nit typo: verion -> version

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we use the --expect_compile_failure=1 trick here, then after peano has fixed the issue we'll need to do

  1. bump peano past the fix on CI, and for all developers
  2. change to --expect_compile_failure=0

at the same time. But we don't control the version of peano we use from with iree-amd-aie (like we do iree, and third-party repos), so this is basically impossible.

(....I think I'll just remove the comment!)

run_matmul_test \
--name_prefix "transpose_int32" \
--lhs_rhs_type "i32" \
--name_prefix "transpose_i8_i32" \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is successful? Only the input and output types are i32 failed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup. i8 -> i32 is vectorized.

@newling newling merged commit c8d095b into nod-ai:main Jun 28, 2024
2 checks passed
@newling newling deleted the only_tile_if_eltype_vectorizable branch July 13, 2024 20:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants