-
Notifications
You must be signed in to change notification settings - Fork 115
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Code generation for packed math instructions #1369
Comments
Hi, as you've already discovered, this is due to the scalarization pass. Scalarization is still required in a lot of the use cases as there are issues due to higher register use when subregisters in the vectors are dead (but not the whole vector). Scalarizing is usually a benefit as it leads to lower register use and as a consequence higher occupancy. The only way at the moment to support the packed instructions is to disable scalarization. |
I confirmed that disabling the scalarization pass indeed enables packed instructions to be generated. However, enabling packed instructions didn't result in any speed up. On a simple convolution workload I tested, the two runtimes (the original vs modified amdvlk) are not too different.
|
Are you able to share the spirv for the shader? |
Sorry, my comment about the binary size was not correct, since I ran auto tuning on the conv2d workload separately for two amdvlk versions, so the input SPIR-V are different for two cases. The attached zip is the SPIR-V used to generate the packed instruction version. Using the modified amdvlk results in this asm having |
Sorry for the slow response, I've had a look at it now.
being transformed into something like this:
which looks to be in line with what I'd expect. Nearly all the differences are of this form throughout the code (looks like there's been some unrolling?)
|
Thank you for taking a look. Yes, currently the inputs to the convolution are scalar fp16 buffers, and I manually added 2-way vectorization in the inner loop to experiment with packed instruction codegen. I believe we can remove those insert/extract if TVM keeps float16x2 buffer input/output throughout. It is great to learn that packed instructions can be generated by amdvlk with a simple modification. |
I'll take a look at some options for enabling f16 vectors through scalarization. |
I've created an experimental change that disables scalarization, but only for v2half types. See https://github.com/dstutt/llvm-project/tree/no-scalarize-v2f16 See if that works for you. It might be possible to upstream something based on this with target hooks to allow disabling of specific types (e.g. v2f16 for us). |
Yes, I confirmed that this change enables For now, I'm happy with this solution. So I'll close this issue. |
seeing GPUOpen-Drivers/AMDVLK#279 |
Hi, I'm trying to generate packed math instructions like
v_pk_fma_f16
via TVM for DL inference use cases. Using the LLVM AMDGPU backend directly (via rocm), I was able to generate asm having thev_pk_fma_f16
instruction like this https://gist.github.com/masahi/2de1a7dc87e2068ffb50ba6135273f95#file-conv2d_nhwc_float16x2-s-L495-L496. But I couldn't get the equivalent asm if I go through SPIRV and AMDVLK. I only havev_fma_f16
etc, even though the generated SPIRV looks good to me: I have instructions that operate on float16x2 like below.After a quick investigation, I identified that the scalarization pass at
llpc/lgc/patch/Patch.cpp
Line 247 in 95b2dfd
llvm.fma.v2f16(...)
instruction into the scalarllvm.fma.f16
instruction.Is the scalarization pass necessary? If so, is there a way to support packed math instructions by AMDVLK?
The text was updated successfully, but these errors were encountered: