-
Notifications
You must be signed in to change notification settings - Fork 52
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Intel] special handling for transferWithinBlock for boolean values #3599
base: main
Are you sure you want to change the base?
Conversation
I think the i1 is just a concept type which is need to be materialized later because all the HW arch is only byte addressable memory. What is the PTX on NV backend for this case? The code pieces seems no sense to me. It loads 16 x i1 for two bytes from the SLM. But only uses 2 values 0 and 8 from it. Should we just make it to 2 x i8 and then convert it to i1 by trunc or cmp.ne
|
That is also what the working code does - it loads two bytes from each location but takes double the number of load instructions to do it. The "broken" code is actually a bit more clear - it loads two bytes each from two locations in SLM and reads the first bit from each byte. What I do not understand is why the broken code is not working - we looked at the IGC shader dumps and didn't see anything obvious: Working shader (four 1 byte loads - note that they have been optimized to two, two byte loads just as we would expect)
broken shader, loads are 4 bytes each and there are 4 of them!:
Well, that's what the patch does. But I think long term we should figure out why the code as generated is not working. I am working on a unit test to make the IR a little easier to read and then I can try and get some PTX. |
After further investigation the code generated by Triton without this patch is correct. But the PromoteBools pass within IGC appears to be changing the bit type in the llvm vector type to bytes, which loads the incorrect data. A ticket has been filed with IGC, but I think we should merge this patch and the test for now and then revert changes in common code once IGC resolves the problem. |
2af840a
to
212b45e
Compare
Given this convert op layout lowering:
We discovered that the generated IR changed after https://github.com/intel/intel-xpu-backend-for-triton/pull/3515/files#diff-fd4c24537e95bcab1b909fd764c84d63e5a844e1aa1ffaf5354510572a7d8bc6
Previously the shared memory load (after transformation) was returned as an
i8
pointer and each element was extracted usinggep
instructions:but using the upstream method of transferring between blocks using linear layout, the extract is using an
i1
vector of length 16:this seems to be causing some trouble in the IGC lowering.
Inserting the
icmp_ne
instruction (which was previously used inprocessReplica
here: https://github.com/intel/intel-xpu-backend-for-triton/pull/3515/files#diff-3fa75fa6b39886d9576a671c306d98b0deb43f81c2fc7873ad08892d190d2622L215) forces us back to the existing method for doing the conversion. We need to figure out whether there is a true hardware limitation here or a bug, and it is possible there are better ways to handle this when converting the layouts. For now, I left the change in common upstream code and am marking this as a draft. But if this is the most expedient way to resolve the regression without side effect then I think we should move forward.cc #3570