-
-
Notifications
You must be signed in to change notification settings - Fork 4
FIX: Adding build-patch to disable Scalar FMA on SLEEF #62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
We cannot forward this patch over conda-forge as there we used the conda vendored SLEEF, forcing this patch there will be making all general public to accept it (and this is bad) read the last messages of this conversation here conda-forge/numpy_quaddtype-feedstock#11 So maybe soon as they get the native macos ARM runners, we can then port this fix there. (This thing can be documented) |
|
@mhvk you might want to test this on your old machine? maybe directly installing from my fork's branch? |
|
Yes, that works! Thanks!!! |
| if(SLEEF_TARGET_PROCESSOR MATCHES "(x86|AMD64|amd64|^i.86$)") | ||
| set(SLEEF_ARCH_X86 ON CACHE INTERNAL "True for x86 architecture.") | ||
|
|
||
| - set(CLANG_FLAGS_ENABLE_PURECFMA_SCALAR "-mavx2;-mfma") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unfortunately my cmake is pretty terrible: is there a way to set these flags for x86_64-v3 or newer?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was reading and it seems querying CPUIDs is a way to conditionally set the flags, also not just flags but the entire dispatch mechanism.
I'll try if I somehow managed to get this right, as this unconditional SIMD dispatching is very tightly integrated with SLEEF's FMA code generation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, thanks! It'd be nice to enable better optimizations when people build for themselves but not critical.
|
Also should we merge the #59 ? to get that CI working here for this PR? |
Probably this first, then that PR, otherwise the tests will be failing temporarily. |
Maybe one more, I am not sure if possible but if Meson allows conditional patch-applying then we can detect by ourselves about x86_64-v2 CPUs and apply the patch else don't :) |
|
@ngoldbaum I tested this thing on my fork and it worked, before I make the updates here to overwhelm, here is the workflow if it makes sense then will push the changes here.
|
|
Let me know if it sounds good then I'll cherry-pick the commits here |
|
Sounds reasonable to me. The new build argument will need a mention in the readme, probably with a link to the SLEEF issue you opened and a note that it's a workaround for a SLEEF issue. |
| assert result == True | ||
|
|
||
|
|
||
| def test_sleef_purecfma_symbols(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a test that justs prints the available purecfma functions inside the quaddtype build (if supports) otherwise nothing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Which seems to work as from SandyBridge processor logs (fma is disabled) the output has nothing
tests/test_quaddtype.py::test_sleef_purecfma_symbols PASSEDbut on haswell (fma enabled) output shows some
tests/test_quaddtype.py::test_logical_reduce_on_non_quad_arrays PASSED
tests/test_quaddtype.py::test_sleef_purecfma_symbols
✓ Found 67 PURECFMA symbols (FMA optimizations enabled)
Sample symbols:
000000000014cd90 t sleef_acoshq1_u10purecfma
000000000013eab0 t sleef_acosq1_u10purecfma
0000000000129540 t sleef_addq1_u05purecfma
000000000014a150 t sleef_asinhq1_u10purecfma
000000000013d3b0 t sleef_asinq1_u10purecfma
... and 62 more
PASSED
Can you update the readme and/or docs too? |
Yup on it |
|
The docs include the sections of README so they'll automatically get updated |
ngoldbaum
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome! I didn't spot any issues in the new meson configuration.
|
Thanks, I updated the PR's description and will merge once @mhvk gives a thumbs up by installing from this new workflow |
|
@SwayamInSync - unfortunately, the branch no longer works! Just to be sure I didn't make a mistake before, I checked out the second commit (efba625) and that does work, so something must have gone wrong in the third commit (f574019)... The both statements are true for creating a fresh virtual environment, I also tried what is suggested in the readme now, but that errors: In case it helps, inside |
|
Can you send the installation command you tried? |
|
@mhvk I guess you might be having an old cached subproject folder from previous install, because only then meson skips the new build and tried to use the older directory and fails. Can you please try the following steps This will remove the old cached files and will do a fresh build |
|
you might need |
|
https://git-scm.com/docs/git-clean#Documentation/git-clean.txt---force Git: it has the best UI |
|
I did the above What does help is See https://www.astro.utoronto.ca/~mhvk/build_log_success.txt |
|
@mhvk both the logs are showing What was the issue here? |
|
Installation indeed works in both cases, but for the PR as it stands I get a failure on import: while installing at the second commit allows the import to succeed. |
|
I saw the failure log is saying Maybe the implicit fma check is not working? can you try passing the flag now? Make sure all the old subprojects are cleaned |
|
I made the implicit FMA check to compile and run the code, since @mhvk is getting the error at runtime, so this should be now catching it correct. |
|
Just checked on CI emulating SandyBridge, the detection is now working correct (although emulating build was pretty slow ~40 mins to build, so we are only be using it at runtime) and this is on haswell |
|
Super! I now checked that on the latest branch, and just a plain |
|
Thanks @mhvk for confirming it, your verification helped a lot today 😄 |
|
That definitely was a tricky one... |
|
|
||
| python -m pip uninstall -y numpy_quaddtype | ||
| python -m pip install . -vv 2>&1 | tee build_log.txt | ||
| # pip install . --no-build-isolation -v -Csetup-args=-Ddisable_fma=true 2>&1 | tee build_log.txt |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah will be easy to test the option on x86 machines
closes #56
This PR adds the patch to disable the quad scalar FMA instruction compile and dispatch, this was the only behaviour which was enabled by default and cannot be controlled by SIMD feature detection. I noticed for enabling/disabling AVX the detection already works as expected (so this scalar FMA was the only issue)
In terms of performance on V3+ CPUs, the scalar computation will use the pure-C implementations instead of FMA, this might effect the currentthe QBLAS performance won't be hurt that much as it uses the vectorized implementations on supported machines.numpy_quaddtypecodebase performance (i.e. ufunc, other slots)NOTE: On non-x86-64 machines, scalar will dispatch the FMA instruction and there will be no performance hurt
UPDATE: With the new workflow, build detects the FMA support and dispatches the corresponding SIMD build flags, so modern x86_64-V3 SIMD CPUs will no longer get performance hurt. Also introduced a new flag
disable_fmaat quaddtype build for user to enable/disable the FMA build#59 is required to be merged before this (I validated the working on my fork)