Skip to content

Conversation

@SwayamInSync
Copy link
Member

@SwayamInSync SwayamInSync commented Jan 21, 2026

closes #56

This PR adds the patch to disable the quad scalar FMA instruction compile and dispatch, this was the only behaviour which was enabled by default and cannot be controlled by SIMD feature detection. I noticed for enabling/disabling AVX the detection already works as expected (so this scalar FMA was the only issue)

In terms of performance on V3+ CPUs, the scalar computation will use the pure-C implementations instead of FMA, this might effect the current numpy_quaddtype codebase performance (i.e. ufunc, other slots) the QBLAS performance won't be hurt that much as it uses the vectorized implementations on supported machines.

So as a future note, it was already in my plans to extend the QBLAS and replace all the quaddtype loops with the vectorized loops from QBLAS

NOTE: On non-x86-64 machines, scalar will dispatch the FMA instruction and there will be no performance hurt

UPDATE: With the new workflow, build detects the FMA support and dispatches the corresponding SIMD build flags, so modern x86_64-V3 SIMD CPUs will no longer get performance hurt. Also introduced a new flag disable_fma at quaddtype build for user to enable/disable the FMA build

#59 is required to be merged before this (I validated the working on my fork)

@SwayamInSync
Copy link
Member Author

SwayamInSync commented Jan 21, 2026

We cannot forward this patch over conda-forge as there we used the conda vendored SLEEF, forcing this patch there will be making all general public to accept it (and this is bad) read the last messages of this conversation here conda-forge/numpy_quaddtype-feedstock#11

So maybe soon as they get the native macos ARM runners, we can then port this fix there. (This thing can be documented)

@SwayamInSync
Copy link
Member Author

@mhvk you might want to test this on your old machine?

maybe directly installing from my fork's branch?

https://github.com/SwayamInSync/numpy-quaddtype/tree/simd-patch

@mhvk
Copy link
Contributor

mhvk commented Jan 21, 2026

Yes, that works! Thanks!!!

if(SLEEF_TARGET_PROCESSOR MATCHES "(x86|AMD64|amd64|^i.86$)")
set(SLEEF_ARCH_X86 ON CACHE INTERNAL "True for x86 architecture.")

- set(CLANG_FLAGS_ENABLE_PURECFMA_SCALAR "-mavx2;-mfma")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately my cmake is pretty terrible: is there a way to set these flags for x86_64-v3 or newer?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was reading and it seems querying CPUIDs is a way to conditionally set the flags, also not just flags but the entire dispatch mechanism.
I'll try if I somehow managed to get this right, as this unconditional SIMD dispatching is very tightly integrated with SLEEF's FMA code generation

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, thanks! It'd be nice to enable better optimizations when people build for themselves but not critical.

@SwayamInSync
Copy link
Member Author

Also should we merge the #59 ? to get that CI working here for this PR?

@ngoldbaum
Copy link
Member

Also should we merge the #59 ?

Probably this first, then that PR, otherwise the tests will be failing temporarily.

@SwayamInSync
Copy link
Member Author

SwayamInSync commented Jan 21, 2026

Unfortunately my cmake is pretty terrible: is there a way to set these flags for x86_64-v3 or newer?

Maybe one more, I am not sure if possible but if Meson allows conditional patch-applying then we can detect by ourselves about x86_64-v2 CPUs and apply the patch else don't :)
This woud've been pretty simple, if meson allows (I didn't find this info online)

@SwayamInSync
Copy link
Member Author

@ngoldbaum I tested this thing on my fork and it worked, before I make the updates here to overwhelm, here is the workflow if it makes sense then will push the changes here.

  1. For native systems where FMA is not supported, Inside the SLEEF's meson.build, it compile a C code snippet and if failed then it sets a new custom flag SLEEF_DISABLE_PURECFMA_SCALAR=ON (which is not present in SLEEF but we add it via our patch) and then accordingly the flags and FMA code dispatch gets disabled. If that C code snippet compiles then this flag is set to OFF, patch is applied but no changes were made
  2. For CI (emulations/cross-compilation) in these cases machines usually support FMA but we emulate the device, so for these cases a user can install the numpy_quaddtype as pip install . --no-build-isolation -v -Csetup-args=-Ddisable_fma=true this new option will also follow the same as in above mentioned

@SwayamInSync
Copy link
Member Author

Let me know if it sounds good then I'll cherry-pick the commits here

@ngoldbaum
Copy link
Member

Sounds reasonable to me. The new build argument will need a mention in the readme, probably with a link to the SLEEF issue you opened and a note that it's a workaround for a SLEEF issue.

assert result == True


def test_sleef_purecfma_symbols():
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a test that justs prints the available purecfma functions inside the quaddtype build (if supports) otherwise nothing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Which seems to work as from SandyBridge processor logs (fma is disabled) the output has nothing

tests/test_quaddtype.py::test_sleef_purecfma_symbols PASSED

but on haswell (fma enabled) output shows some

tests/test_quaddtype.py::test_logical_reduce_on_non_quad_arrays PASSED
tests/test_quaddtype.py::test_sleef_purecfma_symbols 
✓ Found 67 PURECFMA symbols (FMA optimizations enabled)
  Sample symbols:
    000000000014cd90 t sleef_acoshq1_u10purecfma
    000000000013eab0 t sleef_acosq1_u10purecfma
    0000000000129540 t sleef_addq1_u05purecfma
    000000000014a150 t sleef_asinhq1_u10purecfma
    000000000013d3b0 t sleef_asinq1_u10purecfma
    ... and 62 more
PASSED

@SwayamInSync
Copy link
Member Author

@mhvk you might want to test this on your old machine?

maybe directly installing from my fork's branch?

https://github.com/SwayamInSync/numpy-quaddtype/tree/simd-patch

@mhvk just for double checking, can you give another shot?

@ngoldbaum
Copy link
Member

The new build argument will need a mention in the readme, probably with a link to the SLEEF issue you opened and a note that it's a workaround for a SLEEF issue.

Can you update the readme and/or docs too?

@SwayamInSync
Copy link
Member Author

Can you update the readme and/or docs too?

Yup on it

@SwayamInSync
Copy link
Member Author

The docs include the sections of README so they'll automatically get updated

Copy link
Member

@ngoldbaum ngoldbaum left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! I didn't spot any issues in the new meson configuration.

@SwayamInSync
Copy link
Member Author

Thanks, I updated the PR's description and will merge once @mhvk gives a thumbs up by installing from this new workflow

@mhvk
Copy link
Contributor

mhvk commented Jan 21, 2026

@SwayamInSync - unfortunately, the branch no longer works! Just to be sure I didn't make a mistake before, I checked out the second commit (efba625) and that does work, so something must have gone wrong in the third commit (f574019)...

The both statements are true for creating a fresh virtual environment, git clean -fxd on the repository, and then running reinstall.sh.

I also tried what is suggested in the readme now, but that errors:

pip install . -Csetup-args=-Ddisable_fma=true
Processing /data/mhvk/packages/numpy-quaddtype
  Installing build dependencies ... done
  Getting requirements to build wheel ... done
  Preparing metadata (pyproject.toml) ... error
  error: subprocess-exited-with-error
  
  × Preparing metadata (pyproject.toml) did not run successfully.
  │ exit code: 1
  ╰─> [42 lines of output]
      + meson setup /data/mhvk/packages/numpy-quaddtype /data/mhvk/packages/numpy-quaddtype/.mesonpy-3pzucj5o -Dbuildtype=release -Db_ndebug=if-release -Db_vscrt=md -Ddisable_fma=true --native-file=/data/mhvk/packages/numpy-quaddtype/.mesonpy-3pzucj5o/meson-python-native-file.ini
      The Meson build system
      Version: 1.10.1
      Source dir: /data/mhvk/packages/numpy-quaddtype
      Build dir: /data/mhvk/packages/numpy-quaddtype/.mesonpy-3pzucj5o
      Build type: native build
      WARNING: Project does not target a minimum version but uses feature introduced in '1.1': meson.options file. Use meson_options.txt instead
      Project name: numpy_quaddtype
      Project version: undefined
      C compiler for the host machine: /usr/bin/ccache cc (gcc 15.2.0 "cc (Debian 15.2.0-12) 15.2.0")
      C linker for the host machine: cc ld.bfd 2.45.50.20251209
      C++ compiler for the host machine: /usr/bin/ccache c++ (gcc 15.2.0 "c++ (Debian 15.2.0-12) 15.2.0")
      C++ linker for the host machine: c++ ld.bfd 2.45.50.20251209
      Host machine cpu family: x86_64
      Host machine cpu: x86_64
      Program python found: YES (/data/mhvk/packages/numpy-quaddtype/temp/bin/python3)
      Did not find pkg-config by name 'pkg-config'
      Found pkg-config: NO
      Run-time dependency python found: YES 3.13
      Did not find pkg-config by name 'pkg-config'
      Found pkg-config: NO
      Found CMake: /tmp/user/2500/pip-build-env-6sbgqy4r/overlay/bin/cmake (4.2.1)
      Run-time dependency qblas found: NO (tried pkgconfig and cmake)
      Looking for a fallback subproject for the dependency qblas
      
      Executing subproject qblas
      
      qblas| Project name: qblas
      qblas| Project version: 1.0.0
      qblas| Build targets in project: 0
      qblas| Subproject qblas finished.
      
      Dependency qblas from subproject subprojects/qblas found: YES 1.0.0
      Run-time dependency sleef found: NO (tried pkgconfig and cmake)
      Message: SLEEF FMA disable option: true
      
      Executing subproject sleef
      
      
      ../subprojects/sleef/meson.build:1:0: ERROR: Unknown option: "sleef:disable_fma".
      
      A full log can be found at /data/mhvk/packages/numpy-quaddtype/.mesonpy-3pzucj5o/meson-logs/meson-log.txt
      [end of output]
  
  note: This error originates from a subprocess, and is likely not a problem with pip.
error: metadata-generation-failed

× Encountered error while generating package metadata.
╰─> from file:///data/mhvk/packages/numpy-quaddtype

note: This is an issue with the package mentioned above, not pip.
hint: See above for details.

In case it helps, inside subprojects/sleef,

git status
HEAD detached at 3.9.0
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git restore <file>..." to discard changes in working directory)
        modified:   Configure.cmake
        modified:   src/quad/CMakeLists.txt
        modified:   src/quad/qdispscalar.c.org

Untracked files:
  (use "git add <file>..." to include in what will be committed)
        .meson-subproject-wrap-hash.txt
        fix-purecfma-scalar-x86.patch
        meson.build

no changes added to commit (use "git add" and/or "git commit -a")

@SwayamInSync
Copy link
Member Author

SwayamInSync commented Jan 21, 2026

Can you send the installation command you tried?

@SwayamInSync
Copy link
Member Author

@mhvk I guess you might be having an old cached subproject folder from previous install, because only then meson skips the new build and tried to use the older directory and fails.

Can you please try the following steps

git clone -b simd-patch https://github.com/SwayamInSync/numpy-quaddtype.git
cd numpy-quaddtype

# all the env steps if needed
bash reinstall.sh

This will remove the old cached files and will do a fresh build

@ngoldbaum
Copy link
Member

you might need git clean -ffxd to remove the subprojects

@ngoldbaum
Copy link
Member

@mhvk
Copy link
Contributor

mhvk commented Jan 21, 2026

I did the above git clone, etc., in a clean directory -- no luck. Doing git clean -ffxd beforehand also did not help. See
https://www.astro.utoronto.ca/~mhvk/build_log_failure.txt for a build after doing that deep clean.

What does help is

git checkout HEAD~2
reinstall.sh

See https://www.astro.utoronto.ca/~mhvk/build_log_success.txt

@SwayamInSync
Copy link
Member Author

@mhvk both the logs are showing

Successfully installed numpy_quaddtype-1.0.0

What was the issue here?

@mhvk
Copy link
Contributor

mhvk commented Jan 21, 2026

Installation indeed works in both cases, but for the PR as it stands I get a failure on import:

python -c "import numpy_quaddtype"
Illegal instruction        (core dumped) python -c "import numpy_quaddtype"

while installing at the second commit allows the import to succeed.

@SwayamInSync
Copy link
Member Author

SwayamInSync commented Jan 21, 2026

I saw the failure log is saying

  sleef| Checking if "FMA instruction support" compiles: YES
  sleef| Message: FMA supported - enabling PURECFMA scalar code path

Maybe the implicit fma check is not working? can you try passing the flag now?

pip install . -v -Csetup-args=-Ddisable_fma=true 2>&1 | tee build_log.txt

Make sure all the old subprojects are cleaned

@SwayamInSync
Copy link
Member Author

I made the implicit FMA check to compile and run the code, since @mhvk is getting the error at runtime, so this should be now catching it correct.

@SwayamInSync
Copy link
Member Author

Just checked on CI emulating SandyBridge, the detection is now working correct (although emulating build was pretty slow ~40 mins to build, so we are only be using it at runtime)

sleef| Checking if "FMA instruction runtime support" runs: NO (1)
sleef| Message: FMA not supported at runtime - disabling PURECFMA scalar code path

and this is on haswell

sleef| Checking if "FMA instruction runtime support" runs: YES
sleef| Message: FMA supported - enabling PURECFMA scalar code path

@mhvk
Copy link
Contributor

mhvk commented Jan 21, 2026

Super! I now checked that on the latest branch, and just a plain pip install . works. Thanks so much!!

@SwayamInSync
Copy link
Member Author

Thanks @mhvk for confirming it, your verification helped a lot today 😄

@mhvk
Copy link
Contributor

mhvk commented Jan 21, 2026

That definitely was a tricky one...


python -m pip uninstall -y numpy_quaddtype
python -m pip install . -vv 2>&1 | tee build_log.txt
# pip install . --no-build-isolation -v -Csetup-args=-Ddisable_fma=true 2>&1 | tee build_log.txt
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need this?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah will be easy to test the option on x86 machines

@SwayamInSync SwayamInSync merged commit b396148 into numpy:main Jan 22, 2026
13 checks passed
@SwayamInSync SwayamInSync deleted the simd-patch branch January 22, 2026 06:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Failure on import (for older processor?)

4 participants