perf report shows most cycles spent in blas_thread_server #5171

nickdesaulniers · 2025-03-05T23:30:26Z

Not sure this is necessarily an issue with OpenBLAS vs users of OpenBLAS (numpy, pytorch).

I'm seeing slow python imports of pytorch; literally import pytorch is taking multiple seconds on my system.

When I record the python interpreter with linux perf record, perf report shows most cycles are spent in blas_thread_server via BOTH liblapack.so.3 and libcblas.so.3. i.e.

Overhead  Command  Shared Object             Symbol
  40.31%  python   liblapack.so.3            [.] blas_thread_server
  36.85%  python   libcblas.so.3             [.] blas_thread_server

If I annotate either, it seems both are near reading the time stamp counter:

  0.31 │3c:┌─→mov   (%r15),%rax                                                                                      ▒
       │   │  cmp   $0x1,%rax                                                                                        ▒
       │   │↓ ja    b0                                                                                               ▒
       │   │  nop                                                                                                    ▒
       │   │  nop                                                                                                    ▒
       │   │  nop                                                                                                    ▒
       │   │  nop                                                                                                    ▒
       │   │  nop                                                                                                    ▒
       │   │  nop                                                                                                    ▒
  5.29 │   │  nop                                                                                                    ▒
       │   │  nop                                                                                                    ▒
       │   │  rdtsc                                                                                                  ◆
 91.82 │   │  sub   %ecx,%eax                                                                                        ▒
       │   │  cmp   %eax,thread_timeout                                                                              ▒
  2.59 │   └──jae   3c

I'm guessing that's corresponding to code around here.

numpy/numpy#24639 seems like someone else hit this, too, but...https://xkcd.com/979/.

How do I even go about debugging this further? Is it an issue in pytorch? numpy? openblas? PEBKAC?

Importing numpy alone doesn't seem problematic, though I suspect that it's part of the chain of dependencies here. Perhaps related to how pytorch is (mis)using numpy then???

The text was updated successfully, but these errors were encountered:

martin-frbg · 2025-03-06T08:23:31Z

That reads like you have threads idling while there is nothing for them to do - what hardware is this, which version of OpenBLAS, do you (or pytorch) do anything to constrain the number of threads OpenBLAS will use ? (by default, it will start up as many as there are cpu cores)

rgommers · 2025-03-06T12:53:12Z

It'd be great to get an exact reproducer for this @nickdesaulniers. Without knowing how you installed numpy, pytorch and whatever else is in your environment (venv or conda env?), this isn't reproducible.

nickdesaulniers · 2025-03-06T18:29:11Z

what hardware is this

$ uname -m
x86_64

which version of OpenBLAS

Hard to say, exactly. Numpy can print some info about how it was configured, but I don't see any version info in its output.

https://numpy.org/doc/1.24/reference/generated/numpy.show_config.html

I am using numpy '1.23.5' from an environment managed by conda.

Given the .so files, is it possible the check the version from that?

do you (or pytorch) do anything to constrain the number of threads OpenBLAS will use ?

Unsure. What are some symbols I can grep for to check (in numpy and pytorch sources)?

It'd be great to get an exact reproducer for this @nickdesaulniers.

Yeah, sorry. I know it's not the greatest bug report ever.

Without knowing how you installed numpy, pytorch and whatever else is in your environment (venv or conda env?)

We're definitely using conda to manage our python packages.

Right, I'm not even sure myself if this is an issue with my env, conda's packaging/configuration of numpy or openblas itself. Filing something in case this has been brought up before and is familiar, or for future travelers to find a similar thread.

martin-frbg · 2025-03-07T15:18:47Z

what hardware is this

$ uname -m
x86_64

thanks - that's a bit more specific than "a computer", but it might help to know if it has like 4 or 400 cores ?

Hard to say, exactly. Numpy can print some info about how it was configured, but I don't see any version info in its output.

Strange, I'd think Numpy's show_config displays the version information, unless you are using a really ancient version of OpenBLAS that does not yet include the version number in its response to openblas_get_config() - this was added in 0.3.4
a bit over 6 years ago...

Unsure. What are some symbols I can grep for to check (in numpy and pytorch sources)?

See if you can find any calls to openblas_set_num_threads() - or anything that sets the environment variable OPENBLAS_NUM_THREADS.

You could also try to constrain it yourself, by setting OPENBLAS_NUM_THREADS to a small(ish) value before starting your numpy/pytorch environment. Depending on the problem (matrix) sizes you are working with, having a large number of threads may not be beneficial, as each needs to allocate memory on startup and most will end up idling if they're only lying in ambush, waiting for the occasional 4x4 matrix to stumble by

nickdesaulniers · 2025-03-07T17:45:27Z

thanks - that's a bit more specific than "a computer", but it might help to know if it has like 4 or 400 cores ?

$ lscpu
...
Architecture:            x86_64
  CPU op-mode(s):        32-bit, 64-bit
  Address sizes:         48 bits physical, 48 bits virtual
  Byte Order:            Little Endian
CPU(s):                  256
  On-line CPU(s) list:   0-255
Vendor ID:               AuthenticAMD
  Model name:            AMD EPYC 7763 64-Core Processor
    CPU family:          25
    Model:               1
    Thread(s) per core:  2
    Core(s) per socket:  64
    Socket(s):           2
...

I've definitely seen strange performance on multithreaded code on dual socket NUMA machines before...wonder if that's related, or just a red herring. IIRC, using man 8 numactl and seeing speedups limiting processes to single sockets was happening. Maybe I should play with that again here and see if it makes a difference.

Strange, I'd think Numpy's show_config displays the version information, unless you are using a really ancient version of OpenBLAS that does not yet include the version number in its response to openblas_get_config() - this was added in 0.3.4
a bit over 6 years ago...

$ python3
>>> import numpy as np
>>> np.show_config()
blas_info:
    libraries = ['cblas', 'blas', 'cblas', 'blas']
    library_dirs = ['<redacted>/conda/envs/<redacted>/lib']
    include_dirs = ['<redacted>/conda/envs/<redacted>/include']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
blas_opt_info:
    define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
    libraries = ['cblas', 'blas', 'cblas', 'blas']
    library_dirs = ['<redacted>/conda/envs/<redacted>/lib']
    include_dirs = ['<redacted>/conda/envs/<redacted>/include']
    language = c
lapack_info:
    libraries = ['lapack', 'blas', 'lapack', 'blas']
    library_dirs = ['<redacted>/conda/envs/<redacted>/lib']
    language = f77
lapack_opt_info:
    libraries = ['lapack', 'blas', 'lapack', 'blas', 'cblas', 'blas', 'cblas', 'blas']
    library_dirs = ['<redacted>/conda/envs/<redacted>/lib']
    language = c
    define_macros = [('NO_ATLAS_INFO', 1), ('HAVE_CBLAS', None)]
    include_dirs = ['<redacted>/conda/envs/<redacted>/include']
Supported SIMD extensions in this NumPy install:
    baseline = SSE,SSE2,SSE3
    found = SSSE3,SSE41,POPCNT,SSE42,AVX,F16C,FMA3,AVX2
    not found = AVX512F,AVX512CD,AVX512_KNL,AVX512_KNM,AVX512_SKX,AVX512_CLX,AVX512_CNL,AVX512_ICL

>>> print (np.__file__)
<redacted>/conda/envs/<redacted>/lib/python3.11/site-packages/numpy/__init__.py

$ strings <redacted>/sw/conda/envs/<redacted>/lib/python3.11/site-packages/numpy/core/libcblas.so.3 | grep -i 0\\.
OpenBLAS 0.3.29 DYNAMIC_ARCH NO_CH NO_AFFINITY
...

so guess I'm running 0.3.29, which at the moment looks like your latest release.

Looks like CNAME is the function that returns that configuration string. A quick grep for CNAME turned up no hits in numpy's sources, so I don't think numpy exposes the exact openblas version string to users in python. But I don't know much about python to C bindings, so I may have missed it with my simple grep.

Oh, they do have this but it doesn't seem to work.

See if you can find any calls to openblas_set_num_threads() - or anything that sets the environment variable OPENBLAS_NUM_THREADS.

No hits for openblas_set_num_threads or OPENBLAS_NUM_THREADS in numpy sources.

You could also try to constrain it yourself, by setting OPENBLAS_NUM_THREADS to a small(ish) value before starting your numpy/pytorch environment.

$ for i in $(seq 1 30); do python -c "import time; z=time.time(); import torch; print(time.time() - z)" > /tmp/allthreads.txt; done
$ awk '{s+=$1}END{print "ave:",s/NR}' RS=" " /tmp/allthreads.txt
ave: 1.71255

$ for i in $(seq 1 30); do OPENBLAS_NUM_THREADS=1 python -c "import time; z=time.time(); import torch; print(time.time() - z)" > /tmp/onethread.txt; done
$ awk '{s+=$1}END{print "ave:",s/NR}' RS=" " /tmp/onethread.txt
ave: 1.97009

Didn't seem to make a difference, but there's a lot of variance.

martin-frbg · 2025-03-07T18:27:01Z

Ok, thanks. No idea why the show_config() would not be more specific - besides the version number, it should tell us the maximum number of threads the library was built for (which may well be smaller than 256)
Symbol name CNAME gets filled in by the build system, so this is the openblas_get_config() function I mentioned. NUMA effects might play a role, though probably not already during initialization, I'd think if anything it would increase latency if threads started wandering between nodes, or data was shared between threads on different nodes under load. But startup being actually slower with just a single thread would seem to rule out NUMA (as well as anything else about allocating memory buffers per thread etc). Maybe the problem is not in OpenBLAS itself, and the idling that is so prominent in your perf
record happens only after whatever causes the initial slowness.
BTW what happens when you try an intermediate number, say OPENBLAS_NUM_THREADS=32 ?

nickdesaulniers mentioned this issue Mar 7, 2025

BUG: tools/check_openblas_version.py no longer works numpy/numpy#28448

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf report shows most cycles spent in blas_thread_server #5171

perf report shows most cycles spent in blas_thread_server #5171

nickdesaulniers commented Mar 5, 2025

martin-frbg commented Mar 6, 2025

rgommers commented Mar 6, 2025

nickdesaulniers commented Mar 6, 2025

martin-frbg commented Mar 7, 2025 •

edited

Loading

nickdesaulniers commented Mar 7, 2025 •

edited

Loading

martin-frbg commented Mar 7, 2025

perf report shows most cycles spent in blas_thread_server #5171

perf report shows most cycles spent in blas_thread_server #5171

Comments

nickdesaulniers commented Mar 5, 2025

martin-frbg commented Mar 6, 2025

rgommers commented Mar 6, 2025

nickdesaulniers commented Mar 6, 2025

martin-frbg commented Mar 7, 2025 • edited Loading

nickdesaulniers commented Mar 7, 2025 • edited Loading

martin-frbg commented Mar 7, 2025

martin-frbg commented Mar 7, 2025 •

edited

Loading

nickdesaulniers commented Mar 7, 2025 •

edited

Loading