Improving performance with WebAssembly #4023

rth · 2023-04-27T22:35:50Z

Thanks to the discussions and fixes in #3640 and follow up work by @lesteve and @ogrisel we now have a build of OpenBLAS with emscripten for WebAssembly in Pyodide. It works quite well when used via scipy.

I recently run some benchmarks for square matrix multiplications (DGEMM) to get some ideas about the performance, which can be found here. The good news is that the scipy build with OpenBLAS is around 2-3x times faster for DGEMM than with the reference BLAS. The less good news is that it's still around 10x slower than almost the same OpenBLAS version built for a modern x86-64 CPU (single-threaded) natively.

For now, the constraint of that runtime is single-threaded, and without SIMD. (Though we should investigate whether it would be possible optionally built with SIMD and have some browser feature detection.)

I was wondering if is there anything else we could try to improve the performance of OpenBLAS for the WebAssembly platform ?

It's currently built with Emscripten using the following options,

make libs shared CC=emcc HOSTCC=gcc TARGET=RISCV64_GENERIC NOFORTRAN=1 NO_LAPACKE=1 \
        USE_THREAD=0 -O2

Thank you!

rth · 2023-04-27T22:47:00Z

And a follow-up question, if by chance, there is anything OpenBLAS specific that we could be doing to reduce the size of the produced shared library? Currently, it's 1.75 MB compressed which is non-negligible when loading on a web page.

martin-frbg · 2023-04-28T06:59:06Z

The current level of emscripten support from #3640 was expected to be the bare minimum to get it working at all, if this already provides a 2x speediup over the reference implementation I consider that quite impressive.

As far as I know, there is work underway to have emscriptem understand at least SSE and AVX intrinsics, though not the corresponding assembly instructions themselves, as used in the current NEHALEM and SANDYBRIDGE kernels.

About the only part of BLAS where intrinsics are currently used is in the DOT kernels, so at least GEMM and probably AXPY would need to be (re)written to achieve some speedup. (Not sure if your 10x figure comes from a native build for such a fairly outdated cpu - definitely do not get your hopes too high by comparing to current AVX2/AVX512 cpus).

As for library size, there is nothing obviously unnecessary compiled in by default (like profiling code or a testsuite),
so the most likely means of a size reduction would be by dropping support for individual precisions - if, for instance,
it were safe to assume that no web code would want to process bfloats (BUILD_BFLOAT16=0) or double-precision complex numbers (BUILD_COMPLEX16=0) .

martin-frbg · 2023-04-28T07:12:32Z

(That is unless you are really only interested in BLAS and not LAPACK - building with NO_LAPACK=1 would certainly make a big difference in library size)

lesteve · 2023-04-28T08:11:01Z

(That is unless you are really only interested in BLAS and not LAPACK - building with NO_LAPACK=1 would certainly make a big difference in library size)

scipy does need LAPACK but numpy only needs BLAS I think. Not sure we can leverage this in Pyodide though to only load the BLAS part for numpy and BLAS + LAPACK for scipy ...

martin-frbg · 2023-04-28T09:03:54Z

Hmm. I believe some Linux distros take OpenBLAS apart to produce separate BLAS and LAPACK libs (probably mostly for the "alternatives" mechanism). There are a few functions from Reference-LAPACK like POTRF/GETRF that we replace with optimized reimplementations so I'm not sure it is entirely a good idea, but in principle it should be possible to create a separate liblapack that itself is linked against a BLAS-only libopenblas. Or have a BLAS-only libopenblas and a suitably named "full" one. (OTOH I do not know how common it is to have some code depend on both numpy and scipy, in which case you'd need to use the complete library anyway - at least ISTR that having an individual copy of OpenBLAS built into each of numpy and scipy was seen as a serious problem )

brada4 · 2023-04-28T09:44:31Z

BLAS is optional for NumPy, and the generic BLAS-like macros in numpy may generate code that is easier/faster/beter JIT-able later.

rth · 2023-04-28T12:10:44Z

Thank you for your feedback and ideas!

I agree that this is already significant progress, and I'm not asking for official WASM support. I agree that it's already great that the performance is better than with reference BLAS!

there is work underway to have emscriptem understand at least SSE and AVX intrinsics, though not the corresponding assembly instructions themselves, as used in the current NEHALEM and SANDYBRIDGE kernels.

Yes, the state of SIMD for WASM in emscripten is outlined here. So if we wanted to try to build with x86 SSE but not the corresponding assembly instructions, is there is an existing target we could use in OpenBLAS? Though I guess it could also be limited by the specific instructions that are supported/or not supported.

martin-frbg · 2023-04-28T12:30:34Z

Unfortunately none that could be used as-is - such a target (i.e. its intrinsics-based BLAS kernels) would need to be created, as most kernels, and especially any predating AVX512, are written in assembler (or with inline assembly parts)

brada4 · 2023-04-28T13:15:37Z

The assembly parts are interleaved in the way that all execution units in real CPU are kept busy.

Emscripten is evolving. If code is written today for limited "supported" SIMD ISA, tomorrow it is no longer valid if some useful instruction gets optimally supported.

In the meantime generic code, maybe sometimes unrolling loops to convince compiler to SIMD will be compiled better and better under ever evolving webassembly support.

Maybe you can

defining the macro WASM_SIMD_COMPAT_SLOW

And point out any hit to emulated instructions, with best of luck it may point to openblas pessimality, and not compiler/jit shortcoming.

martin-frbg · 2023-04-28T13:34:33Z

@brada4 as I understand this environment variable, it will only warn about use of intrinsics that are poorly supported in webm translation, not bottlenecks in code translated from C in general.

brada4 · 2023-04-28T14:13:25Z

Indeed it is just a warning in intrinsics header.

martin-frbg · 2024-12-28T20:44:52Z

revisiting this and the current implementation status document for SSE/AVX and NEON instructions in webm, it seems to me that the best path forward is still to use the compiler's autovectorizer on the generic C code by adding only -msimd128 to the compilation flags (if I understand correctly). It would be possible to create an entire webm kernel tree and configuration file, but with the currently available kernels it would be at most the DAXPY and ROT functions that might benefit from using intrinsics-based codes. Does the emscripten project or one of its consumers provide a suitable benchmark, and is anybody aware of any exploratory followup work that may have been done since we last discussed this here, a bit over a year ago ?

lesteve mentioned this issue Apr 28, 2023

Benchmark OpenBLAS and consider it for numpy pyodide/pyodide#3763

Open

martinjrobins mentioned this issue Sep 19, 2023

need to compile a blas implementation for wasm martinjrobins/sundials-wasm-tests#1

Open

sevagh mentioned this issue Jan 9, 2024

Implementing sgemm_, sgemv_ cospectrum/microgemm#15

Closed

agriyakhetarpal mentioned this issue Feb 29, 2024

Feature request: out-of-tree Pyodide builds for statsmodels statsmodels/statsmodels#9166

Closed

JAicewizard mentioned this issue Jul 2, 2024

[emmake]compile OpenBLAS using emmake #3640

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improving performance with WebAssembly #4023

Improving performance with WebAssembly #4023

rth commented Apr 27, 2023

rth commented Apr 27, 2023

martin-frbg commented Apr 28, 2023

martin-frbg commented Apr 28, 2023

lesteve commented Apr 28, 2023

martin-frbg commented Apr 28, 2023

brada4 commented Apr 28, 2023

rth commented Apr 28, 2023

martin-frbg commented Apr 28, 2023

brada4 commented Apr 28, 2023

martin-frbg commented Apr 28, 2023

brada4 commented Apr 28, 2023

martin-frbg commented Dec 28, 2024

Improving performance with WebAssembly #4023

Improving performance with WebAssembly #4023

Comments

rth commented Apr 27, 2023

rth commented Apr 27, 2023

martin-frbg commented Apr 28, 2023

martin-frbg commented Apr 28, 2023

lesteve commented Apr 28, 2023

martin-frbg commented Apr 28, 2023

brada4 commented Apr 28, 2023

rth commented Apr 28, 2023

martin-frbg commented Apr 28, 2023

brada4 commented Apr 28, 2023

martin-frbg commented Apr 28, 2023

brada4 commented Apr 28, 2023

martin-frbg commented Dec 28, 2024