Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improving performance with WebAssembly #4023

Open
rth opened this issue Apr 27, 2023 · 12 comments
Open

Improving performance with WebAssembly #4023

rth opened this issue Apr 27, 2023 · 12 comments

Comments

@rth
Copy link

rth commented Apr 27, 2023

Thanks to the discussions and fixes in #3640 and follow up work by @lesteve and @ogrisel we now have a build of OpenBLAS with emscripten for WebAssembly in Pyodide. It works quite well when used via scipy.

I recently run some benchmarks for square matrix multiplications (DGEMM) to get some ideas about the performance, which can be found here. The good news is that the scipy build with OpenBLAS is around 2-3x times faster for DGEMM than with the reference BLAS. The less good news is that it's still around 10x slower than almost the same OpenBLAS version built for a modern x86-64 CPU (single-threaded) natively.

For now, the constraint of that runtime is single-threaded, and without SIMD. (Though we should investigate whether it would be possible optionally built with SIMD and have some browser feature detection.)

I was wondering if is there anything else we could try to improve the performance of OpenBLAS for the WebAssembly platform ?

It's currently built with Emscripten using the following options,

make libs shared CC=emcc HOSTCC=gcc TARGET=RISCV64_GENERIC NOFORTRAN=1 NO_LAPACKE=1 \
        USE_THREAD=0 -O2

Thank you!

@rth
Copy link
Author

rth commented Apr 27, 2023

And a follow-up question, if by chance, there is anything OpenBLAS specific that we could be doing to reduce the size of the produced shared library? Currently, it's 1.75 MB compressed which is non-negligible when loading on a web page.

@martin-frbg
Copy link
Collaborator

The current level of emscripten support from #3640 was expected to be the bare minimum to get it working at all, if this already provides a 2x speediup over the reference implementation I consider that quite impressive.

As far as I know, there is work underway to have emscriptem understand at least SSE and AVX intrinsics, though not the corresponding assembly instructions themselves, as used in the current NEHALEM and SANDYBRIDGE kernels.

About the only part of BLAS where intrinsics are currently used is in the DOT kernels, so at least GEMM and probably AXPY would need to be (re)written to achieve some speedup. (Not sure if your 10x figure comes from a native build for such a fairly outdated cpu - definitely do not get your hopes too high by comparing to current AVX2/AVX512 cpus).

As for library size, there is nothing obviously unnecessary compiled in by default (like profiling code or a testsuite),
so the most likely means of a size reduction would be by dropping support for individual precisions - if, for instance,
it were safe to assume that no web code would want to process bfloats (BUILD_BFLOAT16=0) or double-precision complex numbers (BUILD_COMPLEX16=0) .

@martin-frbg
Copy link
Collaborator

(That is unless you are really only interested in BLAS and not LAPACK - building with NO_LAPACK=1 would certainly make a big difference in library size)

@lesteve
Copy link

lesteve commented Apr 28, 2023

(That is unless you are really only interested in BLAS and not LAPACK - building with NO_LAPACK=1 would certainly make a big difference in library size)

scipy does need LAPACK but numpy only needs BLAS I think. Not sure we can leverage this in Pyodide though to only load the BLAS part for numpy and BLAS + LAPACK for scipy ...

@martin-frbg
Copy link
Collaborator

Hmm. I believe some Linux distros take OpenBLAS apart to produce separate BLAS and LAPACK libs (probably mostly for the "alternatives" mechanism). There are a few functions from Reference-LAPACK like POTRF/GETRF that we replace with optimized reimplementations so I'm not sure it is entirely a good idea, but in principle it should be possible to create a separate liblapack that itself is linked against a BLAS-only libopenblas. Or have a BLAS-only libopenblas and a suitably named "full" one. (OTOH I do not know how common it is to have some code depend on both numpy and scipy, in which case you'd need to use the complete library anyway - at least ISTR that having an individual copy of OpenBLAS built into each of numpy and scipy was seen as a serious problem )

@brada4
Copy link
Contributor

brada4 commented Apr 28, 2023

BLAS is optional for NumPy, and the generic BLAS-like macros in numpy may generate code that is easier/faster/beter JIT-able later.

@rth
Copy link
Author

rth commented Apr 28, 2023

Thank you for your feedback and ideas!

I agree that this is already significant progress, and I'm not asking for official WASM support. I agree that it's already great that the performance is better than with reference BLAS!

there is work underway to have emscriptem understand at least SSE and AVX intrinsics, though not the corresponding assembly instructions themselves, as used in the current NEHALEM and SANDYBRIDGE kernels.

Yes, the state of SIMD for WASM in emscripten is outlined here. So if we wanted to try to build with x86 SSE but not the corresponding assembly instructions, is there is an existing target we could use in OpenBLAS? Though I guess it could also be limited by the specific instructions that are supported/or not supported.

@martin-frbg
Copy link
Collaborator

Unfortunately none that could be used as-is - such a target (i.e. its intrinsics-based BLAS kernels) would need to be created, as most kernels, and especially any predating AVX512, are written in assembler (or with inline assembly parts)

@brada4
Copy link
Contributor

brada4 commented Apr 28, 2023

The assembly parts are interleaved in the way that all execution units in real CPU are kept busy.

Emscripten is evolving. If code is written today for limited "supported" SIMD ISA, tomorrow it is no longer valid if some useful instruction gets optimally supported.

In the meantime generic code, maybe sometimes unrolling loops to convince compiler to SIMD will be compiled better and better under ever evolving webassembly support.

Maybe you can

defining the macro WASM_SIMD_COMPAT_SLOW

And point out any hit to emulated instructions, with best of luck it may point to openblas pessimality, and not compiler/jit shortcoming.

@martin-frbg
Copy link
Collaborator

@brada4 as I understand this environment variable, it will only warn about use of intrinsics that are poorly supported in webm translation, not bottlenecks in code translated from C in general.

@brada4
Copy link
Contributor

brada4 commented Apr 28, 2023

Indeed it is just a warning in intrinsics header.

@martin-frbg
Copy link
Collaborator

revisiting this and the current implementation status document for SSE/AVX and NEON instructions in webm, it seems to me that the best path forward is still to use the compiler's autovectorizer on the generic C code by adding only -msimd128 to the compilation flags (if I understand correctly). It would be possible to create an entire webm kernel tree and configuration file, but with the currently available kernels it would be at most the DAXPY and ROT functions that might benefit from using intrinsics-based codes. Does the emscripten project or one of its consumers provide a suitable benchmark, and is anybody aware of any exploratory followup work that may have been done since we last discussed this here, a bit over a year ago ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants