-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improving performance with WebAssembly #4023
Comments
And a follow-up question, if by chance, there is anything OpenBLAS specific that we could be doing to reduce the size of the produced shared library? Currently, it's 1.75 MB compressed which is non-negligible when loading on a web page. |
The current level of emscripten support from #3640 was expected to be the bare minimum to get it working at all, if this already provides a 2x speediup over the reference implementation I consider that quite impressive. As far as I know, there is work underway to have emscriptem understand at least SSE and AVX intrinsics, though not the corresponding assembly instructions themselves, as used in the current NEHALEM and SANDYBRIDGE kernels. About the only part of BLAS where intrinsics are currently used is in the DOT kernels, so at least GEMM and probably AXPY would need to be (re)written to achieve some speedup. (Not sure if your 10x figure comes from a native build for such a fairly outdated cpu - definitely do not get your hopes too high by comparing to current AVX2/AVX512 cpus). As for library size, there is nothing obviously unnecessary compiled in by default (like profiling code or a testsuite), |
(That is unless you are really only interested in BLAS and not LAPACK - building with NO_LAPACK=1 would certainly make a big difference in library size) |
scipy does need LAPACK but numpy only needs BLAS I think. Not sure we can leverage this in Pyodide though to only load the BLAS part for numpy and BLAS + LAPACK for scipy ... |
Hmm. I believe some Linux distros take OpenBLAS apart to produce separate BLAS and LAPACK libs (probably mostly for the "alternatives" mechanism). There are a few functions from Reference-LAPACK like POTRF/GETRF that we replace with optimized reimplementations so I'm not sure it is entirely a good idea, but in principle it should be possible to create a separate liblapack that itself is linked against a BLAS-only libopenblas. Or have a BLAS-only libopenblas and a suitably named "full" one. (OTOH I do not know how common it is to have some code depend on both numpy and scipy, in which case you'd need to use the complete library anyway - at least ISTR that having an individual copy of OpenBLAS built into each of numpy and scipy was seen as a serious problem ) |
BLAS is optional for NumPy, and the generic BLAS-like macros in numpy may generate code that is easier/faster/beter JIT-able later. |
Thank you for your feedback and ideas! I agree that this is already significant progress, and I'm not asking for official WASM support. I agree that it's already great that the performance is better than with reference BLAS!
Yes, the state of SIMD for WASM in emscripten is outlined here. So if we wanted to try to build with x86 SSE but not the corresponding assembly instructions, is there is an existing target we could use in OpenBLAS? Though I guess it could also be limited by the specific instructions that are supported/or not supported. |
Unfortunately none that could be used as-is - such a target (i.e. its intrinsics-based BLAS kernels) would need to be created, as most kernels, and especially any predating AVX512, are written in assembler (or with inline assembly parts) |
The assembly parts are interleaved in the way that all execution units in real CPU are kept busy. Emscripten is evolving. If code is written today for limited "supported" SIMD ISA, tomorrow it is no longer valid if some useful instruction gets optimally supported. In the meantime generic code, maybe sometimes unrolling loops to convince compiler to SIMD will be compiled better and better under ever evolving webassembly support. Maybe you can
And point out any hit to emulated instructions, with best of luck it may point to openblas pessimality, and not compiler/jit shortcoming. |
@brada4 as I understand this environment variable, it will only warn about use of intrinsics that are poorly supported in webm translation, not bottlenecks in code translated from C in general. |
Indeed it is just a warning in intrinsics header. |
revisiting this and the current implementation status document for SSE/AVX and NEON instructions in webm, it seems to me that the best path forward is still to use the compiler's autovectorizer on the generic C code by adding only |
Thanks to the discussions and fixes in #3640 and follow up work by @lesteve and @ogrisel we now have a build of OpenBLAS with emscripten for WebAssembly in Pyodide. It works quite well when used via scipy.
I recently run some benchmarks for square matrix multiplications (DGEMM) to get some ideas about the performance, which can be found here. The good news is that the scipy build with OpenBLAS is around 2-3x times faster for DGEMM than with the reference BLAS. The less good news is that it's still around 10x slower than almost the same OpenBLAS version built for a modern x86-64 CPU (single-threaded) natively.
For now, the constraint of that runtime is single-threaded, and without SIMD. (Though we should investigate whether it would be possible optionally built with SIMD and have some browser feature detection.)
I was wondering if is there anything else we could try to improve the performance of OpenBLAS for the WebAssembly platform ?
It's currently built with Emscripten using the following options,
Thank you!
The text was updated successfully, but these errors were encountered: