Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release candidate v1.0.0 #94

Merged
merged 79 commits into from
Jan 31, 2024
Merged

Release candidate v1.0.0 #94

merged 79 commits into from
Jan 31, 2024

Conversation

RUrlus
Copy link
Collaborator

@RUrlus RUrlus commented Jan 26, 2024

Changes

The primary changes are:

  1. Redesigned Python API
  2. New build system
  3. Performance improvements

Python API

This PR introduces a number of significant changes.

  • awesome_cossim_topn has been deprecated and is superseded with sp_matmul_topn

The reasoning here is that the function has little to do with cosine similarity other than being a common use for it.
By disabling of the lower_bound (now threshold) by default and adding support for integers the function is now capable of generic sparse matrix multiplication and top-n selection.
Additionally sp_matmul_topn returns the top-n values per row in the order they would be if you performed normal multiplication, i.e. sp_matmul_topn(A, B, B.shape[1]) == sp_matmul(A, B) == A.dot(B).

Backwards compatibility

Backwards compatibility has been (largely) retained for the deprecated awesome_cossim_topn.
Migrating to sp_matmul_topn is a matter of setting different than default settings for threshold and sort.
Only the return_best_ntop option has been removed as this can be easily determined on the resulting matrix and the need for setting threshold for memory pressure purposes is no longer needed.

Build system

  • Nanobind (C++) bindings
  • CMake + Scikit-build-core build system

The Cython bindings have been a major pain when looking at the issues, this is resolved by switching to Nanobind.
Nanobind also opens the door the future GPU support (if desired) as it uses the DLPACK protocol.
CMake is becoming the de-facto standard for C++ and Scikit-build-core is actively being developed by a core-contributor of Pybind11.

  • Bundled OpenMP for parallelisation

Parallelisation is now handled by OpenMP which significantly reduces the maintenance burden and resolves buffer-overruns hiding in the previous threading implementation.
For ease-of-use we ship the OpenMP binary in the wheels. Note, this is not without risk/issue but given the size of the user-base probably OK for now.
People that encounter issues can recompile from source or download the OpenMP free wheels on the release page.

Performance improvements

Reduced memory complexity

The biggest performance improvement is a significantly reduced memory footprint due to using a Max-Heap of constant size top_n.
The previous approach (for a given row) collected all values above the threshold and than selected the top-n values.
For relatively dense/large matrices or a low threshold, this could result in the collection of many values and significant memory pressure.
Benchmarking shows that the Max-Heap is also faster for top-n values up to at least 100.

Additionally, when threshold=None we pre-compute the number of non-zero values (given top-n) and allocate the right size.
This does incurs a performance penalty but users that favour runtime over memory pressure can avoid this overhead by setting the threshold to np.finfo(A.data.dtype).min or np.iinfor(A.data.dtype) for integers.
For the single threaded path we added the density parameter which is used to determine the pre-allocation size when threshold is not None, n_alloc = ceil(n_rows * top_n * density).
This allows users to reduce the pre-allocated memory when they have a good expectation of the result density, being wrong incurs a copy penalty if they vectors have to resize.
The multithreaded implementation allocates the right amount before copying the results from the thread blocks.

Benchmarking

The new candidate version has been benchmarked to the previous implementation over word-based TF-IDF matrices of American company names from the EDGAR company database.
richbench was used to measure the performance using the matrix product (100_000, 193190) x (193190, 100_000) repeated 30 times.
See bech/ for detailed results and the code.

General findings are that:

  • v1.0 generally outperforms v0.3.6
  • v0.3.6 is generally faster when top_n = 1000
  • performance difference is largest for multithreaded implementation
  • performance converges as the threshold increases

The biggest performance difference is observed for threshold=0.0, top_n=10 and n_threads=8 where 1.0 (0.501s) is on average 2.4 times faster than v0.3.6 (1.215s)
v1.0.0 is on average 1.195 faster (median: 1.06) over all the benchmark settings.

It appears that there is a buffer-overrun hiding in the v0.3.6 multithreaded implementation that is hidden by the large over-allocation for C.
Additionally, the implementation appears to be significantly slower than the OpenMP implementation used in v1.0

Memory profiling is still an open item as memory_profiler does not correctly track the memory allocated by v1.0 and memray requires a symbolized debug build of Python.

Release notes

API

  • ENH: Add support for 32 and 64bit integers

  • ENH: Add sp_matmul parallelised implementation of CSR matrix multiplication

  • BLD: Add support for CPython 3.12, closes Add wheels for python 3.12 #92

  • API: awesome_cossim_topn is superseded with sp_matmul_topn.

  • API: awesome_cossim_topn has been deprecated and will be removed in a future version.

  • API: ntop parameter has been renamed to top_n

  • API: lower_bound parameter has been renamed to threshold

  • API: use_threads and n_jobs parameters have been combined into n_threads

  • API: return_best_ntop parameter has been removed

  • API: test_nnz_max parameter has been removed

  • API: default parameter value for threshold changed from 0.0 to None (disabled)

  • API: default parameter value for sort changed to False

Internal

  • FIX: [C++] Resolve unneeded memory allocation that solved hidden buffer-overrun in multithreaded implementation
  • BLD: Switch to pyproject.toml based setup (scikit-build-core)
  • BLD: [C++] Switch to Nanobind bindings
  • CHG: [C++] Switch to OpenMP for multithreading
  • ENH: [C++] Use MaxHeap to collect top-n results over vector of candidates

@RUrlus
Copy link
Collaborator Author

RUrlus commented Jan 26, 2024

Ha @ymwdalex, @stephanecollot,

As discussed in #92 I've worked on the package which turned into a fairly complete refactor.
I tagged it as v1.0.0rc0 as there a number of API changes.

Keep to hear your thoughts!

@RUrlus RUrlus requested a review from sbrugman January 26, 2024 19:04
Copy link
Collaborator

@mbaak mbaak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've already reviewed and tested the python and c++ changes in Ralph's fork.

@RUrlus
Copy link
Collaborator Author

RUrlus commented Jan 31, 2024

@ymwdalex @stephanecollot Let us know if you are interested in reviewing but haven't had the time yet.

Otherwise we'll take over maintenance entirely and merge it in. I am off at the end of week and keen to get this merged before then.

@ymwdalex
Copy link
Collaborator

@RUrlus thanks for the work! Sorry I have no time to review. I am fine for you and the ING team to take over the maintenance.

@RUrlus RUrlus merged commit 6fe35c2 into ing-bank:master Jan 31, 2024
32 checks passed
@RUrlus RUrlus deleted the refactor branch January 31, 2024 16:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add wheels for python 3.12
4 participants