Release candidate v1.0.0 #94

RUrlus · 2024-01-26T19:03:26Z

Changes

The primary changes are:

Redesigned Python API
New build system
Performance improvements

Python API

This PR introduces a number of significant changes.

awesome_cossim_topn has been deprecated and is superseded with sp_matmul_topn

The reasoning here is that the function has little to do with cosine similarity other than being a common use for it.
By disabling of the lower_bound (now threshold) by default and adding support for integers the function is now capable of generic sparse matrix multiplication and top-n selection.
Additionally sp_matmul_topn returns the top-n values per row in the order they would be if you performed normal multiplication, i.e. sp_matmul_topn(A, B, B.shape[1]) == sp_matmul(A, B) == A.dot(B).

Backwards compatibility

Backwards compatibility has been (largely) retained for the deprecated awesome_cossim_topn.
Migrating to sp_matmul_topn is a matter of setting different than default settings for threshold and sort.
Only the return_best_ntop option has been removed as this can be easily determined on the resulting matrix and the need for setting threshold for memory pressure purposes is no longer needed.

Build system

Nanobind (C++) bindings
CMake + Scikit-build-core build system

The Cython bindings have been a major pain when looking at the issues, this is resolved by switching to Nanobind.
Nanobind also opens the door the future GPU support (if desired) as it uses the DLPACK protocol.
CMake is becoming the de-facto standard for C++ and Scikit-build-core is actively being developed by a core-contributor of Pybind11.

Bundled OpenMP for parallelisation

Parallelisation is now handled by OpenMP which significantly reduces the maintenance burden and resolves buffer-overruns hiding in the previous threading implementation.
For ease-of-use we ship the OpenMP binary in the wheels. Note, this is not without risk/issue but given the size of the user-base probably OK for now.
People that encounter issues can recompile from source or download the OpenMP free wheels on the release page.

Performance improvements

Reduced memory complexity

The biggest performance improvement is a significantly reduced memory footprint due to using a Max-Heap of constant size top_n.
The previous approach (for a given row) collected all values above the threshold and than selected the top-n values.
For relatively dense/large matrices or a low threshold, this could result in the collection of many values and significant memory pressure.
Benchmarking shows that the Max-Heap is also faster for top-n values up to at least 100.

Additionally, when threshold=None we pre-compute the number of non-zero values (given top-n) and allocate the right size.
This does incurs a performance penalty but users that favour runtime over memory pressure can avoid this overhead by setting the threshold to np.finfo(A.data.dtype).min or np.iinfor(A.data.dtype) for integers.
For the single threaded path we added the density parameter which is used to determine the pre-allocation size when threshold is not None, n_alloc = ceil(n_rows * top_n * density).
This allows users to reduce the pre-allocated memory when they have a good expectation of the result density, being wrong incurs a copy penalty if they vectors have to resize.
The multithreaded implementation allocates the right amount before copying the results from the thread blocks.

Benchmarking

The new candidate version has been benchmarked to the previous implementation over word-based TF-IDF matrices of American company names from the EDGAR company database.
richbench was used to measure the performance using the matrix product (100_000, 193190) x (193190, 100_000) repeated 30 times.
See bech/ for detailed results and the code.

General findings are that:

v1.0 generally outperforms v0.3.6
v0.3.6 is generally faster when top_n = 1000
performance difference is largest for multithreaded implementation
performance converges as the threshold increases

The biggest performance difference is observed for threshold=0.0, top_n=10 and n_threads=8 where 1.0 (0.501s) is on average 2.4 times faster than v0.3.6 (1.215s)
v1.0.0 is on average 1.195 faster (median: 1.06) over all the benchmark settings.

It appears that there is a buffer-overrun hiding in the v0.3.6 multithreaded implementation that is hidden by the large over-allocation for C.
Additionally, the implementation appears to be significantly slower than the OpenMP implementation used in v1.0

Memory profiling is still an open item as memory_profiler does not correctly track the memory allocated by v1.0 and memray requires a symbolized debug build of Python.

Release notes

API

ENH: Add support for 32 and 64bit integers
ENH: Add sp_matmul parallelised implementation of CSR matrix multiplication
BLD: Add support for CPython 3.12, closes Add wheels for python 3.12 #92
API: awesome_cossim_topn is superseded with sp_matmul_topn.
API: awesome_cossim_topn has been deprecated and will be removed in a future version.
API: ntop parameter has been renamed to top_n
API: lower_bound parameter has been renamed to threshold
API: use_threads and n_jobs parameters have been combined into n_threads
API: return_best_ntop parameter has been removed
API: test_nnz_max parameter has been removed
API: default parameter value for threshold changed from 0.0 to None (disabled)
API: default parameter value for sort changed to False

Internal

FIX: [C++] Resolve unneeded memory allocation that solved hidden buffer-overrun in multithreaded implementation
BLD: Switch to pyproject.toml based setup (scikit-build-core)
BLD: [C++] Switch to Nanobind bindings
CHG: [C++] Switch to OpenMP for multithreading
ENH: [C++] Use MaxHeap to collect top-n results over vector of candidates

…se_dot_topn

…ation warning

…es over the threshold

… accordingly

…PCL)

…olumns

RUrlus · 2024-01-26T19:04:04Z

Ha @ymwdalex, @stephanecollot,

As discussed in #92 I've worked on the package which turned into a fairly complete refactor.
I tagged it as v1.0.0rc0 as there a number of API changes.

Keep to hear your thoughts!

mbaak

I've already reviewed and tested the python and c++ changes in Ralph's fork.

RUrlus · 2024-01-31T11:15:39Z

@ymwdalex @stephanecollot Let us know if you are interested in reviewing but haven't had the time yet.

Otherwise we'll take over maintenance entirely and merge it in. I am off at the end of week and keen to get this merged before then.

ymwdalex · 2024-01-31T14:11:42Z

@RUrlus thanks for the work! Sorry I have no time to review. I am fine for you and the ING team to take over the maintenance.

RUrlus added 30 commits December 23, 2023 17:56

PKG: Switch to pyrpoject based setup with scikit-core

4f5a935

MAINT: Switch to python and extension directory layouts

0586413

CICD: Update actions to new setup

ed3975c

BREAK: Rename and refactor awesome_cossim_topn to (what it does) spar…

321feb7

…se_dot_topn

STY: [C++] Add style and formatting configs

209e9d0

BLD: [C++] Set up basis for nanobind bindings

6e8f63c

ENH: [C++] Add bindings for sparse_dot_topn

16aa4cd

MAINT: [C++] Rename base func to sp_matmul_topn

db7ba4f

MAINT: Rename base func to sp_matmul_topn

613eb25

BLD: [C++] Add flags to default build

607ad65

CHG: Support disabling the threshold properly

285d324

TST: Add test fixtures

553d896

BLD: [C++] Move CMake modules into subdirectory

d90e045

BLD: Specify directories to include in the sdist

aad472f

ENH: Add ability for users to set the index dtype

820ac34

FIX: Catch case where shapes and storage are transpose compatible

743d7ec

TST: Add tests for sp_matmul_topn

3dc8470

CHG: redirect awsome_comssim_topn to sp_matmul_topn with a deprec…

47e963e

…ation warning

ENH: Guard against a too large top_n value

84e4d61

ENH: [C++] Add Min-Max Heap to retain top n scores

8e56c42

CHG: [C++] Switch to more efficient MaxHeap over keeping all candidat…

4efab54

…es over the threshold

TST: Maintain insertion order when taking the top n values

ce3371f

BLD: [C++] Search for Homebrew OpenMP on ARM MacOS

52d5e5c

ENH: [C++] Add OpenMP multithreaded implementation

56aeed6

ENH: Add multithreaded implementation

712f1d2

TST: Add tests for threaded variant

254cb9d

CHG: [C++] Allocate elements of C with known size and return

8006bec

ENH: [C++] Enable users to specify the expected density and allocated…

dee6e5e

… accordingly

API: Enable users to specify the expected density

8ca36e3

TST: Add tests for density paramter

e068468

RUrlus added 18 commits January 22, 2024 14:47

CICD: Silence warnings about testing ARM wheels on Intel hardware

cdd0e39

CICD: Add explicit repair wheel commands

f098847

CICD: Vendor OpenMP in the wheels

d2480c7

CICD: Do not set deployment target manually

6eed2b1

BLD: Set default deployment target

8cd8fc8

CICD: Test against all Python versions for PRs

8c3ab36

CICD: Combine non-vendored wheels into single directory

4628e1f

DOC: Move detailed intallation information to a seperate file

07dfeb4

BLD: [C++] Use SIMD detection routines from the Point Cloud Library (…

875e0ed

…PCL)

CHG: [C++] Move out general functionality to common

207942a

FIX: [C++] Correct recounting of columns for row sums

26b4bb1

ENH: [C++] Add sp_matmul and parallel variant sp_matmul

f73db93

ENH: Add sp_matmul and pass through when topn is equal to number of c…

ee58547

…olumns

DOC: Add benchmark code and instructions

a41657c

BLD: [C++] Fix SIMD checks for Apple Intel

e8eb183

DOC: Update documentation and benchmark results

9b36167

TST: Add tests for sp_matmul

764489a

TST: Add pytest config

0024322

RUrlus requested review from stephanecollot and ymwdalex January 26, 2024 19:03

RUrlus requested a review from sbrugman January 26, 2024 19:04

mbaak approved these changes Jan 26, 2024

View reviewed changes

sbrugman approved these changes Jan 30, 2024

View reviewed changes

RUrlus added 2 commits January 31, 2024 15:20

REL: Release v1.0.0

7af66b1

CICD: Switch to manual install of CIBuildwheels 2.16.5

3a9279f

RUrlus merged commit 6fe35c2 into ing-bank:master Jan 31, 2024
32 checks passed

RUrlus deleted the refactor branch January 31, 2024 16:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Release candidate v1.0.0 #94

Release candidate v1.0.0 #94

RUrlus commented Jan 26, 2024

RUrlus commented Jan 26, 2024

mbaak left a comment •

edited

Loading

RUrlus commented Jan 31, 2024 •

edited

Loading

ymwdalex commented Jan 31, 2024

Release candidate v1.0.0 #94

Release candidate v1.0.0 #94

Conversation

RUrlus commented Jan 26, 2024

Changes

Python API

Backwards compatibility

Build system

Performance improvements

Reduced memory complexity

Benchmarking

Release notes

API

Internal

RUrlus commented Jan 26, 2024

mbaak left a comment • edited Loading

Choose a reason for hiding this comment

RUrlus commented Jan 31, 2024 • edited Loading

ymwdalex commented Jan 31, 2024

mbaak left a comment •

edited

Loading

RUrlus commented Jan 31, 2024 •

edited

Loading