Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support Bloom filters #1303

Merged
merged 50 commits into from
Aug 2, 2023

Conversation

nvdbaranec
Copy link
Collaborator

@nvdbaranec nvdbaranec commented Jul 28, 2023

Adds support for Spark-style bloom filters via the BloomFilter class. The gpu implementation is in spark-rapids-jni itself and not cudf.

This version of the PR uses a different style of interface the encapsulates the entire Spark serialized blob of bloom filter data. It will probably render #1269 obsolete.

Added benchmark for bloom_filter_put. On an A5000, we're getting 140 GB/s write-throughput for bloom filter sizes of 512k, 1MB, 2MB, 4MB and 8MB. 12.5 milliseconds for 150 million rows. So it's not lightning fast, but it's serviceable.

Also fixed several assorted benchmark build errors. The cudf push for always providing null counts and specifying stream/mr broke a few of them.

…mur hash instead of the cudf version. Brought over

cpp and java tests.
…nents instead of an instance. Change BloomFilterInterfaces to take a

BaseDeviceMemoryBuffer instead of a DeviceMemoryBuffer. Handle some exception cases. Reordered some function parameter lists for consistency/cleanliness.
…oomFilter class to be more restrictive about bloom filter bit sizes:

must always be a multiple of 64 bits.
… Handles nulls in the c++ code : build will ignore null input values and probe will return

null for any input value.
@nvdbaranec nvdbaranec requested a review from jlowe July 28, 2023 22:38
@nvdbaranec nvdbaranec marked this pull request as draft July 28, 2023 22:38
@nvdbaranec nvdbaranec marked this pull request as ready for review July 29, 2023 00:43
…nterface for probing directly from a buffer. Improve error

checking in unpacking code.
jlowe
jlowe previously approved these changes Jul 31, 2023
Copy link
Member

@jlowe jlowe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested with NVIDIA/spark-rapids#8775 as well as in-progress BloomFilterAggregate code. Minor nit on better comments for CudfAccessor usage.

@jlowe
Copy link
Member

jlowe commented Jul 31, 2023

build

@nvdbaranec nvdbaranec merged commit d22259a into NVIDIA:branch-23.08 Aug 2, 2023
1 check passed
@jlowe jlowe changed the title Rework BloomFilter interface Support Bloom filters Aug 2, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants