Move MinHashLSH logic to Rust, we're back #2307

yonromai · 2026-01-09T02:00:53Z

Rebase of #2191

Refs:

ravwojdyla

lgtm, thank you!! We may change some inputs/outputs, but we can do that in a follow up PR 🙇

rjpower

Looks great! Just one comment about the bench test location.

(Alternatively could change the benchmark code to copy-paste some of the logic, so long as we don't have the Marin -> Dupekit dependency)

rjpower · 2026-01-09T20:35:56Z

lib/dupekit/tests/bench/test_minhash.py

+try:
+    # Use the internal _minhash_lsh function to benchmark the Datasketch logic directly
+    # without the Zephyr Dataset API wrapper overhead or type checks.
+    from marin.processing.classification.deduplication.minhash_lsh import _minhash_lsh


This feels a little weird, maybe this should be in marin instead? I don't like the inverted dependency

yonromai · 2026-01-12T16:42:52Z

Looks great! Just one comment about the bench test location.

(Alternatively could change the benchmark code to copy-paste some of the logic, so long as we don't have the Marin -> Dupekit dependency)

This makes sense, I agree! I looked into the two ways you suggested:

Copying code from Marin to Dupekit: The python code is a little deep as it depends on marin.processing.classification.deduplication.vendor.datasketch.minhash.
Add the bench code to Marin: This could make sense, but that's not an existing pattern and it would require bringing in some extra packages to Marin (which already has a lot):

benchmark = [
    "pytest-benchmark>=5.2.3",
    "pytest-memray>=1.8.0"
]

I think my favorite option for now is just to delete this benchmark. If we happen to actually need it in the future, I volunteer to bring it back. cc: @ravwojdyla

Note: I'm going to eager merge this PR, to unblock this. @rjpower If you think of something else I should do please LMK and I'll open another PR.

yonromai requested a review from ravwojdyla January 9, 2026 02:00

ravwojdyla approved these changes Jan 9, 2026

View reviewed changes

yonromai marked this pull request as ready for review January 9, 2026 19:52

yonromai force-pushed the romain/minhashlsh-rs-rebase-wereback branch from 2ea4c14 to 1041288 Compare January 9, 2026 19:52

yonromai requested review from dlwh and rjpower January 9, 2026 19:59

rjpower approved these changes Jan 9, 2026

View reviewed changes

yonromai added 2 commits January 12, 2026 08:15

Move MinHashLSH logic to Rust

b7e78dd

Remove dupekit benchmark which depends on marin lib

6f6f4c2

yonromai force-pushed the romain/minhashlsh-rs-rebase-wereback branch from 1041288 to 6f6f4c2 Compare January 12, 2026 16:36

yonromai merged commit a114e08 into main Jan 12, 2026
9 checks passed

yonromai deleted the romain/minhashlsh-rs-rebase-wereback branch January 12, 2026 16:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Move MinHashLSH logic to Rust, we're back #2307

Move MinHashLSH logic to Rust, we're back #2307

Uh oh!

yonromai commented Jan 9, 2026

Uh oh!

ravwojdyla left a comment

Uh oh!

rjpower left a comment

Uh oh!

rjpower Jan 9, 2026

Uh oh!

yonromai commented Jan 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Move MinHashLSH logic to Rust, we're back #2307

Move MinHashLSH logic to Rust, we're back #2307

Uh oh!

Conversation

yonromai commented Jan 9, 2026

Uh oh!

ravwojdyla left a comment

Choose a reason for hiding this comment

Uh oh!

rjpower left a comment

Choose a reason for hiding this comment

Uh oh!

rjpower Jan 9, 2026

Choose a reason for hiding this comment

Uh oh!

yonromai commented Jan 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants