Skip to content

Conversation

@yonromai
Copy link
Contributor

@yonromai yonromai commented Jan 9, 2026

@yonromai yonromai requested a review from ravwojdyla January 9, 2026 02:00
Copy link
Contributor

@ravwojdyla ravwojdyla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, thank you!! We may change some inputs/outputs, but we can do that in a follow up PR 🙇

@yonromai yonromai marked this pull request as ready for review January 9, 2026 19:52
@yonromai yonromai force-pushed the romain/minhashlsh-rs-rebase-wereback branch from 2ea4c14 to 1041288 Compare January 9, 2026 19:52
@yonromai yonromai requested review from dlwh and rjpower January 9, 2026 19:59
Copy link
Collaborator

@rjpower rjpower left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Just one comment about the bench test location.

(Alternatively could change the benchmark code to copy-paste some of the logic, so long as we don't have the Marin -> Dupekit dependency)

try:
# Use the internal _minhash_lsh function to benchmark the Datasketch logic directly
# without the Zephyr Dataset API wrapper overhead or type checks.
from marin.processing.classification.deduplication.minhash_lsh import _minhash_lsh
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels a little weird, maybe this should be in marin instead? I don't like the inverted dependency

@yonromai yonromai force-pushed the romain/minhashlsh-rs-rebase-wereback branch from 1041288 to 6f6f4c2 Compare January 12, 2026 16:36
@yonromai
Copy link
Contributor Author

Looks great! Just one comment about the bench test location.

(Alternatively could change the benchmark code to copy-paste some of the logic, so long as we don't have the Marin -> Dupekit dependency)

This makes sense, I agree! I looked into the two ways you suggested:

  1. Copying code from Marin to Dupekit: The python code is a little deep as it depends on marin.processing.classification.deduplication.vendor.datasketch.minhash.
  2. Add the bench code to Marin: This could make sense, but that's not an existing pattern and it would require bringing in some extra packages to Marin (which already has a lot):
benchmark = [
    "pytest-benchmark>=5.2.3",
    "pytest-memray>=1.8.0"
]

I think my favorite option for now is just to delete this benchmark. If we happen to actually need it in the future, I volunteer to bring it back. cc: @ravwojdyla

Note: I'm going to eager merge this PR, to unblock this. @rjpower If you think of something else I should do please LMK and I'll open another PR.

@yonromai yonromai merged commit a114e08 into main Jan 12, 2026
9 checks passed
@yonromai yonromai deleted the romain/minhashlsh-rs-rebase-wereback branch January 12, 2026 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants