-
Notifications
You must be signed in to change notification settings - Fork 71
Move MinHashLSH logic to Rust, we're back #2307
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
ravwojdyla
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, thank you!! We may change some inputs/outputs, but we can do that in a follow up PR 🙇
2ea4c14 to
1041288
Compare
rjpower
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Just one comment about the bench test location.
(Alternatively could change the benchmark code to copy-paste some of the logic, so long as we don't have the Marin -> Dupekit dependency)
| try: | ||
| # Use the internal _minhash_lsh function to benchmark the Datasketch logic directly | ||
| # without the Zephyr Dataset API wrapper overhead or type checks. | ||
| from marin.processing.classification.deduplication.minhash_lsh import _minhash_lsh |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This feels a little weird, maybe this should be in marin instead? I don't like the inverted dependency
1041288 to
6f6f4c2
Compare
This makes sense, I agree! I looked into the two ways you suggested:
benchmark = [
"pytest-benchmark>=5.2.3",
"pytest-memray>=1.8.0"
]I think my favorite option for now is just to delete this benchmark. If we happen to actually need it in the future, I volunteer to bring it back. cc: @ravwojdyla Note: I'm going to eager merge this PR, to unblock this. @rjpower If you think of something else I should do please LMK and I'll open another PR. |
Rebase of #2191
Refs: