Question: Why getting a min-hash for a single sample is possible? #181

bdeng3 · 2022-04-06T08:18:13Z

Say if we have three documents A, B and C. Each document might contains different words.

According to the document of data sketch.MinHash, we can get a min-hash for A with

minxish = Minhash(num_perm=128)
minhash.update(A.encode('utf-8'))
vector = minhash.digest()

But isn't that we need to create a vocabulary consisting of all words from A, B and C before getting the vector?

The text was updated successfully, but these errors were encountered:

akuuzii · 2022-04-11T13:41:41Z

I'm also interested in seeing this topic answered.

ekzhu · 2022-06-02T19:20:07Z

Good question. The idea here is to "cheat" by mapping each token (or "word" in your example) to an integer in the hash space, which we know the complete vocabulary -- all integers 0 - 2^32!

ekzhu added the question label Mar 21, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Question: Why getting a min-hash for a single sample is possible? #181

Question: Why getting a min-hash for a single sample is possible? #181

bdeng3 commented Apr 6, 2022

akuuzii commented Apr 11, 2022

ekzhu commented Jun 2, 2022

Question: Why getting a min-hash for a single sample is possible? #181

Question: Why getting a min-hash for a single sample is possible? #181

Comments

bdeng3 commented Apr 6, 2022

akuuzii commented Apr 11, 2022

ekzhu commented Jun 2, 2022