simhash cpp module for python, a cpp implement of simhash, support for large dimesion such as 128bit
pip install pysimhash
or install from github.com
git clone https://github.com/skiloop/simhash
cd simhash
python setup.py install
- boost-python
example:
import pysimhash
import hashlib
document = "google.com hybridtheory.com youtube.com reddit.com"
tokens = [hashlib.md5(s.encode('utf-8')).hexdigest() for s in document.split(" ")]
s2 = pysimhash.SimHash(128, 16) # f=128, hash_bit=16
s2.build(tokens, base=16)
print(s2.hex())
With 10000 creating and 100,000 comparing(using benchmark.py) on the same linux, results go as follow
implement | build time | comparison time |
---|---|---|
pure python | 1.73s | 222.99s |
pysimhash | 0.14s | 49.89s |