Python code used for research paper - Computer Science
The code is divided in 8 parts:
- Data (loading, cleaning, preparing)
- Model Words (extracting model words from product descriptions)
- Binary Vectors (constructing the binary vectors for each product)
- Min-Hashing (constructing the signature matrix)
- Locality-Sensitive Hashing (identifying candidate duplicate pairs)
- Jaccard Similarity
- Bootstrapping training
- Bootstrapping testing
The code is provided with a lot of comments, making all steps as clear as possible. There is no need to switch between documents, all code is provided in one file.