EntityMatcher

EntityMatcher is a Python package including implementations of multiple deep entity matching models proposed by our group. The current version only contains the HierMatcher model proposed in IJCAI-2020 paper "Hierarchical Matching Network for Heterogeneous Entity Resolution". More models (MPM, Seq2SeqMatcher, ect.) will be available later.

EntityMatcher is built on the framework of DeepMatcher, which is an easily customizable deep entity matching package.

Environment Setting

Python 3.6
scikit-learn 0.22.2
deepmatcher 0.1.1

Datasets

There are ten datasets of three types in the “data/” directory of this project:

Four pubic homogeneous datasets, which are originally obtained from here.
Walmart-Amazon₁: "data/walmart_amazon"
Amazon-Google: "data/amazon_google"
DBLP-ACM₁: "data/dblp_acm"
DBLP-Scholar₁: "data/dblp_scholar"
Three public dirty datasets, which are originally obtained from here.
Walmart-Amazon₂: "data/dirty_walmart_amazon"
DBLP-ACM₂: "data/dirty_dblp_acm"
DBLP-Scholar₂: "data/dirty_dblp_scholar"
Three heterogeneous datasets, which are derived from Walmart-Amazon₁ using different attribute merging operations (see more details from here).
Walmart-Amazon₃: "data/walmart_amazon_3"
Walmart-Amazon₄: "data/walmart_amazon_4"
Walmart-Amazon₅: "data/walmart_amazon_5"

All of the above datasets have been processed according to the input data format of DeepMatcher, thus can be directly used.

Embedding file

Download fastText model file trained on English Wikipedia from here. Then unzip it and copy the file named "wiki.en.bin" to the “embedding/” directory of this project.

Quick start

Run experiments on specified dataset and model:

    python run.py -m <model_name>  -d <dataset_dir>  -e <embedding_dir>

For example, to run an experiment on Walmart-Amazon with HierMatcher model, use:

    python run.py -m "HierMatcher" -d "data/walmart_amazon/" -e "embedding"

Citation

Please cite our work if you like or are using our codes for your projects:

Cheng Fu, Xianpei Han, Jiaming He and Le Sun, Hierarchical Matching Network for Heterogeneous Entity Resolution. IJCAI 2020: 3665-3671

The Team

EntityMatcher is developed by Chinese Information Processing Laboratory (CIP), Institute of Software , Chinese Academy of Science.
If you have any problem in running the code, please email to [email protected].

Name		Name	Last commit message	Last commit date
Latest commit History 11 Commits
data		data
embedding		embedding
model		model
README.md		README.md
requirements.txt		requirements.txt
run.py		run.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EntityMatcher

Environment Setting

Datasets

Embedding file

Quick start

Citation

The Team

About

Releases

Packages

Languages

jingac/EntityMatcher

Folders and files

Latest commit

History

Repository files navigation

EntityMatcher

Environment Setting

Datasets

Embedding file

Quick start

Citation

The Team

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages