This repository provides labeling functions for seven common entity matching datasets and helper functions to evaluate different labeling models (truth inference methods).
- Read the candidate set of one dataset e.g. fodors_zagats.
dataset = "fodors_zagats"
cand = pd.read_csv(os.path.join("data",dataset,"cache_cand.csv"))
- Apply lfs to candidate set
from LFs.fodors_zagats import LF_dict
preds = apply_parallel(cand,LF_dict)
- Infer gt labels with a labeling model/truth inference method
y_pred = majority_vote(preds.values)
You can replace majority_vote with the model you want to evaluate.
- Get the ground labels
matches = pd.read_csv(os.path.join("data", dataset, "matches.csv"))
cand_pairs = list(map(tuple, cand[["id_l", "id_r"]].values))
y_gt = get_gt(cand_pairs, matches)
- Evaluate the infered labels
scores = eval_score(y_pred, y_gt)
print(scores)
The datasets dblp_acm, dblp_scholar, amazon_googleproducts and abt_buy are from DB Group Leipzig. The datasets monitor and camera are from the Alaska Benchmark.