Skip to content

Weak Supervision Benchmark for Entity Matching

License

Notifications You must be signed in to change notification settings

wurenzhi/Wesbem

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Wesbem: Weak Supervision Benchmark for Entity Matching

This repository provides labeling functions for seven common entity matching datasets and helper functions to evaluate different labeling models (truth inference methods).

How to use

  1. Read the candidate set of one dataset e.g. fodors_zagats.
dataset = "fodors_zagats"
cand = pd.read_csv(os.path.join("data",dataset,"cache_cand.csv"))
  1. Apply lfs to candidate set
from LFs.fodors_zagats import LF_dict
preds =  apply_parallel(cand,LF_dict)
  1. Infer gt labels with a labeling model/truth inference method
y_pred = majority_vote(preds.values)

You can replace majority_vote with the model you want to evaluate.

  1. Get the ground labels
matches = pd.read_csv(os.path.join("data", dataset, "matches.csv"))
cand_pairs = list(map(tuple, cand[["id_l", "id_r"]].values))
y_gt = get_gt(cand_pairs, matches)
  1. Evaluate the infered labels
scores = eval_score(y_pred, y_gt)
print(scores)

Acknowledgment

The datasets dblp_acm, dblp_scholar, amazon_googleproducts and abt_buy are from DB Group Leipzig. The datasets monitor and camera are from the Alaska Benchmark.

About

Weak Supervision Benchmark for Entity Matching

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages