Dataset Geography

This repository walk throughs the approach proposed in our paper: Dataset Geography: Mapping Language Data to Language Users.

Steps

1. Entity recognition and linking

We use mGENRE to perform both entity recogniton and linking for most of the datasets we experimented with. See installation, download and example usage section to get familier with mGENRE. Once you have mGENRE running on your system, you can use it to extract as well as link dataset to wikidata entites. Primarily, you need to install fairseq and download the following files which you can obtain following the mGENRE instructions:

the pre-trained model
the prefix tree (trie) from Wikipedia titles
the dictionary to map the generated strings to Wikidata identifiers
a mention table to restrict the search space to a number of candidates

Then we use entity_le.py (modified version of mGENRE example codes) to extract the entities given the dataset files(format: pickle file containing dataset texts as a list of string). This script is defined to process 200 sentences at once. Provide the correct model file paths before running.

example input dataset format:

import sys
>>> import pickle
>>> datafile="data/sentences/tydiqa/tydiqa-train-english.pickle"
>>> with open(datafile,'rb') as f:
...             sentences =  pickle.load(f)
... 
>>> sentences[:2]
['[START] Is Creole a pidgin of French? [END]', '[START] When was quantum field theory developed? [END]']

command to extract entities for all the files in `data/sentences/tydiqa/` folder:

cd ../code
./mgenre_ner.sh ../data/sentences/tydiqa ../data/entities/tydiqa

example output entity linked dataset format:

>>> import pickle
>>> outfile="data/entities/tydiqa/tydiqa-train-english.pickle"
>>> with open(outfile,'rb') as f:
...     entities=pickle.load(f)
... 
>>> entities[:2]
[[{'id': 'Q33260', 'texts': ['French-based creole languages >> en'], 'scores': tensor([-0.6672]), 'score': tensor(-2.0017)}, {'id': 'Q33831', 'texts': ['Pidgin >> en'], 'scores': tensor([-0.9490]), 'score': tensor(-2.1220)}, {'id': 'Q33289', 'texts': ['Creole language >> en'], 'scores': tensor([-0.9833]), 'score': tensor(-2.4085)}, {'id': 'Q17093549', 'texts': ['Pidgin English >> en'], 'scores': tensor([-1.3735]), 'score': tensor(-3.3644)}, {'id': 'Q150', 'texts': ['French language >> en'], 'scores': tensor([-1.6143]), 'score': tensor(-3.6097)}], [{'id': 'Q54505', 'texts': ['Quantum field theory >> en'], 'scores': tensor([-0.2873]), 'score': tensor(-0.7602)}, {'id': 'Q899444', 'texts': ['De Broglie–Bohm theory >> en'], 'scores': tensor([-0.3818]), 'score': tensor(-1.3227)}, {'id': 'Q3278166', 'texts': ['History of quantum field theory >> en'], 'scores': tensor([-0.7374]), 'score': tensor(-2.0858)}, {'id': 'Q188403', 'texts': ['History of quantum chemistry >> en'], 'scores': tensor([-0.8181]), 'score': tensor(-2.1645)}, {'id': 'Q730675', 'texts': ['Quantitative research >> en'], 'scores': tensor([-1.0298]), 'score': tensor(-2.5224)}]]

2. Dataset Country Mapping

We use plot_geo_tydiqa.py to map dataset entities to geographical entities. Modify input/output file paths and other pointers as you need. We use plotly to create dataset maps. Take a look at the requirements dump for all additional dependencies.

Experiments

Additional detailed experiment pointers including ner-relaxed vs ner-constrain results, region-wise performance measurement will be uploaded.

Citation

If you use Dataset Geography, please cite the "Dataset Geography: Mapping Language Data to Language Users". You can use the following BibTeX entry

@inproceedings{faisal-etal-2022-dataset,
    title = "Dataset Geography: Mapping Language Data to Language Users",
    author = "Faisal, Fahim  and
      Wang, Yinkai  and
      Anastasopoulos, Antonios",
    booktitle = "Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
    month = may,
    year = "2022",
    address = "Dublin, Ireland",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.acl-long.239",
    doi = "10.18653/v1/2022.acl-long.239",
    pages = "3381--3411",
}

We built the entity-linked version of the datasets using mGENRE. Kindly also make sure to cite the original mGENRE paper if you use this specific entity linking approach or the already linked datasets using mGENRE,

@article{de-cao-etal-2022-multilingual,
    title = "Multilingual Autoregressive Entity Linking",
    author = "De Cao, Nicola  and
      Wu, Ledell  and
      Popat, Kashyap  and
      Artetxe, Mikel  and
      Goyal, Naman  and
      Plekhanov, Mikhail  and
      Zettlemoyer, Luke  and
      Cancedda, Nicola  and
      Riedel, Sebastian  and
      Petroni, Fabio",
    journal = "Transactions of the Association for Computational Linguistics",
    volume = "10",
    year = "2022",
    address = "Cambridge, MA",
    publisher = "MIT Press",
    url = "https://aclanthology.org/2022.tacl-1.16",
    doi = "10.1162/tacl_a_00460",
    pages = "274--290",
}

License

Dataset Geography: Apache License 2.0
mGENRE: CC-BY-NC 4.0

Name		Name	Last commit message	Last commit date
Latest commit History 38 Commits
code		code
constrain		constrain
data		data
.gitignore		.gitignore
LICENSE.md		LICENSE.md
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Dataset Geography

Steps

1. Entity recognition and linking

example input dataset format:

command to extract entities for all the files in `data/sentences/tydiqa/` folder:

example output entity linked dataset format:

2. Dataset Country Mapping

Experiments

Citation

License

About

Releases 1

Packages

Contributors 2

Languages

License

ffaisal93/dataset_geography

Folders and files

Latest commit

History

Repository files navigation

Dataset Geography

Steps

1. Entity recognition and linking

example input dataset format:

command to extract entities for all the files in data/sentences/tydiqa/ folder:

example output entity linked dataset format:

2. Dataset Country Mapping

Experiments

Citation

License

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 2

Languages

command to extract entities for all the files in `data/sentences/tydiqa/` folder:

Packages