Skip to content

Standardized datasets for Word Sense Disambiguation

License

Notifications You must be signed in to change notification settings

akranlu/wsd-data

Repository files navigation

wsd-data

Standardized datasets (in jsonl) for Word Sense Disambiguation (SemCor and data sets from Raganato et al., 2017, for now.)

Reproducing

Clone the repository using:

git clone [email protected]:akranlu/wsd-json.git

Create the data/ directory:

cd wsd-json
mkdir data

Download the Unified WSD evaluation files from here, unzip it, and store the following directories into the data/ directory:

ALL # found in WSD_Evaluation_Framework/Evaluation_Datasets/
semeval2007 # found in WSD_Evaluation_Framework/Evaluation_Datasets/
semeval2013 # found in WSD_Evaluation_Framework/Evaluation_Datasets/
semeval2015 # found in WSD_Evaluation_Framework/Evaluation_Datasets/
senseval2 # found in WSD_Evaluation_Framework/Evaluation_Datasets/
senseval3 # found in WSD_Evaluation_Framework/Evaluation_Datasets/
SemCor # found in WSD_Evaluation_Framework/Training_Corpora/

Then, run the following command to convert the xml and txt files in each of the above datasets into jsonl files, stored in data/jsonl/:

python convert.py

This script converts the datasets into jsonl files with the following format for each entry:

{
    "id": "id as mentioned in wsd datasets.",
    "word": "<the word>",
    "start": "<start index of the span>",
    "end": "<end index of the span>",
    "sense": "<WordNet sense key>",
    "lemma": "<Lemma of the word>",
    "pos": "<Part of speech tag of the word in the context>",
    "sentence": "<sentence containing the sense-annotated word>"
}

Citation

This script only converts the original data provided by Raganato et al., 2017. Please cite the original authors using this bibtex:

@inproceedings{raganato-etal-2017-word,
    title = "Word Sense Disambiguation: A Unified Evaluation Framework and Empirical Comparison",
    author = "Raganato, Alessandro  and
      Camacho-Collados, Jose  and
      Navigli, Roberto",
    booktitle = "Proceedings of the 15th Conference of the {E}uropean Chapter of the Association for Computational Linguistics: Volume 1, Long Papers",
    month = apr,
    year = "2017",
    address = "Valencia, Spain",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/E17-1010",
    pages = "99--110",
    abstract = "Word Sense Disambiguation is a long-standing task in Natural Language Processing, lying at the core of human language understanding. However, the evaluation of automatic systems has been problematic, mainly due to the lack of a reliable evaluation framework. In this paper we develop a unified evaluation framework and analyze the performance of various Word Sense Disambiguation systems in a fair setup. The results show that supervised systems clearly outperform knowledge-based models. Among the supervised systems, a linear classifier trained on conventional local features still proves to be a hard baseline to beat. Nonetheless, recent approaches exploiting neural networks on unlabeled corpora achieve promising results, surpassing this hard baseline in most test sets.",
}

About

Standardized datasets for Word Sense Disambiguation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published