Skip to content

google-research-datasets/WikipediaHomographData

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

19 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Homograph disambiguation data

This repository provides labeled data for training homograph disambiguation models, as described in:

Gorman, K., Mazovetskiy, G., and Nikolaev, V. (2018). Improving homograph disambiguation with machine learning. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation, pages 1349-1352. Miyazaki, Japan.

If you use this data in a publication, we would appreciate if you cite this paper.

Annotation

Sentences were extracted from English Wikipedia articles. Homograph were initially labeled for the most likely WORDID (as defined below) in context by a team of three annotators. In the case that all three did not agree on the WORDID, a fourth senior annotator resolved the disagreements.

There are now 162 unique homographs and roughly 100 examples per homograph.

Organization

The files in the directories data/train and data/eval are TSV files with the following fields:

  • homograph: the homograph word itself
  • wordid: name of the pronunciation
  • sentence: text of the example
  • start: the first byte---inclusive--of the target homograph in sentence
  • end: the last byte---exclusive---of the target homograph in sentence

These two files represent a suggested 90%/10% train/test split stratified by homograph.

The file data/wordids.tsv is a TSV file which maps from the WORDID field above to information used by the annotator: -a short human-readable description of the WORDID, and a transcription of the WORDID. Note that neither are intended to be authoritative; they are simply to help users distinguish between the various WORDIDs for a homograph. The final two fields have some impressionistic taxonomic information about the nature of the homography itself intended for use during error analysis. The following fields are present:

  • homograph: the homograph word itself
  • wordid: name of the pronunciation
  • label: a short human-readable description of the wordid
  • pronunciation: a phonemic transcription of the wordid in US English.
  • homograph_type: a binary category describing the broad source of homography: morphosyntactic derivations from the same lemma, or lexically distinct terms.
  • fine_homography_type: a more detailed classification of the above.

Authors

This data was collected by Kyle Gorman, Vitaly Nikolaev, and Gleb Mazovetskiy, with help from a team of linguists and annotators.

License

See LICENSE.

Contributing

See CONTRIBUTING.

Mandatory disclaimer

This is not an official Google product.

About

Labeled data for homograph disambiguation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published