Skip to content

Latest commit

 

History

History

pos

Parts of Speech Datasets

There are several datasets available for Parts of Speech tagging.

Dataset

Directory Structure:

  • LDC: The LDC corpus is publicly available through LDC. Due to the license restriction we are not able to make it available.
    • License: LDC
  • train_data_IITKGP.tsv: The IITKGP POS Tagged corpus consists of a tagset comprising 38 tags, developed by Microsoft Research in collaboration with IIT Kharagpur and several institutions in India
    • License: NA
  • pos_tag_data_bangla_tagged_corpus.tsv: CRBLP POS Tagged corpus consists of ∼ 20𝐾 tokens, from 1176 sentences.
    • License: NA

Licensing

Our work is licensed under https://creativecommons.org/licenses/by-nc/4.0/. For each specific dataset, please see the license information associated with it.

Citation

Please cite the following papers if you are using the data:

@article{alam2021review,
  title={A Review of Bangla Natural Language Processing Tasks and the Utility of Transformer Models},
  author={Alam, Firoj and Hasan, Md Arid and Alam, Tanvir and Khan, Akib and Tajrin, Janntatul and Khan, Naira and Chowdhury, Shammur Absar},
  journal={arXiv preprint arXiv:2107.03844},
  year={2021}
}

@inproceedings{alam2016bidirectional,
  title={Bidirectional LSTMs—CRFs networks for bangla POS tagging},
  author={Alam, Firoj and Chowdhury, Shammur Absar and Noori, Sheak Rashed Haider},
  booktitle={19th International Conference on Computer and Information Technology (ICCIT), 2016},
  pages={377--382},
  year={2016},
  organization={IEEE}
}

@techreport{iitgpgpostagging,
 Noaddress = {India},
 author = {A Kumaran},
 institution = {Microsoft Research},
 title = {A Part of Speech Tagger for Indian Languages (POS tagger) },
 year = {2007}
}

@techreport{ummi2008developing,
 author = {Ummi, Rabia Sultana and Huda, Fahmina},
 institution = {BRAC University},
 title = {Developing language resources for English machine translation},
 year = {2008}
}