There are several datasets available for Parts of Speech tagging.
- LDC: The LDC corpus is publicly available through LDC. Due to the license restriction we are not able to make it available.
- License: LDC
- train_data_IITKGP.tsv: The IITKGP POS Tagged corpus consists of a tagset comprising 38 tags, developed by Microsoft Research in collaboration with IIT Kharagpur and several institutions in India
- License: NA
- pos_tag_data_bangla_tagged_corpus.tsv: CRBLP POS Tagged corpus consists of ∼ 20𝐾 tokens, from 1176 sentences.
- License: NA
Our work is licensed under https://creativecommons.org/licenses/by-nc/4.0/. For each specific dataset, please see the license information associated with it.
Please cite the following papers if you are using the data:
@article{alam2021review,
title={A Review of Bangla Natural Language Processing Tasks and the Utility of Transformer Models},
author={Alam, Firoj and Hasan, Md Arid and Alam, Tanvir and Khan, Akib and Tajrin, Janntatul and Khan, Naira and Chowdhury, Shammur Absar},
journal={arXiv preprint arXiv:2107.03844},
year={2021}
}
@inproceedings{alam2016bidirectional,
title={Bidirectional LSTMs—CRFs networks for bangla POS tagging},
author={Alam, Firoj and Chowdhury, Shammur Absar and Noori, Sheak Rashed Haider},
booktitle={19th International Conference on Computer and Information Technology (ICCIT), 2016},
pages={377--382},
year={2016},
organization={IEEE}
}
@techreport{iitgpgpostagging,
Noaddress = {India},
author = {A Kumaran},
institution = {Microsoft Research},
title = {A Part of Speech Tagger for Indian Languages (POS tagger) },
year = {2007}
}
@techreport{ummi2008developing,
author = {Ummi, Rabia Sultana and Huda, Fahmina},
institution = {BRAC University},
title = {Developing language resources for English machine translation},
year = {2008}
}