Name		Name	Last commit message	Last commit date
parent directory ..
LICENSE_CC_BY_NC_SA_4.0.txt		LICENSE_CC_BY_NC_SA_4.0.txt
README.md		README.md
pos_tag_data_bangla_tagged_corpus.tsv		pos_tag_data_bangla_tagged_corpus.tsv
train_data_IITKGP.tsv		train_data_IITKGP.tsv

README.md

Parts of Speech Datasets

There are several datasets available for Parts of Speech tagging.

Dataset

Directory Structure:

LDC: The LDC corpus is publicly available through LDC. Due to the license restriction we are not able to make it available.
- License: LDC
train_data_IITKGP.tsv: The IITKGP POS Tagged corpus consists of a tagset comprising 38 tags, developed by Microsoft Research in collaboration with IIT Kharagpur and several institutions in India
- License: NA
pos_tag_data_bangla_tagged_corpus.tsv: CRBLP POS Tagged corpus consists of ∼ 20𝐾 tokens, from 1176 sentences.
- License: NA

Licensing

Our work is licensed under https://creativecommons.org/licenses/by-nc/4.0/. For each specific dataset, please see the license information associated with it.

Citation

Please cite the following papers if you are using the data:

@article{alam2021review,
  title={A Review of Bangla Natural Language Processing Tasks and the Utility of Transformer Models},
  author={Alam, Firoj and Hasan, Md Arid and Alam, Tanvir and Khan, Akib and Tajrin, Janntatul and Khan, Naira and Chowdhury, Shammur Absar},
  journal={arXiv preprint arXiv:2107.03844},
  year={2021}
}

@inproceedings{alam2016bidirectional,
  title={Bidirectional LSTMs—CRFs networks for bangla POS tagging},
  author={Alam, Firoj and Chowdhury, Shammur Absar and Noori, Sheak Rashed Haider},
  booktitle={19th International Conference on Computer and Information Technology (ICCIT), 2016},
  pages={377--382},
  year={2016},
  organization={IEEE}
}

@techreport{iitgpgpostagging,
 Noaddress = {India},
 author = {A Kumaran},
 institution = {Microsoft Research},
 title = {A Part of Speech Tagger for Indian Languages (POS tagger) },
 year = {2007}
}

@techreport{ummi2008developing,
 author = {Ummi, Rabia Sultana and Huda, Fahmina},
 institution = {BRAC University},
 title = {Developing language resources for English machine translation},
 year = {2008}
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pos

pos

README.md

Parts of Speech Datasets

Dataset

Directory Structure:

Licensing

Citation

Files

pos

Directory actions

More options

Directory actions

More options

Latest commit

History

pos

Folders and files

parent directory

README.md

Parts of Speech Datasets

Dataset

Directory Structure:

Licensing

Citation