|
| 1 | +# DATASET-LICENSES.md |
| 2 | + |
| 3 | +This document provides a grouped summary of licenses for all data packages present in the [`nltk_data`](https://github.com/nltk/nltk_data) repository, based on the current `index.xml` file. Each package is listed by its exact `id` and `name`, and grouped by license type as declared in the metadata. |
| 4 | + |
| 5 | +> **Disclaimer:** |
| 6 | +> This information is provided as a convenience to users and is not legal advice. |
| 7 | +> **You must verify the license for each dataset with the original source if your use case is sensitive (especially for commercial or redistributive use).** |
| 8 | +> Licenses or terms can change over time; this file may become outdated if not maintained. |
| 9 | +
|
| 10 | +--- |
| 11 | + |
| 12 | +## MIT License |
| 13 | + |
| 14 | +- averaged_perceptron_tagger — Averaged Perceptron Tagger |
| 15 | +- averaged_perceptron_tagger_eng — Averaged Perceptron Tagger (JSON) |
| 16 | +- averaged_perceptron_tagger_ru — Averaged Perceptron Tagger (Russian) |
| 17 | +- averaged_perceptron_tagger_rus — Averaged Perceptron Tagger (Russian) |
| 18 | +- vader_lexicon — VADER Sentiment Lexicon |
| 19 | + |
| 20 | +--- |
| 21 | + |
| 22 | +## Creative Commons Licenses |
| 23 | + |
| 24 | +### Creative Commons Attribution 4.0 International |
| 25 | + |
| 26 | +- opinion_lexicon — Opinion Lexicon |
| 27 | +- product_reviews_1 — Product Reviews (5 Products) |
| 28 | +- product_reviews_2 — Product Reviews (9 Products) |
| 29 | +- pros_cons — Pros and Cons |
| 30 | +- subjectivity — Subjectivity Dataset v1.0 |
| 31 | + |
| 32 | +### Creative Commons Attribution 3.0 Unported License |
| 33 | + |
| 34 | +- framenet_v17 — FrameNet 1.7 |
| 35 | + |
| 36 | +### Creative Commons Attribution-NonCommercial-ShareAlike 3.0 United States |
| 37 | + |
| 38 | +- universal_treebanks_v20 — Universal Treebanks Version 2.0 |
| 39 | + |
| 40 | +### Creative Commons Attribution 3.0 (unspecified region) |
| 41 | + |
| 42 | +- sentiwordnet — SentiWordNet |
| 43 | + |
| 44 | +### CC0 1.0 Universal |
| 45 | + |
| 46 | +- panlex_swadesh — PanLex Swadesh Corpora |
| 47 | + |
| 48 | +### CC By SA 3.0 (Wiktionary) & UBY 1.0 (UBY) |
| 49 | + |
| 50 | +- extended_omw — Extended Open Multilingual WordNet |
| 51 | + |
| 52 | +--- |
| 53 | + |
| 54 | +## GNU Licenses |
| 55 | + |
| 56 | +### GNU General Public License |
| 57 | + |
| 58 | +- pl196x — Polish language of the XX century sixties |
| 59 | + |
| 60 | +### GNU Free Documentation License |
| 61 | + |
| 62 | +- swadesh — Swadesh Wordlists |
| 63 | +- gazetteers — Gazetteer Lists (note: for some files only; others may be public domain) |
| 64 | + |
| 65 | +### GNU Lesser General Public License |
| 66 | + |
| 67 | +- nonbreaking_prefixes — Non-Breaking Prefixes (Moses Decoder) |
| 68 | + |
| 69 | +--- |
| 70 | + |
| 71 | +## Public Domain |
| 72 | + |
| 73 | +- genesis — Genesis Corpus |
| 74 | +- gutenberg — Project Gutenberg Selections |
| 75 | +- inaugural — C-Span Inaugural Address Corpus |
| 76 | +- shakespeare — Shakespeare XML Corpus Sample |
| 77 | +- udhr — Universal Declaration of Human Rights Corpus |
| 78 | +- udhr2 — Universal Declaration of Human Rights Corpus (Unicode Version) |
| 79 | +- words — Word Lists |
| 80 | + |
| 81 | +--- |
| 82 | + |
| 83 | +## “Distributed with Permission” / “May be used with Permission” / “Freely Redistributable” |
| 84 | + |
| 85 | +> **Warning:** |
| 86 | +> These are not standard open licenses. Terms may prohibit redistribution, modification, or commercial use. |
| 87 | +> **You must consult the upstream source for the actual terms and whether permission applies to your use case.** |
| 88 | +
|
| 89 | +- alpino — Alpino Dutch Treebank |
| 90 | +- indian — Indian Language POS-Tagged Corpus |
| 91 | +- lin_thesaurus — Lin's Dependency Thesaurus |
| 92 | +- mac_morpho — MAC-MORPHO: Brazilian Portuguese news text with part-of-speech tags |
| 93 | +- paradigms — Paradigm Corpus |
| 94 | +- nombank.1.0 — NomBank Corpus 1.0 |
| 95 | +- propbank — Proposition Bank Corpus 1.0 |
| 96 | +- senseval — SENSEVAL 2 Corpus: Sense Tagged Text |
| 97 | +- verbnet — VerbNet Lexicon, Version 2.1 |
| 98 | +- verbnet3 — VerbNet Lexicon, Version 3.3 |
| 99 | +- maxent_treebank_pos_tagger — Treebank Part of Speech Tagger (Maximum entropy) |
| 100 | +- maxent_treebank_pos_tagger_tab — Treebank Part of Speech Tagger (Maximum entropy) |
| 101 | +- maxent_ne_chunker — ACE Named Entity Chunker (Maximum entropy) |
| 102 | +- maxent_ne_chunker_tab — ACE Named Entity Chunker (Maximum entropy) |
| 103 | +- pil — The Patient Information Leaflet (PIL) Corpus |
| 104 | +- pe08 — Cross-Framework and Cross-Domain Parser Evaluation Shared Task |
| 105 | +- kimmo — PC-KIMMO Data Files |
| 106 | +- jeita — JEITA Public Morphologically Tagged Corpus |
| 107 | +- knbc — KNB Corpus (Annotated blog corpus) |
| 108 | + |
| 109 | +--- |
| 110 | + |
| 111 | +## “Non-commercial Use Only” / Educational Use |
| 112 | + |
| 113 | +- brown — Brown Corpus |
| 114 | +- brown_tei — Brown Corpus (TEI XML Version) |
| 115 | +- framenet_v15 — FrameNet 1.5 |
| 116 | +- floresta — Portuguese Treebank |
| 117 | +- masc_tagged — MASC Tagged Corpus |
| 118 | +- nps_chat — NPS Chat |
| 119 | + |
| 120 | +--- |
| 121 | + |
| 122 | +## “See LICENSE Files” (Aggregated/Mixed Licensing) |
| 123 | + |
| 124 | +> **Warning:** |
| 125 | +> These packages include files from multiple sources, each with their own license. See LICENSE files inside the package and verify terms for your use case. |
| 126 | +
|
| 127 | +- omw — Open Multilingual Wordnet |
| 128 | +- omw-1.4 — Open Multilingual Wordnet |
| 129 | + |
| 130 | +--- |
| 131 | + |
| 132 | +## Special Cases, Custom, or Unique Licenses |
| 133 | + |
| 134 | +- bcp47 — BCP-47 Language Tags ("IETF Trust and Unicode Inc."; custom) |
| 135 | +- wordnet — WordNet ("Permission to use, copy, modify and distribute this software and database and its documentation for any purpose and without fee or royalty") |
| 136 | +- wordnet31 — Wordnet 3.1 (same as above) |
| 137 | +- wordnet2021 / wordnet2022 / english_wordnet — Open English Wordnet (combines WordNet License and Creative Commons Attribution) |
| 138 | +- twitter_samples — Twitter Samples ("Must be used subject to Twitter Developer Agreement") |
| 139 | +- switchboard — Switchboard Corpus Sample ("Permission is granted for use of this material in accordance with the Open Content License") |
| 140 | +- dependency_treebank — Dependency Parsed Treebank (fragment of Penn Treebank; non-commercial, no redistribution) |
| 141 | +- ptb — Penn Treebank (stub for full corpus) |
| 142 | +- treebank — Penn Treebank Sample (fragment; non-commercial, no redistribution) |
| 143 | +- conll2000 — CONLL 2000 Chunking Corpus (research use only) |
| 144 | +- conll2002 — CONLL 2002 Named Entity Recognition Corpus (see website) |
| 145 | +- conll2007 — Dependency Treebanks from CoNLL 2007 (Catalan and Basque Subset, see website) |
| 146 | +- ieer — NIST IE-ER DATA SAMPLE (see website) |
| 147 | +- reuters — Reuters-21578 benchmark corpus, ApteMod version (Reuters Ltd. copyright) |
| 148 | +- timit — TIMIT Corpus Sample (Creative Commons Attribution-NonCommercial-ShareAlike 3.0) |
| 149 | + |
| 150 | +--- |
| 151 | + |
| 152 | +## Unclarified, Unknown, Ambiguous, or Citation-Only |
| 153 | + |
| 154 | +The following packages have: |
| 155 | +- No `license` attribute |
| 156 | +- An empty or ambiguous value |
| 157 | +- A citation request instead of a license |
| 158 | +- Or otherwise ambiguous status |
| 159 | + |
| 160 | +> **Warning:** |
| 161 | +> These packages lack open, standard, or clearly documented licenses. |
| 162 | +> Citation requests do **not** constitute a license. |
| 163 | +> Despite long-standing and ongoing efforts (see [nltk_data issue #241](https://github.com/nltk/nltk_data/issues/241) and related discussions), clarification has not been possible for these cases. |
| 164 | +> **If you need to use any of these for commercial or redistributive purposes, consult a qualified legal professional.** |
| 165 | +
|
| 166 | +- abc — Australian Broadcasting Commission 2006 |
| 167 | +- basque_grammars — Grammars for Basque |
| 168 | +- biocreative_ppi — BioCreAtIvE (Critical Assessment of Information Extraction Systems in Biology) |
| 169 | +- bllip_wsj_no_aux — BLLIP Parser: WSJ Model |
| 170 | +- book_grammars — Grammars from NLTK Book |
| 171 | +- cess_cat — CESS-CAT Treebank (citation requested, not a license) |
| 172 | +- cess_esp — CESS-ESP Treebank (citation requested, not a license) |
| 173 | +- chat80 — Chat-80 Data Files |
| 174 | +- city_database — City Database |
| 175 | +- cmudict — The Carnegie Mellon Pronouncing Dictionary (0.6) |
| 176 | +- comparative_sentences — Comparative Sentence Dataset (ambiguous license) |
| 177 | +- comtrans — ComTrans Corpus Sample |
| 178 | +- dolch — Dolch Word List |
| 179 | +- europarl_raw — Sample European Parliament Proceedings Parallel Corpus |
| 180 | +- framenet_v15 — FrameNet 1.5 (non-commercial use only) |
| 181 | +- gazetteers — Gazetteer Lists (mixed per-file) |
| 182 | +- large_grammars — Large context-free and feature-based grammars |
| 183 | +- machado — Machado de Assis -- Obra Completa ("Public Domain", verify at source) |
| 184 | +- moses_sample — Moses Sample Models |
| 185 | +- mwa_ppdb — Monolingual word aligner (subset of Paraphrase Database) |
| 186 | +- names — Names Corpus, Version 1.3 (1994-03-29) |
| 187 | +- nonbreaking_prefixes — Non-Breaking Prefixes (empty license field) |
| 188 | +- punkt — Punkt Tokenizer Models (no license attribute) |
| 189 | +- punkt_tab — Punkt Tokenizer Models (no license attribute) |
| 190 | +- porter_test — Porter Stemmer Test Files |
| 191 | +- ppattach — Prepositional Phrase Attachment Corpus |
| 192 | +- problem_reports — Problem Report Corpus |
| 193 | +- qc — Experimental Data for Question Classification |
| 194 | +- rslp — RSLP Stemmer (Removedor de Sufixos da Lingua Portuguesa) |
| 195 | +- rte — PASCAL RTE Challenges 1, 2, and 3 |
| 196 | +- sample_grammars — Sample Grammars |
| 197 | +- semcor — SemCor 3.0 |
| 198 | +- sentence_polarity — Sentence Polarity Dataset v1.0 (ambiguous license) |
| 199 | +- smultron — SMULTRON Corpus Sample |
| 200 | +- snowball_data — Snowball Data |
| 201 | +- spanish_grammars — Grammars for Spanish |
| 202 | +- state_union — C-Span State of the Union Address Corpus |
| 203 | +- stopwords — Stopwords Corpus |
| 204 | +- tagsets — Help on Tagsets |
| 205 | +- tagsets_json — Help on Tagsets (JSON) |
| 206 | +- toolbox — Toolbox Sample Files |
| 207 | +- unicode_samples — Unicode Samples |
| 208 | +- webtext — Web Text Corpus |
| 209 | +- wmt15_eval — Evaluation data from WMT15 |
| 210 | +- word2vec_sample — Word2Vec Sample |
| 211 | +- wordnet_ic — WordNet-InfoContent |
| 212 | +- ycoe — York-Toronto-Helsinki Parsed Corpus of Old English Prose |
| 213 | + |
| 214 | +--- |
| 215 | + |
| 216 | +## Packages with Citation Requests Instead of Licenses |
| 217 | + |
| 218 | +> **Note:** |
| 219 | +> These packages specifically request citation for use, but do not provide a license. Citation requests are not a license. |
| 220 | +
|
| 221 | +- cess_cat — CESS-CAT Treebank |
| 222 | +- cess_esp — CESS-ESP Treebank |
| 223 | + |
| 224 | +--- |
| 225 | + |
| 226 | +## Packages Citing Source Website or “See Website” for Terms |
| 227 | + |
| 228 | +> **Note:** |
| 229 | +> These packages refer users to an external website for their licensing terms. |
| 230 | +
|
| 231 | +- conll2002 — CONLL 2002 Named Entity Recognition Corpus |
| 232 | +- conll2007 — Dependency Treebanks from CoNLL 2007 (Catalan and Basque Subset) |
| 233 | +- ieer — NIST IE-ER DATA SAMPLE |
| 234 | +- reuters — The Reuters-21578 benchmark corpus, ApteMod version |
| 235 | + |
| 236 | +--- |
| 237 | + |
| 238 | +## Maintenance |
| 239 | + |
| 240 | +**If you add, update, or remove any data packages, update this file accordingly to ensure continued transparency and compliance.** |
| 241 | +If you find omissions, errors, or outdated information, please open an issue or pull request. |
| 242 | + |
| 243 | +--- |
0 commit comments