Skip to content

Commit b2f5e5f

Browse files
authored
Merge pull request #242 from ekaf/hotfix-241
Clarify licensing, add CONTRIBUTING.md, and update README.md
2 parents 66f9f16 + 1bb484f commit b2f5e5f

File tree

6 files changed

+607
-7
lines changed

6 files changed

+607
-7
lines changed

CONTRIBUTING.md

Lines changed: 80 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,80 @@
1+
# Contributing to nltk_data
2+
3+
Thank you for your interest in contributing to [`nltk_data`](https://github.com/nltk/nltk_data)! This guide will help you add new data packages (corpora, taggers, models, etc.) and contribute improvements to existing ones.
4+
5+
## Adding a New Data Package
6+
7+
The `nltk_data` repository contains datasets and resources that can be downloaded by `nltk.downloader`. To add a new dataset or resource, please follow these steps:
8+
9+
### 1. Fork and Clone the Repository
10+
11+
First, fork the [`nltk_data`](https://github.com/nltk/nltk_data) repository to your own GitHub account. For help with forking, see the [GitHub documentation on forking a repository](https://docs.github.com/en/get-started/quickstart/fork-a-repo).
12+
13+
Then, clone your fork locally:
14+
15+
```bash
16+
git clone https://github.com/<your-github-username>/nltk_data.git
17+
cd nltk_data
18+
```
19+
20+
### 2. Create a New Branch
21+
22+
Create a branch for your dataset:
23+
24+
```bash
25+
git checkout -b add-my-dataset
26+
```
27+
28+
### 3. Add Your Data Package
29+
30+
- Place your dataset in the appropriate directory (`corpora/`, `models/`, `tokenizers/`, etc.). If you are unsure, check the existing structure or open an issue for clarification.
31+
- If your dataset has a license, include the license file in the same directory. If the license is unknown or separate from the repository, please add a note in a `README` or `LICENSE` file within the dataset’s folder, and document this in your pull request.
32+
33+
**Whenever you add a new data package, you must update [`DATASET-LICENSES.md`](DATASET-LICENSES.md) with the license information for your package.**
34+
35+
You only need to update [`LICENSE-OVERVIEW.md`](LICENSE-OVERVIEW.md) if you are making changes to the repository’s overall licensing structure or guidance.
36+
37+
### 4. Update Index Files
38+
39+
- If required, update any index or metadata files so that the new dataset is discoverable by NLTK’s downloader. Follow the format of the existing files.
40+
- Provide a short README or metadata file describing the package, its origin, and its license.
41+
42+
### 5. Commit and Push Your Changes
43+
44+
```bash
45+
git add <your new files>
46+
git commit -m "Add <name> dataset to nltk_data"
47+
git push origin add-my-dataset
48+
```
49+
50+
### 6. Create a Pull Request
51+
52+
Open a pull request from your branch to the `master` branch of `nltk/nltk_data`. For help, see the [GitHub documentation on creating a pull request](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request).
53+
54+
In your pull request, please include:
55+
- A description of the dataset and its purpose.
56+
- Any relevant licensing information or restrictions.
57+
- Instructions for any special installation or usage requirements.
58+
59+
### 7. Respond to Feedback
60+
61+
- Be responsive to comments and requested changes.
62+
- If your dataset cannot be accepted (e.g., due to licensing issues), we will let you know in the pull request.
63+
64+
## General Guidelines
65+
66+
- **Licensing**: Please ensure you have the right to redistribute any data you submit, and document the license clearly. If the license is unknown, state this explicitly in your pull request.
67+
- **No Large Files**: If your package is extremely large, consider hosting it elsewhere and providing an index/manifest, or open an issue to discuss options.
68+
- **No Executable Files**: Only data, not code, should be included unless a script is essential for using the dataset.
69+
70+
## Additional Resources
71+
72+
- [GitHub Docs: Fork a repo](https://docs.github.com/en/get-started/quickstart/fork-a-repo)
73+
- [GitHub Docs: Branches](https://docs.github.com/en/get-started/quickstart/github-glossary#branch)
74+
- [GitHub Docs: Pull Requests](https://docs.github.com/en/pull-requests)
75+
76+
If you have questions or need help, please open an issue or join the [nltk-dev mailing list](https://groups.google.com/forum/#!forum/nltk-dev).
77+
78+
---
79+
80+
Thank you for helping improve NLTK’s data resources!

DATASET-LICENSES.md

Lines changed: 243 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,243 @@
1+
# DATASET-LICENSES.md
2+
3+
This document provides a grouped summary of licenses for all data packages present in the [`nltk_data`](https://github.com/nltk/nltk_data) repository, based on the current `index.xml` file. Each package is listed by its exact `id` and `name`, and grouped by license type as declared in the metadata.
4+
5+
> **Disclaimer:**
6+
> This information is provided as a convenience to users and is not legal advice.
7+
> **You must verify the license for each dataset with the original source if your use case is sensitive (especially for commercial or redistributive use).**
8+
> Licenses or terms can change over time; this file may become outdated if not maintained.
9+
10+
---
11+
12+
## MIT License
13+
14+
- averaged_perceptron_tagger — Averaged Perceptron Tagger
15+
- averaged_perceptron_tagger_eng — Averaged Perceptron Tagger (JSON)
16+
- averaged_perceptron_tagger_ru — Averaged Perceptron Tagger (Russian)
17+
- averaged_perceptron_tagger_rus — Averaged Perceptron Tagger (Russian)
18+
- vader_lexicon — VADER Sentiment Lexicon
19+
20+
---
21+
22+
## Creative Commons Licenses
23+
24+
### Creative Commons Attribution 4.0 International
25+
26+
- opinion_lexicon — Opinion Lexicon
27+
- product_reviews_1 — Product Reviews (5 Products)
28+
- product_reviews_2 — Product Reviews (9 Products)
29+
- pros_cons — Pros and Cons
30+
- subjectivity — Subjectivity Dataset v1.0
31+
32+
### Creative Commons Attribution 3.0 Unported License
33+
34+
- framenet_v17 — FrameNet 1.7
35+
36+
### Creative Commons Attribution-NonCommercial-ShareAlike 3.0 United States
37+
38+
- universal_treebanks_v20 — Universal Treebanks Version 2.0
39+
40+
### Creative Commons Attribution 3.0 (unspecified region)
41+
42+
- sentiwordnet — SentiWordNet
43+
44+
### CC0 1.0 Universal
45+
46+
- panlex_swadesh — PanLex Swadesh Corpora
47+
48+
### CC By SA 3.0 (Wiktionary) & UBY 1.0 (UBY)
49+
50+
- extended_omw — Extended Open Multilingual WordNet
51+
52+
---
53+
54+
## GNU Licenses
55+
56+
### GNU General Public License
57+
58+
- pl196x — Polish language of the XX century sixties
59+
60+
### GNU Free Documentation License
61+
62+
- swadesh — Swadesh Wordlists
63+
- gazetteers — Gazetteer Lists (note: for some files only; others may be public domain)
64+
65+
### GNU Lesser General Public License
66+
67+
- nonbreaking_prefixes — Non-Breaking Prefixes (Moses Decoder)
68+
69+
---
70+
71+
## Public Domain
72+
73+
- genesis — Genesis Corpus
74+
- gutenberg — Project Gutenberg Selections
75+
- inaugural — C-Span Inaugural Address Corpus
76+
- shakespeare — Shakespeare XML Corpus Sample
77+
- udhr — Universal Declaration of Human Rights Corpus
78+
- udhr2 — Universal Declaration of Human Rights Corpus (Unicode Version)
79+
- words — Word Lists
80+
81+
---
82+
83+
## “Distributed with Permission” / “May be used with Permission” / “Freely Redistributable”
84+
85+
> **Warning:**
86+
> These are not standard open licenses. Terms may prohibit redistribution, modification, or commercial use.
87+
> **You must consult the upstream source for the actual terms and whether permission applies to your use case.**
88+
89+
- alpino — Alpino Dutch Treebank
90+
- indian — Indian Language POS-Tagged Corpus
91+
- lin_thesaurus — Lin's Dependency Thesaurus
92+
- mac_morpho — MAC-MORPHO: Brazilian Portuguese news text with part-of-speech tags
93+
- paradigms — Paradigm Corpus
94+
- nombank.1.0 — NomBank Corpus 1.0
95+
- propbank — Proposition Bank Corpus 1.0
96+
- senseval — SENSEVAL 2 Corpus: Sense Tagged Text
97+
- verbnet — VerbNet Lexicon, Version 2.1
98+
- verbnet3 — VerbNet Lexicon, Version 3.3
99+
- maxent_treebank_pos_tagger — Treebank Part of Speech Tagger (Maximum entropy)
100+
- maxent_treebank_pos_tagger_tab — Treebank Part of Speech Tagger (Maximum entropy)
101+
- maxent_ne_chunker — ACE Named Entity Chunker (Maximum entropy)
102+
- maxent_ne_chunker_tab — ACE Named Entity Chunker (Maximum entropy)
103+
- pil — The Patient Information Leaflet (PIL) Corpus
104+
- pe08 — Cross-Framework and Cross-Domain Parser Evaluation Shared Task
105+
- kimmo — PC-KIMMO Data Files
106+
- jeita — JEITA Public Morphologically Tagged Corpus
107+
- knbc — KNB Corpus (Annotated blog corpus)
108+
109+
---
110+
111+
## “Non-commercial Use Only” / Educational Use
112+
113+
- brown — Brown Corpus
114+
- brown_tei — Brown Corpus (TEI XML Version)
115+
- framenet_v15 — FrameNet 1.5
116+
- floresta — Portuguese Treebank
117+
- masc_tagged — MASC Tagged Corpus
118+
- nps_chat — NPS Chat
119+
120+
---
121+
122+
## “See LICENSE Files” (Aggregated/Mixed Licensing)
123+
124+
> **Warning:**
125+
> These packages include files from multiple sources, each with their own license. See LICENSE files inside the package and verify terms for your use case.
126+
127+
- omw — Open Multilingual Wordnet
128+
- omw-1.4 — Open Multilingual Wordnet
129+
130+
---
131+
132+
## Special Cases, Custom, or Unique Licenses
133+
134+
- bcp47 — BCP-47 Language Tags ("IETF Trust and Unicode Inc."; custom)
135+
- wordnet — WordNet ("Permission to use, copy, modify and distribute this software and database and its documentation for any purpose and without fee or royalty")
136+
- wordnet31 — Wordnet 3.1 (same as above)
137+
- wordnet2021 / wordnet2022 / english_wordnet — Open English Wordnet (combines WordNet License and Creative Commons Attribution)
138+
- twitter_samples — Twitter Samples ("Must be used subject to Twitter Developer Agreement")
139+
- switchboard — Switchboard Corpus Sample ("Permission is granted for use of this material in accordance with the Open Content License")
140+
- dependency_treebank — Dependency Parsed Treebank (fragment of Penn Treebank; non-commercial, no redistribution)
141+
- ptb — Penn Treebank (stub for full corpus)
142+
- treebank — Penn Treebank Sample (fragment; non-commercial, no redistribution)
143+
- conll2000 — CONLL 2000 Chunking Corpus (research use only)
144+
- conll2002 — CONLL 2002 Named Entity Recognition Corpus (see website)
145+
- conll2007 — Dependency Treebanks from CoNLL 2007 (Catalan and Basque Subset, see website)
146+
- ieer — NIST IE-ER DATA SAMPLE (see website)
147+
- reuters — Reuters-21578 benchmark corpus, ApteMod version (Reuters Ltd. copyright)
148+
- timit — TIMIT Corpus Sample (Creative Commons Attribution-NonCommercial-ShareAlike 3.0)
149+
150+
---
151+
152+
## Unclarified, Unknown, Ambiguous, or Citation-Only
153+
154+
The following packages have:
155+
- No `license` attribute
156+
- An empty or ambiguous value
157+
- A citation request instead of a license
158+
- Or otherwise ambiguous status
159+
160+
> **Warning:**
161+
> These packages lack open, standard, or clearly documented licenses.
162+
> Citation requests do **not** constitute a license.
163+
> Despite long-standing and ongoing efforts (see [nltk_data issue #241](https://github.com/nltk/nltk_data/issues/241) and related discussions), clarification has not been possible for these cases.
164+
> **If you need to use any of these for commercial or redistributive purposes, consult a qualified legal professional.**
165+
166+
- abc — Australian Broadcasting Commission 2006
167+
- basque_grammars — Grammars for Basque
168+
- biocreative_ppi — BioCreAtIvE (Critical Assessment of Information Extraction Systems in Biology)
169+
- bllip_wsj_no_aux — BLLIP Parser: WSJ Model
170+
- book_grammars — Grammars from NLTK Book
171+
- cess_cat — CESS-CAT Treebank (citation requested, not a license)
172+
- cess_esp — CESS-ESP Treebank (citation requested, not a license)
173+
- chat80 — Chat-80 Data Files
174+
- city_database — City Database
175+
- cmudict — The Carnegie Mellon Pronouncing Dictionary (0.6)
176+
- comparative_sentences — Comparative Sentence Dataset (ambiguous license)
177+
- comtrans — ComTrans Corpus Sample
178+
- dolch — Dolch Word List
179+
- europarl_raw — Sample European Parliament Proceedings Parallel Corpus
180+
- framenet_v15 — FrameNet 1.5 (non-commercial use only)
181+
- gazetteers — Gazetteer Lists (mixed per-file)
182+
- large_grammars — Large context-free and feature-based grammars
183+
- machado — Machado de Assis -- Obra Completa ("Public Domain", verify at source)
184+
- moses_sample — Moses Sample Models
185+
- mwa_ppdb — Monolingual word aligner (subset of Paraphrase Database)
186+
- names — Names Corpus, Version 1.3 (1994-03-29)
187+
- nonbreaking_prefixes — Non-Breaking Prefixes (empty license field)
188+
- punkt — Punkt Tokenizer Models (no license attribute)
189+
- punkt_tab — Punkt Tokenizer Models (no license attribute)
190+
- porter_test — Porter Stemmer Test Files
191+
- ppattach — Prepositional Phrase Attachment Corpus
192+
- problem_reports — Problem Report Corpus
193+
- qc — Experimental Data for Question Classification
194+
- rslp — RSLP Stemmer (Removedor de Sufixos da Lingua Portuguesa)
195+
- rte — PASCAL RTE Challenges 1, 2, and 3
196+
- sample_grammars — Sample Grammars
197+
- semcor — SemCor 3.0
198+
- sentence_polarity — Sentence Polarity Dataset v1.0 (ambiguous license)
199+
- smultron — SMULTRON Corpus Sample
200+
- snowball_data — Snowball Data
201+
- spanish_grammars — Grammars for Spanish
202+
- state_union — C-Span State of the Union Address Corpus
203+
- stopwords — Stopwords Corpus
204+
- tagsets — Help on Tagsets
205+
- tagsets_json — Help on Tagsets (JSON)
206+
- toolbox — Toolbox Sample Files
207+
- unicode_samples — Unicode Samples
208+
- webtext — Web Text Corpus
209+
- wmt15_eval — Evaluation data from WMT15
210+
- word2vec_sample — Word2Vec Sample
211+
- wordnet_ic — WordNet-InfoContent
212+
- ycoe — York-Toronto-Helsinki Parsed Corpus of Old English Prose
213+
214+
---
215+
216+
## Packages with Citation Requests Instead of Licenses
217+
218+
> **Note:**
219+
> These packages specifically request citation for use, but do not provide a license. Citation requests are not a license.
220+
221+
- cess_cat — CESS-CAT Treebank
222+
- cess_esp — CESS-ESP Treebank
223+
224+
---
225+
226+
## Packages Citing Source Website or “See Website” for Terms
227+
228+
> **Note:**
229+
> These packages refer users to an external website for their licensing terms.
230+
231+
- conll2002 — CONLL 2002 Named Entity Recognition Corpus
232+
- conll2007 — Dependency Treebanks from CoNLL 2007 (Catalan and Basque Subset)
233+
- ieer — NIST IE-ER DATA SAMPLE
234+
- reuters — The Reuters-21578 benchmark corpus, ApteMod version
235+
236+
---
237+
238+
## Maintenance
239+
240+
**If you add, update, or remove any data packages, update this file accordingly to ensure continued transparency and compliance.**
241+
If you find omissions, errors, or outdated information, please open an issue or pull request.
242+
243+
---

0 commit comments

Comments
 (0)