How to contribute to Datasets?

Fork the repository by clicking on the 'Fork' button on the repository's page. This creates a copy of the code under your GitHub user account.

Clone your fork to your local disk, and add the base repository as a remote:

git clone git@github.com:<your Github handle>/datasets.git
cd datasets
git remote add upstream https://github.com/huggingface/datasets.git

Create a new branch to hold your development changes:
```
git checkout -b a-descriptive-name-for-my-changes
```
do not work on the master branch.
Set up a development environment by running the following command in a virtual environment:
```
pip install -e ".[dev]"
```
(If datasets was already installed in the virtual environment, remove it with pip uninstall datasets before reinstalling it in editable mode with the -e flag.)
Develop the features on your branch. If you want to add a dataset see more in-detail intsructions in the section How to add a dataset. Alternatively, you can follow the steps to add a dataset and share a dataset in the documentation.
Format your code. Run black and isort so that your newly added files look nice with the following command:
```
make style
```
Once you're happy with your dataset script file, add your changes and make a commit to record your changes locally:
```
git add datasets/<your_dataset_name>
git commit
```
It is a good idea to sync your copy of the code with the original repository regularly. This way you can quickly account for changes:
```
git fetch upstream
git rebase upstream/master
```
Push the changes to your account using:
```
git push -u origin a-descriptive-name-for-my-changes
```
Once you are satisfied, go the webpage of your fork on GitHub. Click on "Pull request" to send your to the project maintainers for review.

How-To-Add a dataset

Make sure you followed steps 1-4 of the section How to contribute to datasets?.
Create your dataset folder under datasets/<your_dataset_name> and create your dataset script under datasets/<your_dataset_name>/<your_dataset_name>.py. You can check out other dataset scripts under datasets for some inspiration. Note on naming: the dataset class should be camel case, while the dataset name is its snake case equivalent (ex: class BookCorpus(datasets.GeneratorBasedBuilder) for the dataset book_corpus).
Make sure you run all of the following commands from the root of your datasets git clone. To check that your dataset works correctly and to create its dataset_infos.json file run the command:
```
python datasets-cli test datasets/<your-dataset-folder> --save_infos --all_configs
```
If the command was succesful, you should now create some dummy data. Use the following command to get in-detail instructions on how to create the dummy data:
```
python datasets-cli dummy_data datasets/<your-dataset-folder>
```
There is a tool that automatically generates dummy data for you. At the moment it supports data files in the following format: txt, csv, tsv, jsonl, json, xml. If the extensions of the raw data files of your dataset are in this list, then you can automatically generate your dummy data with:
```
python datasets-cli dummy_data datasets/<your-dataset-folder> --auto_generate
```

Now test that both the real data and the dummy data work correctly using the following commands:

For the real data:

RUN_SLOW=1 pytest tests/test_dataset_common.py::LocalDatasetTest::test_load_real_dataset_<your-dataset-name>

and

For the dummy data:

RUN_SLOW=1 pytest tests/test_dataset_common.py::LocalDatasetTest::test_load_dataset_all_configs_<your-dataset-name>

If all tests pass, your dataset works correctly. Awesome! You can now follow steps 6, 7 and 8 of the section How to contribute to 🤗Datasets?. If you experience problems with the dummy data tests, you might want to take a look at the section Help for dummy data tests below.

Follow these steps in case the dummy data test keeps failing:

Verify that all filenames are spelled correctly. Rerun the command
```
 python datasets-cli dummy_data datasets/<your-dataset-folder>
```
and make sure you follow the exact instructions provided by the command of step 5).
Your datascript might require a difficult dummy data structure. In this case make sure you fully understand the data folder logit created by the function _split_generators(...) and expected by the function _generate_examples(...) of your dataset script. Also take a look at tests/README.md which lists different possible cases of how the dummy data should be created.
If the dummy data tests still fail, open a PR in the repo anyways and make a remark in the description that you need help creating the dummy data.

If you're looking for more details about dataset scripts creation, please refer to the documentation.