-
Fork the repository by clicking on the 'Fork' button on the repository's page. This creates a copy of the code under your GitHub user account.
-
Clone your fork to your local disk, and add the base repository as a remote:
git clone [email protected]:<your Github handle>/datasets.git cd datasets git remote add upstream https://github.com/huggingface/datasets.git
-
Create a new branch to hold your development changes:
git checkout -b a-descriptive-name-for-my-changes
do not work on the
master
branch. -
Set up a development environment by running the following command in a virtual environment:
pip install -e ".[dev]"
(If datasets was already installed in the virtual environment, remove it with
pip uninstall datasets
before reinstalling it in editable mode with the-e
flag.) -
Develop the features on your branch. If you want to add a dataset see more in-detail intsructions in the section How to add a dataset. Alternatively, you can follow the steps to add a dataset and share a dataset in the documentation.
-
Format your code. Run black and isort so that your newly added files look nice with the following command:
make style
-
Once you're happy with your dataset script file, add your changes and make a commit to record your changes locally:
git add datasets/<your_dataset_name> git commit
It is a good idea to sync your copy of the code with the original repository regularly. This way you can quickly account for changes:
git fetch upstream git rebase upstream/master
Push the changes to your account using:
git push -u origin a-descriptive-name-for-my-changes
-
Once you are satisfied, go the webpage of your fork on GitHub. Click on "Pull request" to send your to the project maintainers for review.
-
Make sure you followed steps 1-4 of the section How to contribute to datasets?.
-
Create your dataset folder under
datasets/<your_dataset_name>
and create your dataset script underdatasets/<your_dataset_name>/<your_dataset_name>.py
. You can check out other dataset scripts underdatasets
for some inspiration. Note on naming: the dataset class should be camel case, while the dataset name is its snake case equivalent (ex:class BookCorpus(datasets.GeneratorBasedBuilder)
for the datasetbook_corpus
). -
Make sure you run all of the following commands from the root of your
datasets
git clone. To check that your dataset works correctly and to create itsdataset_infos.json
file run the command:python datasets-cli test datasets/<your-dataset-folder> --save_infos --all_configs
-
If the command was succesful, you should now create some dummy data. Use the following command to get in-detail instructions on how to create the dummy data:
python datasets-cli dummy_data datasets/<your-dataset-folder>
There is a tool that automatically generates dummy data for you. At the moment it supports data files in the following format: txt, csv, tsv, jsonl, json, xml. If the extensions of the raw data files of your dataset are in this list, then you can automatically generate your dummy data with:
python datasets-cli dummy_data datasets/<your-dataset-folder> --auto_generate
-
Now test that both the real data and the dummy data work correctly using the following commands:
For the real data:
RUN_SLOW=1 pytest tests/test_dataset_common.py::LocalDatasetTest::test_load_real_dataset_<your-dataset-name>
and
For the dummy data:
RUN_SLOW=1 pytest tests/test_dataset_common.py::LocalDatasetTest::test_load_dataset_all_configs_<your-dataset-name>
-
If all tests pass, your dataset works correctly. Awesome! You can now follow steps 6, 7 and 8 of the section How to contribute to 🤗Datasets?. If you experience problems with the dummy data tests, you might want to take a look at the section Help for dummy data tests below.
Follow these steps in case the dummy data test keeps failing:
-
Verify that all filenames are spelled correctly. Rerun the command
python datasets-cli dummy_data datasets/<your-dataset-folder>
and make sure you follow the exact instructions provided by the command of step 5).
-
Your datascript might require a difficult dummy data structure. In this case make sure you fully understand the data folder logit created by the function
_split_generators(...)
and expected by the function_generate_examples(...)
of your dataset script. Also take a look attests/README.md
which lists different possible cases of how the dummy data should be created. -
If the dummy data tests still fail, open a PR in the repo anyways and make a remark in the description that you need help creating the dummy data.
If you're looking for more details about dataset scripts creation, please refer to the documentation.