Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to use load_from_disk function in pretraining #74

Open
ngupta-slb opened this issue Jun 17, 2024 · 2 comments
Open

Unable to use load_from_disk function in pretraining #74

ngupta-slb opened this issue Jun 17, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@ngupta-slb
Copy link

ngupta-slb commented Jun 17, 2024

I am trying to run the pretraining scripts and encountering the following error while loading the datasets from disk.

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
[2024-06-17 21:27:11,765][datasets][INFO] - PyTorch version 2.3.1 available.
[2024-06-17 21:27:11,767][datasets][INFO] - JAX version 0.4.29 available.
Error executing job with overrides: ['run_name=first_run', 'model=moirai_small', 'data=lotsa_v1_unweighted']
Traceback (most recent call last):
File "/naveen/uni2ts/cli/train.py", line 130, in main
train_dataset: Dataset = instantiate(cfg.data).load_dataset(
File "/naveen/uni2ts/src/uni2ts/data/builder/_base.py", line 53, in load_dataset
[builder.load_dataset(transform_map) for builder in self.builders]
File "/naveen/uni2ts/src/uni2ts/data/builder/_base.py", line 53, in
[builder.load_dataset(transform_map) for builder in self.builders]
File "/naveen/uni2ts/src/uni2ts/data/builder/lotsa_v1/_base.py", line 58, in load_dataset
datasets = [
File "/naveen/uni2ts/src/uni2ts/data/builder/lotsa_v1/_base.py", line 61, in
load_from_disk(self.storage_path / dataset), uniform=self.uniform
File "/naveen/uni2ts/venv/lib/python3.10/site-packages/datasets/load.py", line 2663, in load_from_disk
raise FileNotFoundError(
FileNotFoundError: Directory /uni2ts/lotsa_data/cmip6_1855 is neither a Dataset directory nor a DatasetDict directory.

Reproduce the error

  1. Downloaded only a small fraction of data using the following command

huggingface-cli download Salesforce/lotsa_data cmip6_1855/data-00001-of-00096.arrow cmip6_1850/data-00001-of-00096.arrow --repo-type=dataset --local-dir /naveen/uni2ts/lotsa_data

  1. Modified the yaml file to include this dataset only at uni2ts/cli/conf/pretrain/data/lotsa_v1_unweighted.yaml

  2. python3 -m cli.train -cp conf/pretrain run_name=first_run model=moirai_small data=lotsa_v1_unweighted

Python version - 3.10.14

Could you please suggest why it is unable to load the data? When I look at Huggingface load_from_disk API, it states that you need to use save_from_disk, however, I do not see save_from_disk being called before the load_from_disk? Please advise how to fix this issue.

Thank you very much!

@ngupta-slb ngupta-slb added the bug Something isn't working label Jun 17, 2024
@ngupta-slb
Copy link
Author

@gorold @liu-jc
Could you please look into this issue?

@gorold
Copy link
Contributor

gorold commented Jun 27, 2024

Hey, it might be due to the data being in the wrong directory. The recommended approach is:
huggingface-cli download Salesforce/lotsa_data --repo-type=dataset --local-dir PATH_TO_SAVE
which would download the data into lotsa_data/CMIP6_1855/data-00001-of-00096.arrow and so on. You'll need to make sure to arrange the data files in this manner. Also I'm not too sure whether you can partially load these files, you may want to try it out on some smaller datasets with only a single arrow file.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants