Unable to use load_from_disk function in pretraining #74

ngupta-slb · 2024-06-17T22:16:55Z

I am trying to run the pretraining scripts and encountering the following error while loading the datasets from disk.

GPU available: True (cuda), used: True
TPU available: False, using: 0 TPU cores
HPU available: False, using: 0 HPUs
[2024-06-17 21:27:11,765][datasets][INFO] - PyTorch version 2.3.1 available.
[2024-06-17 21:27:11,767][datasets][INFO] - JAX version 0.4.29 available.
Error executing job with overrides: ['run_name=first_run', 'model=moirai_small', 'data=lotsa_v1_unweighted']
Traceback (most recent call last):
File "/naveen/uni2ts/cli/train.py", line 130, in main
train_dataset: Dataset = instantiate(cfg.data).load_dataset(
File "/naveen/uni2ts/src/uni2ts/data/builder/_base.py", line 53, in load_dataset
[builder.load_dataset(transform_map) for builder in self.builders]
File "/naveen/uni2ts/src/uni2ts/data/builder/_base.py", line 53, in
[builder.load_dataset(transform_map) for builder in self.builders]
File "/naveen/uni2ts/src/uni2ts/data/builder/lotsa_v1/_base.py", line 58, in load_dataset
datasets = [
File "/naveen/uni2ts/src/uni2ts/data/builder/lotsa_v1/_base.py", line 61, in
load_from_disk(self.storage_path / dataset), uniform=self.uniform
File "/naveen/uni2ts/venv/lib/python3.10/site-packages/datasets/load.py", line 2663, in load_from_disk
raise FileNotFoundError(
FileNotFoundError: Directory /uni2ts/lotsa_data/cmip6_1855 is neither a Dataset directory nor a DatasetDict directory.

Reproduce the error

Downloaded only a small fraction of data using the following command

huggingface-cli download Salesforce/lotsa_data cmip6_1855/data-00001-of-00096.arrow cmip6_1850/data-00001-of-00096.arrow --repo-type=dataset --local-dir /naveen/uni2ts/lotsa_data

Modified the yaml file to include this dataset only at uni2ts/cli/conf/pretrain/data/lotsa_v1_unweighted.yaml
python3 -m cli.train -cp conf/pretrain run_name=first_run model=moirai_small data=lotsa_v1_unweighted

Python version - 3.10.14

Could you please suggest why it is unable to load the data? When I look at Huggingface load_from_disk API, it states that you need to use save_from_disk, however, I do not see save_from_disk being called before the load_from_disk? Please advise how to fix this issue.

Thank you very much!

The text was updated successfully, but these errors were encountered:

ngupta-slb · 2024-06-26T00:25:45Z

@gorold @liu-jc
Could you please look into this issue?

gorold · 2024-06-27T03:11:55Z

Hey, it might be due to the data being in the wrong directory. The recommended approach is:
huggingface-cli download Salesforce/lotsa_data --repo-type=dataset --local-dir PATH_TO_SAVE
which would download the data into lotsa_data/CMIP6_1855/data-00001-of-00096.arrow and so on. You'll need to make sure to arrange the data files in this manner. Also I'm not too sure whether you can partially load these files, you may want to try it out on some smaller datasets with only a single arrow file.

ngupta-slb added the bug Something isn't working label Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unable to use load_from_disk function in pretraining #74

Unable to use load_from_disk function in pretraining #74

ngupta-slb commented Jun 17, 2024 •

edited

Loading

ngupta-slb commented Jun 26, 2024

gorold commented Jun 27, 2024

Unable to use load_from_disk function in pretraining #74

Unable to use load_from_disk function in pretraining #74

Comments

ngupta-slb commented Jun 17, 2024 • edited Loading

ngupta-slb commented Jun 26, 2024

gorold commented Jun 27, 2024

ngupta-slb commented Jun 17, 2024 •

edited

Loading