Skip to content

使用ulysses处理32k模型长度报错 FileNotFoundError: [Errno 2] No such file or directory #53

@aishoot

Description

@aishoot

在多线程处理32k模型长度,用ulysses模式,会有如下报错,其他长度如4k等不会出现。(备注:启动方式为torchrun)

@HaoshengZou @cizhenshi @gom168 各位大佬请赐教~

Generating train split: 0 examples [00:00, ? examples/s]Generating train split: 69397 examples [00:21, 3298.92 examples/s]Generating train split: 69397 examples [00:21, 3190.59 examples/s]   491  Traceback (most recent call last):
492    File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/datasets/builder.py", line 1885, in _prepare_split_single
493      num_examples, num_bytes = writer.finalize()
494    File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/datasets/arrow_writer.py", line 602, in finalize
495      self.stream.close()
496    File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/fsspec/implementations/local.py", line 440, in close
497      return self.f.close()
498  FileNotFoundError: [Errno 2] No such file or directory
499  
500  The above exception was the direct cause of the following exception:
501  
502  Traceback (most recent call last):
503    File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/datasets/builder.py", line 889, in incomplete_dir
504      yield tmp_dir
505    File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/datasets/builder.py", line 924, in download_and_prepare
506      self._download_and_prepare(
507    File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/datasets/builder.py", line 999, in _download_and_prepare
508      self._prepare_split(split_generator, **prepare_split_kwargs)
509    File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/datasets/builder.py", line 1740, in _prepare_split
510      for job_id, done, content in self._prepare_split_single(
511    File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/datasets/builder.py", line 1896, in _prepare_split_single
512      raise DatasetGenerationError("An error occurred while generating the dataset") from e
513  datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset
514  
515  During handling of the above exception, another exception occurred:

516   
517   Traceback (most recent call last):
518     File "360-LLaMA-Factory/src/train.py", line 28, in <module>
519       main()
520     File "360-LLaMA-Factory/src/train.py", line 19, in main
521       run_exp()
522     File "360-LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
523       run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
524     File "360-LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 47, in run_sft
525       dataset_module = get_dataset(template, model_args, data_args, training_args, stage="sft", **tokenizer_module)
526     File "360-LLaMA-Factory/src/llamafactory/data/loader.py", line 295, in get_dataset
527       dataset = _get_merged_dataset(data_args.dataset, model_args, data_args, training_args, stage)
528     File "360-LLaMA-Factory/src/llamafactory/data/loader.py", line 170, in _get_merged_dataset
529       datasets.append(_load_single_dataset(dataset_attr, model_args, data_args, training_args))
530     File "360-LLaMA-Factory/src/llamafactory/data/loader.py", line 121, in _load_single_dataset
531       dataset = load_dataset(
532     File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/datasets/load.py", line 2096, in load_dataset
533       builder_instance.download_and_prepare(
534     File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/datasets/builder.py", line 935, in download_and_prepare
535       self._save_info()
536     File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/contextlib.py", line 137, in __exit__
537       self.gen.throw(typ, value, traceback)
538     File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/datasets/builder.py", line 896, in incomplete_dir
539       shutil.rmtree(tmp_dir)
540     File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/shutil.py", line 731, in rmtree
541       onerror(os.rmdir, path, sys.exc_info())
542     File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/shutil.py", line 729, in rmtree
543       os.rmdir(path)
544   FileNotFoundError: [Errno 2] No such file or directory: '360-LLaMA-Factory/.cache/mydata/default/0.0.0/812eb925cd599c82.incomplete'
545   [2025-05-06 00:42:55,532] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1731 closing signal SIGTERM
546   [2025-05-06 00:42:55,532] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1732 closing signal SIGTERM
547   [2025-05-06 00:42:55,532] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1733 closing signal SIGTERM
548   [2025-05-06 00:42:55,532] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1734 closing signal SIGTERM
549   [2025-05-06 00:42:55,533] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1735 closing signal SIGTERM
550   [2025-05-06 00:42:55,533] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1736 closing signal SIGTERM

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions