forked from hiyouga/LlamaFactory
-
Notifications
You must be signed in to change notification settings - Fork 41
Closed
Description
在多线程处理32k模型长度,用ulysses模式,会有如下报错,其他长度如4k等不会出现。(备注:启动方式为torchrun)
@HaoshengZou @cizhenshi @gom168 各位大佬请赐教~
Generating train split: 0 examples [00:00, ? examples/s]Generating train split: 69397 examples [00:21, 3298.92 examples/s]Generating train split: 69397 examples [00:21, 3190.59 examples/s] 491 Traceback (most recent call last):
492 File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/datasets/builder.py", line 1885, in _prepare_split_single
493 num_examples, num_bytes = writer.finalize()
494 File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/datasets/arrow_writer.py", line 602, in finalize
495 self.stream.close()
496 File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/fsspec/implementations/local.py", line 440, in close
497 return self.f.close()
498 FileNotFoundError: [Errno 2] No such file or directory
499
500 The above exception was the direct cause of the following exception:
501
502 Traceback (most recent call last):
503 File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/datasets/builder.py", line 889, in incomplete_dir
504 yield tmp_dir
505 File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/datasets/builder.py", line 924, in download_and_prepare
506 self._download_and_prepare(
507 File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/datasets/builder.py", line 999, in _download_and_prepare
508 self._prepare_split(split_generator, **prepare_split_kwargs)
509 File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/datasets/builder.py", line 1740, in _prepare_split
510 for job_id, done, content in self._prepare_split_single(
511 File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/datasets/builder.py", line 1896, in _prepare_split_single
512 raise DatasetGenerationError("An error occurred while generating the dataset") from e
513 datasets.exceptions.DatasetGenerationError: An error occurred while generating the dataset
514
515 During handling of the above exception, another exception occurred:
516
517 Traceback (most recent call last):
518 File "360-LLaMA-Factory/src/train.py", line 28, in <module>
519 main()
520 File "360-LLaMA-Factory/src/train.py", line 19, in main
521 run_exp()
522 File "360-LLaMA-Factory/src/llamafactory/train/tuner.py", line 50, in run_exp
523 run_sft(model_args, data_args, training_args, finetuning_args, generating_args, callbacks)
524 File "360-LLaMA-Factory/src/llamafactory/train/sft/workflow.py", line 47, in run_sft
525 dataset_module = get_dataset(template, model_args, data_args, training_args, stage="sft", **tokenizer_module)
526 File "360-LLaMA-Factory/src/llamafactory/data/loader.py", line 295, in get_dataset
527 dataset = _get_merged_dataset(data_args.dataset, model_args, data_args, training_args, stage)
528 File "360-LLaMA-Factory/src/llamafactory/data/loader.py", line 170, in _get_merged_dataset
529 datasets.append(_load_single_dataset(dataset_attr, model_args, data_args, training_args))
530 File "360-LLaMA-Factory/src/llamafactory/data/loader.py", line 121, in _load_single_dataset
531 dataset = load_dataset(
532 File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/datasets/load.py", line 2096, in load_dataset
533 builder_instance.download_and_prepare(
534 File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/datasets/builder.py", line 935, in download_and_prepare
535 self._save_info()
536 File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/contextlib.py", line 137, in __exit__
537 self.gen.throw(typ, value, traceback)
538 File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/site-packages/datasets/builder.py", line 896, in incomplete_dir
539 shutil.rmtree(tmp_dir)
540 File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/shutil.py", line 731, in rmtree
541 onerror(os.rmdir, path, sys.exc_info())
542 File "/home/ma-user/anaconda3/envs/PyTorch-2.1.0/lib/python3.9/shutil.py", line 729, in rmtree
543 os.rmdir(path)
544 FileNotFoundError: [Errno 2] No such file or directory: '360-LLaMA-Factory/.cache/mydata/default/0.0.0/812eb925cd599c82.incomplete'
545 [2025-05-06 00:42:55,532] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1731 closing signal SIGTERM
546 [2025-05-06 00:42:55,532] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1732 closing signal SIGTERM
547 [2025-05-06 00:42:55,532] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1733 closing signal SIGTERM
548 [2025-05-06 00:42:55,532] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1734 closing signal SIGTERM
549 [2025-05-06 00:42:55,533] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1735 closing signal SIGTERM
550 [2025-05-06 00:42:55,533] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 1736 closing signal SIGTERM
Metadata
Metadata
Assignees
Labels
No labels