You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I am trying to replicate the trainning, but I encountered an error that I cannot solve. I replace the model name in the mono_ft.sh to your model name haoranxu/ALMA-7B-Pretrain. Then I run the script, saw this error:
pytorch_model.bin.index.json: 100%|█████████████████████████████████████████████████████████████████████████████████████| 26.8k/26.8k [00:00<00:00, 203MB/s]
Downloading shards: 0%| | 0/3 [00:00<?, ?it/s][INFO|modeling_utils.py:3257] 2024-03-27 20:40:50,775 >> loading weights file pytorch_model.bin from cache at .cache/models/models--haoranxu--ALMA-7B-Pretrain/snapshots/a00b4a7a96c38117ac6a4e3e7228e7b06ba992ff/pytorch_model.bin.index.json
pytorch_model-00001-of-00003.bin: 100%|████████████████████████████████████████████████████████████████████████████████| 9.88G/9.88G [04:49<00:00, 34.1MB/s]
Downloading shards: 0%| | 0/3 [04:50<?, ?it/s]
Traceback (most recent call last):
File "/s/mlsc/hxue3/alma_experiments/ALMA/run_llmmt.py", line 226, in <module>
File "/s/mlsc/hxue3/alma_experiments/ALMA/run_llmmt.py", line 155, in main
File "/s/mlsc/hxue3/alma_experiments/ALMA/utils/utils.py", line 350, in load_model
File "/s/hxue3/.local/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 561, in from_pretrained
return model_class.from_pretrained(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/s/hxue3/.local/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3264, in from_pretrained
resolved_archive_file, sharded_metadata = get_checkpoint_shard_files(
^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/s/hxue3/.local/lib/python3.11/site-packages/transformers/utils/hub.py", line 1038, in get_checkpoint_shard_files
cached_filename = cached_file(
^^^^^^^^^^^^
File "/s/hxue3/.local/lib/python3.11/site-packages/transformers/utils/hub.py", line 398, in cached_file
resolved_file = hf_hub_download(
^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/huggingface_hub/utils/_validators.py", line 119, in _inner_fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.11/dist-packages/huggingface_hub/file_download.py", line 1504, in hf_hub_download
_chmod_and_replace(temp_file.name, blob_path)
File "/usr/local/lib/python3.11/dist-packages/huggingface_hub/file_download.py", line 1724, in _chmod_and_replace
os.chmod(src, stat.S_IMODE(cache_dir_mode))
FileNotFoundError: [Errno 2] No such file or directory: '/s/mlsc/hxue3/alma_experiments/ALMA/.cache/models/tmprvwvokuy'
[2024-03-27 20:45:42,972] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 581 closing signal SIGTERM9M/9.88G [00:01<02:29, 65.5MB/s]
[2024-03-27 20:45:42,972] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 582 closing signal SIGTERM
[2024-03-27 20:45:42,973] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 583 closing signal SIGTERM
[2024-03-27 20:45:43,614] torch.distributed.elastic.multiprocessing.api: [ERROR] failed (exitcode: 1) local_rank: 3 (pid: 584) of binary: /usr/bin/python3
Any idea why?
The text was updated successfully, but these errors were encountered:
Hi,
I am trying to replicate the trainning, but I encountered an error that I cannot solve. I replace the model name in the mono_ft.sh to your model name
haoranxu/ALMA-7B-Pretrain
. Then I run the script, saw this error:Any idea why?
The text was updated successfully, but these errors were encountered: