We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
deepspeed hostfile多机多卡分布式训练时出现以下问题: Traceback (most recent call last): worker0: File "finetune_XrayGLM.py", line 173, in worker0: args = get_args(args_list) worker0: File "/home/sfz/soft/miniconda3/envs/test/lib/python3.8/site-packages/sat/arguments.py", line 360, in get_args worker0: raise ValueError( worker0: ValueError: LOCAL_RANK (default 0) and args.device inconsistent. This can only happens in inference mode. Please use CUDA_VISIBLE_DEVICES=x for single-GPU training. worker0: [2023-12-14 14:49:37,662] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 9305 worker0: [2023-12-14 14:49:37,663] [ERROR] [launch.py:321:sigkill_handler] ['/home/sfz/soft/miniconda3/envs/test/bin/python', '-u', 'finetune_XrayGLM.py', '--local_rank=0', '--experiment-name', 'finetune-CityGLM', '--model-parallel-size', '2', '--mode', 'finetune', '--train-iters', '10000', '--resume-dataloader', '--max_source_length', '64', '--max_target_length', '256', '--lora_rank', '10', '--pre_seq_len', '4', '--train-data', './data/changjing9/data.json', '--valid-data', './data/changjing9/data.json', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--save-interval', '2000', '--eval-interval', '2000', '--save', './checkpoints', '--split', '1', '--eval-iters', '10', '--eval-batch-size', '4', '--zero-stage', '1', '--lr', '0.0001', '--batch-size', '6', '--skip-init', '--fp16', '--use_lora'] exits with return code = 1
The text was updated successfully, but these errors were encountered:
XrayGLM相关的问题需要在XrayGLM仓库解决,因为我们也不太清楚他的代码是怎么写的……
Sorry, something went wrong.
No branches or pull requests
deepspeed hostfile多机多卡分布式训练时出现以下问题:
Traceback (most recent call last):
worker0: File "finetune_XrayGLM.py", line 173, in
worker0: args = get_args(args_list)
worker0: File "/home/sfz/soft/miniconda3/envs/test/lib/python3.8/site-packages/sat/arguments.py", line 360, in get_args
worker0: raise ValueError(
worker0: ValueError: LOCAL_RANK (default 0) and args.device inconsistent. This can only happens in inference mode. Please use CUDA_VISIBLE_DEVICES=x for single-GPU training.
worker0: [2023-12-14 14:49:37,662] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 9305
worker0: [2023-12-14 14:49:37,663] [ERROR] [launch.py:321:sigkill_handler] ['/home/sfz/soft/miniconda3/envs/test/bin/python', '-u', 'finetune_XrayGLM.py', '--local_rank=0', '--experiment-name', 'finetune-CityGLM', '--model-parallel-size', '2', '--mode', 'finetune', '--train-iters', '10000', '--resume-dataloader', '--max_source_length', '64', '--max_target_length', '256', '--lora_rank', '10', '--pre_seq_len', '4', '--train-data', './data/changjing9/data.json', '--valid-data', './data/changjing9/data.json', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--save-interval', '2000', '--eval-interval', '2000', '--save', './checkpoints', '--split', '1', '--eval-iters', '10', '--eval-batch-size', '4', '--zero-stage', '1', '--lr', '0.0001', '--batch-size', '6', '--skip-init', '--fp16', '--use_lora'] exits with return code = 1
The text was updated successfully, but these errors were encountered: