Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deepspeed分布式训练出现sat ValueError inconsistent #149

Open
elesun2018 opened this issue Dec 14, 2023 · 1 comment
Open

deepspeed分布式训练出现sat ValueError inconsistent #149

elesun2018 opened this issue Dec 14, 2023 · 1 comment

Comments

@elesun2018
Copy link

deepspeed hostfile多机多卡分布式训练时出现以下问题:
Traceback (most recent call last):
worker0: File "finetune_XrayGLM.py", line 173, in
worker0: args = get_args(args_list)
worker0: File "/home/sfz/soft/miniconda3/envs/test/lib/python3.8/site-packages/sat/arguments.py", line 360, in get_args
worker0: raise ValueError(
worker0: ValueError: LOCAL_RANK (default 0) and args.device inconsistent. This can only happens in inference mode. Please use CUDA_VISIBLE_DEVICES=x for single-GPU training.
worker0: [2023-12-14 14:49:37,662] [INFO] [launch.py:315:sigkill_handler] Killing subprocess 9305
worker0: [2023-12-14 14:49:37,663] [ERROR] [launch.py:321:sigkill_handler] ['/home/sfz/soft/miniconda3/envs/test/bin/python', '-u', 'finetune_XrayGLM.py', '--local_rank=0', '--experiment-name', 'finetune-CityGLM', '--model-parallel-size', '2', '--mode', 'finetune', '--train-iters', '10000', '--resume-dataloader', '--max_source_length', '64', '--max_target_length', '256', '--lora_rank', '10', '--pre_seq_len', '4', '--train-data', './data/changjing9/data.json', '--valid-data', './data/changjing9/data.json', '--distributed-backend', 'nccl', '--lr-decay-style', 'cosine', '--warmup', '.02', '--checkpoint-activations', '--save-interval', '2000', '--eval-interval', '2000', '--save', './checkpoints', '--split', '1', '--eval-iters', '10', '--eval-batch-size', '4', '--zero-stage', '1', '--lr', '0.0001', '--batch-size', '6', '--skip-init', '--fp16', '--use_lora'] exits with return code = 1

@1049451037
Copy link
Member

XrayGLM相关的问题需要在XrayGLM仓库解决,因为我们也不太清楚他的代码是怎么写的……

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants