Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

“No backend type associated with device type cpu” when run cli_demo_sat.py #173

Open
yileld opened this issue Mar 5, 2024 · 5 comments

Comments

@yileld
Copy link

yileld commented Mar 5, 2024

Traceback (most recent call last):
  File "/ssd/ylying/CogVLM/basic_demo/infer_dataset.py", line 164, in <module>
    main()
  File "/ssd/ylying/CogVLM/basic_demo/infer_dataset.py", line 36, in main
    model, model_args = AutoModel.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/sat/model/base_model.py", line 367, in from_pretrained
    mp_split_model_receive(model, use_node_group=use_node_group)
  File "/usr/local/lib/python3.10/dist-packages/sat/mpu/operation.py", line 91, in mp_split_model_receive
    iter_repartition(model)
  File "/usr/local/lib/python3.10/dist-packages/sat/mpu/operation.py", line 90, in iter_repartition
    iter_repartition(sub_module)
  File "/usr/local/lib/python3.10/dist-packages/sat/mpu/operation.py", line 90, in iter_repartition
    iter_repartition(sub_module)
  File "/usr/local/lib/python3.10/dist-packages/sat/mpu/operation.py", line 84, in iter_repartition
    torch.distributed.recv(sub_module.weight.data, src)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/c10d_logger.py", line 47, in wrapper
    return func(*args, **kwargs)
  File "/usr/local/lib/python3.10/dist-packages/torch/distributed/distributed_c10d.py", line 1640, in recv
    pg.recv([tensor], src, tag).wait()
RuntimeError: No backend type associated with device type cpu
[W ProcessGroupNCCL.cpp:1856] Warning: 0NCCL_AVOID_RECORD_STREAMS=1 has no effect for point-to-point collectives. (function operator())
[2024-03-05 14:32:43,744] torch.distributed.elastic.multiprocessing.api: [WARNING] Sending process 50878 closing signal SIGTERM

原来能跑起来的,现在又不行了,是sat又更新了吗?
目前版本torch=2.1.2,sat=0.4.11,transformers=4.38.2

@1049451037
Copy link
Member

如果想用cpu运行,请确保CUDA_VISIBLE_DEVICES=空

@yileld
Copy link
Author

yileld commented Mar 5, 2024

如果想用cpu运行,请确保CUDA_VISIBLE_DEVICES=空

是想用GPU运行的,但是有quant 8,所以AutoModel.from_pretrained()一开始是在CPU上吧

@1049451037
Copy link
Member

quant 8暂时不支持overwrite_arge={'model_parallel_size'}

@yileld
Copy link
Author

yileld commented Mar 5, 2024

quant 8暂时不支持overwrite_arge={'model_parallel_size'}

那难道是我记错了。。。所以目前quant是不支持多卡推理的是吧

另外我改成bf16报错

Traceback (most recent call last):
  File "/ssd/ylying/CogVLM/basic_demo/infer_dataset.py", line 164, in <module>
    main()
  File "/ssd/ylying/CogVLM/basic_demo/infer_dataset.py", line 36, in main
    model, model_args = AutoModel.from_pretrained(
  File "/usr/local/lib/python3.10/dist-packages/sat/model/base_model.py", line 368, in from_pretrained
    reset_random_seed(6)
  File "/usr/local/lib/python3.10/dist-packages/sat/arguments.py", line 572, in reset_random_seed
    assert _GLOBAL_RANDOM_SEED is not None, "You have not set random seed. No need to reset it."
AssertionError: You have not set random seed. No need to reset it.

@1049451037
Copy link
Member

是的,因为quant切分的状态我也不知道怎么均分到不同卡上……取决于量化算法

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants