[Bug]: When I'm doing distributed inference using vllm and minference, the example starts with an error when I set tensor_parallel_size to a value greater than 1 #63

zh2333 · 2024-08-04T08:42:34Z

Describe the bug

(VllmWorkerProcess pid=13977) Process VllmWorkerProcess:
(VllmWorkerProcess pid=13977) Traceback (most recent call last):
(VllmWorkerProcess pid=13977) File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
(VllmWorkerProcess pid=13977) self.run()
(VllmWorkerProcess pid=13977) File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 108, in run
(VllmWorkerProcess pid=13977) self._target(*self._args, **self._kwargs)
(VllmWorkerProcess pid=13977) File "/opt/conda/lib/python3.8/site-packages/vllm/executor/multiproc_worker_utils.py", line 211, in _run_worker_process
(VllmWorkerProcess pid=13977) worker = worker_factory()
(VllmWorkerProcess pid=13977) File "/opt/conda/lib/python3.8/site-packages/vllm/executor/gpu_executor.py", line 68, in _create_worker
(VllmWorkerProcess pid=13977) wrapper.init_worker(**self._get_worker_kwargs(local_rank, rank,
(VllmWorkerProcess pid=13977) File "/opt/conda/lib/python3.8/site-packages/vllm/worker/worker_base.py", line 311, in init_worker
(VllmWorkerProcess pid=13977) self.worker = worker_class(*args, **kwargs)
(VllmWorkerProcess pid=13977) File "/opt/conda/lib/python3.8/site-packages/vllm/worker/worker.py", line 87, in init
(VllmWorkerProcess pid=13977) self.model_runner: GPUModelRunnerBase = ModelRunnerClass(
(VllmWorkerProcess pid=13977) File "/opt/conda/lib/python3.8/site-packages/vllm/worker/model_runner.py", line 196, in init
(VllmWorkerProcess pid=13977) self.attn_backend = get_attn_backend(
(VllmWorkerProcess pid=13977) File "/opt/conda/lib/python3.8/site-packages/vllm/attention/selector.py", line 51, in get_attn_backend
(VllmWorkerProcess pid=13977) backend = which_attn_to_use(num_heads, head_size, num_kv_heads,
(VllmWorkerProcess pid=13977) File "/opt/conda/lib/python3.8/site-packages/vllm/attention/selector.py", line 158, in which_attn_to_use
(VllmWorkerProcess pid=13977) if torch.cuda.get_device_capability()[0] < 8:
(VllmWorkerProcess pid=13977) File "/opt/conda/lib/python3.8/site-packages/torch/cuda/init.py", line 430, in get_device_capability
(VllmWorkerProcess pid=13977) prop = get_device_properties(device)
(VllmWorkerProcess pid=13977) File "/opt/conda/lib/python3.8/site-packages/torch/cuda/init.py", line 444, in get_device_properties
(VllmWorkerProcess pid=13977) _lazy_init() # will define _get_device_properties
(VllmWorkerProcess pid=13977) File "/opt/conda/lib/python3.8/site-packages/torch/cuda/init.py", line 279, in _lazy_init
(VllmWorkerProcess pid=13977) raise RuntimeError(
(VllmWorkerProcess pid=13977) RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
(VllmWorkerProcess pid=13978) Process VllmWorkerProcess:
(VllmWorkerProcess pid=13978) Traceback (most recent call last):
(VllmWorkerProcess pid=13978) File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
(VllmWorkerProcess pid=13978) self.run()
(VllmWorkerProcess pid=13978) File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 108, in run
(VllmWorkerProcess pid=13978) self._target(*self._args, **self._kwargs)
(VllmWorkerProcess pid=13978) File "/opt/conda/lib/python3.8/site-packages/vllm/executor/multiproc_worker_utils.py", line 211, in _run_worker_process
(VllmWorkerProcess pid=13978) worker = worker_factory()
(VllmWorkerProcess pid=13978) File "/opt/conda/lib/python3.8/site-packages/vllm/executor/gpu_executor.py", line 68, in _create_worker
(VllmWorkerProcess pid=13978) wrapper.init_worker(**self._get_worker_kwargs(local_rank, rank,
(VllmWorkerProcess pid=13978) File "/opt/conda/lib/python3.8/site-packages/vllm/worker/worker_base.py", line 311, in init_worker
(VllmWorkerProcess pid=13978) self.worker = worker_class(*args, **kwargs)
(VllmWorkerProcess pid=13978) File "/opt/conda/lib/python3.8/site-packages/vllm/worker/worker.py", line 87, in init
(VllmWorkerProcess pid=13978) self.model_runner: GPUModelRunnerBase = ModelRunnerClass(
(VllmWorkerProcess pid=13978) File "/opt/conda/lib/python3.8/site-packages/vllm/worker/model_runner.py", line 196, in init
(VllmWorkerProcess pid=13978) self.attn_backend = get_attn_backend(
(VllmWorkerProcess pid=13978) File "/opt/conda/lib/python3.8/site-packages/vllm/attention/selector.py", line 51, in get_attn_backend
(VllmWorkerProcess pid=13978) backend = which_attn_to_use(num_heads, head_size, num_kv_heads,
(VllmWorkerProcess pid=13978) File "/opt/conda/lib/python3.8/site-packages/vllm/attention/selector.py", line 158, in which_attn_to_use
(VllmWorkerProcess pid=13978) if torch.cuda.get_device_capability()[0] < 8:
(VllmWorkerProcess pid=13978) File "/opt/conda/lib/python3.8/site-packages/torch/cuda/init.py", line 430, in get_device_capability
(VllmWorkerProcess pid=13978) prop = get_device_properties(device)
(VllmWorkerProcess pid=13978) File "/opt/conda/lib/python3.8/site-packages/torch/cuda/init.py", line 444, in get_device_properties
(VllmWorkerProcess pid=13978) _lazy_init() # will define _get_device_properties
(VllmWorkerProcess pid=13978) File "/opt/conda/lib/python3.8/site-packages/torch/cuda/init.py", line 279, in _lazy_init
(VllmWorkerProcess pid=13978) raise RuntimeError(
(VllmWorkerProcess pid=13978) RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
(VllmWorkerProcess pid=13979) Process VllmWorkerProcess:
(VllmWorkerProcess pid=13979) Traceback (most recent call last):
(VllmWorkerProcess pid=13979) File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
(VllmWorkerProcess pid=13979) self.run()
(VllmWorkerProcess pid=13979) File "/opt/conda/lib/python3.8/multiprocessing/process.py", line 108, in run
(VllmWorkerProcess pid=13979) self._target(*self._args, **self._kwargs)
(VllmWorkerProcess pid=13979) File "/opt/conda/lib/python3.8/site-packages/vllm/executor/multiproc_worker_utils.py", line 211, in _run_worker_process
(VllmWorkerProcess pid=13979) worker = worker_factory()
(VllmWorkerProcess pid=13979) File "/opt/conda/lib/python3.8/site-packages/vllm/executor/gpu_executor.py", line 68, in _create_worker
(VllmWorkerProcess pid=13979) wrapper.init_worker(**self._get_worker_kwargs(local_rank, rank,
(VllmWorkerProcess pid=13979) File "/opt/conda/lib/python3.8/site-packages/vllm/worker/worker_base.py", line 311, in init_worker
(VllmWorkerProcess pid=13979) self.worker = worker_class(*args, **kwargs)
(VllmWorkerProcess pid=13979) File "/opt/conda/lib/python3.8/site-packages/vllm/worker/worker.py", line 87, in init
(VllmWorkerProcess pid=13979) self.model_runner: GPUModelRunnerBase = ModelRunnerClass(
(VllmWorkerProcess pid=13979) File "/opt/conda/lib/python3.8/site-packages/vllm/worker/model_runner.py", line 196, in init
(VllmWorkerProcess pid=13979) self.attn_backend = get_attn_backend(
(VllmWorkerProcess pid=13979) File "/opt/conda/lib/python3.8/site-packages/vllm/attention/selector.py", line 51, in get_attn_backend
(VllmWorkerProcess pid=13979) backend = which_attn_to_use(num_heads, head_size, num_kv_heads,
(VllmWorkerProcess pid=13979) File "/opt/conda/lib/python3.8/site-packages/vllm/attention/selector.py", line 158, in which_attn_to_use
(VllmWorkerProcess pid=13979) if torch.cuda.get_device_capability()[0] < 8:
(VllmWorkerProcess pid=13979) File "/opt/conda/lib/python3.8/site-packages/torch/cuda/init.py", line 430, in get_device_capability
(VllmWorkerProcess pid=13979) prop = get_device_properties(device)
(VllmWorkerProcess pid=13979) File "/opt/conda/lib/python3.8/site-packages/torch/cuda/init.py", line 444, in get_device_properties
(VllmWorkerProcess pid=13979) _lazy_init() # will define _get_device_properties
(VllmWorkerProcess pid=13979) File "/opt/conda/lib/python3.8/site-packages/torch/cuda/init.py", line 279, in _lazy_init
(VllmWorkerProcess pid=13979) raise RuntimeError(
(VllmWorkerProcess pid=13979) RuntimeError: Cannot re-initialize CUDA in forked subprocess. To use CUDA with multiprocessing, you must use the 'spawn' start method
Traceback (most recent call last):
File "/opt/conda/lib/python3.8/site-packages/vllm/executor/multiproc_worker_utils.py", line 170, in _enqueue_task
self._task_queue.put((task_id, method, args, kwargs))
File "/opt/conda/lib/python3.8/multiprocessing/queues.py", line 82, in put
raise ValueError(f"Queue {self!r} is closed")
ValueError: Queue <multiprocessing.queues.Queue object at 0x7feceed0fbb0> is closed

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/ossfs/workspace/inter_vllm_long_length.py", line 94, in
test_model(path)
File "/ossfs/workspace/inter_vllm_long_length.py", line 64, in test_model
llm = LLM(model=model_path, **kwargs_launcher)
File "/opt/conda/lib/python3.8/site-packages/vllm/entrypoints/llm.py", line 144, in init
self.llm_engine = LLMEngine.from_engine_args(
File "/opt/conda/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 409, in from_engine_args
engine = cls(
File "/opt/conda/lib/python3.8/site-packages/vllm/engine/llm_engine.py", line 242, in init
self.model_executor = executor_class(
File "/opt/conda/lib/python3.8/site-packages/vllm/executor/distributed_gpu_executor.py", line 25, in init
super().init(*args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/vllm/executor/executor_base.py", line 41, in init
self._init_executor()
File "/opt/conda/lib/python3.8/site-packages/vllm/executor/multiproc_gpu_executor.py", line 70, in _init_executor
self._run_workers("init_device")
File "/opt/conda/lib/python3.8/site-packages/vllm/executor/multiproc_gpu_executor.py", line 112, in _run_workers
worker_outputs = [
File "/opt/conda/lib/python3.8/site-packages/vllm/executor/multiproc_gpu_executor.py", line 113, in
worker.execute_method(method, *args, **kwargs)
File "/opt/conda/lib/python3.8/site-packages/vllm/executor/multiproc_worker_utils.py", line 177, in execute_method
self._enqueue_task(future, method, args, kwargs)
File "/opt/conda/lib/python3.8/site-packages/vllm/executor/multiproc_worker_utils.py", line 173, in _enqueue_task
raise ChildProcessError("worker died") from e
ChildProcessError: worker died

Steps to reproduce

No response

Expected Behavior

No response

Logs

No response

Additional Information

No response

iofu728 · 2024-08-05T08:31:40Z

Close due to duplicate content with #62.

zh2333 added the bug Something isn't working label Aug 4, 2024

iofu728 mentioned this issue Aug 5, 2024

[Question]: It seems that minference does not currently support tensor parallelism under vllm, right? Because in a multi-card environment, the head_id here is incorrect compared to a single card #62

Open

iofu728 closed this as completed Aug 5, 2024

iofu728 self-assigned this Aug 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: When I'm doing distributed inference using vllm and minference, the example starts with an error when I set tensor_parallel_size to a value greater than 1 #63

[Bug]: When I'm doing distributed inference using vllm and minference, the example starts with an error when I set tensor_parallel_size to a value greater than 1 #63

zh2333 commented Aug 4, 2024

iofu728 commented Aug 5, 2024

[Bug]: When I'm doing distributed inference using vllm and minference, the example starts with an error when I set tensor_parallel_size to a value greater than 1 #63

[Bug]: When I'm doing distributed inference using vllm and minference, the example starts with an error when I set tensor_parallel_size to a value greater than 1 #63

Comments

zh2333 commented Aug 4, 2024

Describe the bug

Steps to reproduce

Expected Behavior

Logs

Additional Information

iofu728 commented Aug 5, 2024