[Bug]: vllm executor.driver_worker. 'RayWorkerWrapper' object has no attribute 'model_runner' #67

TPLink32 · 2024-08-08T08:49:03Z

Describe the bug

Describe the issue
I encountered some issues when using minference in Python.
[rank0]: Traceback (most recent call last):
[rank0]: File "/data/code/minference/test_minference.py", line 31, in
[rank0]: llm = minference_patch(llm)
[rank0]: File "/data/miniconda3/envs/moe/lib/python3.10/site-packages/minference/models_patch.py", line 39, in call
[rank0]: return self.patch_model(model)
[rank0]: File "/data/miniconda3/envs/moe/lib/python3.10/site-packages/minference/models_patch.py", line 102, in patch_model
[rank0]: model = minference_patch_vllm(model, self.config.config_path)
[rank0]: File "/data/miniconda3/envs/moe/lib/python3.10/site-packages/minference/patch.py", line 1111, in minference_patch_vllm
[rank0]: llm.llm_engine.model_executor.driver_worker.model_runner.model.apply(update_module)
[rank0]: AttributeError: 'RayWorkerWrapper' object has no attribute 'model_runner'

enviornment :
Python 3.10
minference 0.1.5
triton 2.1.0
torch 2.3.0
CUDA 12.1
vllm 0.4.3
flash-attn 2.5.8

Steps to reproduce

No response

Expected Behavior

No response

Logs

No response

Additional Information

No response

TPLink32 · 2024-08-08T10:04:37Z

llm = LLM(
model_name,
max_num_seqs=2,
#tensor_parallel_size=4, --Adding this option will trigger
trust_remote_code=True,
max_model_len=12000,
)

iofu728 · 2024-08-13T08:24:09Z

Hi @TPLink32, thank you for your feedback.

Apologies, but I wasn't able to reproduce the bug based on your environment. From the vllm source code, it appears that version 0.4.3 does include this object: test_fp8.py#L21.

TPLink32 · 2024-08-16T06:37:41Z

tensor_parallel_size = 1 works fine, but this error occurs when it is changed to 4。

susu1210 · 2024-09-05T09:00:26Z

hi @iofu728
I think that, when tensor_parallel_size > 1, driver_worker with type RayWorkerWrapper should be patched with the following command?
llm.llm_engine.model_executor.driver_worker.worker.model_runner.model.apply(update_module)

iofu728 · 2024-09-06T03:49:09Z

Hi @susu1210 and @TPLink32,

Thank you for your patience. We now have a working version with decent speedup. However, due to the difficulty of broadcasting the patch function in the main process, this version requires manual modification of the vLLM code.

We won’t be merging this branch for now until we find a more suitable solution.

Here’s how to run it:

Switch to the hjiang/support_vllm_tp branch.
Run pip install -e .
Copy minference_patch_vllm_tp and minference_patch_vllm_executor from minference/patch.py to the end of the Worker class in vllm/worker/worker.py. Make sure to indent minference_patch_vllm_tp.
When calling VLLM, ensure enable_chunked_prefill=False is set.
Refer to the script in https://github.com/microsoft/MInference/blob/hjiang/support_vllm_tp/experiments/benchmarks/run_e2e_vllm_tp.sh:

wget https://raw.githubusercontent.com/FranxYao/chain-of-thought-hub/main/gsm8k/lib_prompt/prompt_hardest.txt

VLLM_WORKER_MULTIPROC_METHOD=spawn python experiments/benchmarks/benchmark_e2e_vllm_tp.py \
    --attn_type minference \
    --context_window 500_000 --tensor_parallel_size 4

susu1210 · 2024-09-06T08:02:33Z

Hi @susu1210 and @TPLink32,

Thank you for your patience. We now have a working version with decent speedup. However, due to the difficulty of broadcasting the patch function in the main process, this version requires manual modification of the vLLM code.

We won’t be merging this branch for now until we find a more suitable solution.

Here’s how to run it:

Switch to the hjiang/support_vllm_tp branch.

Run pip install -e .

Copy minference_patch_vllm_tp and minference_patch_vllm_executor from minference/patch.py to the end of the Worker class in vllm/worker/worker.py. Make sure to indent minference_patch_vllm_tp.

When calling VLLM, ensure enable_chunked_prefill=False is set.

Refer to the script in https://github.com/microsoft/MInference/blob/hjiang/support_vllm_tp/experiments/benchmarks/run_e2e_vllm_tp.sh:
wget https://raw.githubusercontent.com/FranxYao/chain-of-thought-hub/main/gsm8k/lib_prompt/prompt_hardest.txt

VLLM_WORKER_MULTIPROC_METHOD=spawn python experiments/benchmarks/benchmark_e2e_vllm_tp.py \
    --attn_type minference \
    --context_window 500_000 --tensor_parallel_size 4

@iofu728 Thanks a lot for your reply.
It works for me.

susu1210 · 2024-09-06T10:34:25Z

@iofu728 one more question, is there any code for offline pattern searching with tensor parallel?
I found that vllm + tp is only being used at online stage.

iofu728 · 2024-09-09T04:54:48Z

@iofu728 one more question, is there any code for offline pattern searching with tensor parallel? I found that vllm + tp is only being used at online stage.

Yes, currently the search phase can only be used within HuggingFace and doesn't support TP.

leoyuppieqnew · 2024-09-18T06:50:20Z

@iofu728 one more question, is there any code for offline pattern searching with tensor parallel? I found that vllm + tp is only being used at online stage.

Yes, currently the search phase can only be used within HuggingFace and doesn't support TP.

You mean when I use the "vllm" mode of run_inifintebench.py, it does not output the best_pattern file, and this stage is generated online during the inference phase, right? Have you ever evaluated the impact of this on performance: vllm+tp vs vllm+minference+tp

iofu728 · 2024-09-19T04:59:28Z

@iofu728 one more question, is there any code for offline pattern searching with tensor parallel? I found that vllm + tp is only being used at online stage.

Yes, currently the search phase can only be used within HuggingFace and doesn't support TP.

You mean when I use the "vllm" mode of run_inifintebench.py, it does not output the best_pattern file, and this stage is generated online during the inference phase, right? Have you ever evaluated the impact of this on performance: vllm+tp vs vllm+minference+tp

Hi @leoyuppieqnew, the head pattern search is performed offline, and once the configuration is obtained, it is applied to all contexts. During online processing, dynamic approximation and calculations are carried out based on the searched configuration.

The current version does not support head pattern search in vLLM, either with or without TP. This feature is only supported in HF at the moment.

MInference online computation, however, is supported across all platforms: HF, vLLM w/o TP, and vLLM w/ TP.

TPLink32 added the bug Something isn't working label Aug 8, 2024

iofu728 self-assigned this Aug 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: vllm executor.driver_worker. 'RayWorkerWrapper' object has no attribute 'model_runner' #67

[Bug]: vllm executor.driver_worker. 'RayWorkerWrapper' object has no attribute 'model_runner' #67

TPLink32 commented Aug 8, 2024

TPLink32 commented Aug 8, 2024

iofu728 commented Aug 13, 2024

TPLink32 commented Aug 16, 2024

susu1210 commented Sep 5, 2024

iofu728 commented Sep 6, 2024

susu1210 commented Sep 6, 2024

susu1210 commented Sep 6, 2024

iofu728 commented Sep 9, 2024

leoyuppieqnew commented Sep 18, 2024

iofu728 commented Sep 19, 2024

[Bug]: vllm executor.driver_worker. 'RayWorkerWrapper' object has no attribute 'model_runner' #67

[Bug]: vllm executor.driver_worker. 'RayWorkerWrapper' object has no attribute 'model_runner' #67

Comments

TPLink32 commented Aug 8, 2024

Describe the bug

Steps to reproduce

Expected Behavior

Logs

Additional Information

TPLink32 commented Aug 8, 2024

iofu728 commented Aug 13, 2024

TPLink32 commented Aug 16, 2024

susu1210 commented Sep 5, 2024

iofu728 commented Sep 6, 2024

susu1210 commented Sep 6, 2024

susu1210 commented Sep 6, 2024

iofu728 commented Sep 9, 2024

leoyuppieqnew commented Sep 18, 2024

iofu728 commented Sep 19, 2024