Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: vllm executor.driver_worker. 'RayWorkerWrapper' object has no attribute 'model_runner' #67

Open
TPLink32 opened this issue Aug 8, 2024 · 10 comments
Assignees
Labels
bug Something isn't working

Comments

@TPLink32
Copy link

TPLink32 commented Aug 8, 2024

Describe the bug

Describe the issue
I encountered some issues when using minference in Python.
[rank0]: Traceback (most recent call last):
[rank0]: File "/data/code/minference/test_minference.py", line 31, in
[rank0]: llm = minference_patch(llm)
[rank0]: File "/data/miniconda3/envs/moe/lib/python3.10/site-packages/minference/models_patch.py", line 39, in call
[rank0]: return self.patch_model(model)
[rank0]: File "/data/miniconda3/envs/moe/lib/python3.10/site-packages/minference/models_patch.py", line 102, in patch_model
[rank0]: model = minference_patch_vllm(model, self.config.config_path)
[rank0]: File "/data/miniconda3/envs/moe/lib/python3.10/site-packages/minference/patch.py", line 1111, in minference_patch_vllm
[rank0]: llm.llm_engine.model_executor.driver_worker.model_runner.model.apply(update_module)
[rank0]: AttributeError: 'RayWorkerWrapper' object has no attribute 'model_runner'

enviornment :
Python 3.10
minference 0.1.5
triton 2.1.0
torch 2.3.0
CUDA 12.1
vllm 0.4.3
flash-attn 2.5.8

Steps to reproduce

No response

Expected Behavior

No response

Logs

No response

Additional Information

No response

@TPLink32 TPLink32 added the bug Something isn't working label Aug 8, 2024
@TPLink32
Copy link
Author

TPLink32 commented Aug 8, 2024

llm = LLM(
model_name,
max_num_seqs=2,
#tensor_parallel_size=4, --Adding this option will trigger
trust_remote_code=True,
max_model_len=12000,
)

@iofu728 iofu728 self-assigned this Aug 12, 2024
@iofu728
Copy link
Contributor

iofu728 commented Aug 13, 2024

Hi @TPLink32, thank you for your feedback.

Apologies, but I wasn't able to reproduce the bug based on your environment. From the vllm source code, it appears that version 0.4.3 does include this object: test_fp8.py#L21.

@TPLink32
Copy link
Author

tensor_parallel_size = 1 works fine, but this error occurs when it is changed to 4。

@susu1210
Copy link

susu1210 commented Sep 5, 2024

hi @iofu728
I think that, when tensor_parallel_size > 1, driver_worker with type RayWorkerWrapper should be patched with the following command?
llm.llm_engine.model_executor.driver_worker.worker.model_runner.model.apply(update_module)

@iofu728
Copy link
Contributor

iofu728 commented Sep 6, 2024

Hi @susu1210 and @TPLink32,

Thank you for your patience. We now have a working version with decent speedup. However, due to the difficulty of broadcasting the patch function in the main process, this version requires manual modification of the vLLM code.

We won’t be merging this branch for now until we find a more suitable solution.

Here’s how to run it:

  1. Switch to the hjiang/support_vllm_tp branch.
  2. Run pip install -e .
  3. Copy minference_patch_vllm_tp and minference_patch_vllm_executor from minference/patch.py to the end of the Worker class in vllm/worker/worker.py. Make sure to indent minference_patch_vllm_tp.
  4. When calling VLLM, ensure enable_chunked_prefill=False is set.
  5. Refer to the script in https://github.com/microsoft/MInference/blob/hjiang/support_vllm_tp/experiments/benchmarks/run_e2e_vllm_tp.sh:
wget https://raw.githubusercontent.com/FranxYao/chain-of-thought-hub/main/gsm8k/lib_prompt/prompt_hardest.txt

VLLM_WORKER_MULTIPROC_METHOD=spawn python experiments/benchmarks/benchmark_e2e_vllm_tp.py \
    --attn_type minference \
    --context_window 500_000 --tensor_parallel_size 4

@susu1210
Copy link

susu1210 commented Sep 6, 2024

Hi @susu1210 and @TPLink32,

Thank you for your patience. We now have a working version with decent speedup. However, due to the difficulty of broadcasting the patch function in the main process, this version requires manual modification of the vLLM code.

We won’t be merging this branch for now until we find a more suitable solution.

Here’s how to run it:

  1. Switch to the hjiang/support_vllm_tp branch.
  2. Run pip install -e .
  3. Copy minference_patch_vllm_tp and minference_patch_vllm_executor from minference/patch.py to the end of the Worker class in vllm/worker/worker.py. Make sure to indent minference_patch_vllm_tp.
  4. When calling VLLM, ensure enable_chunked_prefill=False is set.
  5. Refer to the script in https://github.com/microsoft/MInference/blob/hjiang/support_vllm_tp/experiments/benchmarks/run_e2e_vllm_tp.sh:
wget https://raw.githubusercontent.com/FranxYao/chain-of-thought-hub/main/gsm8k/lib_prompt/prompt_hardest.txt

VLLM_WORKER_MULTIPROC_METHOD=spawn python experiments/benchmarks/benchmark_e2e_vllm_tp.py \
    --attn_type minference \
    --context_window 500_000 --tensor_parallel_size 4

@iofu728 Thanks a lot for your reply.
It works for me.

@susu1210
Copy link

susu1210 commented Sep 6, 2024

@iofu728 one more question, is there any code for offline pattern searching with tensor parallel?
I found that vllm + tp is only being used at online stage.

@iofu728
Copy link
Contributor

iofu728 commented Sep 9, 2024

@iofu728 one more question, is there any code for offline pattern searching with tensor parallel? I found that vllm + tp is only being used at online stage.

Yes, currently the search phase can only be used within HuggingFace and doesn't support TP.

@leoyuppieqnew
Copy link

@iofu728 one more question, is there any code for offline pattern searching with tensor parallel? I found that vllm + tp is only being used at online stage.

Yes, currently the search phase can only be used within HuggingFace and doesn't support TP.

You mean when I use the "vllm" mode of run_inifintebench.py, it does not output the best_pattern file, and this stage is generated online during the inference phase, right? Have you ever evaluated the impact of this on performance: vllm+tp vs vllm+minference+tp

@iofu728
Copy link
Contributor

iofu728 commented Sep 19, 2024

@iofu728 one more question, is there any code for offline pattern searching with tensor parallel? I found that vllm + tp is only being used at online stage.

Yes, currently the search phase can only be used within HuggingFace and doesn't support TP.

You mean when I use the "vllm" mode of run_inifintebench.py, it does not output the best_pattern file, and this stage is generated online during the inference phase, right? Have you ever evaluated the impact of this on performance: vllm+tp vs vllm+minference+tp

Hi @leoyuppieqnew, the head pattern search is performed offline, and once the configuration is obtained, it is applied to all contexts. During online processing, dynamic approximation and calculations are carried out based on the searched configuration.

The current version does not support head pattern search in vLLM, either with or without TP. This feature is only supported in HF at the moment.

MInference online computation, however, is supported across all platforms: HF, vLLM w/o TP, and vLLM w/ TP.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants