-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug]: vllm executor.driver_worker. 'RayWorkerWrapper' object has no attribute 'model_runner' #67
Comments
llm = LLM( |
Hi @TPLink32, thank you for your feedback. Apologies, but I wasn't able to reproduce the bug based on your environment. From the vllm source code, it appears that version 0.4.3 does include this object: test_fp8.py#L21. |
tensor_parallel_size = 1 works fine, but this error occurs when it is changed to 4。 |
hi @iofu728 |
Thank you for your patience. We now have a working version with decent speedup. However, due to the difficulty of broadcasting the patch function in the main process, this version requires manual modification of the vLLM code. We won’t be merging this branch for now until we find a more suitable solution. Here’s how to run it:
wget https://raw.githubusercontent.com/FranxYao/chain-of-thought-hub/main/gsm8k/lib_prompt/prompt_hardest.txt
VLLM_WORKER_MULTIPROC_METHOD=spawn python experiments/benchmarks/benchmark_e2e_vllm_tp.py \
--attn_type minference \
--context_window 500_000 --tensor_parallel_size 4 |
@iofu728 Thanks a lot for your reply. |
@iofu728 one more question, is there any code for offline pattern searching with tensor parallel? |
Yes, currently the search phase can only be used within HuggingFace and doesn't support TP. |
You mean when I use the "vllm" mode of run_inifintebench.py, it does not output the best_pattern file, and this stage is generated online during the inference phase, right? Have you ever evaluated the impact of this on performance: vllm+tp vs vllm+minference+tp |
Hi @leoyuppieqnew, the head pattern search is performed offline, and once the configuration is obtained, it is applied to all contexts. During online processing, dynamic approximation and calculations are carried out based on the searched configuration. The current version does not support head pattern search in vLLM, either with or without TP. This feature is only supported in HF at the moment. MInference online computation, however, is supported across all platforms: HF, vLLM w/o TP, and vLLM w/ TP. |
Describe the bug
Describe the issue
I encountered some issues when using minference in Python.
[rank0]: Traceback (most recent call last):
[rank0]: File "/data/code/minference/test_minference.py", line 31, in
[rank0]: llm = minference_patch(llm)
[rank0]: File "/data/miniconda3/envs/moe/lib/python3.10/site-packages/minference/models_patch.py", line 39, in call
[rank0]: return self.patch_model(model)
[rank0]: File "/data/miniconda3/envs/moe/lib/python3.10/site-packages/minference/models_patch.py", line 102, in patch_model
[rank0]: model = minference_patch_vllm(model, self.config.config_path)
[rank0]: File "/data/miniconda3/envs/moe/lib/python3.10/site-packages/minference/patch.py", line 1111, in minference_patch_vllm
[rank0]: llm.llm_engine.model_executor.driver_worker.model_runner.model.apply(update_module)
[rank0]: AttributeError: 'RayWorkerWrapper' object has no attribute 'model_runner'
enviornment :
Python 3.10
minference 0.1.5
triton 2.1.0
torch 2.3.0
CUDA 12.1
vllm 0.4.3
flash-attn 2.5.8
Steps to reproduce
No response
Expected Behavior
No response
Logs
No response
Additional Information
No response
The text was updated successfully, but these errors were encountered: