-
Notifications
You must be signed in to change notification settings - Fork 80
[VLLM integration][90% completed] Add Ovis2 to VLLM #70
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
It needs to be said that this port is intended for vllm 0.7.2 |
@alibaba-oss help on discovering the numerical difference is welcome, after that the implementation will be completed |
@mlinmg The numberic difference in |
Thank you for your diligent effort. I will review this code, attempt to run it, and strive to identify the cause of the precision inconsistency later next week (due to being tied up with work reports in the coming days). |
Cool I'll check on that |
Thank You for the OVIS PRThank you very much for your work on porting OVIS to VLLM! I see you've addressed the needs from multiple community issues, which will greatly help improve OVIS inference efficiency. I've successfully run VLLM with OVIS locally based on your code, with results matching expectations. Here's the code I used to test it: # Import necessary modules
from PIL import Image
from vllm import LLM, SamplingParams
from vllm import ModelRegistry
from ovis.vllm.ovis_modeling import OvisForConditionalGeneration
ModelRegistry.register_model("Ovis", OvisForConditionalGeneration)
llm = LLM(model="path-to-model/Ovis2-2B", trust_remote_code=True)
from ovis.vllm.processing_ovis import OvisProcessor
processor = OvisProcessor.from_pretrained('/mnt/workspace/cv_multimodal/daxiao/models/Ovis2-2B')
image = Image.open("ovis2_ocr1.jpg")
# Set sampling parameters for generation
greedy_params = SamplingParams(temperature=0.0, max_tokens=250)
# Format the conversation using the processor
output_from_processor = processor.tokenizer.apply_chat_template(
add_generation_prompt=True,conversation=[
{
"role": "user",
"content": [
{"type": "image", "image": '',},
{"type": "text", "text": "Describe the image."},
],}],
tokenize=False
)
# Generate the caption
output = llm.generate(
{
"prompt": output_from_processor,
"multi_modal_data": {"image": image},
},
greedy_params
)
# Print the generated caption
print(output[0].outputs[0].text) Questions about next steps:
Thanks again for your contribution to the OVIS ecosystem. Looking forward to working together to improve the model's accessibility and performance! |
Addresses @abhiaagarwal comment
Nice to hear it! 1. HF and GitHub modifications:
2. VLLM integration:
3. Technical questions:
|
In fact, you can expose extra multimodal processor kwargs in vLLM, just like qwen2.5-vl:
You can refer to https://github.com/vllm-project/vllm/blob/6909a762012ce665931ff6d482dce17cf927108a/vllm/model_executor/models/qwen2_vl.py#L754-L800 about how to expose processor kwargs.
If you need, I can help upstreaming this implementation to vLLM. |
@mlinmg @Isotr0py I can confirm that I've modified the OVIS processor to support configuring llm = LLM(model="/mnt/workspace/cv_multimodal/daxiao/models/Ovis2-2B",
device="cuda",
mm_processor_kwargs={"max_partition": 12},
trust_remote_code=True) Currently, this approach allows setting the maximum number of partitions globally for the LLM instance. However, it doesn't yet support configuring different |
I think you can actually pass them as mm_processor_kwargs in the chat call api
It would be awesome, I'll open it later today |
|
Greetings,
Since I had the need to use ovis in an efficient way, and since there are multiple request to do it ( #57 #50 vllm-project/vllm#13441 vllm-project/vllm#13251 vllm-project/vllm#14115 vllm-project/vllm#8972 ), I've decided to port it in VLLM
There wiill be some new files that needs to be added the HF repos of OVIS models:
Those are the things that needs to be done to have a fully functional immplementation:
Adapt the HF implementation to have a correct tokenizer, MM processor and config file
processing_ovis.py
file which removed the need to do the preprocessing inside the ovis modeling filenum_hidden_layers
,AutoProcessor
,vocab_size
,num_attention_heads
, and swithcedmode_type
to chamaleon since it has the same image token placeholder, if you make a pr to vllm you can add the ovis arch to the list of models that shares that image tokenEnsure Identical numerical values
Done up until the llm part, i.e:
where the decoding block of Qwen2 for the OG VLLM implementation seems to yields different values from the transformers one (but maybe I'm missing something)
check how it handles non uniform batches, needs numiercal identity however