[VLLM integration][90% completed] Add Ovis2 to VLLM #70

mlinmg · 2025-03-19T14:45:06Z

Greetings,
Since I had the need to use ovis in an efficient way, and since there are multiple request to do it ( #57 #50 vllm-project/vllm#13441 vllm-project/vllm#13251 vllm-project/vllm#14115 vllm-project/vllm#8972 ), I've decided to port it in VLLM

There wiill be some new files that needs to be added the HF repos of OVIS models:

chat_template.json
processing_ovis.py

Those are the things that needs to be done to have a fully functional immplementation:

Adapt the HF implementation to have a correct tokenizer, MM processor and config file

I've added a processing_ovis.py file which removed the need to do the preprocessing inside the ovis modeling file
I've fixed the tokenizer so that there is no need to use the QwenConversationFormatter class, in particular I've added the chat template and special tokens to work correctly via

text_tokenizer = AutoTokenizer.from_pretrained("AIDC-AI/Ovis2-2B",
                                               extra_special_tokens={
    "image_token": "<image>",
    "image_atom": "<image_atom>",
    "image_start": "<img>",
    "image_prefix": "<pre>",
    "image_col_sep": "<col>",
    "image_row_sep": "<row>",
    "image_end": "</img>",
    'image_pad': '<image_pad>',
})
text_tokenizer.chat_template="{% for message in messages %}{% if loop.first and message['role'] != 'system' %}<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n{% endif %}<|im_start|>{{ message['role'] }}\n{% if message['content'] is string %}{{ message['content'] }}{% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}<image>\n{% elif 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}{% endif %}<|im_end|>\n{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant\n{% endif %}"
text_tokenizer.push_to_hub('AIDC-AI/Ovis2-2B')

I've fixed the tokenizer so that there is no need to use the QwenConversationFormatter class, in particular I've added the chat template and special tokens to work correctly via the default HF pipeline:

processor = OvisProcessor.from_pretrained('mlinmg/ovis_new')
output_from_processor = processor.apply_chat_template(add_generation_prompt=True, conversation=[
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": f'{image_url}',
            },
            {"type": "text", "text": "Describe the image."},
        ],
    }
])
inputs = processor(
    text=[output_from_processor],
    images=images,
    padding=True,
    return_tensors="pt",
    max_partition=max_partition,
)

Lastly I've modified the config file (both the .py and the .json) to expose num_hidden_layers,AutoProcessor,vocab_size,num_attention_heads, and swithced mode_type to chamaleon since it has the same image token placeholder, if you make a pr to vllm you can add the ovis arch to the list of models that shares that image token

Ensure Identical numerical values
Done up until the llm part, i.e:

hidden_states = self.llm(
            input_ids=input_ids,
            positions=positions,
            kv_caches=kv_caches,
            attn_metadata=attn_metadata,
            intermediate_tensors=intermediate_tensors,
            inputs_embeds=inputs_embeds,
        )

where the decoding block of Qwen2 for the OG VLLM implementation seems to yields different values from the transformers one (but maybe I'm missing something)

check how it handles non uniform batches, needs numiercal identity however

mlinmg · 2025-03-19T14:48:18Z

It needs to be said that this port is intended for vllm 0.7.2

mlinmg · 2025-03-19T14:49:34Z

@alibaba-oss help on discovering the numerical difference is welcome, after that the implementation will be completed

Isotr0py · 2025-03-19T16:41:15Z

@mlinmg The numberic difference in Qwen2 llm backbone might come from the RMSNorm and RotaryEmbedding because they use faster custom operators with approximate results. You can add .forward_native() at their calling to use native implementation and see if the numerical difference still exists.

runninglsy · 2025-03-20T06:12:43Z

Thank you for your diligent effort. I will review this code, attempt to run it, and strive to identify the cause of the precision inconsistency later next week (due to being tied up with work reports in the coming days).

mlinmg · 2025-03-20T11:58:50Z

@mlinmg The numberic difference in Qwen2 llm backbone might come from the RMSNorm and RotaryEmbedding because they use faster custom operators with approximate results. You can add .forward_native() at their calling to use native implementation and see if the numerical difference still exists.

Cool I'll check on that

ovis/vllm/aimv2/visual_tokenizer_aimv2.py

JumpingRain · 2025-03-28T09:56:28Z

@mlinmg

Thank You for the OVIS PR

Thank you very much for your work on porting OVIS to VLLM! I see you've addressed the needs from multiple community issues, which will greatly help improve OVIS inference efficiency. I've successfully run VLLM with OVIS locally based on your code, with results matching expectations.

Here's the code I used to test it:

# Import necessary modules
from PIL import Image
from vllm import LLM, SamplingParams
from vllm import ModelRegistry

from ovis.vllm.ovis_modeling import OvisForConditionalGeneration
ModelRegistry.register_model("Ovis", OvisForConditionalGeneration)
llm = LLM(model="path-to-model/Ovis2-2B", trust_remote_code=True)

from ovis.vllm.processing_ovis import OvisProcessor
processor = OvisProcessor.from_pretrained('/mnt/workspace/cv_multimodal/daxiao/models/Ovis2-2B')

image = Image.open("ovis2_ocr1.jpg")

# Set sampling parameters for generation
greedy_params = SamplingParams(temperature=0.0, max_tokens=250)

# Format the conversation using the processor
output_from_processor = processor.tokenizer.apply_chat_template(
    add_generation_prompt=True,conversation=[
        {
            "role": "user",
            "content": [
                {"type": "image", "image": '',},
                {"type": "text", "text": "Describe the image."},
            ],}],
    tokenize=False
)

# Generate the caption
output = llm.generate(
    {
        "prompt": output_from_processor,
        "multi_modal_data": {"image": image},
    },
    greedy_params
)

# Print the generated caption
print(output[0].outputs[0].text)

Questions about next steps:

HF and GitHub modifications:
- What specific files do we need to add to the HF repositories? Is it just chat_template.json and processing_ovis.py?
- Are there any other changes needed to our existing model files?
VLLM integration:
- What's the best approach to merge OVIS architecture into the VLLM main branch?
- Would you be willing to create a PR to the VLLM repository, or should we handle that?
Technical questions:
- How can we specify the number of sub-images/partitions at runtime?
- How do we support passing pixel values directly instead of loading images from paths?
- Is there support for multi-image input in your implementation?

Thanks again for your contribution to the OVIS ecosystem. Looking forward to working together to improve the model's accessibility and performance!

@abhiaagarwal

Addresses @abhiaagarwal comment

mlinmg · 2025-03-28T12:00:09Z

@JumpingRain

Nice to hear it!
To address your questions in order:

1. HF and GitHub modifications:

You'll also need to modify the tokenizer to add the special tokens that were hardcoded in your code. For instance, here's the added_tokens.json file required to make the processor work correctly:

{
  "</img>": 151671,
  "</tool_call>": 151658,
  "<col>": 151669,
  "<image>": 151665,
  "<image_atom>": 151666,
  "<image_pad>": 151672,
  "<img>": 151667,
  "<pre>": 151668,
  "<row>": 151670,
  "<tool_call>": 151657,
  "<|box_end|>": 151649,
  "<|box_start|>": 151648,
  "<|endoftext|>": 151643,
  "<|file_sep|>": 151664,
  "<|fim_middle|>": 151660,
  "<|fim_pad|>": 151662,
  "<|fim_prefix|>": 151659,
  "<|fim_suffix|>": 151661,
  "<|im_end|>": 151645,
  "<|im_start|>": 151644,
  "<|image_pad|>": 151655,
  "<|object_ref_end|>": 151647,
  "<|object_ref_start|>": 151646,
  "<|quad_end|>": 151651,
  "<|quad_start|>": 151650,
  "<|repo_name|>": 151663,
  "<|video_pad|>": 151656,
  "<|vision_end|>": 151653,
  "<|vision_pad|>": 151654,
  "<|vision_start|>": 151652
}

You’ll need to modify the modeling file for Ovis, since we no longer need the internal image pre-processing logic.
Update the config files (both .py and .json) to expose:
- num_hidden_layers
- AutoProcessor
- vocab_size
- num_attention_heads
- And switch mode_type to "chameleon", since it uses the same image token placeholder.
- If you make a PR to vLLM, you can add the Ovis architecture to the list of models that share that image token.

2. VLLM integration:

You should add the Ovis modeling file in this PR:
https://github.com/vllm-project/vllm/tree/main/vllm/model_executor/models
Modify the _MULTIMODAL_MODELS dictionary:
https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/registry.py#L160
Add Ovis as a model type here:
https://github.com/vllm-project/vllm/blob/fd5fd2690275e90865023a0bcac0047ecb3f3897/vllm/entrypoints/chat_utils.py#L498
Find suitable locations for the accessory files for this PR.
I’d prefer to offload the PR work to you since I’m currently very busy.

3. Technical questions:

How can we specify the number of sub-images/partitions at runtime?
This is a tricky question. AFAIK, vLLM doesn’t currently expose extra multimodal processor kwargs in their API—but I might be wrong.
@Isotr0py, do you think this is possible?
Can we bypass the HF processor and pass a pre-processed image?
I’m not sure. The standard vLLM pipeline sends a multimodal image as either a file path or a base64-encoded item. I don’t think you can bypass the HF processor, but again, I could be wrong.
Final question?
Yes.

Isotr0py · 2025-03-30T07:13:17Z

This is a tricky question. AFAIK, vLLM doesn’t currently expose extra multimodal processor kwargs in their API—but I might be wrong.

In fact, you can expose extra multimodal processor kwargs in vLLM, just like qwen2.5-vl:

        # Note - mm_processor_kwargs can also be passed to generate/chat calls
        mm_processor_kwargs={
            "min_pixels": 28 * 28,
            "max_pixels": 1280 * 28 * 28,
        },

You can refer to https://github.com/vllm-project/vllm/blob/6909a762012ce665931ff6d482dce17cf927108a/vllm/model_executor/models/qwen2_vl.py#L754-L800 about how to expose processor kwargs.

I’d prefer to offload the PR work to you since I’m currently very busy.

If you need, I can help upstreaming this implementation to vLLM.

JumpingRain · 2025-03-30T14:54:40Z

@mlinmg @Isotr0py
Thank you for your responses. I've made further progress on implementing the max_partition parameter for OVIS in VLLM.

I can confirm that I've modified the OVIS processor to support configuring max_partition at initialization time. This can be done using the mm_processor_kwargs parameter when initializing the LLM,:

llm = LLM(model="/mnt/workspace/cv_multimodal/daxiao/models/Ovis2-2B", 
          device="cuda",
          mm_processor_kwargs={"max_partition": 12},
          trust_remote_code=True)

Currently, this approach allows setting the maximum number of partitions globally for the LLM instance. However, it doesn't yet support configuring different max_partition values for individual requests.

mlinmg · 2025-03-31T13:55:55Z

@JumpingRain

Currently, this approach allows setting the maximum number of partitions globally for the LLM instance. However, it doesn't yet support configuring different max_partition values for individual requests.

I think you can actually pass them as mm_processor_kwargs in the chat call api

@Isotr0py

If you need, I can help upstreaming this implementation to vLLM.

It would be awesome, I'll open it later today

CLAassistant · 2025-08-05T04:23:33Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

mlinmg added 3 commits March 19, 2025 15:17

startd file addition

65194b4

Added small instruction on vois models

6db0d77

Added ovis files

a54eb70

mlinmg mentioned this pull request Mar 19, 2025

[New Model]: Ovis2 vllm-project/vllm#13251

Closed

1 task

abhiaagarwal reviewed Mar 24, 2025

View reviewed changes

ovis/vllm/aimv2/visual_tokenizer_aimv2.py Outdated Show resolved Hide resolved

Update visual_tokenizer_aimv2.py

35ab51a

Addresses @abhiaagarwal comment

mlinmg mentioned this pull request Mar 31, 2025

[MODEL ADDITION] Ovis2 Model Addition vllm-project/vllm#15826

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[VLLM integration][90% completed] Add Ovis2 to VLLM #70

[VLLM integration][90% completed] Add Ovis2 to VLLM #70

Uh oh!

mlinmg commented Mar 19, 2025

Uh oh!

mlinmg commented Mar 19, 2025

Uh oh!

mlinmg commented Mar 19, 2025

Uh oh!

Isotr0py commented Mar 19, 2025 •

edited

Loading

Uh oh!

runninglsy commented Mar 20, 2025

Uh oh!

mlinmg commented Mar 20, 2025 •

edited

Loading

Uh oh!

Uh oh!

JumpingRain commented Mar 28, 2025

Uh oh!

mlinmg commented Mar 28, 2025

Uh oh!

Isotr0py commented Mar 30, 2025

Uh oh!

JumpingRain commented Mar 30, 2025

Uh oh!

mlinmg commented Mar 31, 2025

Uh oh!

CLAassistant commented Aug 5, 2025

Uh oh!

Uh oh!

[VLLM integration][90% completed] Add Ovis2 to VLLM #70

Are you sure you want to change the base?

[VLLM integration][90% completed] Add Ovis2 to VLLM #70

Uh oh!

Conversation

mlinmg commented Mar 19, 2025

Uh oh!

mlinmg commented Mar 19, 2025

Uh oh!

mlinmg commented Mar 19, 2025

Uh oh!

Isotr0py commented Mar 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

runninglsy commented Mar 20, 2025

Uh oh!

mlinmg commented Mar 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

JumpingRain commented Mar 28, 2025

Thank You for the OVIS PR

Questions about next steps:

Uh oh!

mlinmg commented Mar 28, 2025

1. HF and GitHub modifications:

2. VLLM integration:

3. Technical questions:

Uh oh!

Isotr0py commented Mar 30, 2025

Uh oh!

JumpingRain commented Mar 30, 2025

Uh oh!

mlinmg commented Mar 31, 2025

Uh oh!

CLAassistant commented Aug 5, 2025

Uh oh!

Uh oh!

Isotr0py commented Mar 19, 2025 •

edited

Loading

mlinmg commented Mar 20, 2025 •

edited

Loading