Skip to content

Conversation

mlinmg
Copy link

@mlinmg mlinmg commented Mar 19, 2025

Greetings,
Since I had the need to use ovis in an efficient way, and since there are multiple request to do it ( #57 #50 vllm-project/vllm#13441 vllm-project/vllm#13251 vllm-project/vllm#14115 vllm-project/vllm#8972 ), I've decided to port it in VLLM

There wiill be some new files that needs to be added the HF repos of OVIS models:

  • chat_template.json
  • processing_ovis.py

Those are the things that needs to be done to have a fully functional immplementation:

  • Adapt the HF implementation to have a correct tokenizer, MM processor and config file

    • I've added a processing_ovis.py file which removed the need to do the preprocessing inside the ovis modeling file
    • I've fixed the tokenizer so that there is no need to use the QwenConversationFormatter class, in particular I've added the chat template and special tokens to work correctly via
    text_tokenizer = AutoTokenizer.from_pretrained("AIDC-AI/Ovis2-2B",
                                                   extra_special_tokens={
        "image_token": "<image>",
        "image_atom": "<image_atom>",
        "image_start": "<img>",
        "image_prefix": "<pre>",
        "image_col_sep": "<col>",
        "image_row_sep": "<row>",
        "image_end": "</img>",
        'image_pad': '<image_pad>',
    })
    text_tokenizer.chat_template="{% for message in messages %}{% if loop.first and message['role'] != 'system' %}<|im_start|>system\nYou are a helpful assistant.<|im_end|>\n{% endif %}<|im_start|>{{ message['role'] }}\n{% if message['content'] is string %}{{ message['content'] }}{% else %}{% for content in message['content'] %}{% if content['type'] == 'image' or 'image' in content or 'image_url' in content %}<image>\n{% elif 'text' in content %}{{ content['text'] }}{% endif %}{% endfor %}{% endif %}<|im_end|>\n{% endfor %}{% if add_generation_prompt %}<|im_start|>assistant\n{% endif %}"
    text_tokenizer.push_to_hub('AIDC-AI/Ovis2-2B')
    • I've fixed the tokenizer so that there is no need to use the QwenConversationFormatter class, in particular I've added the chat template and special tokens to work correctly via the default HF pipeline:
    processor = OvisProcessor.from_pretrained('mlinmg/ovis_new')
    output_from_processor = processor.apply_chat_template(add_generation_prompt=True, conversation=[
        {
            "role": "user",
            "content": [
                {
                    "type": "image",
                    "image": f'{image_url}',
                },
                {"type": "text", "text": "Describe the image."},
            ],
        }
    ])
    inputs = processor(
        text=[output_from_processor],
        images=images,
        padding=True,
        return_tensors="pt",
        max_partition=max_partition,
    )
    • Lastly I've modified the config file (both the .py and the .json) to expose num_hidden_layers,AutoProcessor,vocab_size,num_attention_heads, and swithced mode_type to chamaleon since it has the same image token placeholder, if you make a pr to vllm you can add the ovis arch to the list of models that shares that image token
  • Ensure Identical numerical values
    Done up until the llm part, i.e:

    hidden_states = self.llm(
                input_ids=input_ids,
                positions=positions,
                kv_caches=kv_caches,
                attn_metadata=attn_metadata,
                intermediate_tensors=intermediate_tensors,
                inputs_embeds=inputs_embeds,
            )

    where the decoding block of Qwen2 for the OG VLLM implementation seems to yields different values from the transformers one (but maybe I'm missing something)

  • check how it handles non uniform batches, needs numiercal identity however

@mlinmg
Copy link
Author

mlinmg commented Mar 19, 2025

It needs to be said that this port is intended for vllm 0.7.2

@mlinmg
Copy link
Author

mlinmg commented Mar 19, 2025

@alibaba-oss help on discovering the numerical difference is welcome, after that the implementation will be completed

@Isotr0py
Copy link

Isotr0py commented Mar 19, 2025

@mlinmg The numberic difference in Qwen2 llm backbone might come from the RMSNorm and RotaryEmbedding because they use faster custom operators with approximate results. You can add .forward_native() at their calling to use native implementation and see if the numerical difference still exists.

@runninglsy
Copy link
Collaborator

Thank you for your diligent effort. I will review this code, attempt to run it, and strive to identify the cause of the precision inconsistency later next week (due to being tied up with work reports in the coming days).

@mlinmg
Copy link
Author

mlinmg commented Mar 20, 2025

@mlinmg The numberic difference in Qwen2 llm backbone might come from the RMSNorm and RotaryEmbedding because they use faster custom operators with approximate results. You can add .forward_native() at their calling to use native implementation and see if the numerical difference still exists.

Cool I'll check on that

@JumpingRain
Copy link
Collaborator

@mlinmg

Thank You for the OVIS PR

Thank you very much for your work on porting OVIS to VLLM! I see you've addressed the needs from multiple community issues, which will greatly help improve OVIS inference efficiency. I've successfully run VLLM with OVIS locally based on your code, with results matching expectations.

Here's the code I used to test it:

# Import necessary modules
from PIL import Image
from vllm import LLM, SamplingParams
from vllm import ModelRegistry

from ovis.vllm.ovis_modeling import OvisForConditionalGeneration
ModelRegistry.register_model("Ovis", OvisForConditionalGeneration)
llm = LLM(model="path-to-model/Ovis2-2B", trust_remote_code=True)

from ovis.vllm.processing_ovis import OvisProcessor
processor = OvisProcessor.from_pretrained('/mnt/workspace/cv_multimodal/daxiao/models/Ovis2-2B')

image = Image.open("ovis2_ocr1.jpg")

# Set sampling parameters for generation
greedy_params = SamplingParams(temperature=0.0, max_tokens=250)

# Format the conversation using the processor
output_from_processor = processor.tokenizer.apply_chat_template(
    add_generation_prompt=True,conversation=[
        {
            "role": "user",
            "content": [
                {"type": "image", "image": '',},
                {"type": "text", "text": "Describe the image."},
            ],}],
    tokenize=False
)

# Generate the caption
output = llm.generate(
    {
        "prompt": output_from_processor,
        "multi_modal_data": {"image": image},
    },
    greedy_params
)

# Print the generated caption
print(output[0].outputs[0].text)

Questions about next steps:

  1. HF and GitHub modifications:

    • What specific files do we need to add to the HF repositories? Is it just chat_template.json and processing_ovis.py?
    • Are there any other changes needed to our existing model files?
  2. VLLM integration:

    • What's the best approach to merge OVIS architecture into the VLLM main branch?
    • Would you be willing to create a PR to the VLLM repository, or should we handle that?
  3. Technical questions:

    • How can we specify the number of sub-images/partitions at runtime?
    • How do we support passing pixel values directly instead of loading images from paths?
    • Is there support for multi-image input in your implementation?

Thanks again for your contribution to the OVIS ecosystem. Looking forward to working together to improve the model's accessibility and performance!

@mlinmg
Copy link
Author

mlinmg commented Mar 28, 2025

@JumpingRain

Nice to hear it!
To address your questions in order:

1. HF and GitHub modifications:

  • You'll also need to modify the tokenizer to add the special tokens that were hardcoded in your code. For instance, here's the added_tokens.json file required to make the processor work correctly:

    {
      "</img>": 151671,
      "</tool_call>": 151658,
      "<col>": 151669,
      "<image>": 151665,
      "<image_atom>": 151666,
      "<image_pad>": 151672,
      "<img>": 151667,
      "<pre>": 151668,
      "<row>": 151670,
      "<tool_call>": 151657,
      "<|box_end|>": 151649,
      "<|box_start|>": 151648,
      "<|endoftext|>": 151643,
      "<|file_sep|>": 151664,
      "<|fim_middle|>": 151660,
      "<|fim_pad|>": 151662,
      "<|fim_prefix|>": 151659,
      "<|fim_suffix|>": 151661,
      "<|im_end|>": 151645,
      "<|im_start|>": 151644,
      "<|image_pad|>": 151655,
      "<|object_ref_end|>": 151647,
      "<|object_ref_start|>": 151646,
      "<|quad_end|>": 151651,
      "<|quad_start|>": 151650,
      "<|repo_name|>": 151663,
      "<|video_pad|>": 151656,
      "<|vision_end|>": 151653,
      "<|vision_pad|>": 151654,
      "<|vision_start|>": 151652
    }
  • You’ll need to modify the modeling file for Ovis, since we no longer need the internal image pre-processing logic.

  • Update the config files (both .py and .json) to expose:

    • num_hidden_layers
    • AutoProcessor
    • vocab_size
    • num_attention_heads
    • And switch mode_type to "chameleon", since it uses the same image token placeholder.
    • If you make a PR to vLLM, you can add the Ovis architecture to the list of models that share that image token.

2. VLLM integration:


3. Technical questions:

  • How can we specify the number of sub-images/partitions at runtime?
    This is a tricky question. AFAIK, vLLM doesn’t currently expose extra multimodal processor kwargs in their API—but I might be wrong.
    @Isotr0py, do you think this is possible?

  • Can we bypass the HF processor and pass a pre-processed image?
    I’m not sure. The standard vLLM pipeline sends a multimodal image as either a file path or a base64-encoded item. I don’t think you can bypass the HF processor, but again, I could be wrong.

  • Final question?
    Yes.

@Isotr0py
Copy link

This is a tricky question. AFAIK, vLLM doesn’t currently expose extra multimodal processor kwargs in their API—but I might be wrong.

In fact, you can expose extra multimodal processor kwargs in vLLM, just like qwen2.5-vl:

        # Note - mm_processor_kwargs can also be passed to generate/chat calls
        mm_processor_kwargs={
            "min_pixels": 28 * 28,
            "max_pixels": 1280 * 28 * 28,
        },

You can refer to https://github.com/vllm-project/vllm/blob/6909a762012ce665931ff6d482dce17cf927108a/vllm/model_executor/models/qwen2_vl.py#L754-L800 about how to expose processor kwargs.

I’d prefer to offload the PR work to you since I’m currently very busy.

If you need, I can help upstreaming this implementation to vLLM.

@JumpingRain
Copy link
Collaborator

@mlinmg @Isotr0py
Thank you for your responses. I've made further progress on implementing the max_partition parameter for OVIS in VLLM.

I can confirm that I've modified the OVIS processor to support configuring max_partition at initialization time. This can be done using the mm_processor_kwargs parameter when initializing the LLM,:

llm = LLM(model="/mnt/workspace/cv_multimodal/daxiao/models/Ovis2-2B", 
          device="cuda",
          mm_processor_kwargs={"max_partition": 12},
          trust_remote_code=True)

Currently, this approach allows setting the maximum number of partitions globally for the LLM instance. However, it doesn't yet support configuring different max_partition values for individual requests.

@mlinmg
Copy link
Author

mlinmg commented Mar 31, 2025

@JumpingRain

Currently, this approach allows setting the maximum number of partitions globally for the LLM instance. However, it doesn't yet support configuring different max_partition values for individual requests.

I think you can actually pass them as mm_processor_kwargs in the chat call api

@Isotr0py

If you need, I can help upstreaming this implementation to vLLM.

It would be awesome, I'll open it later today

@CLAassistant
Copy link

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants