Skip to content

Conversation

@souvikchand
Copy link

This PR fixes a runtime error in the vLLM multimodal pipeline when running HunyuanOCR.
The issue #35 was caused by sending images using the wrong message schema, which led vLLM to misinterpret the image input and generate an invalid tensor shape.

ValueError: image_grid_thw has rank 3 but expected 2.
Expected shape: ('ni', 3), but got torch.Size([2, 1, 3])

what i changed

  1. updated request format
{ "type": "image_url", "image_url": { "url": "data:image/jpeg;base64,..." } }

to

{
    "type": "image_url",
    "image_url": f"data:{mime};base64,{encode_image(image_path)}"
 },
  1. Added automatic MIME-type detection to ensure images are sent with the correct format (png/jpeg/webp/etc)
  2. Ensured image_url is a string, not a nested object, which aligns with vLLM’s expected schema for HuggingFace vision models.

@diegocarturan-debug
Copy link

Uploading ia-prime-historia-completa-scene-4.png…

@souvikchand
Copy link
Author

@diegocarturan-debug
sorry but can you explain your comment above. it's actually redirecting to this same page

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants