Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Model] Initial support for BLIP-2 #5920

Draft
wants to merge 214 commits into
base: main
Choose a base branch
from
Draft
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
214 commits
Select commit Hold shift + click to select a range
34bfa79
Introduce a higher level `INPUT_REGISTRY`
DarkLight1337 Jun 3, 2024
df2aa19
Move dummy data generation to input registry
DarkLight1337 Jun 3, 2024
c72d2b3
Update docs
DarkLight1337 Jun 3, 2024
d8c6488
Rename `process_input` to `map_input`
DarkLight1337 Jun 3, 2024
f18de48
Reorder arguments
DarkLight1337 Jun 3, 2024
653537d
Apply input processor
DarkLight1337 Jun 3, 2024
a2f5a3c
Remove `VisionLanguageConfig` from input mapper
DarkLight1337 Jun 3, 2024
378ad80
Fix bad use of `functools.partial`
DarkLight1337 Jun 3, 2024
7aa3778
Use default input processor
DarkLight1337 Jun 3, 2024
c774168
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 4, 2024
532f863
Fix wrong arguments
DarkLight1337 Jun 4, 2024
080d40c
Use pillow image instead of tensor to avoid bypassing the processor b…
DarkLight1337 Jun 5, 2024
662693a
Update interface of dummy data factory and input processor
DarkLight1337 Jun 5, 2024
9bc5fcc
Use `InputContext` to handle checked type cast of config types
DarkLight1337 Jun 5, 2024
911cac7
Add input processor for injecting image tokens; fix docs
DarkLight1337 Jun 5, 2024
a38b347
Add new documentation pages
DarkLight1337 Jun 5, 2024
29c3bb3
Fix LLaVA-NeXT input processor and cleanup code
DarkLight1337 Jun 5, 2024
9cfbcce
Fix LLaVA-NeXT input processor and cleanup code
DarkLight1337 Jun 5, 2024
7bb6cbf
Add sanity check
DarkLight1337 Jun 6, 2024
ccf49c4
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 6, 2024
3482d32
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 6, 2024
8ea8468
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 8, 2024
be3d64f
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 8, 2024
2ff5be6
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 10, 2024
8e2ff86
Update LLaVA-NeXT
DarkLight1337 Jun 11, 2024
553f684
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 11, 2024
b134dfc
Update name
DarkLight1337 Jun 11, 2024
1efa480
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 11, 2024
1a08444
Update LLaVA-NeXT
DarkLight1337 Jun 11, 2024
7e33706
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 11, 2024
cfc31fd
Merge branch 'upstream' into mm-image-tokenizer-2
DarkLight1337 Jun 11, 2024
3fb622c
Remove `MULTIMODAL` convenience property as it was causing some (impo…
DarkLight1337 Jun 11, 2024
da85ab2
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 11, 2024
383bea1
Update docs
DarkLight1337 Jun 11, 2024
80a09f2
Remove double processing of image tokens
DarkLight1337 Jun 12, 2024
6a70e4f
Add docs
DarkLight1337 Jun 12, 2024
8322ecb
Add docs
DarkLight1337 Jun 12, 2024
52a0116
Add docs
DarkLight1337 Jun 12, 2024
c1733dd
Add docs
DarkLight1337 Jun 12, 2024
b7a8683
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 12, 2024
9fb5e72
Remove more instances of double processing; update docs
DarkLight1337 Jun 13, 2024
25f9949
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 13, 2024
03c7e65
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 13, 2024
3932b3f
Remove xfail
DarkLight1337 Jun 13, 2024
7fa877a
Fix missing image token in OpenAI API serving
DarkLight1337 Jun 13, 2024
092e550
Fix LLaVA-NeXT test
DarkLight1337 Jun 14, 2024
7a19862
Remove duplicate processing in async engine
DarkLight1337 Jun 14, 2024
fd7d954
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 15, 2024
49dac3e
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 15, 2024
b2c6832
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 15, 2024
0104218
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 18, 2024
18cc7e0
Set up dummy data factory for phi3v
DarkLight1337 Jun 18, 2024
2291617
Move dummy data factories to model files
DarkLight1337 Jun 18, 2024
adf5503
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 18, 2024
e5a94e4
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 18, 2024
9b0386d
Move input processors to model files
DarkLight1337 Jun 18, 2024
4e656e7
Set up input processor for phi3v
DarkLight1337 Jun 18, 2024
fecf1f0
Fix wrong feature size
DarkLight1337 Jun 18, 2024
086e0fe
Fix wrong feature size
DarkLight1337 Jun 18, 2024
8c26a18
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 19, 2024
81522fe
Fix wrong feature size
DarkLight1337 Jun 19, 2024
c036b86
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 24, 2024
f75e1ab
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 24, 2024
b24e8d9
Update validation
DarkLight1337 Jun 24, 2024
8569d35
Fix image feature calculation for phi3v
DarkLight1337 Jun 24, 2024
bfa5aa9
Remove redundant code
DarkLight1337 Jun 24, 2024
dc34121
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 24, 2024
07e695d
Apply isort
DarkLight1337 Jun 24, 2024
8a43a77
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 24, 2024
825401d
Apply yapf
DarkLight1337 Jun 24, 2024
4a0d4d1
Reduce `max_tokens` so that test still passes
DarkLight1337 Jun 25, 2024
8d22fe0
Fix vllm to hf output (+ rename)
DarkLight1337 Jun 25, 2024
2e1ee2f
Fix wrong arguments
DarkLight1337 Jun 25, 2024
7229b07
Move `DummyImageDataFactories` into CLIP model file
DarkLight1337 Jun 25, 2024
17800fd
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 25, 2024
50f994b
Move `input_processor_for_clip` into CLIP
DarkLight1337 Jun 25, 2024
838aa9b
Remove some magic numbers
DarkLight1337 Jun 25, 2024
e7a5564
Test multiscale inputs for LLaVA-NeXT
DarkLight1337 Jun 25, 2024
36e8001
Handle multiscale inputs (different number of patches per batch) in L…
DarkLight1337 Jun 25, 2024
39e6d42
Fix wrong feature size
DarkLight1337 Jun 26, 2024
0d7f18f
Apply formatter
DarkLight1337 Jun 26, 2024
8e5dc7c
Merge branch 'upstream' into mm-image-tokenizer-2
DarkLight1337 Jun 26, 2024
d9a4150
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 26, 2024
6849236
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 26, 2024
6d02491
Revert max_tokens
DarkLight1337 Jun 26, 2024
76ddea4
Add more tests for input mapper
DarkLight1337 Jun 26, 2024
4b20e66
Sanity check: Also test multiscale inputs for LLaVA-1.5
DarkLight1337 Jun 26, 2024
784af1a
Do not auto-convert image dtype to model's dtype
DarkLight1337 Jun 26, 2024
8e5fb12
Update prompts
DarkLight1337 Jun 26, 2024
4b947ad
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 26, 2024
e7397ee
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 26, 2024
865be7a
Fix mapper tests w.r.t. dtype change
DarkLight1337 Jun 26, 2024
9e82a26
Clarify docs and add todo
DarkLight1337 Jun 26, 2024
46391de
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 26, 2024
a4733f9
Remove TODO since vision config will be removed soon
DarkLight1337 Jun 26, 2024
6b19e6c
Expand docs
DarkLight1337 Jun 26, 2024
be326f2
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 26, 2024
f451668
Add ref
DarkLight1337 Jun 26, 2024
5c0c8cf
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 26, 2024
3d7b795
Update docs
DarkLight1337 Jun 26, 2024
1abb8a7
Add docs
DarkLight1337 Jun 26, 2024
428d420
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 26, 2024
698830f
Fix name
DarkLight1337 Jun 26, 2024
ac9ea9a
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 26, 2024
334b1a9
Add `MultiModalInputs` to docs
DarkLight1337 Jun 26, 2024
36ab12d
Fix and add links
DarkLight1337 Jun 26, 2024
af01e97
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 26, 2024
c303421
Fix `is_multiscale` not provided anymore
DarkLight1337 Jun 26, 2024
0a0c0e3
Also test multiscale input for phi3v
DarkLight1337 Jun 26, 2024
60517a7
Revert max_tokens for phi3v as numerical error still persists
DarkLight1337 Jun 26, 2024
57df434
Improve error message
DarkLight1337 Jun 26, 2024
ffe0675
Log the full output for easier reference
DarkLight1337 Jun 26, 2024
c7a2a66
Update xfail to be more efficient
DarkLight1337 Jun 26, 2024
598e0e3
Also xfail llava test
DarkLight1337 Jun 26, 2024
f84d87a
Update comment
DarkLight1337 Jun 27, 2024
5dfb6fc
Update docs
DarkLight1337 Jun 27, 2024
bbeff03
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 27, 2024
bf3281c
modify llava_next
ywang96 Jun 27, 2024
56e2d3b
Update comment
DarkLight1337 Jun 27, 2024
d2f8c6d
Update docs
DarkLight1337 Jun 27, 2024
7c197d2
Use dynamic image feature size calculation
DarkLight1337 Jun 27, 2024
f5ffd3e
Fix phi3v not handling `image_sizes` correctly
DarkLight1337 Jun 27, 2024
66aad21
Apply formatter
DarkLight1337 Jun 27, 2024
d1c68c0
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 27, 2024
5f32d53
Add see also
DarkLight1337 Jun 27, 2024
15df4ef
Update examples prompt format
DarkLight1337 Jun 27, 2024
f2e4633
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 27, 2024
095e008
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 27, 2024
a6e3162
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 27, 2024
28922af
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 27, 2024
ce06541
Fix config
DarkLight1337 Jun 27, 2024
cdcc2d4
Fix config
DarkLight1337 Jun 27, 2024
4212abf
Update docs
DarkLight1337 Jun 27, 2024
07c08e3
Update docs
DarkLight1337 Jun 27, 2024
f3f5854
Fix `MultiModalInputs` not working in Python 3.8
DarkLight1337 Jun 27, 2024
bebf9e7
Fix `_ImageAssets` not working in Python 3.8
DarkLight1337 Jun 27, 2024
3f4f4bf
Merge branch 'upstream' into blip-2
DarkLight1337 Jun 27, 2024
e948275
Inital impl. BLIP-2
DarkLight1337 Jun 27, 2024
06d2339
Merge branch 'mm-image-tokenizer-2' into blip-2
DarkLight1337 Jun 27, 2024
fc83d0c
Update BLIP-2 using new API
DarkLight1337 Jun 27, 2024
eb33485
Enable test to run
DarkLight1337 Jun 27, 2024
d68c462
Fix input processor
DarkLight1337 Jun 27, 2024
b909c60
Fix wrong lm_head
DarkLight1337 Jun 27, 2024
9b85e60
Fix wrong output of vision tower
DarkLight1337 Jun 28, 2024
e864427
Fix BLIP-2 repeating output and output conversion
DarkLight1337 Jun 28, 2024
7e80ecc
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 28, 2024
487d742
Merge branch 'upstream' into mm-image-tokenizer
DarkLight1337 Jun 28, 2024
36f72b6
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 28, 2024
e354d2b
Merge branch 'mm-image-tokenizer-2' into blip-2
DarkLight1337 Jun 28, 2024
43350b8
update example
ywang96 Jun 28, 2024
57791de
update doc
ywang96 Jun 28, 2024
b2b1e11
Merge branch 'mm-image-tokenizer' into mm-image-tokenizer-2
DarkLight1337 Jun 28, 2024
5757821
Apply formatter
DarkLight1337 Jun 28, 2024
f36e099
Merge branch 'mm-image-tokenizer-2' into blip-2
DarkLight1337 Jun 28, 2024
fbc5f70
Update docs
DarkLight1337 Jun 28, 2024
dbeee10
Support `eos_token_id` from `config.json`
DarkLight1337 Jun 28, 2024
161585a
Add `trust_remote_code`
DarkLight1337 Jun 28, 2024
261dc71
Merge branch 'eos-from-config' into blip-2
DarkLight1337 Jun 28, 2024
36de6f4
Use updated EOS token detection code
DarkLight1337 Jun 28, 2024
4292ccb
Merge branch 'upstream' into mm-image-tokenizer-2
DarkLight1337 Jun 28, 2024
6a24360
Merge branch 'mm-image-tokenizer-2' into blip-2
DarkLight1337 Jun 28, 2024
5d23a96
Apply formatter
DarkLight1337 Jun 28, 2024
9c91649
Merge branch 'mm-image-tokenizer-2' into blip-2
DarkLight1337 Jun 28, 2024
78d2e10
Apply formatter
DarkLight1337 Jun 28, 2024
78064e0
Fix OpenAI server not working for phi3v
DarkLight1337 Jun 28, 2024
4cb809c
Preemptively handle upcoming models
DarkLight1337 Jun 28, 2024
754e238
Add more models
DarkLight1337 Jun 28, 2024
9edb53c
Update feature size for dummy data
DarkLight1337 Jun 28, 2024
2795b16
Use a less strict check
DarkLight1337 Jun 29, 2024
86ffd60
Fix phi3v test
DarkLight1337 Jun 29, 2024
f339dd1
Update default length as the dummy image feature size is increased
DarkLight1337 Jun 29, 2024
59a7a4c
Raise full error if output is completely different
DarkLight1337 Jun 29, 2024
62952e1
Fix phi3v not using input processor
DarkLight1337 Jun 29, 2024
49f39d6
Merge branch 'mm-image-tokenizer-2' into blip-2
DarkLight1337 Jun 29, 2024
5810335
Update BLIP-2 test according to merged changes
DarkLight1337 Jun 29, 2024
0ce3ecb
Move size factors outside
DarkLight1337 Jun 29, 2024
379b99a
Merge branch 'mm-image-tokenizer-2' into blip-2
DarkLight1337 Jun 29, 2024
b43e8c3
Apply formatter
DarkLight1337 Jun 29, 2024
ac25521
Merge branch 'mm-image-tokenizer-2' into blip-2
DarkLight1337 Jun 29, 2024
9e2e69a
Move size factors outside
DarkLight1337 Jun 29, 2024
44dec19
Implement BLIPVisionModel in vLLM
DarkLight1337 Jun 29, 2024
9023794
Fix some outputs not being checked
DarkLight1337 Jun 29, 2024
453e144
Merge branch 'mm-image-tokenizer-2' into blip-2
DarkLight1337 Jun 29, 2024
3da4a00
Fix some outputs not being checked
DarkLight1337 Jun 29, 2024
fc5549c
Merge branch 'upstream' into mm-image-tokenizer-2
DarkLight1337 Jun 30, 2024
f6c8061
Also test no image
DarkLight1337 Jun 30, 2024
e4f99e6
Merge branch 'mm-image-tokenizer-2' into blip-2
DarkLight1337 Jun 30, 2024
191c671
Also test no image
DarkLight1337 Jun 30, 2024
15cc847
Merge branch 'upstream' into mm-image-tokenizer-2
DarkLight1337 Jun 30, 2024
235c8a9
Batch by size factors
DarkLight1337 Jun 30, 2024
b98d924
Factor out xfail code
DarkLight1337 Jun 30, 2024
a3eb4fe
Merge branch 'mm-image-tokenizer-2' into blip-2
DarkLight1337 Jun 30, 2024
90c60f3
Apply merge changes to BLIP-2 test
DarkLight1337 Jun 30, 2024
2c2558b
Fix unused args
DarkLight1337 Jun 30, 2024
b0d868c
Merge branch 'mm-image-tokenizer-2' into blip-2
DarkLight1337 Jun 30, 2024
ec28eca
Check logprobs instead of xfailing
DarkLight1337 Jun 30, 2024
5a337f5
Merge branch 'upstream' into mm-image-tokenizer-2
DarkLight1337 Jun 30, 2024
05db877
Merge branch 'mm-image-tokenizer-2' into blip-2
DarkLight1337 Jun 30, 2024
213e1e4
Apply merge changes to BLIP-2 tests
DarkLight1337 Jun 30, 2024
2eb3490
Fix different scales not being in the same batch
DarkLight1337 Jun 30, 2024
9b42589
Merge branch 'mm-image-tokenizer-2' into blip-2
DarkLight1337 Jun 30, 2024
e671249
Apply merge changes to BLIP-2 tests
DarkLight1337 Jun 30, 2024
6301a52
Apply suggestions from code review
DarkLight1337 Jun 30, 2024
14f10fc
Add link
DarkLight1337 Jun 30, 2024
7c335c3
Use `self.multi_modal_projector` directly
DarkLight1337 Jun 30, 2024
33c860e
Allow users to send image token formatted prompt directly
DarkLight1337 Jun 30, 2024
e03bc57
Factor out the code for placeholder token IDs
DarkLight1337 Jun 30, 2024
a9da5cf
Merge branch 'mm-image-tokenizer-2' into blip-2
DarkLight1337 Jun 30, 2024
b270ac3
Remove `-rx` flag
DarkLight1337 Jun 30, 2024
36095e1
Merge branch 'mm-image-tokenizer-2' into blip-2
DarkLight1337 Jun 30, 2024
3161221
Fix distributed tests
DarkLight1337 Jun 30, 2024
85d108a
Fix string mismatch warning
DarkLight1337 Jun 30, 2024
d648e32
Relax phi3v test; add TODO for llava tests
DarkLight1337 Jun 30, 2024
37fa0e7
Merge branch 'mm-image-tokenizer-2' into blip-2
DarkLight1337 Jun 30, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion docs/source/dev/input_processing/model_inputs_index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@ Input Processing
vLLM provides a mechanism for defining input processors for each model so that the inputs are processed
in :class:`~vllm.LLMEngine` before they are passed to model executors.

Currently, this mechanism is only utilized in **multi-modal models** for preprocessing multi-modal input
Currently, this mechanism is only utilized in :ref:`multi-modal models <multi_modality>` for preprocessing multi-modal input
data in addition to input prompt, but it can be extended to text-only language models when needed.

Guides
Expand Down
124 changes: 124 additions & 0 deletions docs/source/dev/multimodal/adding_multimodal_model.rst
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
.. _adding_a_new_multimodal_model:

Adding a New Multimodal Model
=============================

This document provides a high-level guide on integrating a :ref:`multi-modal model <multi_modality>` into vLLM.

.. note::
The complexity of adding a new model depends heavily on the model's architecture.
The process is considerably straightforward if the model shares a similar architecture with an existing model in vLLM.
However, for models that include new operators (e.g., a new attention mechanism), the process can be a bit more complex.

.. tip::
If you are encountering issues while integrating your model into vLLM, feel free to open an issue on our `GitHub <https://github.com/vllm-project/vllm/issues>`_ repository.
We will be happy to help you out!


1. Set up the base vLLM model
-----------------------------

As usual, follow :ref:`these steps <adding_a_new_model>` to implement the model in vLLM, but note the following:

- You should additionally implement the :class:`~vllm.model_executor.models.interfaces.SupportsVision` interface.

.. code-block:: diff
+ from vllm.model_executor.models.interfaces import SupportsVision
- class YourModelForImage2Seq(nn.Module):
+ class YourModelForImage2Seq(nn.Module, SupportsVision):
.. note::
The model class does not have to be named :code:`*ForCausalLM`.
Check out `the HuggingFace Transformers documentation <https://huggingface.co/docs/transformers/model_doc/auto#multimodal>`__ for some examples.

- While implementing the :meth:`~torch.nn.Module.forward` method, reserve a keyword parameter
for each input tensor that corresponds to a multi-modal input, as shown in the following example:

.. code-block:: diff
def forward(
self,
input_ids: torch.Tensor,
positions: torch.Tensor,
kv_caches: List[torch.Tensor],
attn_metadata: AttentionMetadata,
+ pixel_values: torch.Tensor,
) -> SamplerOutput:
2. Register input mappers
-------------------------

For each modality type to support, decorate the model class with :meth:`MULTIMODAL_REGISTRY.register_input_mapper <vllm.multimodal.MultiModalRegistry.register_input_mapper>`.
This decorator accepts a function that maps multi-modal inputs to the keyword arguments you have previously defined in :meth:`~torch.nn.Module.forward`.

.. code-block:: diff
from vllm.model_executor.models.interfaces import SupportsVision
+ from vllm.multimodal import MULTIMODAL_REGISTRY
+ @MULTIMODAL_REGISTRY.register_image_feature_input_mapper()
+ @MULTIMODAL_REGISTRY.register_image_pixel_input_mapper()
class YourModelForImage2Seq(nn.Module, SupportsVision):
A default mapper is available for each modality in the core vLLM library. This input mapper will be used if you do not provide your own function.

.. seealso::
:ref:`input_processing_pipeline`


3. (Optional) Register dummy data
---------------------------------

During startup, dummy data is passed to the vLLM model to allocate memory. This only consists of text input by default, which may not be applicable to multi-modal models.
In such cases, you can define your own dummy data by registering a factory method via :meth:`INPUT_REGISTRY.register_dummy_data <vllm.inputs.registry.InputRegistry.register_dummy_data>`.

.. code-block:: diff
from vllm.inputs import INPUT_REGISTRY
from vllm.model_executor.models.interfaces import SupportsVision
from vllm.multimodal import MULTIMODAL_REGISTRY
@MULTIMODAL_REGISTRY.register_image_feature_input_mapper()
@MULTIMODAL_REGISTRY.register_image_pixel_input_mapper()
+ @INPUT_REGISTRY.register_dummy_data(<your_dummy_data_factory>)
class YourModelForImage2Seq(nn.Module, SupportsVision):
Here are some examples:

- Image inputs (static feature size): `LLaVA-1.5 Model <https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llava.py>`__
- Image inputs (dynamic feature size): `LLaVA-NeXT Model <https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llava_next.py>`__

.. seealso::
:ref:`input_processing_pipeline`


4. (Optional) Register input processor
--------------------------------------

Sometimes, there is a need to process inputs at the :class:`~vllm.LLMEngine` level before they are passed to the model executor.
This is often due to the fact that unlike implementations in HuggingFace Transformers, the reshaping and/or expansion of multi-modal embeddings needs to take place outside model's :meth:`~torch.nn.Module.forward` call.
You can register input processors via :meth:`INPUT_REGISTRY.register_input_processor <vllm.inputs.registry.InputRegistry.register_input_processor>`.

.. code-block:: diff
from vllm.inputs import INPUT_REGISTRY
from vllm.model_executor.models.interfaces import SupportsVision
from vllm.multimodal import MULTIMODAL_REGISTRY
@MULTIMODAL_REGISTRY.register_image_feature_input_mapper()
@MULTIMODAL_REGISTRY.register_image_pixel_input_mapper()
@INPUT_REGISTRY.register_dummy_data(<your_dummy_data_factory>)
+ @INPUT_REGISTRY.register_input_processor(<your_input_processor>)
class YourModelForImage2Seq(nn.Module, SupportsVision):
A common use case of input processors is inserting placeholder tokens to leverage the vLLM framework for attention mask generation.
Here are some examples:

- Insert static number of image tokens: `LLaVA-1.5 Model <https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llava.py>`__
- Insert dynamic number of image tokens: `LLaVA-NeXT Model <https://github.com/vllm-project/vllm/blob/main/vllm/model_executor/models/llava_next.py>`__

.. seealso::
:ref:`input_processing_pipeline`
18 changes: 15 additions & 3 deletions docs/source/dev/multimodal/multimodal_index.rst
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
.. _multi_modality:

Multi-Modality
==============

Expand All @@ -8,9 +10,15 @@ vLLM provides experimental support for multi-modal models through the :mod:`vllm
:class:`vllm.inputs.PromptStrictInputs` accepts an additional attribute ``multi_modal_data``
which allows you to pass in multi-modal input alongside text and token prompts.

By default, vLLM models do not support multi-modal inputs. To enable multi-modal support for a model,
you must decorate the model class with :meth:`MULTIMODAL_REGISTRY.register_dummy_data <MultiModalRegistry.register_dummy_data>`,
as well as :meth:`MULTIMODAL_REGISTRY.register_input <MultiModalRegistry.register_input>` for each modality type to support.
By default, vLLM models do not support multi-modal inputs. To enable multi-modal support for a model, please follow :ref:`the guide for adding a new multimodal model. <adding_a_new_multimodal_model>`.

Guides
++++++

.. toctree::
:maxdepth: 1

adding_multimodal_model

Module Contents
+++++++++++++++
Expand All @@ -33,6 +41,10 @@ Base Classes
:members:
:show-inheritance:

.. autoclass:: vllm.multimodal.MultiModalInputs
:members:
:show-inheritance:

.. autoclass:: vllm.multimodal.MultiModalPlugin
:members:
:show-inheritance:
Expand Down
11 changes: 3 additions & 8 deletions docs/source/models/vlm.rst
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,6 @@ The following :ref:`engine arguments <engine_args>` are specific to VLMs:
Currently, the support for vision language models on vLLM has the following limitations:

* Only single image input is supported per text prompt.
* Dynamic ``image_input_shape`` is not supported: the input image will be resized to the static ``image_input_shape``. This means our LLaVA-NeXT output may not exactly match the huggingface implementation.

We are continuously improving user & developer experience for VLMs. Please `open an issue on GitHub <https://github.com/vllm-project/vllm/issues/new/choose>`_ if you have any feedback or feature requests.

Expand All @@ -48,13 +47,12 @@ To initialize a VLM, the aforementioned arguments must be passed to the ``LLM``

To pass an image to the model, note the following in :class:`vllm.inputs.PromptStrictInputs`:

* ``prompt``: The prompt should have a number of ``<image>`` tokens equal to ``image_feature_size``.
* ``prompt``: The prompt should follow the same format as that for the HuggingFace version of the model.
* ``multi_modal_data``: This should be an instance of :class:`~vllm.multimodal.image.ImagePixelData` or :class:`~vllm.multimodal.image.ImageFeatureData`.

.. code-block:: python
prompt = "<image>" * 576 + (
"\nUSER: What is the content of this image?\nASSISTANT:")
prompt = "USER: <image>\nWhat is the content of this image?\nASSISTANT:"
# Load the image using PIL.Image
image = ...
Expand All @@ -70,8 +68,6 @@ To pass an image to the model, note the following in :class:`vllm.inputs.PromptS
A code example can be found in `examples/llava_example.py <https://github.com/vllm-project/vllm/blob/main/examples/llava_example.py>`_.

.. important::
We will remove the need to format image tokens in a future release. Afterwards, the input text will follow the same format as that for the original HuggingFace model.

Online OpenAI Vision API Compatible Inference
----------------------------------------------
Expand Down Expand Up @@ -141,5 +137,4 @@ A full code example can be found in `examples/openai_vision_api_client.py <https
export VLLM_IMAGE_FETCH_TIMEOUT=<timeout>
.. note::
The prompt formatting with the image token ``<image>`` is not needed when serving VLMs with the API server since the prompt will be
processed automatically by the server.
There is no need to format the prompt in the API request when since it will be handled by the server.
6 changes: 2 additions & 4 deletions examples/llava_example.py
Original file line number Diff line number Diff line change
Expand Up @@ -22,8 +22,7 @@ def run_llava_pixel_values(*, disable_image_processor: bool = False):
disable_image_processor=disable_image_processor,
)

prompt = "<image>" * 576 + (
"\nUSER: What is the content of this image?\nASSISTANT:")
prompt = "USER: <image>\nWhat is the content of this image?\nASSISTANT:"

if disable_image_processor:
image = torch.load("images/stop_sign_pixel_values.pt")
Expand All @@ -49,8 +48,7 @@ def run_llava_image_features():
image_feature_size=576,
)

prompt = "<image>" * 576 + (
"\nUSER: What is the content of this image?\nASSISTANT:")
prompt = "USER: <image>\nWhat is the content of this image?\nASSISTANT:"

image: torch.Tensor = torch.load("images/stop_sign_image_features.pt")

Expand Down
2 changes: 1 addition & 1 deletion examples/llava_next_example.py
Original file line number Diff line number Diff line change
Expand Up @@ -19,7 +19,7 @@
image_feature_size=1176,
)

prompt = "[INST] " + "<image>" * 1176 + "\nWhat is shown in this image? [/INST]"
prompt = "[INST] <image>\nWhat is shown in this image? [/INST]"
url = "https://h2o-release.s3.amazonaws.com/h2ogpt/bigben.jpg"
image = Image.open(BytesIO(requests.get(url).content))
sampling_params = SamplingParams(temperature=0.8, top_p=0.95, max_tokens=100)
Expand Down
2 changes: 0 additions & 2 deletions examples/phi3v_example.py
Original file line number Diff line number Diff line change
Expand Up @@ -28,8 +28,6 @@ def run_phi3v():

# single-image prompt
prompt = "<|user|>\n<|image_1|>\nWhat is the season?<|end|>\n<|assistant|>\n" # noqa: E501
prompt = prompt.replace("<|image_1|>", "<|image|>" * 1921 + "<s>")

sampling_params = SamplingParams(temperature=0, max_tokens=64)

outputs = llm.generate(
Expand Down
Loading
Loading