Skip to content

Commit b8c0a94

Browse files
committed
Begin reworking docs to start fitting in response parsing
1 parent b22008b commit b8c0a94

File tree

2 files changed

+51
-44
lines changed

2 files changed

+51
-44
lines changed

docs/source/en/chat_templating_multimodal.md

Lines changed: 6 additions & 42 deletions
Original file line numberDiff line numberDiff line change
@@ -18,51 +18,12 @@ rendered properly in your Markdown viewer.
1818

1919
Multimodal chat models accept inputs like images, audio or video, in addition to text. The `content` key in a multimodal chat history is a list containing multiple items of different types. This is unlike text-only chat models whose `content` key is a single string.
2020

21-
2221
In the same way the [Tokenizer](./fast_tokenizer) class handles chat templates and tokenization for text-only models,
2322
the [Processor](./processors) class handles preprocessing, tokenization and chat templates for multimodal models. Their [`~ProcessorMixin.apply_chat_template`] methods are almost identical.
2423

25-
This guide will show you how to chat with multimodal models with the high-level [`ImageTextToTextPipeline`] and at a lower level using the [`~ProcessorMixin.apply_chat_template`] and [`~GenerationMixin.generate`] methods.
26-
27-
## ImageTextToTextPipeline
28-
29-
[`ImageTextToTextPipeline`] is a high-level image and text generation class with a “chat mode”. Chat mode is enabled when a conversational model is detected and the chat prompt is [properly formatted](./llm_tutorial#wrong-prompt-format).
30-
31-
Add image and text blocks to the `content` key in the chat history.
32-
33-
```py
34-
messages = [
35-
{
36-
"role": "system",
37-
"content": [{"type": "text", "text": "You are a friendly chatbot who always responds in the style of a pirate"}],
38-
},
39-
{
40-
"role": "user",
41-
"content": [
42-
{"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
43-
{"type": "text", "text": "What are these?"},
44-
],
45-
},
46-
]
47-
```
48-
49-
Create an [`ImageTextToTextPipeline`] and pass the chat to it. For large models, setting [device_map=“auto”](./models#big-model-inference) helps load the model quicker and automatically places it on the fastest device available. Setting the data type to [auto](./models#model-data-type) also helps save memory and improve speed.
50-
51-
```python
52-
import torch
53-
from transformers import pipeline
54-
55-
pipe = pipeline("image-text-to-text", model="Qwen/Qwen2.5-VL-3B-Instruct", device_map="auto", dtype="auto")
56-
out = pipe(text=messages, max_new_tokens=128)
57-
print(out[0]['generated_text'][-1]['content'])
58-
```
59-
60-
61-
```
62-
Ahoy, me hearty! These be two feline friends, likely some tabby cats, taking a siesta on a cozy pink blanket. They're resting near remote controls, perhaps after watching some TV or just enjoying some quiet time together. Cats sure know how to find comfort and relaxation, don't they?
63-
```
64-
65-
Aside from the gradual descent from pirate-speak into modern American English (it **is** only a 3B model, after all), this is correct!
24+
This guide covers chats with image and video models at a lower level using the [`~ProcessorMixin.apply_chat_template`] and [`~GenerationMixin.generate`] methods, and
25+
is intended for more advanced users. If you just want to quickly chat with a VLM, you can use the [`ImageTextToTextPipeline`] class, which is covered in the
26+
[chat basics](./conversations) guide.
6627

6728
## Using `apply_chat_template`
6829

@@ -113,6 +74,9 @@ print(processor.decode(out[0]))
11374

11475
The decoded output contains the full conversation so far, including the user message and the placeholder tokens that contain the image information. You may need to trim the previous conversation from the output before displaying it to the user.
11576

77+
## Response parsing
78+
79+
TODO section on response parsing with a processor here
11680

11781
## Video inputs
11882

docs/source/en/conversations.md

Lines changed: 45 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -49,7 +49,7 @@ transformers chat -h
4949
The chat is implemented on top of the [AutoClass](./model_doc/auto), using tooling from [text generation](./llm_tutorial) and [chat](./chat_templating). It uses the `transformers serve` CLI under the hood ([docs](./serving.md#serve-cli)).
5050

5151

52-
## TextGenerationPipeline
52+
## Using pipelines to chat
5353

5454
[`TextGenerationPipeline`] is a high-level text generation class with a "chat mode". Chat mode is enabled when a conversational model is detected and the chat prompt is [properly formatted](./llm_tutorial#wrong-prompt-format).
5555

@@ -93,9 +93,52 @@ print(response[0]["generated_text"][-1]["content"])
9393
By repeating this process, you can continue the conversation as long as you like, at least until the model runs out of context window
9494
or you run out of memory.
9595

96+
## Including images in chats
97+
98+
Some models, known as vision-language models (VLMs), can accept images as part of the chat input. When loading a VLM, you
99+
should use the `ImageTextToTextPipeline`, which you can load by setting the `task` argument of `pipeline` to `image-text-to-text`. It works very similarly to
100+
the `TextGenerationPipeline` above, but we can add `image` keys to our messages:
101+
102+
```py
103+
messages = [
104+
{
105+
"role": "system",
106+
"content": [{"type": "text", "text": "You are a friendly chatbot who always responds in the style of a pirate"}],
107+
},
108+
{
109+
"role": "user",
110+
"content": [
111+
{"type": "image", "url": "http://images.cocodataset.org/val2017/000000039769.jpg"},
112+
{"type": "text", "text": "What are these?"},
113+
],
114+
},
115+
]
116+
```
117+
118+
Now we just create our pipeline and get a response as before, using the `image-text-to-text` task:
119+
120+
```py
121+
import torch
122+
from transformers import pipeline
123+
124+
pipe = pipeline("image-text-to-text", model="Qwen/Qwen2.5-VL-3B-Instruct", device_map="auto", dtype="auto")
125+
out = pipe(text=messages, max_new_tokens=128)
126+
print(out[0]['generated_text'][-1]['content'])
127+
```
128+
129+
And as above, you can continue the conversation by appending your reply to the `messages` list. It's okay for
130+
some messages to VLMs to be text-only - you don't need to include an image every time!
131+
132+
## Chatting with "reasoning" models
133+
134+
Since late 2024, we have started to see the appearance of "reasoning" models, also known as "chain of thought" models.
135+
These models write a step-by-step reasoning process before their final answer.
136+
137+
TODO example and show response parsing
138+
96139
## Performance and memory usage
97140

98-
Transformers load models in full `float32` precision by default, and for a 8B model, this requires ~32GB of memory! Use the `torch_dtype="auto"` argument, which generally uses `bfloat16` for models that were trained with it, to reduce your memory usage.
141+
Transformers load models in full `float32` precision by default, and for a 8B model, this requires ~32GB of memory! Use the `dtype="auto"` argument, which generally uses `bfloat16` for models that were trained with it, to reduce your memory usage.
99142

100143
> [!TIP]
101144
> Refer to the [Quantization](./quantization/overview) docs for more information about the different quantization backends available.

0 commit comments

Comments
 (0)