Skip to content

Commit

Permalink
Fix qwen2 audio
Browse files Browse the repository at this point in the history
  • Loading branch information
codingl2k1 committed Jan 25, 2025
1 parent c469846 commit e4d9778
Show file tree
Hide file tree
Showing 7 changed files with 146 additions and 82 deletions.
2 changes: 1 addition & 1 deletion doc/source/models/builtin/llm/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -437,7 +437,7 @@ The following is a list of built-in LLM in Xinference:
- Qwen1.5-MoE is a transformer-based MoE decoder-only language model pretrained on a large amount of data.

* - :ref:`qwen2-audio <models_llm_qwen2-audio>`
- chat, audio
- generate, audio
- 32768
- Qwen2-Audio: A large-scale audio-language model which is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions.

Expand Down
2 changes: 1 addition & 1 deletion doc/source/models/builtin/llm/qwen2-audio.rst
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ qwen2-audio
- **Context Length:** 32768
- **Model Name:** qwen2-audio
- **Languages:** en, zh
- **Abilities:** chat, audio
- **Abilities:** generate, audio
- **Description:** Qwen2-Audio: A large-scale audio-language model which is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions.

Specifications
Expand Down
72 changes: 67 additions & 5 deletions doc/source/models/model_abilities/multimodal.rst
Original file line number Diff line number Diff line change
@@ -1,13 +1,13 @@
.. _multimodal:

=====================
Vision
Multimodal
=====================

Learn how to process images and audio with LLMs.


Introduction
Vision
============

With the ``vision`` ability you can have your model take in images and answer questions about them.
Expand Down Expand Up @@ -37,13 +37,13 @@ The ``vision`` ability is supported with the following models in Xinference:


Quickstart
====================
----------------------

Images are made available to the model in two main ways: by passing a link to the image or by passing the
base64 encoded image directly in the request.

Example using OpenAI Client
-------------------------------
~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python
Expand Down Expand Up @@ -74,7 +74,7 @@ Example using OpenAI Client
Uploading base 64 encoded images
------------------------------------
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python
Expand Down Expand Up @@ -125,4 +125,66 @@ You can find more examples of ``vision`` ability in the tutorial notebook:

Learn vision ability from a example using qwen-vl-chat

Audio
============

With the ``audio`` ability you can have your model take in audio and performing audio analysis or direct textual
responses with regard to speech instructions.
Within Xinference, this indicates that certain models are capable of processing audio inputs when conducting
dialogues via the Chat API.

Supported models
----------------------

The ``audio`` ability is supported with the following models in Xinference:

* :ref:`qwen2-audio-instruct <models_llm_qwen2-audio-instruct>`

Quickstart
----------------------

Images are made available to the model in two main ways: by passing a link to the image or by passing the
audio url directly in the request.


Uploading base 64 encoded images
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

.. code-block:: python
import openai
import base64
# Function to encode the image
def encode_image(image_path):
with open(image_path, "rb") as image_file:
return base64.b64encode(image_file.read()).decode('utf-8')
# Path to your image
image_path = "path_to_your_image.jpg"
# Getting the base64 string
b64_img = encode_image(image_path)
client = openai.Client(
api_key="cannot be empty",
base_url=f"http://<XINFERENCE_HOST>:<XINFERENCE_PORT>/v1"
)
response = client.chat.completions.create(
model="<MODEL_UID>",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "What’s in this image?"},
{
"type": "image_url",
"image_url": {
"url": f"data:image/jpeg;base64,{b64_img}",
},
},
],
}
],
)
print(response.choices[0])
74 changes: 0 additions & 74 deletions xinference/core/tests/test_restful_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -1332,77 +1332,3 @@ def test_launch_model_by_version(setup):
# delete again
url = f"{endpoint}/v1/models/test_qwen15"
requests.delete(url)


@pytest.mark.skip(reason="Cost too many resources.")
def test_restful_api_for_qwen_audio(setup):
model_name = "qwen2-audio-instruct"

endpoint, _ = setup
url = f"{endpoint}/v1/models"

# list
response = requests.get(url)
response_data = response.json()
assert len(response_data["data"]) == 0

# launch
payload = {
"model_uid": "test_audio",
"model_name": model_name,
"model_engine": "transformers",
"model_size_in_billions": 7,
"model_format": "pytorch",
"quantization": "none",
}

response = requests.post(url, json=payload)
response_data = response.json()
model_uid_res = response_data["model_uid"]
assert model_uid_res == "test_audio"

response = requests.get(url)
response_data = response.json()
assert len(response_data["data"]) == 1

url = f"{endpoint}/v1/chat/completions"
payload = {
"model": model_uid_res,
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{
"role": "user",
"content": [
{
"type": "audio",
"audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/glass-breaking-151256.mp3",
},
{"type": "text", "text": "What's that sound?"},
],
},
{"role": "assistant", "content": "It is the sound of glass shattering."},
{
"role": "user",
"content": [
{"type": "text", "text": "What can you do when you hear that?"},
],
},
{
"role": "assistant",
"content": "Stay alert and cautious, and check if anyone is hurt or if there is any damage to property.",
},
{
"role": "user",
"content": [
{
"type": "audio",
"audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/1272-128104-0000.flac",
},
{"type": "text", "text": "What does the person say?"},
],
},
],
}
response = requests.post(url, json=payload)
completion = response.json()
assert len(completion["choices"][0]["message"]) > 0
2 changes: 1 addition & 1 deletion xinference/model/llm/llm_family.json
Original file line number Diff line number Diff line change
Expand Up @@ -7212,7 +7212,7 @@
"zh"
],
"model_ability":[
"chat",
"generate",
"audio"
],
"model_description":"Qwen2-Audio: A large-scale audio-language model which is capable of accepting various audio signal inputs and performing audio analysis or direct textual responses with regard to speech instructions.",
Expand Down
74 changes: 74 additions & 0 deletions xinference/model/llm/tests/test_multimodal.py
Original file line number Diff line number Diff line change
Expand Up @@ -318,3 +318,77 @@ def test_restful_api_for_deepseek_vl(setup, model_format, quantization):
],
)
assert any(count in completion.choices[0].message.content for count in ["两条", "四条"])


@pytest.mark.skip(reason="Cost too many resources.")
def test_restful_api_for_qwen_audio(setup):
model_name = "qwen2-audio-instruct"

endpoint, _ = setup
url = f"{endpoint}/v1/models"

# list
response = requests.get(url)
response_data = response.json()
assert len(response_data["data"]) == 0

# launch
payload = {
"model_uid": "test_audio",
"model_name": model_name,
"model_engine": "transformers",
"model_size_in_billions": 7,
"model_format": "pytorch",
"quantization": "none",
}

response = requests.post(url, json=payload)
response_data = response.json()
model_uid_res = response_data["model_uid"]
assert model_uid_res == "test_audio"

response = requests.get(url)
response_data = response.json()
assert len(response_data["data"]) == 1

url = f"{endpoint}/v1/chat/completions"
payload = {
"model": model_uid_res,
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{
"role": "user",
"content": [
{
"type": "audio",
"audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/glass-breaking-151256.mp3",
},
{"type": "text", "text": "What's that sound?"},
],
},
{"role": "assistant", "content": "It is the sound of glass shattering."},
{
"role": "user",
"content": [
{"type": "text", "text": "What can you do when you hear that?"},
],
},
{
"role": "assistant",
"content": "Stay alert and cautious, and check if anyone is hurt or if there is any damage to property.",
},
{
"role": "user",
"content": [
{
"type": "audio",
"audio_url": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen2-Audio/audio/1272-128104-0000.flac",
},
{"type": "text", "text": "What does the person say?"},
],
},
],
}
response = requests.post(url, json=payload)
completion = response.json()
assert len(completion["choices"][0]["message"]) > 0
2 changes: 2 additions & 0 deletions xinference/model/llm/transformers/qwen2_audio.py
Original file line number Diff line number Diff line change
Expand Up @@ -105,6 +105,8 @@ def chat(
inputs = self._processor(
text=text, audios=audios, return_tensors="pt", padding=True
)
# Make sure that the inputs and the model are on the same device.
inputs.data = {k: v.to(self._device) for k, v in inputs.data.items()}
inputs.input_ids = inputs.input_ids.to(self._device)
generate_config = generate_config if generate_config else {}
stream = generate_config.get("stream", False) if generate_config else False
Expand Down

0 comments on commit e4d9778

Please sign in to comment.