Skip to content

Commit 57d85a4

Browse files
WenjiaoYuepre-commit-ci[bot]letonghan
authored
Refine MultimodalQnA Readme (opea-project#2104)
Signed-off-by: WenjiaoYue <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Letong Han <[email protected]>
1 parent ec7132e commit 57d85a4

File tree

7 files changed

+820
-1180
lines changed

7 files changed

+820
-1180
lines changed

MultimodalQnA/README.md

Lines changed: 14 additions & 179 deletions
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,22 @@
11
# MultimodalQnA Application
22

3-
Suppose you possess a set of videos, images, audio files, PDFs, or some combination thereof and wish to perform question-answering to extract insights from these documents. To respond to your questions, the system needs to comprehend a mix of textual, visual, and audio facts drawn from the document contents. The MultimodalQnA framework offers an optimal solution for this purpose.
3+
Multimodal question answering is the process of extracting insights from documents that contain a mix of text, images, videos, audio, and PDFs. It involves reasoning over both textual and non-textual content to answer user queries.
44

5-
`MultimodalQnA` addresses your questions by dynamically fetching the most pertinent multimodal information (e.g. images, transcripts, and captions) from your collection of video, image, audio, and PDF files. For this purpose, MultimodalQnA utilizes [BridgeTower model](https://huggingface.co/BridgeTower/bridgetower-large-itm-mlm-gaudi), a multimodal encoding transformer model which merges visual and textual data into a unified semantic space. During the ingestion phase, the BridgeTower model embeds both visual cues and auditory facts as texts, and those embeddings are then stored in a vector database. When it comes to answering a question, the MultimodalQnA will fetch its most relevant multimodal content from the vector store and feed it into a downstream Large Vision-Language Model (LVM) as input context to generate a response for the user, which can be text or audio.
5+
The MultimodalQnA framework enables this by leveraging the BridgeTower model, which encodes visual and textual data into a shared semantic space. During ingestion, it processes content and stores embeddings in a vector database. At query time, relevant multimodal segments are retrieved and passed to a vision-language model to generate responses in text or audio form.
66

7-
The MultimodalQnA architecture shows below:
7+
## Table of Contents
8+
9+
1. [Architecture](#architecture)
10+
2. [Deployment Options](#deployment-options)
11+
3. [Monitoring and Tracing](./README_miscellaneous.md)
12+
13+
## Architecture
14+
15+
The MultimodalQnA application is an end-to-end workflow designed for multimodal question answering across video, image, audio, and PDF inputs. The architecture is illustrated below:
816

917
![architecture](./assets/img/MultimodalQnA.png)
1018

11-
MultimodalQnA is implemented on top of [GenAIComps](https://github.com/opea-project/GenAIComps), the MultimodalQnA Flow Chart shows below:
19+
The MultimodalQnA example is implemented using the component-level microservices defined in [GenAIComps](https://github.com/opea-project/GenAIComps), the MultimodalQnA Flow Chart shows below:
1220

1321
```mermaid
1422
---
@@ -86,182 +94,9 @@ flowchart LR
8694

8795
This MultimodalQnA use case performs Multimodal-RAG using LangChain, Redis VectorDB and Text Generation Inference on [Intel Gaudi2](https://www.intel.com/content/www/us/en/products/details/processors/ai-accelerators/gaudi-overview.html) and [Intel Xeon Scalable Processors](https://www.intel.com/content/www/us/en/products/details/processors/xeon.html), and we invite contributions from other hardware vendors to expand the example.
8896

89-
The [Whisper Service](https://github.com/opea-project/GenAIComps/blob/main/comps/asr/src/README.md)
90-
is used by MultimodalQnA for converting audio queries to text. If a spoken response is requested, the
91-
[SpeechT5 Service](https://github.com/opea-project/GenAIComps/blob/main/comps/tts/src/README.md) translates the text
92-
response from the LVM to a speech audio file.
93-
94-
The Intel Gaudi2 accelerator supports both training and inference for deep learning models in particular for LLMs. Visit [Habana AI products](https://habana.ai/products) for more details.
95-
96-
In the below, we provide a table that describes for each microservice component in the MultimodalQnA architecture, the default configuration of the open source project, hardware, port, and endpoint.
97-
98-
<details>
99-
<summary><b>Gaudi and Xeon default compose.yaml settings</b></summary>
100-
101-
| MicroService | Open Source Project | HW | Port | Endpoint |
102-
| ------------ | ----------------------- | ----- | ---- | ----------------------------------------------------------- |
103-
| Dataprep | Redis, Langchain, TGI | Xeon | 6007 | /v1/generate_transcripts, /v1/generate_captions, /v1/ingest |
104-
| Embedding | Langchain | Xeon | 6000 | /v1/embeddings |
105-
| LVM | Langchain, Transformers | Xeon | 9399 | /v1/lvm |
106-
| Retriever | Langchain, Redis | Xeon | 7000 | /v1/retrieval |
107-
| SpeechT5 | Transformers | Xeon | 7055 | /v1/tts |
108-
| Whisper | Transformers | Xeon | 7066 | /v1/asr |
109-
| Dataprep | Redis, Langchain, TGI | Gaudi | 6007 | /v1/generate_transcripts, /v1/generate_captions, /v1/ingest |
110-
| Embedding | Langchain | Gaudi | 6000 | /v1/embeddings |
111-
| LVM | Langchain, TGI | Gaudi | 9399 | /v1/lvm |
112-
| Retriever | Langchain, Redis | Gaudi | 7000 | /v1/retrieval |
113-
| SpeechT5 | Transformers | Gaudi | 7055 | /v1/tts |
114-
| Whisper | Transformers | Gaudi | 7066 | /v1/asr |
115-
116-
</details>
117-
118-
## Required Models
119-
120-
By default, the embedding and LVM models are set to a default value as listed below:
121-
122-
| Service | HW | Model |
123-
| --------- | ----- | ----------------------------------------- |
124-
| embedding | Xeon | BridgeTower/bridgetower-large-itm-mlm-itc |
125-
| LVM | Xeon | llava-hf/llava-1.5-7b-hf |
126-
| SpeechT5 | Xeon | microsoft/speecht5_tts |
127-
| Whisper | Xeon | openai/whisper-small |
128-
| embedding | Gaudi | BridgeTower/bridgetower-large-itm-mlm-itc |
129-
| LVM | Gaudi | llava-hf/llava-v1.6-vicuna-13b-hf |
130-
| SpeechT5 | Gaudi | microsoft/speecht5_tts |
131-
| Whisper | Gaudi | openai/whisper-small |
132-
133-
You can choose other LVM models, such as `llava-hf/llava-1.5-7b-hf ` and `llava-hf/llava-1.5-13b-hf`, as needed.
134-
135-
## Deploy MultimodalQnA Service
136-
137-
The MultimodalQnA service can be effortlessly deployed on either Intel Gaudi2 or Intel XEON Scalable Processors.
138-
139-
Currently we support deploying MultimodalQnA services with docker compose. The [`docker_compose`](docker_compose)
140-
directory has folders which include `compose.yaml` files for different hardware types:
141-
142-
```
143-
📂 docker_compose
144-
├── 📂 amd
145-
│   └── 📂 gpu
146-
│   └── 📂 rocm
147-
│   ├── 📄 compose.yaml
148-
│   └── ...
149-
└── 📂 intel
150-
├── 📂 cpu
151-
│   └── 📂 xeon
152-
│   ├── 📄 compose.yaml
153-
│   └── ...
154-
└── 📂 hpu
155-
└── 📂 gaudi
156-
├── 📄 compose.yaml
157-
└── ...
158-
```
159-
160-
### Setup Environment Variables
161-
162-
To set up environment variables for deploying MultimodalQnA services, follow these steps:
163-
164-
1. Set the required environment variables:
165-
166-
```bash
167-
# Example: export host_ip=$(hostname -I | awk '{print $1}')
168-
export host_ip="External_Public_IP"
169-
170-
# Append the host_ip to the no_proxy list to allow container communication
171-
# Example: no_proxy="localhost, 127.0.0.1, 192.168.1.1"
172-
export no_proxy="${no_proxy},${host_ip}"
173-
```
174-
175-
2. If you are in a proxy environment, also set the proxy-related environment variables:
176-
177-
```bash
178-
export http_proxy="Your_HTTP_Proxy"
179-
export https_proxy="Your_HTTPs_Proxy"
180-
```
181-
182-
3. Set up other environment variables:
183-
184-
> Choose **one** command below to set env vars according to your hardware. Otherwise, the port numbers may be set incorrectly.
185-
186-
```bash
187-
# on Gaudi
188-
cd docker_compose/intel/hpu/gaudi
189-
source ./set_env.sh
190-
191-
# on Xeon
192-
cd docker_compose/intel/cpu/xeon
193-
source ./set_env.sh
194-
```
195-
196-
### Deploy MultimodalQnA on Gaudi
197-
198-
Refer to the [Gaudi Guide](./docker_compose/intel/hpu/gaudi/README.md) if you would like to build docker images from
199-
source, otherwise images will be pulled from Docker Hub.
200-
201-
Find the corresponding [compose.yaml](./docker_compose/intel/hpu/gaudi/compose.yaml).
202-
203-
```bash
204-
# While still in the docker_compose/intel/hpu/gaudi directory, use docker compose to bring up the services
205-
docker compose -f compose.yaml up -d
206-
```
207-
208-
> Notice: Currently only the **Habana Driver 1.18.x** is supported for Gaudi.
209-
210-
### Deploy MultimodalQnA on Xeon
211-
212-
Refer to the [Xeon Guide](./docker_compose/intel/cpu/xeon/README.md) if you would like to build docker images from
213-
source, otherwise images will be pulled from Docker Hub.
214-
215-
Find the corresponding [compose.yaml](./docker_compose/intel/cpu/xeon/compose.yaml).
216-
217-
```bash
218-
# While still in the docker_compose/intel/cpu/xeon directory, use docker compose to bring up the services
219-
docker compose -f compose.yaml up -d
220-
```
221-
222-
## MultimodalQnA Demo on Gaudi2
223-
224-
### Multimodal QnA UI
225-
226-
![MultimodalQnA-ui-screenshot](./assets/img/mmqna-ui.png)
227-
228-
### Video Ingestion
229-
230-
![MultimodalQnA-ingest-video-screenshot](./assets/img/video-ingestion.png)
231-
232-
### Text Query following the ingestion of a Video
233-
234-
![MultimodalQnA-video-query-screenshot](./assets/img/video-query.png)
235-
236-
### Image Ingestion
237-
238-
![MultimodalQnA-ingest-image-screenshot](./assets/img/image-ingestion.png)
239-
240-
### Text Query following the ingestion of an image
241-
242-
![MultimodalQnA-video-query-screenshot](./assets/img/image-query-text.png)
243-
244-
### Text Query following the ingestion of an image using text-to-speech
245-
246-
![MultimodalQnA-video-query-screenshot](./assets/img/image-query-tts.png)
247-
248-
### Audio Ingestion
249-
250-
![MultimodalQnA-audio-ingestion-screenshot](./assets/img/audio-ingestion.png)
251-
252-
### Text Query following the ingestion of an Audio Podcast
253-
254-
![MultimodalQnA-audio-query-screenshot](./assets/img/audio-query.png)
255-
256-
### PDF Ingestion
257-
258-
![MultimodalQnA-upload-pdf-screenshot](./assets/img/pdf-ingestion.png)
259-
260-
### Text query following the ingestion of a PDF
261-
262-
![MultimodalQnA-pdf-query-example-screenshot](./assets/img/pdf-query.png)
97+
## Deployment Options
26398

264-
### View, Refresh, and Delete ingested media in the Vector Store
99+
The table below lists currently available deployment options. They outline in detail the implementation of this example on selected hardware.
265100

266101
![MultimodalQnA-pdf-query-example-screenshot](./assets/img/vector-store.png)
267102

Lines changed: 136 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,136 @@
1+
# MultimodalQnA Docker Image Build
2+
3+
## Table of Contents
4+
5+
1. [Build MegaService Docker Image](#build-megaservice-docker-image)
6+
2. [Build UI Docker Image](#build-ui-docker-image)
7+
3. [Generate a HuggingFace Access Token](#generate-a-huggingface-access-token)
8+
4. [Troubleshooting](#troubleshooting)
9+
5. [Monitoring OPEA Services with Prometheus and Grafana Dashboard](#monitoring-opea-services-with-prometheus-and-grafana-dashboard)
10+
6. [Tracing with OpenTelemetry and Jaeger](#tracing-with-opentelemetry-and-jaeger)
11+
7. [Demo Screenshots](#demo-screenshots)
12+
13+
## Build MegaService Docker Image
14+
15+
To construct the Megaservice of MultimodalQnA, the [GenAIExamples](https://github.com/opea-project/GenAIExamples.git) repository is utilized. Build Megaservice Docker image via command below:
16+
17+
```bash
18+
git clone https://github.com/opea-project/GenAIExamples.git
19+
cd GenAIExamples/MultimodalQnA
20+
docker build --no-cache -t opea/multimodalqna:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f Dockerfile .
21+
```
22+
23+
## Build UI Docker Image
24+
25+
Build frontend Docker image via below command:
26+
27+
```bash
28+
cd GenAIExamples/MultimodalQnA/ui
29+
docker build -t opea/multimodalqna-ui:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f ./docker/Dockerfile .
30+
```
31+
32+
## Generate a HuggingFace Access Token
33+
34+
Some HuggingFace resources, such as some models, are only accessible if the developer have an access token. In the absence of a HuggingFace access token, the developer can create one by first creating an account by following the steps provided at [HuggingFace](https://huggingface.co/) and then generating a [user access token](https://huggingface.co/docs/transformers.js/en/guides/private#step-1-generating-a-user-access-token).
35+
36+
## Troubleshooting
37+
38+
1. If you get errors like "Access Denied", [validate micro service](https://github.com/opea-project/GenAIExamples/blob/main/MultimodalQnA/docker_compose/intel/cpu/xeon/README.md#validate-microservices) first. A simple example:
39+
40+
```bash
41+
http_proxy=""
42+
curl http://${host_ip}:8399/generate \
43+
-X POST \
44+
-H 'Content-Type: application/json' \
45+
-d '{
46+
"prompt": "Describe the image please.",
47+
"img_b64_str": [
48+
"iVBORw0KGgoAAAANSUhEUgAAAAoAAAAKCAYAAACNMs+9AAAAFUlEQVR42mP8/5+hnoEIwDiqkL4KAcT9GO0U4BxoAAAAAElFTkSuQmCC",
49+
"iVBORw0KGgoAAAANSUhEUgAAAAoAAAAKCAYAAACNMs+9AAAAFUlEQVR42mNkYPhfz0AEYBxVSF+FAP5FDvcfRYWgAAAAAElFTkSuQmCC"
50+
]
51+
}'
52+
```
53+
54+
2. (Docker only) If all microservices work well, check the port ${host_ip}:8888, the port may be allocated by other users, you can modify the `compose.yaml`.
55+
3. (Docker only) If you get errors like "The container name is in use", change container name in `compose.yaml`.
56+
57+
## Monitoring OPEA Services with Prometheus and Grafana Dashboard
58+
59+
OPEA microservice deployment can easily be monitored through Grafana dashboards using data collected via Prometheus. Follow the [README](https://github.com/opea-project/GenAIEval/blob/main/evals/benchmark/grafana/README.md) to set up Prometheus and Grafana servers and import dashboards to monitor the OPEA services.
60+
61+
![example dashboards](./assets/img/example_dashboards.png)
62+
![tgi dashboard](./assets/img/tgi_dashboard.png)
63+
64+
## Tracing with OpenTelemetry and Jaeger
65+
66+
> NOTE: This feature is disabled by default. Please use the compose.telemetry.yaml file to enable this feature.
67+
68+
OPEA microservice and [TGI](https://huggingface.co/docs/text-generation-inference/en/index)/[TEI](https://huggingface.co/docs/text-embeddings-inference/en/index) serving can easily be traced through [Jaeger](https://www.jaegertracing.io/) dashboards in conjunction with [OpenTelemetry](https://opentelemetry.io/) Tracing feature. Follow the [README](https://github.com/opea-project/GenAIComps/tree/main/comps/cores/telemetry#tracing) to trace additional functions if needed.
69+
70+
Tracing data is exported to http://{EXTERNAL_IP}:4318/v1/traces via Jaeger.
71+
Users could also get the external IP via below command.
72+
73+
```bash
74+
ip route get 8.8.8.8 | grep -oP 'src \K[^ ]+'
75+
```
76+
77+
Access the Jaeger dashboard UI at http://{EXTERNAL_IP}:16686
78+
79+
For TGI serving on Gaudi, users could see different services like opea, TEI and TGI.
80+
![Screenshot from 2024-12-27 11-58-18](https://github.com/user-attachments/assets/6126fa70-e830-4780-bd3f-83cb6eff064e)
81+
82+
Here is a screenshot for one tracing of TGI serving request.
83+
![Screenshot from 2024-12-27 11-26-25](https://github.com/user-attachments/assets/3a7c51c6-f422-41eb-8e82-c3df52cd48b8)
84+
85+
There are also OPEA related tracings. Users could understand the time breakdown of each service request by looking into each opea:schedule operation.
86+
![image](https://github.com/user-attachments/assets/6137068b-b374-4ff8-b345-993343c0c25f)
87+
88+
There could be asynchronous function such as `llm/MicroService_asyn_generate` and user needs to check the trace of the asynchronous function in another operation like
89+
opea:llm_generate_stream.
90+
![image](https://github.com/user-attachments/assets/a973d283-198f-4ce2-a7eb-58515b77503e)
91+
92+
## Demo Screenshots
93+
94+
### Multimodal QnA UI
95+
96+
![MultimodalQnA-ui-screenshot](./assets/img/mmqna-ui.png)
97+
98+
### Video Ingestion
99+
100+
![MultimodalQnA-ingest-video-screenshot](./assets/img/video-ingestion.png)
101+
102+
### Text Query following the ingestion of a Video
103+
104+
![MultimodalQnA-video-query-screenshot](./assets/img/video-query.png)
105+
106+
### Image Ingestion
107+
108+
![MultimodalQnA-ingest-image-screenshot](./assets/img/image-ingestion.png)
109+
110+
### Text Query following the ingestion of an image
111+
112+
![MultimodalQnA-video-query-screenshot](./assets/img/image-query-text.png)
113+
114+
### Text Query following the ingestion of an image using text-to-speech
115+
116+
![MultimodalQnA-video-query-screenshot](./assets/img/image-query-tts.png)
117+
118+
### Audio Ingestion
119+
120+
![MultimodalQnA-audio-ingestion-screenshot](./assets/img/audio-ingestion.png)
121+
122+
### Text Query following the ingestion of an Audio Podcast
123+
124+
![MultimodalQnA-audio-query-screenshot](./assets/img/audio-query.png)
125+
126+
### PDF Ingestion
127+
128+
![MultimodalQnA-upload-pdf-screenshot](./assets/img/pdf-ingestion.png)
129+
130+
### Text query following the ingestion of a PDF
131+
132+
![MultimodalQnA-pdf-query-example-screenshot](./assets/img/pdf-query.png)
133+
134+
### View, Refresh, and Delete ingested media in the Vector Store
135+
136+
![MultimodalQnA-pdf-query-example-screenshot](./assets/img/vector-store.png)
100 KB
Loading
414 KB
Loading

0 commit comments

Comments
 (0)