Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
a330569
Adds an endpoint for image ingestion
mhbuehler Oct 14, 2024
ac82cec
Combined image and video endpoint
mhbuehler Oct 21, 2024
83225e6
Add test and update README
mhbuehler Oct 21, 2024
eb46ff9
Merge branch 'main' of github.com:mhbuehler/GenAIComps into melanie/m…
dmsuehir Oct 22, 2024
6324418
fixed variable name for embedding model (#1)
okhleif-10 Oct 22, 2024
3320b48
Fixed test script
mhbuehler Oct 22, 2024
007aba8
Remove redundant function
mhbuehler Oct 23, 2024
7abee3e
get_videos, delete_videos --> get_files, delete_files (#3)
okhleif-10 Oct 23, 2024
313b344
Updates test per review feedback
mhbuehler Oct 24, 2024
0df7c08
Merge branch 'melanie/mm-rag-enhanced' into melanie/combined_image_vi…
mhbuehler Oct 24, 2024
f620193
Fixed test
mhbuehler Oct 24, 2024
89203cf
Merge branch 'melanie/combined_image_video_ingestion' of github.com:m…
mhbuehler Oct 24, 2024
a51f0aa
Merge pull request #2 from mhbuehler/melanie/combined_image_video_ing…
mhbuehler Oct 24, 2024
d9cb3cf
Merge branch 'main' of github.com:mhbuehler/GenAIComps into melanie/m…
dmsuehir Oct 24, 2024
3684a5f
Merge branch 'melanie/mm-rag-enhanced' of github.com:mhbuehler/GenAIC…
dmsuehir Oct 24, 2024
476591b
Add support for audio files multimodal data ingestion (#4)
dmsuehir Oct 28, 2024
689e9d8
Change videos_with_transcripts to ingest_with_text
mhbuehler Oct 24, 2024
01dc2e7
Add image support to video ingestion with transcript functionality
mhbuehler Oct 24, 2024
7769107
Update test and README
mhbuehler Oct 28, 2024
f1028ca
Merge branch 'main' of github.com:mhbuehler/GenAIComps into melanie/m…
dmsuehir Oct 29, 2024
c49f5d7
Updated for review suggestions
mhbuehler Oct 30, 2024
2f9688c
Merge pull request #5 from mhbuehler/melanie/images_and_text
mhbuehler Oct 30, 2024
4b253ee
Add two tests for ingest_with_text
mhbuehler Nov 4, 2024
47c77d5
Merge pull request #6 from mhbuehler/melanie/negative_test
mhbuehler Nov 4, 2024
929aade
LVM TGI Gaudi update for prompts without images (#7)
dmsuehir Nov 4, 2024
56c4d09
Merge branch 'main' of github.com:mhbuehler/GenAIComps into melanie/m…
dmsuehir Nov 4, 2024
ccc4be2
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Nov 5, 2024
808700d
Merge branch 'main' into melanie/mm-rag-enhanced
ashahba Nov 5, 2024
6eea8cc
Merge branch 'main' into melanie/mm-rag-enhanced
ashahba Nov 6, 2024
97f26bf
Change dummy image to be b64 encoded instead of the url (#9)
dmsuehir Nov 6, 2024
8e4f145
Merge branch 'main' of github.com:mhbuehler/GenAIComps into melanie/m…
dmsuehir Nov 6, 2024
a9b4d0f
Updates based on review feedback (#10)
dmsuehir Nov 6, 2024
c2d1ebe
Test fix (#11)
dmsuehir Nov 6, 2024
c817274
Merge branch 'main' into melanie/mm-rag-enhanced
ashahba Nov 7, 2024
e7b43b2
Merge branch 'main' into melanie/mm-rag-enhanced
mhbuehler Nov 7, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
70 changes: 49 additions & 21 deletions comps/dataprep/multimodal/redis/langchain/README.md
Original file line number Diff line number Diff line change
@@ -1,6 +1,10 @@
# Dataprep Microservice for Multimodal Data with Redis

This `dataprep` microservice accepts videos (mp4 files) and their transcripts (optional) from the user and ingests them into Redis vectorstore.
This `dataprep` microservice accepts the following from the user and ingests them into a Redis vector store:

- Videos (mp4 files) and their transcripts (optional)
- Images (gif, jpg, jpeg, and png files) and their captions (optional)
- Audio (wav files)

## 🚀1. Start Microservice with Python(Option 1)

Expand Down Expand Up @@ -107,18 +111,18 @@ docker container logs -f dataprep-multimodal-redis

## 🚀4. Consume Microservice

Once this dataprep microservice is started, user can use the below commands to invoke the microservice to convert videos and their transcripts (optional) to embeddings and save to the Redis vector store.
Once this dataprep microservice is started, user can use the below commands to invoke the microservice to convert images and videos and their transcripts (optional) to embeddings and save to the Redis vector store.

This mircroservice has provided 3 different ways for users to ingest videos into Redis vector store corresponding to the 3 use cases.
This microservice provides 3 different ways for users to ingest files into Redis vector store corresponding to the 3 use cases.

### 4.1 Consume _videos_with_transcripts_ API
### 4.1 Consume _ingest_with_text_ API

**Use case:** This API is used when a transcript file (under `.vtt` format) is available for each video.
**Use case:** This API is used when videos are accompanied by transcript files (`.vtt` format) or images are accompanied by text caption files (`.txt` format).

**Important notes:**

- Make sure the file paths after `files=@` are correct.
- Every transcript file's name must be identical with its corresponding video file's name (except their extension .vtt and .mp4). For example, `video1.mp4` and `video1.vtt`. Otherwise, if `video1.vtt` is not included correctly in this API call, this microservice will return error `No captions file video1.vtt found for video1.mp4`.
- Every transcript or caption file's name must be identical to its corresponding video or image file's name (except their extension - .vtt goes with .mp4 and .txt goes with .jpg, .jpeg, .png, or .gif). For example, `video1.mp4` and `video1.vtt`. Otherwise, if `video1.vtt` is not included correctly in the API call, the microservice will return an error `No captions file video1.vtt found for video1.mp4`.

#### Single video-transcript pair upload

Expand All @@ -127,10 +131,20 @@ curl -X POST \
-H "Content-Type: multipart/form-data" \
-F "files=@./video1.mp4" \
-F "files=@./video1.vtt" \
http://localhost:6007/v1/videos_with_transcripts
http://localhost:6007/v1/ingest_with_text
```

#### Single image-caption pair upload

```bash
curl -X POST \
-H "Content-Type: multipart/form-data" \
-F "files=@./image.jpg" \
-F "files=@./image.txt" \
http://localhost:6007/v1/ingest_with_text
```

#### Multiple video-transcript pair upload
#### Multiple file pair upload

```bash
curl -X POST \
Expand All @@ -139,16 +153,20 @@ curl -X POST \
-F "files=@./video1.vtt" \
-F "files=@./video2.mp4" \
-F "files=@./video2.vtt" \
http://localhost:6007/v1/videos_with_transcripts
-F "files=@./image1.png" \
-F "files=@./image1.txt" \
-F "files=@./image2.jpg" \
-F "files=@./image2.txt" \
http://localhost:6007/v1/ingest_with_text
```

### 4.2 Consume _generate_transcripts_ API

**Use case:** This API should be used when a video has meaningful audio or recognizable speech but its transcript file is not available.
**Use case:** This API should be used when a video has meaningful audio or recognizable speech but its transcript file is not available, or for audio files with speech.

In this use case, this microservice will use [`whisper`](https://openai.com/index/whisper/) model to generate the `.vtt` transcript for the video.
In this use case, this microservice will use [`whisper`](https://openai.com/index/whisper/) model to generate the `.vtt` transcript for the video or audio files.

#### Single video upload
#### Single file upload

```bash
curl -X POST \
Expand All @@ -157,21 +175,22 @@ curl -X POST \
http://localhost:6007/v1/generate_transcripts
```

#### Multiple video upload
#### Multiple file upload

```bash
curl -X POST \
-H "Content-Type: multipart/form-data" \
-F "files=@./video1.mp4" \
-F "files=@./video2.mp4" \
-F "files=@./audio1.wav" \
http://localhost:6007/v1/generate_transcripts
```

### 4.3 Consume _generate_captions_ API

**Use case:** This API should be used when a video does not have meaningful audio or does not have audio.
**Use case:** This API should be used when uploading an image, or when uploading a video that does not have meaningful audio or does not have audio.

In this use case, transcript either does not provide any meaningful information or does not exist. Thus, it is preferred to leverage a LVM microservice to summarize the video frames.
In this use case, there is no meaningful language transcription. Thus, it is preferred to leverage a LVM microservice to summarize the frames.

- Single video upload

Expand All @@ -192,22 +211,31 @@ curl -X POST \
http://localhost:6007/v1/generate_captions
```

### 4.4 Consume get_videos API
- Single image upload

```bash
curl -X POST \
-H "Content-Type: multipart/form-data" \
-F "files=@./image.jpg" \
http://localhost:6007/v1/generate_captions
```

### 4.4 Consume get_files API

To get names of uploaded videos, use the following command.
To get names of uploaded files, use the following command.

```bash
curl -X POST \
-H "Content-Type: application/json" \
http://localhost:6007/v1/dataprep/get_videos
http://localhost:6007/v1/dataprep/get_files
```

### 4.5 Consume delete_videos API
### 4.5 Consume delete_files API

To delete uploaded videos and clear the database, use the following command.
To delete uploaded files and clear the database, use the following command.

```bash
curl -X POST \
-H "Content-Type: application/json" \
http://localhost:6007/v1/dataprep/delete_videos
http://localhost:6007/v1/dataprep/delete_files
```
2 changes: 1 addition & 1 deletion comps/dataprep/multimodal/redis/langchain/config.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@
import os

# Models
EMBED_MODEL = os.getenv("EMBED_MODEL", "BridgeTower/bridgetower-large-itm-mlm-itc")
EMBED_MODEL = os.getenv("EMBEDDING_MODEL_ID", "BridgeTower/bridgetower-large-itm-mlm-itc")
WHISPER_MODEL = os.getenv("WHISPER_MODEL", "small")

# Redis Connection Information
Expand Down
71 changes: 61 additions & 10 deletions comps/dataprep/multimodal/redis/langchain/multimodal_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,8 +39,8 @@ def clear_upload_folder(upload_path):
os.rmdir(dir_path)


def generate_video_id():
"""Generates a unique identifier for a video file."""
def generate_id():
"""Generates a unique identifier for a file."""
return str(uuid.uuid4())


Expand Down Expand Up @@ -128,8 +128,49 @@ def convert_img_to_base64(image):
return encoded_string.decode()


def generate_annotations_from_transcript(file_id: str, file_path: str, vtt_path: str, output_dir: str):
"""Generates an annotations.json from the transcript file."""

# Set up location to store frames and annotations
os.makedirs(output_dir, exist_ok=True)

# read captions file
captions = webvtt.read(vtt_path)

annotations = []
for idx, caption in enumerate(captions):
start_time = str2time(caption.start)
end_time = str2time(caption.end)
mid_time = (end_time + start_time) / 2
mid_time_ms = mid_time * 1000
text = caption.text.replace("\n", " ")

# Create annotations for frame from transcripts with an empty image
annotations.append(
{
"video_id": file_id,
"video_name": os.path.basename(file_path),
"b64_img_str": "",
"caption": text,
"time": mid_time_ms,
"frame_no": 0,
"sub_video_id": idx,
}
)

# Save transcript annotations as json file for further processing
with open(os.path.join(output_dir, "annotations.json"), "w") as f:
json.dump(annotations, f)

return annotations


def extract_frames_and_annotations_from_transcripts(video_id: str, video_path: str, vtt_path: str, output_dir: str):
"""Extract frames (.png) and annotations (.json) from video file (.mp4) and captions file (.vtt)"""
"""Extract frames (.png) and annotations (.json) from media-text file pairs.

File pairs can be a video
file (.mp4) and transcript file (.vtt) or an image file (.png, .jpg, .jpeg, .gif) and caption file (.txt)
"""
# Set up location to store frames and annotations
os.makedirs(output_dir, exist_ok=True)
os.makedirs(os.path.join(output_dir, "frames"), exist_ok=True)
Expand All @@ -139,18 +180,28 @@ def extract_frames_and_annotations_from_transcripts(video_id: str, video_path: s
fps = vidcap.get(cv2.CAP_PROP_FPS)

# read captions file
captions = webvtt.read(vtt_path)
if os.path.splitext(vtt_path)[-1] == ".vtt":
captions = webvtt.read(vtt_path)
else:
with open(vtt_path, "r") as f:
captions = f.read()

annotations = []
for idx, caption in enumerate(captions):
start_time = str2time(caption.start)
end_time = str2time(caption.end)
if os.path.splitext(vtt_path)[-1] == ".vtt":
start_time = str2time(caption.start)
end_time = str2time(caption.end)

mid_time = (end_time + start_time) / 2
text = caption.text.replace("\n", " ")
mid_time = (end_time + start_time) / 2
text = caption.text.replace("\n", " ")

frame_no = time_to_frame(mid_time, fps)
mid_time_ms = mid_time * 1000
else:
frame_no = 0
mid_time_ms = 0
text = captions.replace("\n", " ")

frame_no = time_to_frame(mid_time, fps)
mid_time_ms = mid_time * 1000
vidcap.set(cv2.CAP_PROP_POS_MSEC, mid_time_ms)
success, frame = vidcap.read()

Expand Down
Loading