Skip to content

Conversation

@mhbuehler
Copy link
Contributor

@mhbuehler mhbuehler commented Nov 5, 2024

Description

This PR adds the following new features as specified in "Phase 1" of this RFC. The affected components include dataprep-multimodal-redis, embedding-multimodal-bridgetower, and lvm-llava. The related PR in GenAIExamples is opea-project/GenAIExamples#1071 and this one in GenAIComps will need to be merged before that one.

Data prep and ingestion enhancements:

  • Accept image only
  • Accept image and text
  • Accept speech audio only

Other enhancements:

  • Allow the user to choose the embedding model and LVM when starting the services

Note that the planned query enhancement "Accept speech audio only" has been moved to Phase 2 and a PR for that phase will be submitted for the next release.

Issues

MultimodalQnA Image & Audio Support RFC

Type of change

List the type of change like below. Please delete options that are not relevant.

  • New feature (non-breaking change which adds new functionality)
  • Breaking change (fix or feature that would break existing design and interface)
  • Others (enhancement, documentation, validation, etc.)

Dependencies

No new dependencies

Tests

Updated the individual microservice's test scripts, the GenAIExamples' MultimodalQnA test scripts, and did manual testing of the UI and documented curl commands.

mhbuehler and others added 26 commits October 14, 2024 16:33
Signed-off-by: Melanie Buehler <[email protected]>
Signed-off-by: Melanie Buehler <[email protected]>
Signed-off-by: Melanie Buehler <[email protected]>
Signed-off-by: Melanie Buehler <[email protected]>
Signed-off-by: Melanie Buehler <[email protected]>
Signed-off-by: Melanie Buehler <[email protected]>
…hbuehler/GenAIComps into melanie/combined_image_video_ingestion
* Add support for audio files multimodal data ingestion

Signed-off-by: dmsuehir <[email protected]>

* Update function name

Signed-off-by: dmsuehir <[email protected]>

---------

Signed-off-by: dmsuehir <[email protected]>
Signed-off-by: Melanie Buehler <[email protected]>
Signed-off-by: Melanie Buehler <[email protected]>
Signed-off-by: Melanie Buehler <[email protected]>
Add two tests for ingest_with_text
* LVM Gaudi TGI update for prompts without images

Signed-off-by: dmsuehir <[email protected]>

* Wording

Signed-off-by: dmsuehir <[email protected]>

* Add a test

Signed-off-by: dmsuehir <[email protected]>

---------

Signed-off-by: dmsuehir <[email protected]>
@ashahba ashahba added the r1.1 label Nov 5, 2024
@ashahba ashahba added this to the v1.1 milestone Nov 5, 2024
Copy link
Collaborator

@ashahba ashahba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @mhbuehler and @dmsuehir for this PR!
Only a few minor comments.

Copy link
Collaborator

@ashahba ashahba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@lvliang-intel lvliang-intel requested a review from Spycsh November 8, 2024 01:53
@Spycsh
Copy link
Collaborator

Spycsh commented Nov 8, 2024

I notice that you use LVM to generate captions for the video of each second. Will it takes too long? For example I have a 1min video and I have key_frame_per_second set to 1, so I would got 60 inference through the LVM with the prompt "Provide a short description for this scene", right? This is actually just an image captioning task which can be solved extremely fast by some other smaller image-caption models. Maybe implement this method as an improvement if dataprep process is slow?

Others look good to me now.

@Spycsh Spycsh merged commit 29ef642 into opea-project:main Nov 8, 2024
@joshuayao joshuayao linked an issue Nov 8, 2024 that may be closed by this pull request
madison-evans pushed a commit to SAPD-Intel/GenAIComps that referenced this pull request May 12, 2025
* Adds an endpoint for image ingestion

Signed-off-by: Melanie Buehler <[email protected]>

* Combined image and video endpoint

Signed-off-by: Melanie Buehler <[email protected]>

* Add test and update README

Signed-off-by: Melanie Buehler <[email protected]>

* fixed variable name for embedding model (opea-project#1)

Signed-off-by: okhleif-IL <[email protected]>

* Fixed test script

Signed-off-by: Melanie Buehler <[email protected]>

* Remove redundant function

Signed-off-by: Melanie Buehler <[email protected]>

* get_videos, delete_videos --> get_files, delete_files (opea-project#3)

Signed-off-by: okhleif-IL <[email protected]>

* Updates test per review feedback

Signed-off-by: Melanie Buehler <[email protected]>

* Fixed test

Signed-off-by: Melanie Buehler <[email protected]>

* Add support for audio files multimodal data ingestion (opea-project#4)

* Add support for audio files multimodal data ingestion

Signed-off-by: dmsuehir <[email protected]>

* Update function name

Signed-off-by: dmsuehir <[email protected]>

---------

Signed-off-by: dmsuehir <[email protected]>

* Change videos_with_transcripts to ingest_with_text

Signed-off-by: Melanie Buehler <[email protected]>

* Add image support to video ingestion with transcript functionality

Signed-off-by: Melanie Buehler <[email protected]>

* Update test and README

Signed-off-by: Melanie Buehler <[email protected]>

* Updated for review suggestions

Signed-off-by: Melanie Buehler <[email protected]>

* Add two tests for ingest_with_text

Signed-off-by: Melanie Buehler <[email protected]>

* LVM TGI Gaudi update for prompts without images (opea-project#7)

* LVM Gaudi TGI update for prompts without images

Signed-off-by: dmsuehir <[email protected]>

* Wording

Signed-off-by: dmsuehir <[email protected]>

* Add a test

Signed-off-by: dmsuehir <[email protected]>

---------

Signed-off-by: dmsuehir <[email protected]>

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Change dummy image to be b64 encoded instead of the url (opea-project#9)

Signed-off-by: dmsuehir <[email protected]>

* Updates based on review feedback (opea-project#10)

Signed-off-by: dmsuehir <[email protected]>

* Test fix (opea-project#11)

Signed-off-by: dmsuehir <[email protected]>

---------

Signed-off-by: Melanie Buehler <[email protected]>
Signed-off-by: okhleif-IL <[email protected]>
Signed-off-by: dmsuehir <[email protected]>
Co-authored-by: dmsuehir <[email protected]>
Co-authored-by: Omar Khleif <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Abolfazl Shahbazi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Image and Audio Support for MultimodalityQnA

6 participants