- We'll create a QA system that understands both text and images
- We'll build this system using Vertex AI
- Focus on Fundamentals: We will start with the essential design pattern of Retrieval Augmented Generation (RAG)—a way to find and use relevant information to answer questions.
- Work with Text and Images: We will expand RAG to handle both text and images found in PDF docs.
- Use Vertex AI: We will use Vertex AI Embeddings API and Vertex AI Gemini API.
By the end, we will have a solid foundation in building multimodal QA systems.
- Gemini is a family of GenAI models that is designed for multimodal use cases.
- The Vertex AI Gemini API gives us access to:
- Gemini 1.0 Pro Model
- Gemini 1.0 Pro Vision Model
- Gemini 1.5 Pro Model
- Gemini 1.5 Flash Model
Multimodal RAG (mRAG) offers several advantages over text-based RAG:
- Enhanced knowledge access: mRAG can access and process both textual and visual info, providing a more comprehensive knowledge base for the LLM.
- Improved reasoning capabilities: By incorporating visual cues, mRAG can make better-informed inferences across different types of data modalities.
We will implement RAG using:
- Vertex AI Gemini API
- Vertex AI Embeddings API
- text embeddings
- multimodal embeddings, to build a doc search engine.
This notebook provides a guide to building a doc search engine using mRAG, step by step:
- Extract and store metadata of docs containing both text and images and generate embeddings of the docs
- Search the metadata with text queries to find similar text or images
- Search the metadata with image queries to find similar images
- Using a text query as input, search for contextual answers using both text and images
- Vertex AI pricing
- Pricing Calculator (generates a cost estimate)