GPT-RAG — Data Ingestion Component

Note

A part of GPT-RAG.

Concepts

Document Ingestion Process

The diagram below provides an overview of the document ingestion pipeline, which handles various document types, preparing them for indexing and retrieval.

Document Ingestion Pipeline

Workflow

The ragindex-indexer-chunk-documents indexer reads new documents from the documents blob container.
For each document, it calls the document-chunking function app to segment the content into chunks and generate embeddings using an embedding model.
Finally, each chunk is indexed in the AI Search Index.

Document Chunking Process

The document_chunking function breaks documents into smaller segments called chunks.

When a document is submitted, the system identifies its file type and selects the appropriate chunker to divide it into chunks suitable for that specific type.

For document files such as .pdf, the system uses the DocumentAnalysisChunker with the Document Intelligence API, which extracts structured elements, like tables and sections, converting them into Markdown. LangChain splitters then segment the content based on sections. When Document Intelligence API 4.0 is enabled, .docx and .pptx files are processed with this chunker as well.
For image files such as .bmp, .png, .jpeg, and .tiff, the DocumentAnalysisChunker performs Optical Character Recognition (OCR) to extract text before chunking.
For specialized formats, specific chunkers are applied:
- .vtt files (video transcriptions) are handled by the TranscriptionChunker, chunking content by time codes.
- .xlsx files (spreadsheets) are processed by the SpreadsheetChunker, chunking by rows or sheets.
For text-based files like .txt, .md, .json, and .csv, the LangChainChunker uses LangChain splitters to divide the content by paragraphs or sections.

This setup ensures each document is processed by the most suitable chunker, leading to efficient and accurate chunking.

Important

The file extension determines the choice of chunker as outlined above.

Tip

Customizations: The chunking process is customizable. You can modify existing chunkers or create new ones to meet specific data processing needs, optimizing the pipeline.

Multimodal Ingestion Process

This repository supports image ingestion for a multimodal RAG scenario. For an overview of how multimodality is implemented in GPT-RAG, see Multimodal RAG Overview.

To enable multimodal ingestion, set the MULTIMODAL environment variable to true before starting to index your data.

When MULTIMODAL is set to true, the data ingestion pipeline extends its capabilities to handle both text and images within your source documents, using the MultimodalChunker. Below is an overview of how this multimodal ingestion process works, including image extraction, captioning, and cleanup.

Thresholded Image Extraction
- The system uses Document Intelligence to parse each document, detecting text elements as well as embedded images. This approach extends the standard DocumentAnalysisChunker by adding image extraction steps on top of the usual text-based process.
- To avoid clutter and maintain relevance, an area threshold is applied so that only images exceeding a certain percentage of the page size are ingested. This ensures very small or irrelevant images are skipped.
- Any images meeting or exceeding this threshold are then extracted for further processing.
Image Storage in Blob Container
- Detected images are downloaded and placed in a dedicated Blob Storage container (by default documents-images).
- Each image is assigned a blob name and a URL, enabling the ingestion pipeline (and later queries) to reference where the image is stored.
Textual Content and Captions
- Alongside normal text chunking (paragraphs, sections, etc.), each extracted image is captioned to generate a concise textual description of its contents.
- These captions are combined with the surrounding text, allowing chunks to contain both plain text and image references (with descriptive captions).
Unified Embeddings and Indexing
- The ingestion pipeline produces embeddings for both text chunks and the generated image captions, storing them in the AI Search Index.
- The index is adapted to include fields for contentVector (text embeddings) and captionVector (image caption embeddings), as well as references to any related images in the documents-images container.
- This architecture allows multimodal retrieval, where queries can match either the main text or the descriptive captions.
Image Cleanup Routine
- A dedicated purging process periodically checks the documents-images container and removes any images no longer referenced in the AI Search Index.
- This ensures storage is kept in sync with ingested content, avoiding orphaned or stale images that are no longer needed.

By activating MULTIMODAL, your ingestion process captures both text and visuals in a single workflow, providing a richer knowledge base for Retrieval-Augmented Generation scenarios. Queries can match not just textual content but also relevant image captions, retrieving valuable visual context stored in documents-images.

NL2SQL Ingestion Process

If you are using the few-shot or few-shot scaled NL2SQL strategies in your orchestration component, you may want to index NL2SQL content for use during the retrieval step. The idea is that this content will aid in SQL query creation with these strategies. More details about these NL2SQL strategies can be found in the orchestrator repository.

Note

This also applies to the agentic strategy for Fabric. For Fabric, the query can be written in either DAX or SQL, depending on the type of data source (Semantic Model or SQL Endpoint, respectively) defined in the orchestration configuration.

The NL2SQL Ingestion Process indexes three content types:

query: Examples of queries for both few-shot and few-shot scaled strategies.
table: Descriptions of tables for the few-shot scaled scenario.
column: Descriptions of columns for the few-shot scaled scenario.

Note

If you are using the few-shot strategy, you will only need to index queries.

Each item — whether a query, table, or column —is represented in a JSON file with information specific to the query, table, or column, respectively.

Here’s an example of a query file:

{
  "datasource": "adventureworks",
  "question": "What are the top 5 most expensive products currently available for sale?",
  "query": "SELECT TOP 5 ProductID, Name, ListPrice FROM SalesLT.Product WHERE SellEndDate IS NULL ORDER BY ListPrice DESC",
  "selected_tables": ["SalesLT.Product"],
  "selected_columns": [
    "SalesLT.Product-ProductID",
    "SalesLT.Product-Name",
    "SalesLT.Product-ListPrice",
    "SalesLT.Product-SellEndDate"
  ],
  "reasoning": "This query retrieves the top 5 products with the highest selling prices that are currently available for sale. It uses the SalesLT.Product table, selects relevant columns, and filters out products that are no longer available by checking that SellEndDate is NULL."
}

In the nl2sql directory of this repository, you can find additional examples of queries, tables, and columns for the following Adventure Works sample SQL Database tables.

Sample Adventure Works Database Tables

Note

You can deploy this sample database in your Azure SQL Database.

The diagram below illustrates the NL2SQL data ingestion pipeline.

NL2SQL Ingestion Pipeline

Workflow

This outlines the ingestion workflow for query elements.

Note

The workflow for tables and columns is similar; just replace queries with tables or columns in the steps below.

The AI Search queries-indexer scans for new query files (each containing a single query) within the queries folder in the nl2sql storage container.

Note

Files are stored in the queries folder, not in the root of the nl2sql container. This setup also applies to tables and columns folders.

The queries-indexer then uses the #Microsoft.Skills.Text.AzureOpenAIEmbeddingSkill to create a vectorized representation of the question text using the Azure OpenAI Embeddings model.

Note

For query items, the question itself is vectorized. For tables and columns, their descriptions are vectorized.

Finally, the indexed content is added to the nl2sql-queries index.

SharePoint Indexing

The SharePoint connector operates through two primary processes, each running in a separate function within the Data Ingestion Function App:

Indexing SharePoint Files: sharepoint_index_files function retrieves files from SharePoint, processes them, and indexes their content into the Azure AI Search Index (ragindex).
Purging Deleted Files: sharepoint_purge_deleted_files identifies and removes files that have been deleted from SharePoint to keep the search index up-to-date.

Both processes are managed by scheduled Azure Functions that run at regular intervals, leveraging configuration settings to determine their behavior. The diagram below illustrates the SharePoint indexing.

SharePoint Indexing Workflow

Workflow

Indexing Process (sharepoint_index_files)

1.1 List files from a specific SharePoint site, directory, and file types configured in the settings.

1.2. Check if the document exists in the AI Search Index. If it exists, compare the metadata_storage_last_modified field to determine if the file has been updated.

1.3. Use the Microsoft Graph API to download the file if it is new or has been updated.

1.4. Process the file content using the regular document chunking process. For specific formats, like PDFs, use Document Intelligence.

1.5. Use Azure OpenAI to generate embeddings for the document chunks.

1.6. Upload the processed document chunks, metadata, and embeddings into the Azure AI Search Index.
Purging Deleted Files (sharepoint_purge_deleted_files)

2.1 Connect to the Azure AI Search Index to identify indexed documents.

2.2 Query the Microsoft Graph API to verify the existence of corresponding files in SharePoint.

2.3 Remove entries in the Azure AI Search Index for files that no longer exist.

Azure Function triggers automate the indexing and purging processes. Indexing runs at regular intervals to ingest updated SharePoint files, while purging removes deleted files to maintain an accurate search index. By default, both processes run every 10 minutes when enabled.

If you'd like to learn how to set up the SharePoint connector, check out SharePoint Connector Setup.

How-to: Developer

Redeploying the Ingestion Component

Provision the infrastructure and deploy the solution using the GPT-RAG template.
Redeployment Steps:
- Prerequisites:
  - Azure Developer CLI: Download azd for Windows, Other OS's.
  - PowerShell 7+ (Windows only): PowerShell.
  - Git: Download Git.
  - Python 3.11: Download Python.

Redeployment Commands:

azd auth login
azd env refresh
azd deploy

Important

Use the same environment name, subscription, and region as the initial deployment when running azd env refresh. However, if your resource group is in the eastus region but your function app is in the eastasia region, make sure the AZURE_LOCATION is set to eastasia.

Running Locally

For more information on how to test the Data Ingestion component locally using VS Code, refer to the Local Deployment guide.

Configuring SharePoint Connector

For more information on how to configure the SharePoint connector, refer to the GPT-RAG SharePoint Setup guide.

How-to: User

Uploading Documents for Ingestion

For more information on uploading documents for ingestion, refer to the GPT-RAG Admin & User Guide: Uploading Documents for Ingestion section.

Reindexing Documents in AI Search

For more information on reindexing documents in AI Search, refer to the GPT-RAG Admin & User Guide: Reindexing Documents in AI Search section.

References

Supported Formats and Chunkers

Here are the formats supported by each chunker. The file extension determines which chunker is used.

DocumentAnalysisChunker (Document Intelligence based)

Extension	Doc Int API Version
`pdf`	`3.1`, `4.0`
`bmp`	`3.1`, `4.0`
`jpeg`	`3.1`, `4.0`
`png`	`3.1`, `4.0`
`tiff`	`3.1`, `4.0`
`xlsx`	`4.0`
`docx`	`4.0`
`pptx`	`4.0`

LangChainChunker

Extension	Format
`md`	Markdown document
`txt`	Plain text file
`html`	HTML document
`shtml`	Server-side HTML document
`htm`	HTML document
`py`	Python script
`json`	JSON data file
`csv`	Comma-separated values file
`xml`	XML data file

TranscriptionChunker

Extension	Format
`vtt`	Video transcription

SpreadsheetChunker

Extension	Format
`xlsx`	Spreadsheet

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
.vscode		.vscode
chunking		chunking
connectors		connectors
docs		docs
infra		infra
media		media
samples		samples
scripts		scripts
tests		tests
tools		tools
utils		utils
.env.template		.env.template
.funcignore		.funcignore
.gitignore		.gitignore
.markdownlint.jsonc		.markdownlint.jsonc
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
azure.yaml		azure.yaml
function_app.py		function_app.py
host.json		host.json
local.settings.json.template		local.settings.json.template
requirements.txt		requirements.txt
run_sharepoint.py		run_sharepoint.py
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPT-RAG — Data Ingestion Component

Table of Contents

Concepts

Document Ingestion Process

Document Chunking Process

Multimodal Ingestion Process

NL2SQL Ingestion Process

SharePoint Indexing

How-to: Developer

Redeploying the Ingestion Component

Running Locally

Configuring SharePoint Connector

How-to: User

Uploading Documents for Ingestion

Reindexing Documents in AI Search

References

Supported Formats and Chunkers

External Resources

About

Releases

Packages

Languages

License

0Upjh80d/gpt-rag-ingestion

Folders and files

Latest commit

History

Repository files navigation

GPT-RAG — Data Ingestion Component

Table of Contents

Concepts

Document Ingestion Process

Document Chunking Process

Multimodal Ingestion Process

NL2SQL Ingestion Process

SharePoint Indexing

How-to: Developer

Redeploying the Ingestion Component

Running Locally

Configuring SharePoint Connector

How-to: User

Uploading Documents for Ingestion

Reindexing Documents in AI Search

References

Supported Formats and Chunkers

External Resources

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages