Releases: NVIDIA-AI-Blueprints/rag
v2.2.1
This minor patch release updates to the latest nvingest client version 25.6.3
to fix breaking changes introduced due to pypdfium.
Details mentioned here:
https://github.com/NVIDIA/nv-ingest/releases/tag/25.6.3
All existing prebuilt containers should work.
Released corresponding pypi library:
https://pypi.org/project/nvidia-rag/2.2.1/
v2.2.0
This release adds B200 platform support, a native Python API, and major enhancements for multimodal and metadata features. It also improves deployment flexibility and customization across the RAG blueprint.
Added
- Support deploying the blueprint on B200 platform.
- Support for native python API
- Refactoring code and directory to support python API
- Better modularization for easier customization
- Moved to
uv
as the package manager for this project
- Added support for configurable vector store consistency levels (Bounded/Strong/Session) to optimize retrieval performance vs accuracy trade-offs.
- Capability to add custom metadata for files and metadata based filtering
- Documentation of using Multi Instance GPUs. Reduces minimum GPU requirement for helm charts to 3xH100.
- Multi collection based retrieval support
- Audio files (.mp3 and .wav) support
- Support of using Vision Language Model based generation for charts and images
- Support for generating summaries of uploaded files
- Sample user interface enhancements
- Support for non-blocking file upload
- More efficient error reporting for ingestion failures
- Prompt customization support without rebuilding images
- Added support to enable infographics, which improves accuracy for documents containing text in image format.
- See this guide for details
- New customizations
- How to support non nvingest based ingestion + retrieval
- How to enable CPU based milvus
- How to enable nemoretriever-parse as an alternate PDF parser
- How to use standalone nv-ingest python client to do ingestion
- Nvidia AI Workbench support
Changed
- Changed API schema to support newly added features
- POST /collections to be deprecated in favour of POST /collection for ingestor-server
- New endpoint GET /summary added for rag-server
- Metadata information available as part of GET /collections and GET /documents API
- Check out migration guide for detailed changes at API level
- Optimized batch mode ingestion support to improve perf for multi user concurrent file upload.
Known Issues
Check out this section to understand the known issues present for this release.
v2.1.0
This release reduces overall GPU requirement for the deployment of the blueprint. It also improves the performance and stability for both docker and helm based deployments.
Added
- Added non-blocking async support to upload documents API
- Added a new field
blocking: bool
to control this behaviour from client side. Default is set totrue
- Added a new API
/status
to monitor state or completion status of uploaded docs
- Added a new field
- Helm chart is published on NGC Public registry.
- Helm chart customization guide is now available for all optional features under documentation.
- Issues with very large file upload has been fixed.
- Security enhancements and stability improvements.
Changed
- Overall GPU requirement reduced to 2xH100/3xA100.
- Changed default LLM model to llama-3_3-nemotron-super-49b-v1. This reduces overall GPU needed to deploy LLM model to 1xH100/2xA100
- Changed default GPU needed for all other NIMs (ingestion and reranker NIMs) to 1xH100/1xA100
- Changed default chunk size to 512 in order to reduce LLM context size and in turn reduce RAG server response latency.
- Exposed config to split PDFs post chunking. Controlled using
APP_NVINGEST_ENABLEPDFSPLITTER
environment variable in ingestor-server. Default value is set toTrue
. - Added batch-based ingestion which can help manage memory usage of
ingestor-server
more effectively. Controlled usingENABLE_NV_INGEST_BATCH_MODE
andNV_INGEST_FILES_PER_BATCH
variables. Default value isTrue
and100
respectively. - Removed
extract_options
from API level ofingestor-server
. - Resolved an issue during bulk ingestion, where ingestion job failed if ingestion of a single file fails.
Known Issues
- The
rag-playground
container needs to be rebuild if theAPP_LLM_MODELNAME
,APP_EMBEDDINGS_MODELNAME
orAPP_RANKING_MODELNAME
environment variable values are changed. - While trying to upload multiple files at the same time, there may be a timeout error
Error uploading documents: [Error: aborted] { code: 'ECONNRESET' }
. Developers are encouraged to use API's directly for bulk uploading, instead of using the sample rag-playground. The default timeout is set to 1 hour from UI side, while uploading. - In case of failure while uploading files, error messages may not be shown in the user interface of rag-playground. Developers are encouraged to check the
ingestor-server
logs for details.
A detailed guide is available here for easing developers experience, while migrating from older versions.
v2.0.0
[2.0.0] - 2025-03-18
This release adds support for multimodal documents using Nvidia Ingest including support for parsing PDFs, Word and PowerPoint documents. It also significantly improves accuracy and perf considerations by refactoring the APIs, architecture as well as adds a new developer friendly UI.
Added
- Integration with Nvingest for ingestion pipeline, the unstructured.io based pipeline is now deprecated.
- OTEL compatible observability and telemetry support.
- API refactoring. Updated schemas here.
- Support runtime configuration of all common parameters.
- Multimodal citation support.
- New dedicated endpoints for deleting collection, creating collections and reingestion of documents
- New react + nodeJS based UI showcasing runtime configurations
- Added optional features to improve accuracy and reliability of the pipeline, turned off by default. Best practices here
- Brev dev compatible notebook
- Security enhancements and stability improvements
Changed
-
- In RAG v1.0.0, a single server managed both ingestion and retrieval/generation APIs. In RAG v2.0.0, the architecture has evolved to utilize two separate microservices.
- Helm charts are now modularized, seperate helm charts are provided for each distinct microservice.
- Default settings configured to achieve a balance between accuracy and perf.
- Default flow uses on-prem models with option to switch to API catalog endpoints for docker based flow.
- Query rewriting uses a smaller llama3.1-8b-instruct and is turned off by default.
- Support to use conversation history during retrieval for low-latency multiturn support.
Known Issues
- The
rag-playground
container needs to be rebuild if theAPP_LLM_MODELNAME
,APP_EMBEDDINGS_MODELNAME
orAPP_RANKING_MODELNAME
environment variable values are changed. - Optional features reflection, nemoguardrails and image captioning are not available in helm based deployment.
- Uploading large files with .txt extension may fail during ingestion, we recommend splitting such files into smaller parts, to avoid this issue.
A detailed guide is available here for easing developers experience, while migrating from older versions.
v1.0.0
This is the first release of the NVIDIA AI RAG blueprint which serves as a reference solution for a foundational Retrieval Augmented Generation (RAG) pipeline. This blueprint demonstrates how to set up a RAG solution that uses NVIDIA NIM and GPU-accelerated components.
By default, this blueprint leverages the NVIDIA-hosted models available in the NVIDIA API Catalog.
However, you can replace these models with your own locally-deployed NIMs to meet specific data governance and latency requirements.
For more details checkout the readme.