Releases: opea-project/Enterprise-RAG
2.0.1: Intel® AI for Enterprise RAG
Getting Started
To deploy the Intel® AI for Enterprise RAG application, follow the instructions.
Highlights:
- Late Chunking Enhancements: Improved document ingestion performance in late chunking mode by up to 6×, ensured chunk alignment with original text for better accuracy, and added full telemetry for the late chunking microservice and its logs in Grafana.
- Two new publications on Enterprise RAG:
- vLLM CPU updated: Version updated to v0.11.2
Detailed changes
AI / Development
- Improved late chunking ingestion performance by reducing TorchServe serialization overhead, resulting in up to 6× faster document ingestion.
- Updated chunk extraction in late chunking mode to pull text directly from the original document, improving accuracy and consistency.
- Upgraded vLLM CPU to v0.11.2 and added high-concurrency handling by increasing connection limits and keep-alive duration for long-running requests
Deployment
- Improved file upload speed by:
- Dynamically calculating Celery BATCH_SIZE.
- Increasing resources for the extractor pod.
- Upgradability Improvements:
- Implemented version tracking for all ERAG components and enabled UI to display deployment version.
- Added a post-upgrade integrity check to verify data retention. (Currently requires manual execution).
- Added a pre-upgrade health check to ensure upgrades occur on healthy deployments. (Currently requires manual execution).
User Interface
ChatQnA
Chat
- Enhanced chat conversation feed UX:
- Chat feed no longer auto-scrolls when a historical chat item is selected; the conversation now starts from the beginning. A Scroll to Bottom button is available for quick navigation.
- When a user sends a new message, the chat scrolls down instantly, ensuring the message is visible at the top of the feed. Remaining space is preserved for streamed responses.
- Chat feed no longer scrolls automatically during response streaming.
- These changes also resolve issues where users could not scroll up during long streamed answers.
- Fixed an issue with Chat History streaming across different chats. All history items and related data (messages, sources, user input, etc.) are now stored separately, preventing previous conflicts.
Admin Panel
- Fixed an issue where the vLLM service node in the Admin Panel’s Control Plane tab was sporadically marked red. StatefulSet state is now interpreted correctly.
- Fixed an issue where email addresses and URLs enclosed in angle brackets (e.g., [email protected]) were removed from UI output.
Telemetry
- Extended Redis telemetry probe timeouts to improve stability.
- Added late chunking microservice metrics and logs, visible in Grafana.
- Extended the Accuracy Evaluator with configurable paths for setup configuration and cluster credentials, allowing non-default locations.
Known issues
- Late Chunking with similarity_search_with_siblings may exceed context. Using late chunking with search_type="similarity_search_with_siblings" may cause context overflow. It is recommended to use late chunking with the default search type, which does not include neighboring chunks.
- Empty refferences and source indexes. Chatbot randomly provide answers with empty references and source indexes when casperhansen/llama-3-8b-instruct-awq LLM model is used.
2.0.0: Intel® AI for Enterprise RAG
Getting Started
To deploy your Intel® AI for Enterprise RAG application, please follow the instructions.
Highlights:
- New use case added! You can now use Intel® AI for Enterprise RAG Document Summarization with separate pipeline and UI for text and file-based summaries.
- Replaced Bitnami images with custom Helm charts for Redis, MongoDB, Postgres, Apisix, and Keycloak to limit third-party dependencies.
- Added automated balloon sizing and reboot-survivability features (Istio streamlining, RAG refresh CronJob) to maximize hardware utilization and improve automatic recovery.
- Added Active Directory support for enterprise authentication.
- Enabled external inference endpoint support for flexible hybrid deployments with remote LLM services.
- Introduced PLLuM models with Polish prompt templates.
Detailed changes
AI / Development
- Document Summarization pipeline integrated
- Added Active Directory support for seamless integration with enterprise applications
- Added support for external inference endpoint for VLLM
- PLLuM models were integrated into the pipeline, together with automatic support for polish prompt templates
- [preview] Introduced Late Chunking as a preview feature, an advanced text-processing technique that improves embedding quality by preserving more semantic context across chunk boundaries
- Added an fallback option for generating presignedUrls if the storage endpoint is not configured or not capable of token credential validation
- LoadPdf in Text Extractor parallized
- Made HF_TOKEN optional - if model is not a gated/restricted one, you don't need to pass a HF_TOKEN now
- Align LLM microservice with OpenAI API – LLM microservice can now be easily used in third party chains and pipelines
- VLLM HPU updated to v0.9.0.1+Gaudi-1.22.0
- Added
docs/accuracy_tuning_tips.mdwith guidance for tuning accuracy with Late Chunking and other techniques - Added
src/comps/vectorstores/CONTRIBUTING.mdwith instructions on how to enabled a new vector database to the pipeline
Deployment
- Replaced Bitnami images and Helm charts with self created solutions for:
- Redis (vdb)
- Fingerprint and Chat history (MongoDB)
- EDP (PostgreSQL)
- Apisix
- Keycloak
- Added automated calculation of balloon sizes.
- Added balloons for torchserve-embedding component
- TDX with One TD approach has been moved to production ready feature
- [preview] Created an
installer.shscript that allows to deploy entire solution on pre-configured software - A series of features have been added for the pipeline to survive the reboot of the cluster:
- Istio streamlined – Istio is being applied at the beginning of deployment now
- Added a CronJob
rag-watcherto refresh RAG services after node reboot, ensuring clean startup and operation
- Upgradability:
- Metadata pre-upgrade verification implemented – will compare metadata available in deployed pipeline with ones coming with an upgrade
- Data consistency report added - report volume of user data in components of deployed pipeline
User Interface
- Document Summarization UI added
- Users can summarize plain text or content from document file (supported file extensions: DOC, DOCX, PDF, MD).
- Generated summaries are stored in client-side history (retained until the page is refreshed or the session ends).
- Admin Panel Tabs:
- Control Plane – Displays pipeline status.
- Telemetry & Authentication – Provides links to Grafana and Keycloak.
- ChatQnA UI - Admin Panel: Added support for filtering and sorting columns in data tables within the Data Ingestion tab.
Telemetry
- Introduced a new enabled flag for telemetry traces, allowing users to control whether traces are deployed (default: false)
- Migrated the OpenTelemetry Collector base image from Ubuntu to Debian
- Upgraded telemetry components, including Grafana and associated Helm charts
- Updated instructions and behavior for accessing logs in Grafana's Explore view, reflecting changes in newer Grafana versions
- Added new monitor for the Docsum pipeline
Known issues
- A regression in performance was observed during data ingestion in Enhanced Dataprep Pipeline. Currently, the pipeline is optimized for the chat, which can slow down file uploads. If you have a lot to upload, consider a workaround: install the pipeline with balloons.enabled:False -> HPA will scale the embeddings. After uploading the files, install-on-install with balloons.enabled:True for best chat performance.
- It was observed that telemetry tracing might fail sporadically during deployment. That's why tracing was disabled at the moment.
- When telemetry tracing is enabled, only one component's spans are visible in Tempo. Expected behavior is to see spans for all eRAG microservices in the distributed trace.
- During late chunking, text decoding performed by the tokenizer introduces formatting changes compared to the original source (e.g., lowercase conversion, added separators). As a result, retrieved chunks may not fully match the original document.
- For ChatQnA pipeline in Admin Panel's Control Plane tab vLLM service node may be sporadically colored with red as Not ready state is read from API for its StatefulSet state
- Document Summarization drag and drop file upload doesn't work. Please use Browse Files.
1.5.0: Intel® AI for Enterprise RAG
Getting Started
To deploy your Chat Q&A RAG application, please follow the instructions.
Highlights:
- Added EDP PostgreSQL migration strategy (default-enabled) for smoother upgrades
- Included source chunk text in guardrail / LLM output payloads for better traceability
- Simplified guardrails: system prompt template removed; only user prompt validated by default
- Implemented automatic MinIO–Keycloak OIDC self-healing cron job
- Added TorchServe balloon policies and Gaudi performance optimizations (incl. reranker pinning & auto vLLM scaling)
- Replaced TEI reranker with TorchServe reranker for improved efficiency
- Added Terraform scripts for AWS deployment plus configurable vector DB type & dimensions
- Enhanced Chat UI: source chunk dialog, stable history saving, Firefox interrupt fix
Detailed changes
AI / Development
- Implemented EDP database (PostgreSQL) migration strategy (enabled by default) to simplify upgrades
- Included chunk text in source metadata (LLM / output guard responses now return chunk content)
- Removed system prompt template from guardrails (only user prompt checked; reranked_docs and past answers still optional via Dataprep / output guardrails when enabled)
- Implemented cron job to auto-verify and reconfigure MinIO OIDC linkage with Keycloak (fixes stale presigned URL issues without admin action)
- Integrated latest GenAIComps core changes to accelerate microservice prototyping
Deployment
- Implemented balloon policies for TorchServe on Gaudi
- Implemented performance optimizations:
- Replaced TEI reranker with TorchServe reranker
- Added CPU pinning for TorchServe reranker
- Enabled automatic scaling of vLLM instances
- Added Terraform scripts to deploy ERAG on AWS
- Added configuration options for vector database type and vector dimensions to streamline embedding / reranker model changes
User Interface
Chat
- Added clickable source buttons that open a dialog showing retrieved chunks used to generate the answer
- Moved file download / external link actions to dialog footer (contextual buttons)
- Fixed Firefox error handling when interrupting streamed responses
- Set chat rename character limit to 250 (aligned with API constraint)
- Refactored chat history saving: background /save call now avoids unnecessary UI refresh and screen blinking unless a non-guardrails error occurs
Admin Panel
Control Plane
- Fixed sentiment scanner threshold argument range
- Added input validation and tooltip for Code Scanner supported languages
- Removed "Edit Service Arguments" button; "Confirm Changes" and "Cancel" now remain disabled until a modification is made
Data Ingestion
- Updated Processing Time column to display "N/A" for Uploaded state or zero start time
- Added UI performance optimizations to reduce unnecessary re-renders and screen blinking on data refetch
Telemetry
- Renamed GMC router metrics prefix from "llm" to "router" for clarity
- Added Grafana dashboards: E2E Time to First Token, E2E Pipeline Latency, Pre-LLM Pipeline Latency
- Fixed log visibility issue in Grafana when deploying pipeline via Kubespray
Known issues
-
User can ask a question exceeding word limit, resulting in a general error
-
Random issue of chatbot not providing context-sensitive answer to a specific prompt although relevant content was provided
-
Post-install Gaudi operator installation fails in slow network conditions
-
Grafana Logs Drilldown fails with grafana-lokiexplore-app plugin version 1.0.27:
Opening view in Grafana "Explore → Logs → Show Logs" may crash with error:Error: Minified React error #130 .... This occurs with grafana-lokiexplore-app v1.0.27 (released 2025-09-17). To workaround, downgrade the plugin version to v1.0.26. To do that, edit thetelemetry-grafanaConfigMap to pin version 1.0.26 (see screenshot below), then restart themonitoring/telemetry-grafana-xxx-xxxk8s pod for the change to take effect.

To verify, go to Grafana → Administration → Plugins and search for "Grafana Logs Drilldown" and confirm that the installed version is 1.0.26, as shown below.

1.4.0: Intel® AI for Enterprise RAG
Getting Started
To deploy your Chat Q&A RAG application, please follow the instructions.
Highlights:
Major new features and improvements:
- Chat History: Users can now save, rename, export, and delete chats.
- Source Attribution in UI: RAG sources used in responses are now visible and downloadable.
- Accuracy Evaluation: Integrated GenAIEvals scripts for RAG performance testing.
- Multi-node Deployment Support: Includes node discovery and NUMA-aware vLLM sizing.
- Velero Backup Integration: Automated backup/restore now optional(if enabled in config.yaml) part of cluster lifecycle.
- Detailed Ingestion Timing: Users can inspect time breakdowns for each ingestion stage.
- Large File Deletion Bug Fixed: Files with >10,000 chunks now fully deleted.
Detailed changes
AI/Development
- Introduced Chat History: Endpoint details in src/comps/chat_history.
- Ported Accuracy Evaluation scripts from OPEA's GenAIEvals to Enterprise RAG (src/tests/e2e/evals/evaluation/rag_eval).
- RAG Source Attribution: UI now displays which ingested documents contributed to answers; files are downloadable.
- Detailed EDP Timing: Clicking ingestion time reveals breakdown (text extraction, splitting, etc.).
- Translation Pipeline (Preview): API-accessible, not yet in UI. Details in deployment/README.md#additional-pipelines.
- Large File Deletion Fix: Files with >10,000 chunks now properly deleted.
Deployment
- Added multi-node deployment support.
- Introduced node discovery mechanism.
- Created balloons policy and HPA support for torchserve-reranker.
- Enabled NUMA-aware vLLM sizing and inventory-based configuration.
- Moved PCV section and model definitions to inventory.
- Automated NFS server installation in infrastructure.yaml post-install tasks.
- Added automated backup/restore playbooks.
- Moved Velero installation to infrastructure.
- Added Terraform deployment for Gaudi 3 node on IBM Cloud.
- Changed default to use HPA with balloons policy.
User Interface
Chat
- Chats saved in left panel; users can rename, export (JSON), or delete.
- If ingested data was used, sources appear below responses:
- Links open in new tab.
- Files are downloaded directly.
Admin Panel
Control Plane
- Configurable services marked with cog icon; only these are clickable.
Data Ingestion
- Clicking Processing Time shows stage durations:
- Standard: 00:00:06.239
- Compact: 6s 239ms
- Auto-refresh every 10s until final status (Error, Ingested, etc.); toggleable in settings.
- Bulk ingestion via .txt file: URLs separated by commas, spaces, or new lines.
- Bucket Synchronization Dialog: Review and sync S3 discrepancies via UI.
Known issues
- [API-only] Deleting >70 documents at once may result in incomplete deletion.
- [input guards] After enabling input guards, using a forbidden word will cause the next three consecutive user queries to be blocked due to chat history enforcement (N+3)
- [vllm-gaudi] When running Enterprise RAG on Gaudi with the default Mixtral 8x7B model, only a single HPU device will be utilized
1.3.2: Intel® AI for Enterprise RAG - patch release
Release Notes
Detailed Changes
AI/Development
- Fix for Header/Footer stripper in TextCompressor microservice
- Enhanced documentation for Performance Tuning Tips
Known issues
- For Qwen models, it's possible to see artifact in the response.
1.3.1: Intel® AI for Enterprise RAG - patch release
Release Notes
Highlights:
- Enhanced model support with six additional LLMs including Meta-Llama-3.1, Qwen3, and Mistral variants
- Upgraded vLLM version to 0.9.2
- Expanded testing capabilities with pubMed dataset support and fixes for e2e performance tests
Publications:
Detailed Changes
AI/Development
-
Added support for the following models:
- hugging-quants/Meta-Llama-3.1-8B-Instruct-AWQ-INT4
- meta-llama/Llama-3.1-8B-Instruct
- Qwen/Qwen3-14B-AWQ
- Qwen/Qwen3-14B
- solidrust/Mistral-7B-Instruct-v0.3-AWQ
- mistralai/Mistral-7B-Instruct-v0.3
-
Upgraded vLLM version to 0.9.2
-
Updated default resources for the standard redis and text-splitter microservice to avoid OOM errors
-
Added support for custom templates in resources-model-cpu.yaml
-
Added support for pubMed dataset and fixed input token length in e2e performance tests
-
Added "Performance Tuning Guide" for Xeon deployment
Known issues
- For Qwen models, it's possible to see artifact in the response.
1.3.0: Intel® AI for Enterprise RAG
Getting Started
To deploy your Chat Q&A RAG application, please follow the instructions.
Highlights:
- Retriever RBAC support: Document filtering based on user's access privileges to underlying S3 storage, enhancing security and data access control.
- Enhanced text extraction: Improved extraction for PDF, DOC, DOCX, and images including better hyperlink, table, and image text processing.
- Microservice architecture improvements: Split Dataprep into separate TextExtractor and TextSplitter services with new TextCompression microservice for cleaner document processing.
- Advanced retrieval algorithms: Added similarity_search_with_siblings algorithm to improve response accuracy by including adjacent chunks.
- Improved Redis implementation: Migrated to standalone namespace with Helm chart support for both single node and cluster setups for better performance.
- Backup/restore functionality: Added Velero-based backup and restore capabilities for Keycloak, EDP, and vector store database.
- UI Accessibility: Enhanced accessibility with React ARIA components and added syntax highlighting for code snippets.
Detailed changes
AI/Development
- Added Retriever RBAC support - document filtering based on user's access privileges to underlying S3 storage.
- Enhanced text extraction for PDF, DOC, DOCX, and images - improved hyperlink extraction, table text extraction, and image text extraction.
- Migrated text extraction from custom loader classes to Markitdown for ADOC, TXT, JSON, JSONL, CSV, XLSX, XLS, HTML, MD, XML, and YAML file formats.
- Introduced MarkdownSplitter for ADOC, MD, and HTML files to split text by sections and add this information to metadata.
- Added filename/URL and Section information to prompt template, improving responses to questions about document names.
- Split Dataprep microservice into separate TextExtractor and TextSplitter services.
- Introduced TextCompression microservice between TextExtractor and TextSplitter to clean and compress document text. More details here.
- Added similarity_search_with_siblings algorithm to retriever, configurable in Admin Panel, which improves response accuracy by including adjacent chunks.
- Enabled semantic chunking in Ansible and debug feature, with fixes for large files.
- Introduced Hierarchical Indexing for PDF files as an experimental feature, configurable via
config.yaml. Learn more here.
User Interface
- Improved accessibility by refactoring UI components with React ARIA.
- Added syntax highlighting for code snippets in Chat.
- Implemented automatic scaling of ChatQnA pipeline graph size in Admin Panel - Control Plane.
Deployment
- Migrated Redis vector database from ChatQnA pipeline to standalone namespace.
- Deployed Redis via Helm chart - supporting both single node Redis and Redis-cluster for improved performance.
- Implemented balloons policy as an alternative method of pinning VLLM resources.
- Created backup/restore functionality using Velero for Keycloak, EDP, and vector store database. Installation steps, update and restore procedure are described in documentation.
- Added support for deployment under user-defined domain names.
- Created Ansible scripts for simplified Kubernetes deployment.
- Added Ansible scripts for deploying Gaudi via operator.
Security
- Removed non-functional scanners from guardrails.
- Enabled remaining input guardrails in UI.
- Fixed and enhanced guardrails end-to-end tests.
- Enabled fingerprint capability for dataprep guardrail.
- Upgraded LLM Guard package to version 3.16.
Known issues
- When using Redis as a vector database, the default resource settings are not optimized, causing Redis to start with configurations that are unsuitable for production environments or intensive testing. To address this, remove the existing resource and persistence node configurations from here. Update it with the following settings:
redis:
(...)
master:
persistence:
enabled: true
size: "10Gi"
resources:
requests:
cpu: 2
memory: 4Gi
limits:
cpu: 16
memory: 16Gi
replica:
persistence:
enabled: true
size: "10Gi"
resources:
requests:
cpu: 2
memory: 4Gi
limits:
cpu: 16
memory: 16GiNote: The resource configuration for redis-cluster is not affected and is correctly set up by default.
1.2.1: Intel® AI for Enterprise RAG - patch release
Release Notes
Highlights:
- Enhanced Performance: Improved hardware support with Habana Gaudi 1.21.0 and implemented core pinning for vLLM pods, resulting in better inference performance.
- Optimized Model Deployment: Added pre-configured optimizations for LLM models and set a default quantized model (
llama-3-8b-instruct-awq) for efficient CPU inference. - Improved Infrastructure Flexibility: Added support for user-defined domain names and S3-compatible storage backends, with smarter resource management that prevents unnecessary MinIO service activation.
- Enhanced Data Processing: Improved Dataprep capabilities with extended link parsing for supported file types and added safeguards to prevent service hangs.
- Extended Hardware Support: Added TDX support in deployment scripts and fixed installation paths for Gaudi-based deployments.
Detailed Changes
AI/Development
- Update Habana Gaudi to 1.21.0
- Dataprep - Enable parsing links that target files(only those extension that are already supported), not only html
- Fix parsing no_proxy parameter in EDP
- Add timeout to Dataprep microservices to avoid indefinite hangs
- Fix sticky session for the generic connector in LLM microservice to enable load balancing for multiple replicas
Deployment
- Created file with optimized configurations for running LLM models
- Set
casperhansen/llama-3-8b-instruct-awqas the default quantized model for CPU inference - Implemented core pinning mechanism for vLLM pods to improve performance
- Enabled user-defined domain name configuration
- Added support for TDX in Ansible deployment scripts
- Documentation update - added detailed instructions on setting up S3 or S3-compatible storage as a backend in EDP
- MinIO service is no longer started when a different storage backend (e.g., S3 or S3-compatible) is configured in EDP, preventing unnecessary resource usage
- Resolved issue with incorrect file paths in
install_chatqna.shfor Gaudi-based installations - the script now uses "hpu" as expected
Known issues
- GMC can update variables passed in config maps or as environment variables. Scripts cannot update changes that don't apply to other objects.
1.2.0: Intel® AI for Enterprise RAG
Getting Started
To deploy your Chat Q&A RAG application, please follow the instructions.
Highlights:
- Ansible based installation: migration of installation from bash scripting to ansible based deployment, improving ease-of-use, and providing more flexibility in configuring the deployment.
- Images on External Registry: Enterprise RAG container images are now available on docker hub.
- Connectivity to custom S3 storage: enhanced support for custom S3 storage options, building from existing support for AWS S3 storage and S3 API compatible storage.
- Data synchronization: Added support manual and scheduled s3 object synchronization option.
- ChatQnA: Support for follow-up questions in ChatQnA.
- HPA: Implemented Horizontal Pod Autoscaling
- Added Scale-out capability on Xeon infrastructure
- Telemetry: Grafana dashboard enhancements including HPA metrics and microservice response time and traffic metrics
AI/Development
- Adjusted microservices and API to accept follow-up questions and implemented naive approach to pass conversation history to llm
- Enterprise RAG pipeline was rewritten to accept OpenAI Chat Completions API
- Added manual and scheduled s3 object synchronization option
- Added a way to disable certificate verification for internal and external s3 endpoints
- Introduced rerank_score_threshold - a parameter to filter trash outputs from the retriever
- Added support in Enterprise RAG Application for AsciiDoc extension
UI
- Support for ChatQnA follow-up questions feature
- Parsing and styling of text returned as markdown language refactored
- Prompt Template service configuration in Control Plane aligned to support OpenAI Chat Completions API approach
- API requests definitions refactored to use Redux Toolkit Query
Deployment
- Migrated ERAG deployment to Ansible scripts (Bash scripts are now deprecated)
- Centralized control of all settings through inventory/test-config/config.yaml
- Push container images to an external public registry (no need to build images manually)
- Implemented horizontal pod autoscaling (HPA) mechanism
- Enabled configuration updates via modifying files deployment/pipelines/resources or /deployment/components/gmc/microservices-connector/helm/values.yaml
Telemetry
- New Grafana Dashboard for Horizontal Pod Autoscaler (HPA) metrics
- Improvements to Grafana Dashboard
EnterpriseRAG / Services / Details- it now displays the services' average response time over the interval window instead of from the start, shows 'No requests detected' when there is no traffic, and the panels are organized into separate rows for better layout. - Fixed instrumentation patch for torchserve-embedding
- Added monitoring for out-guard-usvc
Security (Guardrails/Istio)
- Documentation on istio was added to docs/istio.md
Known issues
- The prebuilt external Docker image for the vLLM model server may encounter compatibility issues on Intel® Sierra Forest and Intel® Ice Lake platforms. To ensure proper functionality, it is recommended to build the image manually on the target system.
- GMC is able to update variables that are passed in configmaps or as a envoiroment valiables. Scripts are not able to update changes that don't apply to other objects.
- In EDP no_proxy parameter is not parsed - this might become an issue for s3 compatible storage
- Requests in Dataprep microservice doesn't contain timeouts and might hang indefinitely when an undefined bug is observed.
- Ingestion of non-HTML content from a provided URL is not supported. Ingestion of non-HTML content e.g. like PDF link without extension with result in EDP hang. In such case, it's required to restart EDP pods to restore the functionality.
- LLM input guardrails issues: ban_code, ban_topics.
- Enabling Gibberish can result in awkward and unexpected ChatQA prompt responses.
- The
install_chatqna.shscript currently contains incorrect file paths for Gaudi-based installations. Specifically, the term "gaudi" should be replaced with "hpu" on lines 166 and 169 to enable proper installation. This issue is scheduled to be resolved in the upcoming minor release. Please note that this bash script deployment method is deprecated and will be removed starting with version 1.3.0. Going forward, the recommended installation method is via the Ansible-based deployment.
1.1.3: Intel® AI for Enterprise RAG - bug fix release
This release fixes the following issues:
Documentation
- Documentation updates for TDX deployment.