Release Release Notes: Intel® AI for Enterprise Inference – Version 1.2.0 · opea-project/Enterprise-Inference

Overview

Intel® AI for Enterprise Inference streamlines the deployment and management of AI inference services on Intel hardware. Focused on Kubernetes orchestration, it automates deploying LLM models, provisioning compute, and configuring hardware for fast, scalable, and secure inference—both on-premises and in cloud-native settings. It provides compatibility with OpenAI standard APIs, making it easy to integrate with enterprise applications.

System Requirements

Category	Details
Operating System	Ubuntu 22.04
Hardware Platforms	3rd, 4th, 5th, 6th Gen Intel® Xeon® Scalable processors; Intel® Gaudi® 2 & 3 AI Accelerators
Gaudi Firmware	1.21.0

Network: Internet access required for deployment; open ports for Kubernetes and container registry.
Storage: Allocate storage based on model size and observability tools (recommend at least 30GB for monitoring data).
Other: SSH key pair, SSL/TLS certificates, Hugging Face token.

Cluster Deployment Modes in Enterprise Inference

Enterprise inference workloads can be deployed in different cluster configurations depending on scale, performance needs, and availability requirements.
Below are the supported modes:

Single Node Cluster — Quick start for testing or lightweight workloads:

Best for: Quick testing, Proof-of-Concepts (POCs), and lightweight workloads.
Purpose: Ideal for fast bring-up and low-latency inference in small-scale scenarios.
Setup: Runs entirely on a single Gaudi3 node, which handles both control and data plane functions.
Benefits:

Minimal orchestration overhead
Fastest deployment time
Suitable for single-user scenarios or serving a limited number of users

Single Master, Multiple Workers:

Best for: Medium-scale deployments requiring higher throughput.
Purpose: Separates Kubernetes infrastructure management from model execution to improve performance.
Setup:

A master node (e.g., Xeon CPU) runs the Kubernetes control plane and infrastructure pods (for example, the Habana Operator)
Multiple Gaudi3 worker nodes are dedicated to running inference workloads
Benefits:
Maximizes compute utilization
Supports batch inference and concurrent model execution
Reduces resource contention by isolating infra from model workloads

Multi-Master, Multiple Workers — Enterprise-ready HA cluster:

Best for: Enterprises or production-grade deployments requiring high availability.
Purpose: Ensures fault tolerance and scalability for mission-critical inference workloads.
Setup:

Multiple master nodes manage the Kubernetes control plane with automatic failover
Gaudi3 worker nodes scale horizontally to handle complex models and high user concurrency
Benefits:
High availability and resilience
Optimized for load balancing and SLA-driven deployments
Supports sustained throughput and enterprise-grade reliability

Key Features

Integrated GenAI Gateway
- Integrated GenAI Gateway with LiteLLM and Langfuse for advanced AI model management and observability.
Xeon Optimization
- Optimized performance for Intel Xeon CPUs.
- Read the detailed CPU optimization guide.
- Dynamic memory allocation for efficient resource usage.
- Automatic topology detection for improved deployment flexibility.
Integrated Ceph and Istio
- Seamless integration with Ceph storage and Istio service mesh for enhanced scalability and resilience.
Enhanced Observability
- Integration with Grafana Loki for advanced log management.
- AWS S3 and Minio support for log storage and retrieval.
Documentation & Workflow Updates
- Refactored and expanded documentation for better developer experience.
- Added vault secret management for secure workflows.
- Various workflow enhancements for stability and usability.
IBM Cloud Multi-Node Architecture
- Added support for multi-node deployment for IBM Cloud deployable architecture.
- Added GenAI Gateway Integeration with IBM Cloud DA

Getting Started

Please refer below documentation for getting started guide
See the Quick Start Guide and Cluster Setup for details.

Post-Deployment

Access deployed models via API endpoints (OpenAI compatible).
Use built-in observability dashboards for monitoring and troubleshooting.

Supported Models

View the Supported Model List.
Deploy custom LLMs directly from Hugging Face.

License

Licensed under the Apache License 2.0.

Thank you for using Intel® AI for Enterprise Inference!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Release Notes: Intel® AI for Enterprise Inference – Version 1.2.0

Choose a tag to compare

Sorry, something went wrong.

Sorry, something went wrong.

Uh oh!

No results found

Overview

System Requirements

Cluster Deployment Modes in Enterprise Inference

Single Node Cluster — Quick start for testing or lightweight workloads:

Single Master, Multiple Workers:

Multi-Master, Multiple Workers — Enterprise-ready HA cluster:

Key Features

Getting Started

Post-Deployment

Supported Models

License

Uh oh!