Releases: oracle-quickstart/oci-hpc-oke
Releases · oracle-quickstart/oci-hpc-oke
OKE RDMA Quickstart Resource Manager template v25.10.0
- Kubernetes upgrade: Added support for Kubernetes v1.34
 - Documentation: New guide — Deploying Prometheus & Grafana Stack with Dashboards and Alerts manually
 - Health checks:
- Added RCCL tests
 - Added RocM Validation Suite (RVS) 
gst_singlefor AMD validation 
 - Grafana access link: Default domain updated to 
endpoint.oci-hpc.ai, configurable for custom domains - Component updates: Refreshed dependencies and minor fixes across the stack
 
Full Changelog: v25.9.0...v25.10.0
OKE RDMA Quickstart Resource Manager template v25.9.0
- Option to provision a shared Lustre file system and a PV backed by the Lustre file system
 - Fully private clusters using Resource Manager Private Endpoint for deployment
 - Same dashboards and notifications with the Slurm stack
 - Option to use Oracle Linux for non-RDMA pools
 - Component updates
 
OKE RDMA Quickstart Resource Manager template v25.5.1
This is a hotfix release to fix the breaking Helm provider change.
More info about the change here: hashicorp/terraform-provider-helm#1637
OKE RDMA Quickstart Resource Manager template v25.5.0
- Added AMD Device Metrics Exporter
 - Added AMD dashboards
 
OKE RDMA Quickstart Resource Manager template v25.4.0
- Added Kubernetes v1.32
 - Changed the default number of maximum pods per node to 110
 
OKE RDMA Quickstart Resource Manager template v25.3.1
- OKE AMD GPU device plugin is enabled for BM.GPU.MI300X.8 shape
 - OKE DCGM Exporter is disabled (upstream DCGM Exporter is deployed)
 - Helm fix for Grafana load balancer not being deleted properly on Terraform destroy
 - Updated the health checks for Node Problem Detector
 - Updated Grafana dashboards
 - Added the required policies for Oracle Cloud Agent GPU/RDMA monitoring
 
OKE RDMA Quickstart Resource Manager template v25.3.0
- VCN-native pod networking is now the default option for pod networking instead of Flannel.
 - Node Problem Detector is now deployed part of the stack and integrated with the Prometheus/Grafana stack for alerting.
 - Switched to using the upstream OKE Terraform module.
 
OKE RDMA Quickstart Resource Manager template v25.3.0-beta
- VCN-native pod networking is now the default option for pod networking instead of Flannel.
 - Node Problem Detector is now deployed part of the stack.
 - Fixed a Node Exporter issue preventing metrics from being streamed from bare metal GPU nodes.
 
OKE RDMA Quickstart Resource Manager template v25.2.0
- The OKE GPU Device plugin is now enabled by default.
 - Added Kubernetes version 1.30 & 1.31.
 
OKE RDMA Quickstart Resource Manager template v24.10.0
Important
Because we moved to Terraform v1.5, this new release is a breaking change. Do not deploy this stack in your existing OKE clusters, only use for deploying new clusters.
- Updated to Terraform v1.5, the same templates can now be used for both OCI Resource Manager and regular Terraform.
 - The bastion and operator nodes now use Ubuntu.
 - Added an option to deploy the Prometheus/Grafana stack with DCGM Exporter.
 - Added an option to create a RAID 0 array using the local NVMe drives on the nodes and configure Kubernetes to use it for container storage.
 - Added options to create storage classes for FSS (File Storage Service) and high performance block volumes.