Releases: NVIDIA/deepops
20.08.1
NOTE: As a result of CVE-2021-31215, SchedMD has un-published the version of Slurm used by default in this release. If deploying Slurm, it is recommended to upgrade to the latest release of DeepOps to get a supported Slurm version.
If it isn’t possible to update to the latest release of DeepOps immediately, update instead to a supported Slurm version by setting slurm_version: 20.02.7 or slurm_version: 20.11.7 in the DeepOps configuration. However, please note that this workflow has not been tested with all past releases.
DeepOps 20.08.1 Release Notes
NOTE: Use this release instead of 20.08.
Changes
- Fix Slurm deployment on CentOS
- Fix hardcoded paths/variables across K8S/Lmod/Slurm/Pyxis deployment issues
- Fix Slurm deployment with existing ssh-keys
- Fix K8S deployment with GPU plugin/Operator on multiple mgmt nodes
- Fix K8S dashboard script and add testing
- Fix Kubeflow istio_dex manifest and add testing
20.08
DeepOps 20.08 Release Notes
NOTE: Use 20.08.1 release instead of this one for various bug fixes.
What's New
- DGX A100 support
- NVIDIA HPC SDK
- Spack package manager
- HPL Burn-in test
- MPI Operator
Changes
- Slurm 20.02.4, Pyxis v0.8.0, Enroot v3.1.1
- Kubernetes v1.17.9 (Kubespray v2.13.3), Helm 3, GPU Operator v0.6.0
- Kubeflow v1.1.0 w/ MPI Operator (kfctl -> v1.1.0, istio_dex -> v1.0.2, istio -> v1.1.0)
- DGX OS 4.5
- DGX role updated to current versions/packages
- K8S DCGM Exporter 1.7.2 (port switch from 9101 to 9400)
- Bug fixes and enhancements
- Default nfs configurations have changed
Bugs/Enhancements
- General Kubeflow installation and polling improvements (along with Jenkins tests)
- Kubeflow deletion now actually deletes Kubeflow along with Istio, cert-manager, etc.
- Kubeflow installation now automatically installs the MPI Operator
- DCGM/Grafana dashboard updates
- General cleanup and version pinning in K8S monitoring deployment script
- Improved Jenkins testing (new tests: spack, kubeflow, centos tests; additional debugging/scale-tests/fixes)
- Peg Rook/Ceph versions
- Updated/improved/spell-checked documentation (slurm-perf, kubeflow, kubernetes, Lmod, Spack, EasyBuild)
- Slurm MPI now defaults to pmix if available
- golang galaxy role bumped to 2.4.0
- Improved Trident usability
- New default config variables (install_chrony, ...)
- General reorg of Slurm role and slurm-cluster.yml
- Dedicated lmod playbook
- Replaced a few helm repos with stable version
- gpu plugin now uses helm install
Upgrade Steps
If you are upgrading to this version of DeepOps from a previous release you will need to follow the upgrade section of the Slurm or Kubernetes Deployment Guides. In addition to this, the setup.sh
script must be re-run and any new variables in the config.example
files should be added to the existing config
. For a full diff from release 20.06
run git diff 20.08 20.06 -- config.example/
It is also necessary to upgrade helm on your provisioner node. This can be done manually using ./scripts/install_helm.sh
as a reference.
20.06.1
NOTE: As a result of CVE-2021-31215, SchedMD has un-published the version of Slurm used by default in this release. If deploying Slurm, it is recommended to upgrade to the latest release of DeepOps to get a supported Slurm version.
If it isn’t possible to update to the latest release of DeepOps immediately, update instead to a supported Slurm version by setting slurm_version: 20.02.7 or slurm_version: 20.11.7 in the DeepOps configuration. However, please note that this workflow has not been tested with all past releases.
DeepOps 20.06.1 Release Notes
NOTE: Use this release instead of 20.06.
Changes
20.06
20.02.1
NOTE: As a result of CVE-2021-31215, SchedMD has un-published the version of Slurm used by default in this release. If deploying Slurm, it is recommended to upgrade to the latest release of DeepOps to get a supported Slurm version.
If it isn’t possible to update to the latest release of DeepOps immediately, update instead to a supported Slurm version by setting slurm_version: 20.02.7 or slurm_version: 20.11.7 in the DeepOps configuration. However, please note that this workflow has not been tested with all past releases.
DeepOps 20.02.1 Release Notes
NOTE: Use this release instead of 20.02.
Changes
- Fixed broken ansible-galaxy roles
- Fix for GPU device plugin in RHEL
- Fix for CentOS missing python-openshift
- Fix for docker repo on RH distros
- Upgraded to use Kubespray docker install
20.02
DeepOps 20.02 Release Notes
NOTE: Use 20.02.1 release instead of this one for various bug fixes.
What's New
- NVIDIA EGX stack
- NVIDIA Kubernetes GPU Operator
- RoCE in Kubernetes
- Proxy support
Changes
- Upgraded Kubeflow to v.0.7.1
- Various bug fixes and enhancements
Software versions
(Unchanged since 19.10)
Software | Version |
---|---|
Ansible | 2.7.11 |
Kubespray | v2.11.0 |
Kubernetes | v1.15.3 |
Helm | 2.14.3 |
Docker | 18.09.7 |
Rook | v1.1.1 |
Ceph | v14.2 |
Slurm | 19.05 |
19.10
NOTE: As a result of CVE-2021-31215, SchedMD has un-published the version of Slurm used by default in this release. If deploying Slurm, it is recommended to upgrade to the latest release of DeepOps to get a supported Slurm version.
If it isn’t possible to update to the latest release of DeepOps immediately, update instead to a supported Slurm version by setting slurm_version: 20.02.7 or slurm_version: 20.11.7 in the DeepOps configuration. However, please note that this workflow has not been tested with all past releases.
DeepOps 19.10 Release Notes
What's New
- Kubeflow 0.6.2
- Slurm + Pyxis + Enroot deployment for highly performant multi-node clusters
Changes
- Upgraded Kubernetes form v1.14 to v1.15 (see notes below)
- Various bug fixes and enhancements
Software versions
Software | Version |
---|---|
Ansible | 2.7.11 |
Kubespray | v2.11.0 |
Kubernetes | v1.15.3 |
Helm | 2.14.3 |
Docker | 18.09.7 |
Rook | v1.1.1 |
Ceph | v14.2 |
Slurm | 19.05 |
19.07
NOTE: As a result of CVE-2021-31215, SchedMD has un-published the version of Slurm used by default in this release. If deploying Slurm, it is recommended to upgrade to the latest release of DeepOps to get a supported Slurm version.
If it isn’t possible to update to the latest release of DeepOps immediately, update instead to a supported Slurm version by setting slurm_version: 20.02.7 or slurm_version: 20.11.7 in the DeepOps configuration. However, please note that this workflow has not been tested with all past releases.
DeepOps 19.07 Release Notes
What's New
- Beta support for air-gapped installations on RHEL/CentOS
- Ansible role for official RHEL/CentOS install on DGX-1/DGX-2
- Updated and customized Kubeflow deployment with NGC container support
Changes
- Upgraded Kubernetes from v1.12 to v1.14 (see notes below)
- Upgraded Slurm build with per-GPU scheduling by default
- Bug fixes and enhancements
Kubernetes upgrade notes
Upgrading Kubernetes can be complicated; for test or empty clusters, it may be easier to start from scratch with DeepOps 19.07. Upgrading from DeepOps 19.03 (Kubernetes v1.12) to DeepOps 19.07 (Kubernetes v1.14) requires first upgrading to Kubernetes v1.13, and then v1.14. See the Kubespray docs for information on upgrading Kubernetes.
Software versions
Software | Version |
---|---|
Ansible | 2.7.11 |
Kubespray | v2.10.4 |
Kubernetes | v1.14.3 |
Docker | 18.09.6 |
Rook | v1.0.2 |
Ceph | v13 (v13.2.6-20190604) |
Slurm | 19.05 |
19.03
NOTE: As a result of CVE-2021-31215, SchedMD has un-published the version of Slurm used by default in this release. If deploying Slurm, it is recommended to upgrade to the latest release of DeepOps to get a supported Slurm version.
If it isn’t possible to update to the latest release of DeepOps immediately, update instead to a supported Slurm version by setting slurm_version: 20.02.7 or slurm_version: 20.11.7 in the DeepOps configuration. However, please note that this workflow has not been tested with all past releases.
DeepOps 19.03 Release Notes
What's New
- Support for RHEL/CentOS
- Standalone virtual deployment option for testing and development
- Scripts for simplified service deployment
- New Services
- Kubernetes Dashboard
- Ceph Dashboard
- Jupyterhub
- Kubeflow
- Examples for HPC and DL jobs
- Slurm MPI job
- Kubernetes/Slurm Dask+RAPIDs
- Role to install cuDNN and NCCL libraries
- Load Balancer option in Kubernetes
Changes
- Simplified, more modular code base
- Documentation cleanup and organization for ease of use
Software versions
Software | Version |
---|---|
Kubernetes | 1.12.5 |
Slurm | 18.08.5-2 |