22.08
DeepOps 22.08 Release Notes
Known Issues
- Kubeflow deployment is currently broken due to incompatibility between current Kubeflow and Kubernetes 1.22. Kubeflow deployment will be updated to add support when Kubeflow releases 1.6.
General
- Re-work of large portion of documentation
- Updates to NCCL tests
- Various bug fixes
Slurm
- Update to Slurm 22.05.2
- Add Alertmanager integration
- Option to share Slurm configuration among nodes via NFS
- Enhancements to Slurm re-install/re-build tasks
Kubernetes
- Update to Kubernetes 1.24.4
- Update to GPU Operator 1.11.1 (GPU driver branch 515)
Changes
Bugs/Enhancements
- Update NVIDIA driver role (#1216)
- Update Kubespray submodule URL (#1200)
- Add Alertmanager to Slurm cluster deployment (#1198)
- Fix Slurm configuration GRES syntax (#1196)
- Update Pyxis image cache size (#1191)
- Updates to documentation (#1188)
- Fix Slurm reinstall/rebuild tasks (#1187)
- Update MetalLB helm repo (#1185)
- Update EPEL GPG key (#1184)
- Add option to share Slurm configuration among nodes (#1182)
- Update NCCL tests (#1180, #1209)
- Netapp Trident fix PATH (#1176)
- Update default Slurm version to 21.08.8 (#1169, #1171)
- Update NVIDIA signing key (#1166, #1167)
- Update Ansible (#1165)
Upgrade Steps
If you are upgrading to this version of DeepOps from a previous release you will need to follow the upgrade section of the Slurm or Kubernetes Deployment Guides. In addition to this, the ./scripts/setup.sh
script must be re-run and any new variables in the config.example files should be added to the existing config. For a full diff from release 22.04
run git diff 22.04 22.08 -- config.example/
. If you encounter problem please open a GitHub issue. See the update guide for additional guidance.