Skip to content

22.08

Compare
Choose a tag to compare
@dholt dholt released this 24 Aug 16:26
· 78 commits to master since this release
5fdde40

DeepOps 22.08 Release Notes

Known Issues

  • Kubeflow deployment is currently broken due to incompatibility between current Kubeflow and Kubernetes 1.22. Kubeflow deployment will be updated to add support when Kubeflow releases 1.6.

General

  • Re-work of large portion of documentation
  • Updates to NCCL tests
  • Various bug fixes

Slurm

  • Update to Slurm 22.05.2
  • Add Alertmanager integration
  • Option to share Slurm configuration among nodes via NFS
  • Enhancements to Slurm re-install/re-build tasks

Kubernetes

  • Update to Kubernetes 1.24.4
  • Update to GPU Operator 1.11.1 (GPU driver branch 515)

Changes

Bugs/Enhancements

  • Update NVIDIA driver role (#1216)
  • Update Kubespray submodule URL (#1200)
  • Add Alertmanager to Slurm cluster deployment (#1198)
  • Fix Slurm configuration GRES syntax (#1196)
  • Update Pyxis image cache size (#1191)
  • Updates to documentation (#1188)
  • Fix Slurm reinstall/rebuild tasks (#1187)
  • Update MetalLB helm repo (#1185)
  • Update EPEL GPG key (#1184)
  • Add option to share Slurm configuration among nodes (#1182)
  • Update NCCL tests (#1180, #1209)
  • Netapp Trident fix PATH (#1176)
  • Update default Slurm version to 21.08.8 (#1169, #1171)
  • Update NVIDIA signing key (#1166, #1167)
  • Update Ansible (#1165)

Upgrade Steps

If you are upgrading to this version of DeepOps from a previous release you will need to follow the upgrade section of the Slurm or Kubernetes Deployment Guides. In addition to this, the ./scripts/setup.sh script must be re-run and any new variables in the config.example files should be added to the existing config. For a full diff from release 22.04 run git diff 22.04 22.08 -- config.example/. If you encounter problem please open a GitHub issue. See the update guide for additional guidance.

Notes