Skip to content

Commit

Permalink
Merge pull request #1093 from dholt/release-22.01
Browse files Browse the repository at this point in the history
Release 22.01
  • Loading branch information
dholt authored Jan 19, 2022
2 parents 21c039d + aaedef8 commit 009bdeb
Show file tree
Hide file tree
Showing 41 changed files with 457 additions and 57 deletions.
33 changes: 33 additions & 0 deletions .github/workflows/molecule.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,33 @@
---
name: test ansible roles with molecule
on:
- push
- pull_request
jobs:
build:
runs-on: ubuntu-20.04
strategy:
max-parallel: 4
matrix:
deepops-role:
- singularity_wrapper
steps:
- name: check out repo
uses: actions/checkout@v2
with:
path: "${{ github.repository }}"
- name: set up python
uses: actions/setup-python@v2
with:
python-version: "3.9"
- name: install dependencies
run: |
python3 -m pip install --upgrade pip
python3 -m pip install molecule[docker] docker ansible
- name: run molecule test
run: |
cd "${{ github.repository }}/roles"
ansible-galaxy role install --force -r ./requirements.yml
ansible-galaxy collection install --force -r ./requirements.yml
cd "${{ matrix.deepops-role }}"
molecule test
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -16,7 +16,7 @@ Check out the [video tutorial](https://drive.google.com/file/d/1RNLQYlgJqE8JMv0n

## Releases

Latest release: [DeepOps 21.09 Release](https://github.com/NVIDIA/deepops/releases/tag/21.09)
Latest release: [DeepOps 22.01 Release](https://github.com/NVIDIA/deepops/releases/tag/22.01)

It is recommended to use the latest release branch for stable code (linked above). All development takes place on the master branch, which is generally [functional](docs/deepops/testing.md) but may change significantly between releases.

Expand Down
6 changes: 6 additions & 0 deletions config.example/env.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
# This file acts as a location to override the default configurations of deepops/scripts/*
# Many of the scripts in this directory define global variables and set reasonable defaults
# Global variables (in all caps) that are defined here will be automatically sourced and used in all scripts
# See deepops/scripts/common.sh for implementation details

DEEPOPS_EXAMPLE_VAR=""
2 changes: 1 addition & 1 deletion config.example/group_vars/all.yml
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,7 @@ sftp_chroot: false
################################################################################
# NVIDIA GPU configuration
# Playbook: nvidia-cuda
cuda_version: cuda-toolkit-11-4
cuda_version: cuda-toolkit-11-5

# DGX-specific vars may be used to target specific models,
# because available versions for DGX may differ from the generic repo
Expand Down
3 changes: 0 additions & 3 deletions config.example/group_vars/k8s-cluster.yml
Original file line number Diff line number Diff line change
Expand Up @@ -36,9 +36,6 @@ dashboard_image_repo: "kubernetesui/dashboard"
dashboard_metrics_scrape_tagr: "v1.0.4"
dashboard_metrics_scraper_repo: "kubernetesui/metrics-scraper"

# Override the Helm version installed by Kubespray
helm_version: "v3.5.4"

# Ensure hosts file generation only runs across k8s cluster
hosts_add_ansible_managed_hosts_groups: ["k8s-cluster"]

Expand Down
8 changes: 4 additions & 4 deletions config.example/group_vars/slurm-cluster.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
################################################################################
# Slurm job scheduler configuration
# Playbook: slurm, slurm-cluster, slurm-perf, slurm-perf-cluster, slurm-validation
slurm_version: 21.08.1
slurm_version: 21.08.5
slurm_install_prefix: /usr/local
pmix_install_prefix: /opt/deepops/pmix
hwloc_install_prefix: /opt/deepops/hwloc
Expand Down Expand Up @@ -117,9 +117,9 @@ sm_install_host: "slurm-master[0]"
slurm_install_hpcsdk: true

# Select the version of HPC SDK to download
hpcsdk_major_version: "21"
hpcsdk_minor_version: "9"
hpcsdk_file_cuda: "11.4"
hpcsdk_major_version: "22"
hpcsdk_minor_version: "1"
hpcsdk_file_cuda: "11.5"
hpcsdk_arch: "x86_64"

# In a Slurm cluster, default to setting up HPC SDK as modules rather than in
Expand Down
1 change: 1 addition & 0 deletions docs/deepops/configuration.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,7 @@ In particular, this directory includes:
- `config/group_vars/all.yml`: An Ansible [variables file](https://docs.ansible.com/ansible/latest/user_guide/playbooks_variables.html) that contains variables we expect to work for all hosts
- `config/group_vars/k8s-cluster.yml`: Variables specific to deploying Kubernetes clusters
- `config/group_vars/slurm-cluster.yml`: Variables specific to deploying Slurm clusters
- `config/env.sh`: Global variables that override default variable values for all `sh` files in `scripts/*`.
- `config/requirements.yml`: An Ansible Galaxy [requirements file](https://docs.ansible.com/ansible/latest/galaxy/user_guide.html#installing-roles-and-collections-from-the-same-requirements-yml-file) that contains a list of custom Collections and Roles to install. Collections and Roles required by DeepOps are stored in a separate `roles/requirements.yml` file, which should not be modified.

It's expected that most DeepOps deployments will make changes to these files!
Expand Down
76 changes: 74 additions & 2 deletions docs/deepops/testing.md
Original file line number Diff line number Diff line change
@@ -1,12 +1,13 @@
# DeepOps Testing, CI/CD, and Validation

## DeepOps Continuous Integration Testing

## DeepOps end-to-end testing

The DeepOps project leverages a private Jenkins server to run continuous integration tests. Testing is done using the [virtual](../../virtual) deployment mechanism. Several Vagrant VMs are created, the cluster is deployed, tests are executed, and then the VMs are destroyed.

The goal of the DeepOps CI is to prevent bugs from being introduced into the code base and to identify when changes in 3rd party platforms have occurred or impacted the DeepOps deployment mechanisms. In general, K8s and Slurm deployment issues are detected and resolved with urgency. Many components of DeepOps are 3rd party open source tools that may silently fail or suddenly change without notice. The team will make a best-effort to resolve these issues and include regression tests, however there may be times where a fix is unavailable. Historically, this has been an issue with Rook-Ceph and Kubeflow, and those GitHub communities are best equipped to help with resolutions.

### Testing Methodi
### Testing Method

DeepOps CI contains two types of automated tests:

Expand Down Expand Up @@ -63,6 +64,77 @@ A short description of the nightly testing is outlined below. The full suit of t
| MIG configuration | | | | No testing support


## DeepOps Ansible role testing

A subset of the Ansible roles in DeepOps have tests defined using [Ansible Molecule](https://molecule.readthedocs.io/en/latest/).
This testing mechanism allows the roles to be tested individually, providing additional test signal to identify issues which do not appear in the end-to-end tests.
These tests are run automatically for each pull request using [Github Actions](https://github.com/NVIDIA/deepops/actions).

Molecule testing runs the Ansible role in quesiton inside a Docker container.
As such, not all roles will be easy to test witth this mechanism.
Roles which mostly involve installing software, configuring services, or executing scripts should generally be possible to test.
Roles which rely on the presence of specific hardware (such as GPUs), which reboot the nodes they act on, or which make changes to kernel configuration are going to be harder to test with Molecule.

### Defining Molecule tests for a new role

To add Molecule tests to a new role, the following procedure can be used.

1. Ensure you have Docker installed in your development environment

2. Install Ansible Molecule in your development environment

```
$ python3 -m pip install "molecule[docker,lint]"
```

3. Initialize Molecule in your new role

```
$ cd deepops/roles/<your-role>
$ molecule init scenario -r <your-role> --driver docker
```

4. In the file `molecule/default/molecule.yml`, define the list of platforms to be tested.
DeepOps currently supports operating systems based on Ubuntu 18.04, Ubuntu 20.04, EL7, and EL8.
To test these stacks, the following `platforms` stanza can be used.

```
platforms:
- name: ubuntu-1804
image: geerlingguy/docker-ubuntu1804-ansible
pre_build_image: true
- name: ubuntu-2004
image: geerlingguy/docker-ubuntu2004-ansible
pre_build_image: true
- name: centos-7
image: geerlingguy/docker-centos7-ansible
pre_build_image: true
- name: centos-8
image: geerlingguy/docker-centos8-ansible
pre_build_image: true
```

5. If you haven't already, define your role's metadata in the file `meta/main.yml`.
A sample `meta.yml` is shown here:

```
galaxy_info:
role_name: <your-role>
namespace: deepops
author: DeepOps Team
company: NVIDIA
description: <your-description>
license: 3-Clause BSD
min_ansible_version: 2.9
```

6. Once this is done, verify that your role executes successfully in the Molecule environment by running `molecule test`. If you run into any issues, consult the [Molecule documentation](https://molecule.readthedocs.io/en/latest/index.html) for help resolving them.

7. (optional) In addition to testing successful execution, you can add additional tests which will be run after your role completes in a file `molecule/default/verify.yml`. This is an Ansible playbook that will run in the same environment as your playbook ran. For a simple example of such a verify playbook, see the [Enroot role](https://github.com/NVIDIA/ansible-role-enroot/blob/master/molecule/default/verify.yml).

8. Once you're confident that your new tests are all passing, add your role to the `deepops-role` section in the `.github/workflows/molecule.yml` file.


## DeepOps Deployment Validation

The Slurm and Kubernetes deployment guides both document cluster verification steps. These should be run during the installation process to validate a GPU workload can be executed on the cluster.
Expand Down
7 changes: 1 addition & 6 deletions playbooks/container/singularity.yml
Original file line number Diff line number Diff line change
@@ -1,10 +1,5 @@
---
- hosts: all
become: yes
pre_tasks:
- name: create a folder for go
file:
path: "{{ golang_install_dir }}"
recurse: yes
roles:
- lecorguille.singularity
- singularity_wrapper
1 change: 1 addition & 0 deletions playbooks/slurm-cluster/files/cve_2021_44228.options
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
-Dlog4j2.formatMsgNoLookups=true
72 changes: 64 additions & 8 deletions playbooks/slurm-cluster/logging.yml
Original file line number Diff line number Diff line change
Expand Up @@ -3,21 +3,77 @@
become: true
vars:
elasticsearch_network_host: 0.0.0.0
logstash_listen_port_beats: 5000
pre_tasks:
- name: debian - ensure apt cache updated
apt:
update_cache: true
when: ansible_os_family == "Debian"
roles:
- geerlingguy.java
- geerlingguy.elasticsearch
- geerlingguy.logstash
- geerlingguy.kibana
- robertdebock.java
- robertdebock.elastic_repo
- robertdebock.elasticsearch
- robertdebock.logstash
- robertdebock.kibana

- hosts: slurm-master[0]
become: true
vars:
filebeat_port: "5000"
tasks:
- name: configure logstash to accept logs from filebeat
template:
src: "filebeat.conf"
dest: "/etc/logstash/conf.d/filebeat.conf"
owner: "root"
group: "root"
mode: "0644"

# Mitigation for CVE-2021-44228 impacting Log4j2
# https://discuss.elastic.co/t/apache-log4j2-remote-code-execution-rce-vulnerability-cve-2021-44228-esa-2021-31/291476
- hosts: slurm-master[0]
become: yes
tasks:
- name: fix bug in logstash role
command: /usr/share/logstash/bin/logstash-plugin install logstash-filter-multiline
- name: configure elasticsearch to mitigate CVE-2021-44228
copy:
src: "cve_2021_44228.options"
dest: "/etc/elasticsearch/jvm.options.d/cve_2021_44228.options"
owner: "root"
group: "root"
mode: "0644"
notify:
- restart-elasticsearch
- name: check for relevant class in logstash
shell: unzip -l /usr/share/logstash/logstash-core/lib/jars/log4j-core-2.* | grep JndiLookup.class
register: logstash_jndi
changed_when: logstash_jndi.rc == 0
failed_when: logstash_jndi.rc == 2
- name: configure logstash to mitigate CVE-2021-44228
shell: zip -q -d /usr/share/logstash/logstash-core/lib/jars/log4j-core-2.* org/apache/logging/log4j/core/lookup/JndiLookup.class
notify:
- restart-logstash
when: logstash_jndi.changed
- name: manually stop logstash as restart is not consistently working later
service:
name: logstash
state: stopped
notify:
- restart-logstash
when: logstash_jndi.changed
handlers:
- name: restart-elasticsearch
service:
name: elasticsearch
state: restarted
- name: restart-logstash
service:
name: logstash
state: restarted

- hosts: slurm-cluster
become: true
vars:
filebeat_create_config: true
filebeat_prospectors:
filebeat_inputs:
- input_type: log
paths:
- "/var/log/*.log"
Expand Down
12 changes: 12 additions & 0 deletions playbooks/slurm-cluster/templates/filebeat.conf
Original file line number Diff line number Diff line change
@@ -0,0 +1,12 @@
input {
beats {
port => {{ filebeat_port }}
}
}

output {
elasticsearch {
hosts => ["http://localhost:9200"]
index => "%{[@metadata][beat]}-%{[@metadata][version]}"
}
}
4 changes: 2 additions & 2 deletions roles/dns-config/tasks/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,12 +16,12 @@
- systemd-resolved
when: ansible_distribution == 'Ubuntu' and ansible_distribution_major_version == '16'

- name: disable services (bionic)
- name: disable services (bionic, focal)
service:
name: systemd-resolved
state: stopped
enabled: no
when: ansible_distribution == 'Ubuntu' and ansible_distribution_major_version == '18'
when: ansible_distribution == 'Ubuntu' and (ansible_distribution_major_version in ['18', '20'])

- name: install /etc/resolv.conf
template:
Expand Down
2 changes: 1 addition & 1 deletion roles/nvidia-cuda/defaults/main.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
---
# 'cuda' is the generic package and will pull the latest version
cuda_version: "cuda-toolkit-11-3"
cuda_version: "cuda-toolkit-11-5"

# DGX-specific vars may be used to target specific models,
# because available versions for DGX may differ from the generic repo
Expand Down
8 changes: 4 additions & 4 deletions roles/nvidia-hpc-sdk/defaults/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -15,15 +15,15 @@
# See https://developer.nvidia.com/nvidia-hpc-sdk-downloads for more detail on available downloads.

# Version strings used to construct download URL
hpcsdk_major_version: "21"
hpcsdk_minor_version: "9"
hpcsdk_file_cuda: "11.4"
hpcsdk_major_version: "22"
hpcsdk_minor_version: "1"
hpcsdk_file_cuda: "11.5"
hpcsdk_arch: "x86_64"

# We need to specify the default CUDA toolkit to use during installation.
# This should usually be the latest CUDA included in the HPC SDK you are
# installing.
hpcsdk_default_cuda: "11.4"
hpcsdk_default_cuda: "11.5"

# Add HPC SDK modules to the MODULEPATH?
hpcsdk_install_as_modules: false
Expand Down
27 changes: 15 additions & 12 deletions roles/requirements.yml
Original file line number Diff line number Diff line change
Expand Up @@ -36,19 +36,22 @@ roles:
version: "v0.5.0"

- src: geerlingguy.filebeat
version: "2.0.1"
version: "3.3.0"

- src: geerlingguy.logstash
version: "4.0.0"
- src: robertdebock.java
version: "4.1.1"

- src: geerlingguy.elasticsearch
version: "3.0.1"
- src: robertdebock.elastic_repo
version: "1.0.3"

- src: geerlingguy.java
version: "1.9.5"
- src: robertdebock.logstash
version: "1.1.1"

- src: geerlingguy.kibana
version: "3.2.1"
- src: robertdebock.elasticsearch
version: "1.1.3"

- src: robertdebock.kibana
version: "1.2.4"

- src: https://github.com/DeepOps/ansible-maas.git
name: ansible-maas
Expand All @@ -61,8 +64,8 @@ roles:
- src: https://github.com/OSC/ood-ansible.git
version: 'v2.0.3'

- src: abims_sbr.singularity
version: 3.7.1-1

- src: gantsign.golang
version: 2.4.0

- src: lecorguille.singularity
version: 1.2.0
Loading

0 comments on commit 009bdeb

Please sign in to comment.