Skip to content

Commit

Permalink
Add nvidia_gpu_exporter
Browse files Browse the repository at this point in the history
Signed-off-by: Zhang, Chaoyue (Jack) <[email protected]>
  • Loading branch information
Zhang, Chaoyue (Jack) committed Jun 7, 2024
1 parent 384d8c7 commit 718b7d5
Show file tree
Hide file tree
Showing 21 changed files with 854 additions and 0 deletions.
84 changes: 84 additions & 0 deletions roles/nvidia_gpu_exporter/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,84 @@
<p><img src="https://www.circonus.com/wp-content/uploads/2015/03/sol-icon-itOps.png" alt="graph logo" title="graph" align="right" height="60" /></p>

# Ansible Role: Nvidia GPU exporter

## Description

Deploy prometheus [Nvidia GPU exporter ](https://github.com/utkuozdemir/nvidia_gpu_exporter) using ansible.

## Requirements

- Ansible >= 2.9 (It might work on previous versions, but we cannot guarantee it)
- gnu-tar on Mac deployer host (`brew install gnu-tar`)
- Passlib is required when using the basic authentication feature (`pip install passlib[bcrypt]`)

## Role Variables

All variables which can be overridden are stored in [defaults/main.yml](defaults/main.yml) file as well as in [meta/argument_specs.yml](meta/argument_specs.yml).
Please refer to the [collection docs](https://prometheus-community.github.io/ansible/branch/main/nvidia_gpu_exporter_role.html) for description and default values of the variables.

## Example

### Playbook

Use it in a playbook as follows:

```yaml
- hosts: all
roles:
- prometheus.prometheus.nvidia_gpu_exporter
```
### TLS config
Before running nvidia_gpu_exporter role, the user needs to provision their own certificate and key.
```yaml
- hosts: all
pre_tasks:
- name: Create nvidia_gpu_exporter cert dir
file:
path: "/etc/nvidia_gpu_exporter"
state: directory
owner: root
group: root

- name: Create cert and key
openssl_certificate:
path: /etc/nvidia_gpu_exporter/tls.cert
csr_path: /etc/nvidia_gpu_exporter/tls.csr
privatekey_path: /etc/nvidia_gpu_exporter/tls.key
provider: selfsigned
roles:
- prometheus.prometheus.nvidia_gpu_exporter
vars:
nvidia_gpu_exporter_tls_server_config:
cert_file: /etc/nvidia_gpu_exporter/tls.cert
key_file: /etc/nvidia_gpu_exporter/tls.key
nvidia_gpu_exporter_basic_auth_users:
randomuser: examplepassword
```
### Demo site
We provide an example site that demonstrates a full monitoring solution based on prometheus and grafana. The repository with code and links to running instances is [available on github](https://github.com/prometheus/demo-site) and the site is hosted on [DigitalOcean](https://digitalocean.com).
## Local Testing
The preferred way of locally testing the role is to use Docker and [molecule](https://github.com/ansible-community/molecule) (v3.x). You will have to install Docker on your system. See "Get started" for a Docker package suitable for your system. Running your tests is as simple as executing `molecule test`.

## Continuous Integration

Combining molecule and circle CI allows us to test how new PRs will behave when used with multiple ansible versions and multiple operating systems. This also allows use to create test scenarios for different role configurations. As a result we have quite a large test matrix which can take more time than local testing, so please be patient.

## Contributing

See [contributor guideline](CONTRIBUTING.md).

## Troubleshooting

See [troubleshooting](TROUBLESHOOTING.md).

## License

This project is licensed under MIT License. See [LICENSE](/LICENSE) for more details.
19 changes: 19 additions & 0 deletions roles/nvidia_gpu_exporter/defaults/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,19 @@
---
nvidia_gpu_exporter_version: 1.2.0

nvidia_gpu_exporter_binary_url: "https://github.com/{{ _nvidia_gpu_exporter_repo }}/releases/download/v{{ nvidia_gpu_exporter_version }}/\
nvidia_gpu_exporter_{{ nvidia_gpu_exporter_version }}.linux-{{ go_arch }}.tar.gz"
nvidia_gpu_exporter_checksums_url: "https://github.com/{{ _nvidia_gpu_exporter_repo }}/releases/download/v{{ nvidia_gpu_exporter_version }}/checksums.txt"

nvidia_gpu_exporter_skip_install: false

nvidia_gpu_exporter_web_disable_exporter_metrics: false
nvidia_gpu_exporter_web_listen_address: "0.0.0.0:9100"
nvidia_gpu_exporter_web_telemetry_path: "/metrics"

nvidia_gpu_exporter_binary_install_dir: "/usr/local/bin"
nvidia_gpu_exporter_system_group: "nvidia-gpu-exp"
nvidia_gpu_exporter_system_user: "{{ nvidia_gpu_exporter_system_group }}"

# Local path to stash the archive and its extraction
nvidia_gpu_exporter_archive_path: /tmp
10 changes: 10 additions & 0 deletions roles/nvidia_gpu_exporter/handlers/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
---
- name: Restart nvidia_gpu_exporter
listen: "restart nvidia_gpu_exporter"
become: true
ansible.builtin.systemd:
daemon_reload: true
name: nvidia_gpu_exporter
state: restarted
when:
- not ansible_check_mode
65 changes: 65 additions & 0 deletions roles/nvidia_gpu_exporter/meta/argument_specs.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
---
# yamllint disable rule:line-length
argument_specs:
main:
short_description: "Prometheus Nvidia GPU Exporter"
description:
- "Deploy prometheus L(Nvidia GPU exporter,https://github.com/utkuozdemir/nvidia_gpu_exporter) using ansible"
author:
- "Prometheus Community"
options:
nvidia_gpu_exporter_version:
description: "Nvidia GPU exporter package version. Also accepts latest as parameter."
default: "1.2.0"
nvidia_gpu_exporter_skip_install:
description: "Nvidia GPU exporter installation tasks gets skipped when set to true."
type: bool
default: false
nvidia_gpu_exporter_binary_local_dir:
description:
- "Enables the use of local packages instead of those distributed on github."
- "The parameter may be set to a directory where the C(nvidia_gpu_exporter) binary is stored on the host where ansible is run."
- "This overrides the I(nvidia_gpu_exporter_version) parameter"
nvidia_gpu_exporter_binary_url:
description: "URL of the Nvidia GPU exporter binaries .tar.gz file"
default: "https://github.com/{{ _nvidia_gpu_exporter_repo }}/releases/download/v{{ nvidia_gpu_exporter_version }}/nvidia_gpu_exporter-{{ nvidia_gpu_exporter_version }}.linux-{{ go_arch }}.tar.gz"
nvidia_gpu_exporter_checksums_url:
description: "URL of the Nvidia GPU exporter checksums file"
default: "https://github.com/{{ _nvidia_gpu_exporter_repo }}/releases/download/v{{ nvidia_gpu_exporter_version }}/sha256sums.txt"
nvidia_gpu_exporter_web_listen_address:
description: "Address on which Nvidia GPU exporter will listen"
default: "0.0.0.0:9835"
nvidia_gpu_exporter_web_telemetry_path:
description: "Path under which to expose metrics"
default: "/metrics"
nvidia_gpu_exporter_tls_server_config:
description:
- "Configuration for TLS authentication."
- "Keys and values are the same as in L(nvidia_gpu_exporter docs,https://prometheus.io/docs/prometheus/latest/configuration/https/)."
type: "dict"
nvidia_gpu_exporter_http_server_config:
description:
- "Config for HTTP/2 support."
- "Keys and values are the same as in L(nvidia_gpu_exporter docs,https://prometheus.io/docs/prometheus/latest/configuration/https/)."
type: "dict"
nvidia_gpu_exporter_basic_auth_users:
description: "Dictionary of users and password for basic authentication. Passwords are automatically hashed with bcrypt."
type: "dict"
nvidia_gpu_exporter_binary_install_dir:
description:
- "I(Advanced)"
- "Directory to install nvidia_gpu_exporter binary"
default: "/usr/local/bin"
nvidia_gpu_exporter_system_group:
description:
- "I(Advanced)"
- "System group for Nvidia GPU exporter"
default: "nvidia-gpu-exp"
nvidia_gpu_exporter_system_user:
description:
- "I(Advanced)"
- "Nvidia GPU exporter user"
default: "nvidia-gpu-exp"
nvidia_gpu_exporter_archive_path:
description: 'Local path to stash the archive and its extraction'
default: "/tmp"
30 changes: 30 additions & 0 deletions roles/nvidia_gpu_exporter/meta/main.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
---
galaxy_info:
author: "Prometheus Community"
description: "Nvidia GPU exporter"
license: "Apache"
min_ansible_version: "2.9"
platforms:
- name: "Ubuntu"
versions:
- "focal"
- "jammy"
- name: "Debian"
versions:
- "bullseye"
- "buster"
- name: "EL"
versions:
- "7"
- "8"
- "9"
- name: "Fedora"
versions:
- "37"
- '38'
galaxy_tags:
- "monitoring"
- "prometheus"
- "exporter"
- "metrics"
- "system"
18 changes: 18 additions & 0 deletions roles/nvidia_gpu_exporter/molecule/alternative/molecule.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
---
provisioner:
inventory:
group_vars:
all:
nvidia_gpu_exporter_binary_local_dir: "/tmp/nvidia_gpu_exporter-linux-amd64"
nvidia_gpu_exporter_web_listen_address:
- '127.0.0.1:9835'

nvidia_gpu_exporter_tls_server_config:
cert_file: /etc/nvidia_gpu_exporter/tls.cert
key_file: /etc/nvidia_gpu_exporter/tls.key
nvidia_gpu_exporter_http_server_config:
http2: true
nvidia_gpu_exporter_basic_auth_users:
randomuser: examplepassword
go_arch: amd64
nvidia_gpu_exporter_version: 1.2.0
78 changes: 78 additions & 0 deletions roles/nvidia_gpu_exporter/molecule/alternative/prepare.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
---
- name: Run local preparation
hosts: localhost
gather_facts: false
tasks:
- name: Download nvidia_gpu_exporter binary to local folder
become: false
ansible.builtin.get_url:
url: "https://github.com/prometheus/nvidia_gpu_exporter/releases/download/v{{\
\ nvidia_gpu_exporter_version }}/nvidia_gpu_exporter-{{ nvidia_gpu_exporter_version }}.linux-{{\
\ go_arch }}.tar.gz"
dest: "/tmp/nvidia_gpu_exporter-{{ nvidia_gpu_exporter_version }}.linux-{{ go_arch }}.tar.gz"
mode: 0644
register: _download_binary
until: _download_binary is succeeded
retries: 5
delay: 2
check_mode: false

- name: Unpack nvidia_gpu_exporter binary
become: false
ansible.builtin.unarchive:
src: "/tmp/nvidia_gpu_exporter-{{ nvidia_gpu_exporter_version }}.linux-{{ go_arch }}.tar.gz"
dest: "/tmp"
creates: "/tmp/nvidia_gpu_exporter-{{ nvidia_gpu_exporter_version }}.linux-{{ go_arch\
\ }}/nvidia_gpu_exporter"
check_mode: false

- name: Link to nvidia_gpu_exporter binaries directory
become: false
ansible.builtin.file:
src: "/tmp/nvidia_gpu_exporter-{{ nvidia_gpu_exporter_version }}.linux-amd64"
dest: "/tmp/nvidia_gpu_exporter-linux-amd64"
state: link
check_mode: false

- name: Install pyOpenSSL for certificate generation
ansible.builtin.pip:
name: "pyOpenSSL"

- name: Create private key
community.crypto.openssl_privatekey:
path: "/tmp/tls.key"

- name: Create CSR
community.crypto.openssl_csr:
path: "/tmp/tls.csr"
privatekey_path: "/tmp/tls.key"

- name: Create certificate
community.crypto.x509_certificate:
path: "/tmp/tls.cert"
csr_path: "/tmp/tls.csr"
privatekey_path: "/tmp/tls.key"
provider: selfsigned

- name: Run target preparation
hosts: all
any_errors_fatal: true
tasks:
- name: Create nvidia_gpu_exporter cert dir
ansible.builtin.file:
path: "{{ nvidia_gpu_exporter_tls_server_config.cert_file | dirname }}"
state: directory
owner: root
group: root
mode: u+rwX,g+rwX,o=rX

- name: Copy cert and key
ansible.builtin.copy:
src: "{{ item.src }}"
dest: "{{ item.dest }}"
mode: "{{ item.mode | default('0644') }}"
loop:
- src: "/tmp/tls.cert"
dest: "{{ nvidia_gpu_exporter_tls_server_config.cert_file }}"
- src: "/tmp/tls.key"
dest: "{{ nvidia_gpu_exporter_tls_server_config.key_file }}"
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
from __future__ import (absolute_import, division, print_function)
__metaclass__ = type

import os
import testinfra.utils.ansible_runner
import pytest

testinfra_hosts = testinfra.utils.ansible_runner.AnsibleRunner(
os.environ['MOLECULE_INVENTORY_FILE']).get_hosts('all')


def test_directories(host):
dirs = [
"/var/lib/nvidia_gpu_exporter"
]
for dir in dirs:
d = host.file(dir)
assert not d.exists


def test_service(host):
s = host.service("nvidia_gpu_exporter")
try:
assert s.is_running
except AssertionError:
# Capture service logs
journal_output = host.run('journalctl -u nvidia_gpu_exporter --since "1 hour ago"')
print("\n==== journalctl -u nvidia_gpu_exporter Output ====\n")
print(journal_output)
print("\n============================================\n")
raise # Re-raise the original assertion error


def test_protecthome_property(host):
s = host.service("nvidia_gpu_exporter")
p = s.systemd_properties
assert p.get("ProtectHome") == "yes"


@pytest.mark.parametrize("sockets", [
"tcp://127.0.1.1:9835",
])
def test_socket(host, sockets):
assert host.socket(sockets).is_listening
6 changes: 6 additions & 0 deletions roles/nvidia_gpu_exporter/molecule/default/molecule.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,6 @@
---
provisioner:
inventory:
group_vars:
all:
nvidia_gpu_exporter_web_listen_address: "127.0.0.1:9835"
Loading

0 comments on commit 718b7d5

Please sign in to comment.