Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Feature Request] Add hostNetwork mode for dcgmExporter #1086

Open
jslouisyou opened this issue Oct 30, 2024 · 1 comment
Open

[Feature Request] Add hostNetwork mode for dcgmExporter #1086

jslouisyou opened this issue Oct 30, 2024 · 1 comment

Comments

@jslouisyou
Copy link

jslouisyou commented Oct 30, 2024

Hello, NVIDIA Team.

I'm facing an issue while configurating dcgm-exporter from gpu-operator. I have 2 Kubernetes clusters - one is a cluster where GPU jobs run, and the other is used for managing the first cluster. In this case, Prometheus is not installed on the cluster where GPU jobs run, due to reduce CPU and memory resources as much can, and I'm trying to collect metrics using Prometheus from the other cluster.

I hope to set hostNetwork service for dcgm-exporter in order to get metrics from each nodes, but I can't find where it should be placed in gpu-operator helm chart (As I remembered this is useful when Prometheus is deployed outside of the Kubernetes cluster).

I found that hostNetwork can be configurable in dcgm-exporter, for example:

    spec:
      {{- if .Values.runtimeClassName }}
      runtimeClassName: {{ .Values.runtimeClassName }}
      {{- end }}
      priorityClassName: {{ .Values.priorityClassName | default "system-node-critical" }}
      {{- if .Values.hostNetwork }}
      hostNetwork: {{ .Values.hostNetwork }}

https://github.com/NVIDIA/dcgm-exporter/blob/4cc1d199cd3b13b6edee96af5339708f9747f499/deployment/templates/daemonset.yaml#L53

But in gpu-operator, only below values can be configurable and can't modify Service in here:

dcgmExporter:
  enabled: true
  repository: nvcr.io/nvidia/k8s
  image: dcgm-exporter
  version: 3.3.8-3.6.0-ubuntu22.04
  imagePullPolicy: IfNotPresent
  env:
    - name: DCGM_EXPORTER_LISTEN
      value: ":9400"
    - name: DCGM_EXPORTER_KUBERNETES
      value: "true"
    - name: DCGM_EXPORTER_COLLECTORS
      value: "/etc/dcgm-exporter/dcp-metrics-included.csv"
  resources: {}
  serviceMonitor:
    enabled: false
    interval: 15s
    honorLabels: false
    additionalLabels: {}
    relabelings: []

https://github.com/NVIDIA/gpu-operator/blob/752e8aed73c8c6141b545f56a0ed23e2a2b637a7/deployments/gpu-operator/values.yaml#L309C1-L328C20

Besides, there isn't configurable section in DaemonSet:
https://github.com/NVIDIA/gpu-operator/blob/752e8aed73c8c6141b545f56a0ed23e2a2b637a7/assets/state-dcgm-exporter/0900_daemonset.yaml

So in this case, Could you please add hostNetwork option in dcgmExporter section?

Thanks.

@tariq1890
Copy link
Contributor

Instead of enabling hostNetwork, would making the dcgm-exporter service a NodePort unblock you? If so, we can look into making the dcgm service configurable.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants