[Feature Request] Add hostNetwork mode for dcgmExporter #1086

jslouisyou · 2024-10-30T11:27:23Z

Hello, NVIDIA Team.

I'm facing an issue while configurating dcgm-exporter from gpu-operator. I have 2 Kubernetes clusters - one is a cluster where GPU jobs run, and the other is used for managing the first cluster. In this case, Prometheus is not installed on the cluster where GPU jobs run, due to reduce CPU and memory resources as much can, and I'm trying to collect metrics using Prometheus from the other cluster.

I hope to set hostNetwork service for dcgm-exporter in order to get metrics from each nodes, but I can't find where it should be placed in gpu-operator helm chart (As I remembered this is useful when Prometheus is deployed outside of the Kubernetes cluster).

I found that hostNetwork can be configurable in dcgm-exporter, for example:

    spec:
      {{- if .Values.runtimeClassName }}
      runtimeClassName: {{ .Values.runtimeClassName }}
      {{- end }}
      priorityClassName: {{ .Values.priorityClassName | default "system-node-critical" }}
      {{- if .Values.hostNetwork }}
      hostNetwork: {{ .Values.hostNetwork }}

https://github.com/NVIDIA/dcgm-exporter/blob/4cc1d199cd3b13b6edee96af5339708f9747f499/deployment/templates/daemonset.yaml#L53

But in gpu-operator, only below values can be configurable and can't modify Service in here:

dcgmExporter:
  enabled: true
  repository: nvcr.io/nvidia/k8s
  image: dcgm-exporter
  version: 3.3.8-3.6.0-ubuntu22.04
  imagePullPolicy: IfNotPresent
  env:
    - name: DCGM_EXPORTER_LISTEN
      value: ":9400"
    - name: DCGM_EXPORTER_KUBERNETES
      value: "true"
    - name: DCGM_EXPORTER_COLLECTORS
      value: "/etc/dcgm-exporter/dcp-metrics-included.csv"
  resources: {}
  serviceMonitor:
    enabled: false
    interval: 15s
    honorLabels: false
    additionalLabels: {}
    relabelings: []

https://github.com/NVIDIA/gpu-operator/blob/752e8aed73c8c6141b545f56a0ed23e2a2b637a7/deployments/gpu-operator/values.yaml#L309C1-L328C20

Besides, there isn't configurable section in DaemonSet:
https://github.com/NVIDIA/gpu-operator/blob/752e8aed73c8c6141b545f56a0ed23e2a2b637a7/assets/state-dcgm-exporter/0900_daemonset.yaml

So in this case, Could you please add hostNetwork option in dcgmExporter section?

Thanks.

The text was updated successfully, but these errors were encountered:

tariq1890 · 2024-11-20T17:37:13Z

Instead of enabling hostNetwork, would making the dcgm-exporter service a NodePort unblock you? If so, we can look into making the dcgm service configurable.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature Request] Add hostNetwork mode for dcgmExporter #1086

[Feature Request] Add hostNetwork mode for dcgmExporter #1086

jslouisyou commented Oct 30, 2024 •

edited

Loading

tariq1890 commented Nov 20, 2024

[Feature Request] Add hostNetwork mode for dcgmExporter #1086

[Feature Request] Add hostNetwork mode for dcgmExporter #1086

Comments

jslouisyou commented Oct 30, 2024 • edited Loading

tariq1890 commented Nov 20, 2024

jslouisyou commented Oct 30, 2024 •

edited

Loading