Failed to get pods metric value: unable to get metric DCGM_FI_DEV_FB_USED_AVG: no metrics returned from custom metrics API #644

saad946 · 2024-02-27T12:58:30Z

What happened?:
I am constantly having this with one of my service hpa which is configured to scale based on custom metrics. Sometime hpa shows able to scale to True and able to get custom metrics most of the time not. Because of that hpa is not able to scale down the pods.

This our hpa description for one of the affected service.

Affected service hpa description:

  Type            Status  Reason               Message
  ----            ------  ------               -------
  AbleToScale     True    SucceededGetScale    the HPA controller was able to get the target's current scale
  ScalingActive   False   FailedGetPodsMetric  the HPA was unable to compute the replica count: unable to get metric DCGM_FI_DEV_FB_USED_AVG: no metrics returned from custom metrics API
  ScalingLimited  False   DesiredWithinRange   the desired count is within the acceptable range

While the other service using same hpa configuration not showing this error while describing its hpa.
This hpa description from another service.

Running service hpa:
This is a Random behaviour we observed in both services that sometime its able to collect custom metric and sometime not.

  Type            Status  Reason              Message
  ----            ------  ------              -------
  AbleToScale     True   ReadyForNewScale    recommended size matches current size
  ScalingActive   True    ValidMetricFound    the HPA was able to successfully calculate a replica count from pods metric DCGM_FI_DEV_FB_USED_AVG

What did you expect to happen?:
Expect the same behaviour of prometheus adapter and hpa among the services if using same configuration for both services.

Please provide the prometheus-adapter config:

prometheus-adapter config

prometheus:
    url: http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local
    port: 9090
resources:
  requests:
    cpu: "500m"
    memory: "512Mi"
  limits:
    cpu: "1"
    memory: "1Gi"

rules:
  default: false
  custom:
    - seriesQuery: 'DCGM_FI_DEV_GPU_UTIL{exported_namespace!="", exported_container!="", exported_pod!=""}'
      name:
        as: "DCGM_FI_DEV_GPU_UTIL_AVG"
      resources:
        overrides:
          exported_namespace: {resource: "namespace"}
          exported_pod: {resource: "pod"}
          exported_container: {resource: "pod"}
      metricsQuery: 'avg by (exported_namespace, exported_pod) (round(avg_over_time(DCGM_FI_DEV_GPU_UTIL{exported_pod!="",exported_container!=""}[1m])))'
    - seriesQuery: 'DCGM_FI_DEV_FB_USED{exported_namespace!="", exported_container!="", exported_pod!=""}'
      name:
        as: "DCGM_FI_DEV_FB_USED_AVG"
      resources:
        overrides:
          exported_namespace: {resource: "namespace"}
          exported_pod: {resource: "pod"}
          exported_container: {resource: "pod"}
      metricsQuery: 'avg by (exported_namespace, exported_pod) (round(avg_over_time(DCGM_FI_DEV_FB_USED{exported_pod!="",exported_container!=""}[1m])))'
    - seriesQuery: 'DCGM_FI_DEV_GPU_UTIL{exported_namespace!="",exported_container!="",exported_pod!=""}'
      name:
        as: "DCGM_FI_DEV_GPU_UTIL_MIN"
      resources:
        overrides:
          exported_container: {resource: "service"}
          exported_namespace: {resource: "namespace"}
          exported_pod: {resource: "pod"}
      metricsQuery: min by (exported_namespace, exported_container) (round(min_over_time(<<.Series>>[1m])))
    - seriesQuery: 'DCGM_FI_DEV_GPU_UTIL{exported_namespace!="",exported_container!="",exported_pod!=""}'
      name:
        as: "DCGM_FI_DEV_GPU_UTIL_MAX"
      resources:
        overrides:
          exported_container: {resource: "service"}
          exported_namespace: {resource: "namespace"}
          exported_pod: {resource: "pod"}
      metricsQuery: max by (exported_namespace, exported_container) (round(max_over_time(<<.Series>>[1m])))

When checking if metrics exist or not, got this response:


 kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq -r . | grep DCGM_FI_DEV_FB_USED_AVG
      "name": "pods/DCGM_FI_DEV_FB_USED_AVG",
      "name": "namespaces/DCGM_FI_DEV_FB_USED_AVG",

Please provide the HPA resource used for autoscaling:

HPA yaml

HPA yaml for both service is here:

Not Working one:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: serviceA-memory-utilization-hpa
  namespace: development
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name:  serviceA
  minReplicas: 1
  maxReplicas: 2
  metrics:
  - type: Pods
    pods:
      metric:
        name: DCGM_FI_DEV_FB_USED_AVG
      target:
        type: AverageValue
        averageValue: 20000

Working One:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: serviceB-memory-utilization-hpa
  namespace: common-service-development 
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name:  serviceB
  minReplicas: 1
  maxReplicas: 2
  metrics:
  - type: Pods
    pods:
      metric:
        name: DCGM_FI_DEV_FB_USED_AVG
      target:
        type: AverageValue
        averageValue: 20000

Please provide the HPA status:

We observed these events in both services time to time, also sometime it is able to collect the metric for ServiceB
but not for serviceA most of the time.

Events:
  Type     Reason                        Age                 From                       Message
  ----     ------                        ----                ----                       -------
  Warning  FailedComputeMetricsReplicas  26m (x12 over 30m)  horizontal-pod-autoscaler  invalid metrics (1 invalid out of 1), first error is: failed to get pods metric value: unable to get metric DCGM_FI_DEV_FB_USED_AVG: no metrics returned from custom metrics API
  Warning  FailedGetPodsMetric           22s (x74 over 30m)  horizontal-pod-autoscaler  unable to get metric DCGM_FI_DEV_FB_USED_AVG: no metrics returned from custom metrics API

And it is the HPA status, it seems it is able to get the memory utilization, but while we describe hpa we observed issues as stated earlier that hpa is unable to collect metrics neither trigger scaling activity.

serviceA-memory-utilization-hpa      Deployment/serviceA            19675/20k   1         2         1          14m
serviceB-memory-utilization-hpa      Deployment/serviceB             19675/20k   1         2         2          11m

Please provide the prometheus-adapter logs with -v=6 around the time the issue happened:

prometheus-adapter logs

Anything else we need to know?:

Environment:

prometheus-adapter version: prometheus-adapter-3.2.2 v0.9.1
prometheus version: kube-prometheus-stack-56.6.2 v0.71.2
Kubernetes version (use kubectl version): Client Version: v1.28.3-eks-e71965b
Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
Server Version: v1.26.12-eks-5e0fdde
Cloud provider or hardware configuration: AWS EKS
Other info:

The text was updated successfully, but these errors were encountered:

dashpole · 2024-03-07T17:48:31Z

/cc @CatherineF-dev
/assign @dgrisonnet
/triage accepted

aurifolia · 2024-05-09T10:01:56Z

Starting from v0.11.0, the file corresponding to this link does not exist, which may be the cause

saad946 · 2024-05-16T13:13:12Z

Starting from v0.11.0, the file corresponding to this link does not exist, which may be the cause

That is not the only issue, as prometheus adapter failed to get GPU metrics, and not able to scale up and down kubernetes deployment and reflect the error of Failed to get pods metric value: unable to get metric DCGM_FI_DEV_FB_USED_AVG: no metrics returned from custom metrics API in hpa describe command, with unknown status or ScalingActive becomes False.

dvp34 · 2024-06-13T04:03:36Z

Just curious, How does the raw data for DCGM_FI_DEV_GPU_UTIL{} from prometheus look like?

mayyyyying · 2024-07-02T08:29:01Z

something like this @dvp34

{
    "status": "success",
    "data": {
        "resultType": "vector",
        "result": [
            {
                "metric": {
                    "DCGM_FI_DRIVER_VERSION": "535.171.04",
                    "Hostname": "qxzg-l4server",
                    "UUID": "GPU-557dbd17-5aa5-ade0-c563-d44fee17f8bc",
                    "__name__": "DCGM_FI_DEV_GPU_UTIL",
                    "container": "nvidia-dcgm-exporter",
                    "device": "nvidia1",
                    "endpoint": "gpu-metrics",
                    "exported_container": "triton",
                    "exported_namespace": "llm",
                    "exported_pod": "qwen-1gpu-75455d6c96-7jcxq",
                    "gpu": "1",
                    "instance": "10.42.0.213:9400",
                    "job": "nvidia-dcgm-exporter",
                    "modelName": "NVIDIA L4",
                    "namespace": "gpu-operator",
                    "pod": "nvidia-dcgm-exporter-rlhcx",
                    "service": "nvidia-dcgm-exporter"
                },
                "value": [
                    1719909159.405,
                    "0"
                ]
            },
            {
                "metric": {
                    "DCGM_FI_DRIVER_VERSION": "535.171.04",
                    "Hostname": "qxzg-l4server",
                    "UUID": "GPU-557dbd17-5aa5-ade0-c563-d44fee17f8bc",
                    "__name__": "DCGM_FI_DEV_GPU_UTIL",
                    "container": "triton",
                    "device": "nvidia1",
                    "gpu": "1",
                    "instance": "10.42.0.213:9400",
                    "job": "gpu-metrics",
                    "kubernetes_node": "qxzg-l4server",
                    "modelName": "NVIDIA L4",
                    "namespace": "llm",
                    "pod": "qwen-1gpu-75455d6c96-7jcxq"
                },
                "value": [
                    1719909159.405,
                    "0"
                ]
            },
            {
                "metric": {
                    "DCGM_FI_DRIVER_VERSION": "535.171.04",
                    "Hostname": "qxzg-l4server",
                    "UUID": "GPU-ec1a0983-4e27-c5c1-16f7-534319ffb62c",
                    "__name__": "DCGM_FI_DEV_GPU_UTIL",
                    "container": "nvidia-dcgm-exporter",
                    "device": "nvidia0",
                    "endpoint": "gpu-metrics",
                    "exported_container": "triton",
                    "exported_namespace": "llm",
                    "exported_pod": "qwen2-d63eff62-2f6a-427d-b231-e7693a1c2915-747c599cb6-4xjlb",
                    "gpu": "0",
                    "instance": "10.42.0.213:9400",
                    "job": "nvidia-dcgm-exporter",
                    "modelName": "NVIDIA L4",
                    "namespace": "gpu-operator",
                    "pod": "nvidia-dcgm-exporter-rlhcx",
                    "service": "nvidia-dcgm-exporter"
                },
                "value": [
                    1719909159.405,
                    "0"
                ]
            },
            {
                "metric": {
                    "DCGM_FI_DRIVER_VERSION": "535.171.04",
                    "Hostname": "qxzg-l4server",
                    "UUID": "GPU-ec1a0983-4e27-c5c1-16f7-534319ffb62c",
                    "__name__": "DCGM_FI_DEV_GPU_UTIL",
                    "container": "triton",
                    "device": "nvidia0",
                    "gpu": "0",
                    "instance": "10.42.0.213:9400",
                    "job": "gpu-metrics",
                    "kubernetes_node": "qxzg-l4server",
                    "modelName": "NVIDIA L4",
                    "namespace": "llm",
                    "pod": "qwen2-d63eff62-2f6a-427d-b231-e7693a1c2915-747c599cb6-4xjlb"
                },
                "value": [
                    1719909159.405,
                    "0"
                ]
            }
        ]
    }
}

saad946 added the kind/bug Categorizes issue or PR as related to a bug. label Feb 27, 2024

k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Feb 27, 2024

k8s-ci-robot assigned dgrisonnet Mar 7, 2024

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to get pods metric value: unable to get metric DCGM_FI_DEV_FB_USED_AVG: no metrics returned from custom metrics API #644

Failed to get pods metric value: unable to get metric DCGM_FI_DEV_FB_USED_AVG: no metrics returned from custom metrics API #644

saad946 commented Feb 27, 2024 •

edited

Loading

dashpole commented Mar 7, 2024

aurifolia commented May 9, 2024

saad946 commented May 16, 2024

dvp34 commented Jun 13, 2024

mayyyyying commented Jul 2, 2024 •

edited

Loading

Failed to get pods metric value: unable to get metric DCGM_FI_DEV_FB_USED_AVG: no metrics returned from custom metrics API #644

Failed to get pods metric value: unable to get metric DCGM_FI_DEV_FB_USED_AVG: no metrics returned from custom metrics API #644

Comments

saad946 commented Feb 27, 2024 • edited Loading

dashpole commented Mar 7, 2024

aurifolia commented May 9, 2024

saad946 commented May 16, 2024

dvp34 commented Jun 13, 2024

mayyyyying commented Jul 2, 2024 • edited Loading

saad946 commented Feb 27, 2024 •

edited

Loading

mayyyyying commented Jul 2, 2024 •

edited

Loading