Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to get pods metric value: unable to get metric DCGM_FI_DEV_FB_USED_AVG: no metrics returned from custom metrics API #644

Open
saad946 opened this issue Feb 27, 2024 · 5 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@saad946
Copy link

saad946 commented Feb 27, 2024

What happened?:
I am constantly having this with one of my service hpa which is configured to scale based on custom metrics. Sometime hpa shows able to scale to True and able to get custom metrics most of the time not. Because of that hpa is not able to scale down the pods.

This our hpa description for one of the affected service.

Affected service hpa description:

  Type            Status  Reason               Message
  ----            ------  ------               -------
  AbleToScale     True    SucceededGetScale    the HPA controller was able to get the target's current scale
  ScalingActive   False   FailedGetPodsMetric  the HPA was unable to compute the replica count: unable to get metric DCGM_FI_DEV_FB_USED_AVG: no metrics returned from custom metrics API
  ScalingLimited  False   DesiredWithinRange   the desired count is within the acceptable range

While the other service using same hpa configuration not showing this error while describing its hpa.
This hpa description from another service.

Running service hpa:
This is a Random behaviour we observed in both services that sometime its able to collect custom metric and sometime not.

  Type            Status  Reason              Message
  ----            ------  ------              -------
  AbleToScale     True   ReadyForNewScale    recommended size matches current size
  ScalingActive   True    ValidMetricFound    the HPA was able to successfully calculate a replica count from pods metric DCGM_FI_DEV_FB_USED_AVG

What did you expect to happen?:
Expect the same behaviour of prometheus adapter and hpa among the services if using same configuration for both services.

Please provide the prometheus-adapter config:

prometheus-adapter config
prometheus:
    url: http://prometheus-kube-prometheus-prometheus.monitoring.svc.cluster.local
    port: 9090
resources:
  requests:
    cpu: "500m"
    memory: "512Mi"
  limits:
    cpu: "1"
    memory: "1Gi"

rules:
  default: false
  custom:
    - seriesQuery: 'DCGM_FI_DEV_GPU_UTIL{exported_namespace!="", exported_container!="", exported_pod!=""}'
      name:
        as: "DCGM_FI_DEV_GPU_UTIL_AVG"
      resources:
        overrides:
          exported_namespace: {resource: "namespace"}
          exported_pod: {resource: "pod"}
          exported_container: {resource: "pod"}
      metricsQuery: 'avg by (exported_namespace, exported_pod) (round(avg_over_time(DCGM_FI_DEV_GPU_UTIL{exported_pod!="",exported_container!=""}[1m])))'
    - seriesQuery: 'DCGM_FI_DEV_FB_USED{exported_namespace!="", exported_container!="", exported_pod!=""}'
      name:
        as: "DCGM_FI_DEV_FB_USED_AVG"
      resources:
        overrides:
          exported_namespace: {resource: "namespace"}
          exported_pod: {resource: "pod"}
          exported_container: {resource: "pod"}
      metricsQuery: 'avg by (exported_namespace, exported_pod) (round(avg_over_time(DCGM_FI_DEV_FB_USED{exported_pod!="",exported_container!=""}[1m])))'
    - seriesQuery: 'DCGM_FI_DEV_GPU_UTIL{exported_namespace!="",exported_container!="",exported_pod!=""}'
      name:
        as: "DCGM_FI_DEV_GPU_UTIL_MIN"
      resources:
        overrides:
          exported_container: {resource: "service"}
          exported_namespace: {resource: "namespace"}
          exported_pod: {resource: "pod"}
      metricsQuery: min by (exported_namespace, exported_container) (round(min_over_time(<<.Series>>[1m])))
    - seriesQuery: 'DCGM_FI_DEV_GPU_UTIL{exported_namespace!="",exported_container!="",exported_pod!=""}'
      name:
        as: "DCGM_FI_DEV_GPU_UTIL_MAX"
      resources:
        overrides:
          exported_container: {resource: "service"}
          exported_namespace: {resource: "namespace"}
          exported_pod: {resource: "pod"}
      metricsQuery: max by (exported_namespace, exported_container) (round(max_over_time(<<.Series>>[1m])))

When checking if metrics exist or not, got this response:


 kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1 | jq -r . | grep DCGM_FI_DEV_FB_USED_AVG
      "name": "pods/DCGM_FI_DEV_FB_USED_AVG",
      "name": "namespaces/DCGM_FI_DEV_FB_USED_AVG",

Please provide the HPA resource used for autoscaling:

HPA yaml

HPA yaml for both service is here:

Not Working one:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: serviceA-memory-utilization-hpa
  namespace: development
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name:  serviceA
  minReplicas: 1
  maxReplicas: 2
  metrics:
  - type: Pods
    pods:
      metric:
        name: DCGM_FI_DEV_FB_USED_AVG
      target:
        type: AverageValue
        averageValue: 20000

Working One:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: serviceB-memory-utilization-hpa
  namespace: common-service-development 
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name:  serviceB
  minReplicas: 1
  maxReplicas: 2
  metrics:
  - type: Pods
    pods:
      metric:
        name: DCGM_FI_DEV_FB_USED_AVG
      target:
        type: AverageValue
        averageValue: 20000

Please provide the HPA status:

We observed these events in both services time to time, also sometime it is able to collect the metric for ServiceB
but not for serviceA most of the time.

Events:
  Type     Reason                        Age                 From                       Message
  ----     ------                        ----                ----                       -------
  Warning  FailedComputeMetricsReplicas  26m (x12 over 30m)  horizontal-pod-autoscaler  invalid metrics (1 invalid out of 1), first error is: failed to get pods metric value: unable to get metric DCGM_FI_DEV_FB_USED_AVG: no metrics returned from custom metrics API
  Warning  FailedGetPodsMetric           22s (x74 over 30m)  horizontal-pod-autoscaler  unable to get metric DCGM_FI_DEV_FB_USED_AVG: no metrics returned from custom metrics API

And it is the HPA status, it seems it is able to get the memory utilization, but while we describe hpa we observed issues as stated earlier that hpa is unable to collect metrics neither trigger scaling activity.

serviceA-memory-utilization-hpa      Deployment/serviceA            19675/20k   1         2         1          14m
serviceB-memory-utilization-hpa      Deployment/serviceB             19675/20k   1         2         2          11m

Please provide the prometheus-adapter logs with -v=6 around the time the issue happened:

prometheus-adapter logs

Anything else we need to know?:

Environment:

  • prometheus-adapter version: prometheus-adapter-3.2.2 v0.9.1

  • prometheus version: kube-prometheus-stack-56.6.2 v0.71.2

  • Kubernetes version (use kubectl version): Client Version: v1.28.3-eks-e71965b
    Kustomize Version: v5.0.4-0.20230601165947-6ce0bf390ce3
    Server Version: v1.26.12-eks-5e0fdde

  • Cloud provider or hardware configuration: AWS EKS

  • Other info:

@saad946 saad946 added the kind/bug Categorizes issue or PR as related to a bug. label Feb 27, 2024
@k8s-ci-robot k8s-ci-robot added the needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. label Feb 27, 2024
@saad946 saad946 changed the title invalid metrics (1 invalid out of 1), first error is: failed to get pods metric value: unable to get metric DCGM_FI_DEV_FB_USED_AVG: no metrics returned from custom metrics API Failed to get pods metric value: unable to get metric DCGM_FI_DEV_FB_USED_AVG: no metrics returned from custom metrics API Feb 27, 2024
@dashpole
Copy link

dashpole commented Mar 7, 2024

/cc @CatherineF-dev
/assign @dgrisonnet
/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Mar 7, 2024
@aurifolia
Copy link

image
Starting from v0.11.0, the file corresponding to this link does not exist, which may be the cause

@saad946
Copy link
Author

saad946 commented May 16, 2024

image Starting from v0.11.0, the file corresponding to this link does not exist, which may be the cause

That is not the only issue, as prometheus adapter failed to get GPU metrics, and not able to scale up and down kubernetes deployment and reflect the error of Failed to get pods metric value: unable to get metric DCGM_FI_DEV_FB_USED_AVG: no metrics returned from custom metrics API in hpa describe command, with unknown status or ScalingActive becomes False.

@dvp34
Copy link

dvp34 commented Jun 13, 2024

Just curious, How does the raw data for DCGM_FI_DEV_GPU_UTIL{} from prometheus look like?

@mayyyyying
Copy link

mayyyyying commented Jul 2, 2024

something like this @dvp34

Screenshot 2024-07-02 at 16 28 44
{
    "status": "success",
    "data": {
        "resultType": "vector",
        "result": [
            {
                "metric": {
                    "DCGM_FI_DRIVER_VERSION": "535.171.04",
                    "Hostname": "qxzg-l4server",
                    "UUID": "GPU-557dbd17-5aa5-ade0-c563-d44fee17f8bc",
                    "__name__": "DCGM_FI_DEV_GPU_UTIL",
                    "container": "nvidia-dcgm-exporter",
                    "device": "nvidia1",
                    "endpoint": "gpu-metrics",
                    "exported_container": "triton",
                    "exported_namespace": "llm",
                    "exported_pod": "qwen-1gpu-75455d6c96-7jcxq",
                    "gpu": "1",
                    "instance": "10.42.0.213:9400",
                    "job": "nvidia-dcgm-exporter",
                    "modelName": "NVIDIA L4",
                    "namespace": "gpu-operator",
                    "pod": "nvidia-dcgm-exporter-rlhcx",
                    "service": "nvidia-dcgm-exporter"
                },
                "value": [
                    1719909159.405,
                    "0"
                ]
            },
            {
                "metric": {
                    "DCGM_FI_DRIVER_VERSION": "535.171.04",
                    "Hostname": "qxzg-l4server",
                    "UUID": "GPU-557dbd17-5aa5-ade0-c563-d44fee17f8bc",
                    "__name__": "DCGM_FI_DEV_GPU_UTIL",
                    "container": "triton",
                    "device": "nvidia1",
                    "gpu": "1",
                    "instance": "10.42.0.213:9400",
                    "job": "gpu-metrics",
                    "kubernetes_node": "qxzg-l4server",
                    "modelName": "NVIDIA L4",
                    "namespace": "llm",
                    "pod": "qwen-1gpu-75455d6c96-7jcxq"
                },
                "value": [
                    1719909159.405,
                    "0"
                ]
            },
            {
                "metric": {
                    "DCGM_FI_DRIVER_VERSION": "535.171.04",
                    "Hostname": "qxzg-l4server",
                    "UUID": "GPU-ec1a0983-4e27-c5c1-16f7-534319ffb62c",
                    "__name__": "DCGM_FI_DEV_GPU_UTIL",
                    "container": "nvidia-dcgm-exporter",
                    "device": "nvidia0",
                    "endpoint": "gpu-metrics",
                    "exported_container": "triton",
                    "exported_namespace": "llm",
                    "exported_pod": "qwen2-d63eff62-2f6a-427d-b231-e7693a1c2915-747c599cb6-4xjlb",
                    "gpu": "0",
                    "instance": "10.42.0.213:9400",
                    "job": "nvidia-dcgm-exporter",
                    "modelName": "NVIDIA L4",
                    "namespace": "gpu-operator",
                    "pod": "nvidia-dcgm-exporter-rlhcx",
                    "service": "nvidia-dcgm-exporter"
                },
                "value": [
                    1719909159.405,
                    "0"
                ]
            },
            {
                "metric": {
                    "DCGM_FI_DRIVER_VERSION": "535.171.04",
                    "Hostname": "qxzg-l4server",
                    "UUID": "GPU-ec1a0983-4e27-c5c1-16f7-534319ffb62c",
                    "__name__": "DCGM_FI_DEV_GPU_UTIL",
                    "container": "triton",
                    "device": "nvidia0",
                    "gpu": "0",
                    "instance": "10.42.0.213:9400",
                    "job": "gpu-metrics",
                    "kubernetes_node": "qxzg-l4server",
                    "modelName": "NVIDIA L4",
                    "namespace": "llm",
                    "pod": "qwen2-d63eff62-2f6a-427d-b231-e7693a1c2915-747c599cb6-4xjlb"
                },
                "value": [
                    1719909159.405,
                    "0"
                ]
            }
        ]
    }
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

7 participants