bug: apisix report duplicate metrics #11934

FeiYing9 · 2025-01-22T09:02:36Z

Current Behavior

there are several k8s clusters running with apisix, but just one cluster (prod cluster) with the problem that apisix report lots of duplicate metrics.

for example:

apisix_http_status{code="200",route="3fb9d6c2",matched_uri="/api/v1/*",matched_host="xxx",service="",consumer="",node="10.244.19.254",host="xxx",upstream_addr="10.244.19.254:8080",upstream_status="200",uri="/api/v1/cluster_metric/list_task_dimension",method="POST"} 96
apisix_http_status{code="200",route="3fb9d6c2",matched_uri="/api/v1/*",matched_host="xxx",service="",consumer="",node="10.244.19.254",host="xxx",upstream_addr="10.244.19.254:8080",upstream_status="200",uri="/api/v1/cluster_metric/list_task_dimension",method="POST"} 96
...
apisix_http_status{code="200",route="3fb9d6c2",matched_uri="/api/v1/*",matched_host="xxx",service="",consumer="",node="10.244.19.254",host="xxx",upstream_addr="10.244.19.254:8080",upstream_status="200",uri="/api/v1/file/upload",method="POST"} 3188
apisix_http_status{code="200",route="3fb9d6c2",matched_uri="/api/v1/*",matched_host="xxx",service="",consumer="",node="10.244.19.254",host="xxx",upstream_addr="10.244.19.254:8080",upstream_status="200",uri="/api/v1/file/upload",method="POST"} 3188

so we will see lots of error logs from prometheus:

ts=2025-01-22T08:51:08.867Z caller=scrape.go:1793 level=debug component="scrape manager" scrape_pool=serviceMonitor/apisix/apisix/0 target=http://10.244.5.32:9091/apisix/prometheus/metrics msg="Duplicate sample for timestamp" series="apisix_http_latency_bucket{type=\"apisix\",route=\"3fb9d6c2\",service=\"\",consumer=\"\",node=\"10.244.10.60\",host=\"xxx\",upstream_addr=\"10.244.10.60:8080\",upstream_status=\"200\",uri=\"/api/v1/user/routes/ws-f4d69b29-e0a5-44e6-bd92-acf4de9990f0\",method=\"GET\",le=\"100\"}"

this metrics is too large, we run 6 pod instance of apisix, i just curl one apisix metrics url, i got about 100mb results.

Expected Behavior

No response

Error Logs

all error logs is about the shdict:

2025/01/22 15:07:27 [error] 534#534: *2088505577 [lua] prometheus_resty_counter.lua:39: increasing counter in shdict: lru eviction: key=http_latency_bucket{type="request",route="3fb9d6c2",service="",consumer="",node="10.244.11.36",host="xxx",upstream_addr="10.244.11.36:8080",upstream_status="200",uri="/api/v1/notebook/7eb9852a-be8d-4fac-a593-31f5f7d864b0",method="GET",le="30000.0"}, context: ngx.timer
...
2025/01/22 16:53:00 [error] 499#499: *2098016584 [lua] prometheus.lua:973: log_error(): Shared dictionary used for prometheus metrics is full. REPORTED METRIC DATA MIGHT BE INCOMPLETE. Please increase the size of the dictionary or decrease metric cardinality.; key index: add key: idx=__ngx_prom__key_115158, key=http_latency_bucket{type="request",route="3fb9d6c2",service="",consumer="",node="10.244.11.36",host="xxx",upstream_addr="10.244.11.36:8080",upstream_status="200",uri="/api/v1/project/project-cc83c686-1515-454e-870b-202a20a67727",method="GET",le="Inf"} while logging request, client: 10.245.13.201, server: _, request: "GET /api/v1/project/project-cc83c686-1515-454e-870b-202a20a67727 HTTP/2.0", upstream: "http://10.244.11.36:8080/api/v1/project/project-cc83c686-1515-454e-870b-202a20a67727", host: "qz.sii.edu.cn", referrer: "https://xxx/jobs/distributedTraining?spaceId=ws-f4d69b29-e0a5-44e6-bd92-acf4de9990f0"

We accept the issue of insufficient shared dict memory, just hope to know why apisix report duplicate metrics.

Steps to Reproduce

no ideas

apisix config:

    nginx_config:    # config for render the template to genarate nginx.conf
        lua_shared_dict:                  
          prometheus-metrics: 200m            # yes, it's 200m
...
    plugin_attr:
      opentelemetry:
        set_ngx_var: true
      prometheus:
        expire: 16
        export_addr:
          ip: 0.0.0.0
          port: 9091
        export_uri: /apisix/prometheus/metrics
        metric_prefix: apisix_
        metrics:
          bandwidth:
            extra_labels:
            - host: $host
            - upstream_addr: $upstream_addr
            - upstream_status: $upstream_status
            - uri: $uri
            - method: $request_method
          http_latency:
            extra_labels:
            - host: $host
            - upstream_addr: $upstream_addr
            - upstream_status: $upstream_status
            - uri: $uri
            - method: $request_method
          http_status:
            extra_labels:
            - host: $host
            - upstream_addr: $upstream_addr
            - upstream_status: $upstream_status
            - uri: $uri
            - method: $request_method
        prefer_name: true

Environment

APISIX version (run apisix version): 3.7.0 (helm version: 2.5.0)
Operating system (run uname -a): Linux cpu-001 5.4.0-192-generic #212-Ubuntu SMP Fri Jul 5 09:47:39 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
OpenResty / Nginx version (run openresty -V or nginx -V): openresty/1.21.4.2
k8s version:

$ kubectl version
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.16", GitCommit:"cbb86e0d7f4a049666fac0551e8b02ef3d6c3d9a", GitTreeState:"clean", BuildDate:"2024-07-17T01:53:56Z", GoVersion:"go1.22.5", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.16", GitCommit:"cbb86e0d7f4a049666fac0551e8b02ef3d6c3d9a", GitTreeState:"clean", BuildDate:"2024-07-17T01:44:26Z", GoVersion:"go1.22.5", Compiler:"gc", Platform:"linux/amd64"}

The text was updated successfully, but these errors were encountered:

yurkovoznyak · 2025-01-22T12:35:32Z

I believe the issue is with metric expiration logic.

I was able to reproduce it locally when I set the expiration time to a low value (like expire: 10) and had more than 1 worker process (I tested on 6, but with more workers, it's easier to reproduce).

Removing metrics expiration configuration resolves duplicates

Environment

APISIX version (run apisix version): 3.11.0
Operating system (run uname -a): Linux 82435988518b 6.10.14-linuxkit #1 SMP Fri Nov 29 17:22:03 UTC 2024 aarch64 aarch64 aarch64 GNU/Linux
OpenResty / Nginx version (run openresty -V or nginx -V): openresty/1.25.3.2

github-project-automation bot added this to Apache APISIX backlog Jan 22, 2025

github-project-automation bot moved this to 📋 Backlog in Apache APISIX backlog Jan 22, 2025

dosubot bot added the bug Something isn't working label Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: apisix report duplicate metrics #11934

bug: apisix report duplicate metrics #11934

FeiYing9 commented Jan 22, 2025 •

edited

Loading

yurkovoznyak commented Jan 22, 2025

bug: apisix report duplicate metrics #11934

bug: apisix report duplicate metrics #11934

Comments

FeiYing9 commented Jan 22, 2025 • edited Loading

Current Behavior

Expected Behavior

Error Logs

Steps to Reproduce

Environment

yurkovoznyak commented Jan 22, 2025

FeiYing9 commented Jan 22, 2025 •

edited

Loading