Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bug: lua_shared_dict prometheus-metrics overflow #11948

Open
DrJSAnD opened this issue Feb 5, 2025 · 2 comments
Open

bug: lua_shared_dict prometheus-metrics overflow #11948

DrJSAnD opened this issue Feb 5, 2025 · 2 comments
Labels
bug Something isn't working

Comments

@DrJSAnD
Copy link

DrJSAnD commented Feb 5, 2025

Current Behavior

We deploy apisix in K8s cluster and have problem with prometheus metrics.
We noticed that lua_shared_dict prometheus-metrics overflows, then the number of apisix_nginx_metric_errors_total errors starts to grow and all metrics stop displaying correctly.

Image

We try increase the prometheus-metrics parameter to 40m in the ConfigMap (config.yaml), but after 2 months this lua_shared_dict was full on all pods and errors started to occur again.

Image

nginx_config:    # config for render the template to genarate nginx.conf
  error_log: "/dev/stderr"
  error_log_level: "warn"    # warn,error
  worker_processes: "auto"
  enable_cpu_affinity: true
  worker_rlimit_nofile: 20480  # the number of files a worker process can open, should be larger than worker_connections
  event:
    worker_connections: 10620
  http:
    enable_access_log: true
    access_log: "/dev/stdout"
    access_log_format: '$remote_addr - $remote_user [$time_local] $http_host \"$request\" $status $body_bytes_sent $request_time \"$http_referer\" \"$http_user_agent\" $upstream_addr $upstream_status $upstream_response_time \"$upstream_scheme://$upstream_host$upstream_uri\"'
    access_log_format_escape: default
    keepalive_timeout: "60s"
    client_header_timeout: 60s     # timeout for reading client request header, then 408 (Request Time-out) error is returned to the client
    client_body_timeout: 60s       # timeout for reading client request body, then 408 (Request Time-out) error is returned to the client
    send_timeout: 10s              # timeout for transmitting a response to the client.then the connection is closed
    underscores_in_headers: "on"   # default enables the use of underscores in client request header fields
    real_ip_header: "X-Real-IP"    # http://nginx.org/en/docs/http/ngx_http_realip_module.html#real_ip_header
    real_ip_from:                  # http://nginx.org/en/docs/http/ngx_http_realip_module.html#set_real_ip_from
      - 127.0.0.1
      - 'unix:'
    lua_shared_dict:
      prometheus-metrics: 40m

Current Apisix state

  • Deployment via Helm chart: https://github.com/apache/apisix-helm-chart
  • Helm Chart version: 2.10.0
  • K8s pods: 3
  • Pod CPU limits: 15 (usage 4%)
  • Pod Memory limits: 60Gb (usage 35 GiB)
  • Total requests per second: 2500 - 3000
  • Active connections: 2000+
  • Upstreams: 100+
  • Routes: 120+
  • Consumers: 60+
  • Plugins: basic-auth and kafka-logger on all routes

Expected Behavior

No response

Error Logs

No response

Steps to Reproduce

  1. Run apisix with default lua_shared_dict: prometheus-metrics
  2. After 2-3 weeks prometheus-metrics overflows and apisix_nginx_metric_errors_total errors starts to grow and all metrics stop displaying correctly
  3. Change lua_shared_dict: prometheus-metrics to 40m
  4. After 2-3 months lua_shared_dict overflows again and we get a similar problem with displaying metrics

Environment

  • APISIX version (run apisix version): 3.10.0
  • Operating system (run uname -a): Linux apisix-69cfdc5fbf-m7k27 5.14.0-362.13.1.el9_3.x86_64 SMP PREEMPT_DYNAMIC Fri Nov 24 01:57:57 EST 2023 x86_64 GNU/Linux
  • OpenResty / Nginx version (run openresty -V or nginx -V): openresty/1.25.3.2
  • etcd version, if relevant (run curl http://127.0.0.1:9090/v1/server_info): 3.5.0
  • APISIX Dashboard version, if relevant: 3.0.0
  • Plugin runner version, for issues related to plugin runners:
  • LuaRocks version, for installation issues (run luarocks --version):
@github-project-automation github-project-automation bot moved this to 📋 Backlog in Apache APISIX backlog Feb 5, 2025
@dosubot dosubot bot added the bug Something isn't working label Feb 5, 2025
@qizhendong1
Copy link
Contributor

You can use curl 127.0.0.1:9091/apisix/prometheus/metrics to test the prometheus plugin endpoint if it works properly

@DrJSAnD
Copy link
Author

DrJSAnD commented Feb 6, 2025

Prometheus plugin works on all pods and it returns 69000+ rows from each pod.
We display metrics from Prometheus in Grafana (https://github.com/apache/apisix/blob/master/docs/assets/other/json/apisix-grafana-dashboard.json).
When apisix_shared_dict_free_space_bytes{name="prometheus-metrics"} reached value "0" then apisix_nginx_metric_errors_total start grows up and all apisix metrics show incorrect values.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: 📋 Backlog
Development

No branches or pull requests

2 participants