Chunk cache too many timeout errors #8004

kothariroshni8 · 2024-03-12T21:55:50Z

kothariroshni8
Mar 12, 2024

Describe the bug

chunks cache too many timeouts.

To Reproduce

Steps to reproduce the behavior:

Start Mimir (SHA or version) 1.10
Perform Operations(Read/Write/Others) : Running query for more than 2
days.

Expected behavior

No Timeouts also faster response from store-gateway.

Environment

Infrastructure: Kubernetes
Deployment tool: helm

Additional Context

We observed many timeout errors for chunks. also chunks cache hit ratio is
also below 10%.
We tried to scale up the memcache and also scaled up the store gateway but
no improvement. Also tried to increase the timeout to 4s, and we saw some
improvement.

Memory Limit : 32Gib
Memory used: 6 Gib
maxConnection Limit: 16k;
total connections ~= 1k;
Requests/sec : 2.5k ops/seconds;
Latency : 500ms(99th percentile) 255ms (average)
Timeout: 450ms;

store gateway logs

ts=2024-03-12T21:22:41.851574875Z caller=memcached_client.go:462 level=debug name=chunks-cache msg="failed to get multiple items from memcached" err="memcache: connect timeout to 10.42.59.250:11211" 
ts=2024-03-12T21:22:41.851604078Z caller=client.go:144 level=debug msg="failed to store item to cache because the async buffer is full" err="the async queue is full" size=50000 
ts=2024-03-12T21:22:41.851618398Z caller=client.go:144 level=debug msg="failed to store item to cache because the async buffer is full" err="the async queue is full" size=50000 
ts=2024-03-12T21:22:41.851730404Z caller=client.go:144 level=debug msg="failed to store item to cache because the async buffer is full" err="the async queue is full" size=50000 
ts=2024-03-12T21:22:41.852071054Z caller=memcached_client.go:462 level=debug name=chunks-cache msg="failed to get multiple items from memcached" err="read tcp 10.42.60.246:45834->10.42.59.250:11211: i/o timeout" 
ts=2024-03-12T21:22:41.852253683Z caller=client.go:144 level=debug msg="failed to store item to cache because the async buffer is full" err="the async queue is full" size=50000 
ts=2024-03-12T21:22:41.852281985Z caller=client.go:144 level=debug msg="failed to store item to cache because the async buffer is full" err="the async queue is full" size=50000 
ts=2024-03-12T21:22:41.8523116Z caller=client.go:144 level=debug msg="failed to store item to cache because the async buffer is full" err="the async queue is full" size=50000

Memcache export logs:

024-03-12T18:42:10.408Z caller=exporter.go:758 level=error msg="Failed to collect stats from memcached" err="memcache: connect timeout to 127.0.0.1:11211" 
ts=2024-03-12T18:42:11.419Z caller=exporter.go:763 level=error msg="Could not query stats settings" err="read tcp 127.0.0.1:58960->127.0.0.1:11211: i/o timeout" 
ts=2024-03-12T18:46:11.409Z caller=exporter.go:763 level=error msg="Could not query stats settings" err="read tcp 127.0.0.1:32916->127.0.0.1:11211: i/o timeout" 
ts=2024-03-12T18:46:40.407Z caller=exporter.go:758 level=error msg="Failed to collect stats from memcached" err="memcache: connect timeout to 127.0.0.1:11211" 
ts=2024-03-12T18:46:41.408Z caller=exporter.go:763 level=error msg="Could not query stats settings" err="memcache: connect timeout to 127.0.0.1:11211" 
ts=2024-03-12T18:47:10.407Z caller=exporter.go:758 level=error msg="Failed to collect stats from memcached" err="memcache: connect timeout to 127.0.0.1:11211" 
ts=2024-03-12T18:47:11.471Z caller=exporter.go:763 level=error msg="Could not query stats settings" err="read tcp 127.0.0.1:57638->127.0.0.1:11211: i/o timeout" 
ts=2024-03-12T18:47:40.411Z caller=exporter.go:758 level=error msg="Failed to collect stats from memcached" err="read tcp 127.0.0.1:40606->127.0.0.1:11211: i/o timeout" 
ts=2024-03-12T18:47:41.412Z caller=exporter.go:763 level=error msg="Could not query stats settings" err="memcache: connect timeout to 127.0.0.1:11211" 
ts=2024-03-12T18:56:10.407Z caller=exporter.go:758 level=error msg="Failed to collect stats from memcached" err="memcache: connect timeout to 127.0.0.1:11211" 
ts=2024-03-12T18:56:11.407Z caller=exporter.go:763 level=error msg="Could not query stats settings" err="memcache: connect timeout to 127.0.0.1:11211" 
ts=2024-03-12T18:56:40.406Z caller=exporter.go:758 level=error msg="Failed to collect stats from memcached" err="memcache: connect timeout to 127.0.0.1:11211" 
ts=2024-03-12T18:56:41.407Z caller=exporter.go:763 level=error msg="Could not query stats settings" err="memcache: connect timeout to 127.0.0.1:11211" 
ts=2024-03-12T18:57:10.406Z caller=exporter.go:758 level=error msg="Failed to collect stats from memcached" err="memcache: connect timeout to 127.0.0.1:11211" 
ts=2024-03-12T18:57:11.407Z caller=exporter.go:763 level=error msg="Could not query stats settings" err="memcache: connect timeout to 127.0.0.1:11211" 
ts=2024-03-12T18:57:40.767Z caller=exporter.go:758 level=error msg="Failed to collect stats from memcached" err="read tcp 127.0.0.1:60124->127.0.0.1:11211: i/o timeout"

chunks-cache:
  <<: *sharedCacheConfigs
  # -- Specifies whether memcached based chunks-cache should be enabled
  enabled: true
  # -- Total number of chunks-cache replicas
  replicas: 4
  # -- Port of the chunks-cache service
  port: 11211
  # -- Amount of memory allocated to chunks-cache for object storage (in MB).
  allocatedMemory: 8192
  # -- Maximum item memory for chunks-cache (in MB).
  maxItemMemory: 1

56quarters · 2024-04-29T19:54:02Z

56quarters
Apr 29, 2024
Maintainer

The exporter not being able to connect to Memcached from within the same pod indicates some kind of network configuration error. You can verify this by attempting to connect to Memcached from any other pod in your infrastructure using telnet.

Example:

kubectl run cache-test --rm=true -n <namespace> -ti --restart=Never --image ubuntu:latest

From within that pod use telnet to connect to Memcached and run a stats command:

telnet cache-host.example 11211
stats

0 replies

majidkhoram · 2024-12-14T21:34:20Z

majidkhoram
Dec 14, 2024

Hi Everyone,

I've got the same issue. Lots of warnings in store-gateway logs "failed to fetch items from Memcached", i/o timeout TCP read from <store-gateway_IP>:port to <chunks-cache_IP>:11211

and in logs of chunks-cache:
Failed to write, and not due to blocking: Broken pipe

All Mimir pods are deployed in the Mimir namespace, so there is no network policy to mistakingly prevent network traffic between pods.

I would be so happy if anyone has found any solution

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chunk cache too many timeout errors #8004

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

Select a reply

Chunk cache too many timeout errors #8004

kothariroshni8 Mar 12, 2024

Describe the bug

To Reproduce

Expected behavior

Environment

Additional Context

Replies: 2 comments

56quarters Apr 29, 2024 Maintainer

majidkhoram Dec 14, 2024

kothariroshni8
Mar 12, 2024

56quarters
Apr 29, 2024
Maintainer

majidkhoram
Dec 14, 2024