[bug] centrifugo node freezes periodically #925

z7zmey · 2024-12-17T10:20:58Z

We are experiencing periodic connection issues between one of the Centrifugo instances (AWS ECS running docker image centrifugo/centrifugo:v5.4) and the ElastiCache OSS cluster Redis:7.1.0. Several times a week, one of the running Centrifugo instances starts to log a “context deadline exceeded” error with the message "error adding subscription". At that time the metrics showed memory leaks so I did profiling during the incident, the full file is attached below.

It looks like there is no timeout either in rueredis dedicatedClusterClient or in centrifuge RedisBroker so missing Redis response blocks the channel forever.

goroutine 1690572 [chan receive, 43 minutes]:
github.com/redis/rueidis.(*pipe).Do(0xc000a6d200, {0x21a3b48, 0x2fa1900}, {0xc00882bca0, 0x3000, 0x4000})
	/home/runner/go/pkg/mod/github.com/redis/[email protected]/pipe.go:853 +0x588
github.com/redis/rueidis.(*dedicatedClusterClient).Do(0xc0001d1b60, {0x21a3b48, 0x2fa1900}, {0xc00882bca0, 0x3000, 0x4000})
	/home/runner/go/pkg/mod/github.com/redis/[email protected]/cluster.go:1219 +0xfa
github.com/centrifugal/centrifuge.(*RedisBroker).subscribe(0xc0003e2c00, 0xc0000333b0, {0xc008c00000, 0x25})
	/home/runner/go/pkg/mod/github.com/centrifugal/[email protected]/broker_redis.go:862 +0x471
github.com/centrifugal/centrifuge.(*RedisBroker).Subscribe(0xc000ab2008?, {0xc008c00000?, 0x25?})
	/home/runner/go/pkg/mod/github.com/centrifugal/[email protected]/broker_redis.go:837 +0x85
github.com/centrifugal/centrifuge.(*Node).addSubscription(0xc000508008, {0xc008c00000, 0x25}, {0xc00a203e00?, {0x0?, 0x0?}})
	/home/runner/go/pkg/mod/github.com/centrifugal/[email protected]/node.go:1015 +0x409
github.com/centrifugal/centrifuge.(*Client).subscribeCmd(_, _, {{0x0, {0x0, 0x0, 0x0}, 0x0, 0x0, 0x0, 0x0, ...}, ...}, ...)
	/home/runner/go/pkg/mod/github.com/centrifugal/[email protected]/client.go:2854 +0x5cc
github.com/centrifugal/centrifuge.(*Client).connectCmd.func1({0xc008c00000, 0x25}, {0x0, {0x0, 0x0, 0x0}, 0x0, 0x0, 0x0, 0x0, ...})
	/home/runner/go/pkg/mod/github.com/centrifugal/[email protected]/client.go:2441 +0x20f
created by github.com/centrifugal/centrifuge.(*Client).connectCmd in goroutine 1690577
	/home/runner/go/pkg/mod/github.com/centrifugal/[email protected]/client.go:2431 +0x13a5

Centrifugo version is 5.4
Client library connects with uni-grpc
The environment is AWS ECS running docker image centrifugo/centrifugo:v5.4 (cluster) and AWS ElastiCache OSS cluster Redis:7.1.0 as Redis engine.

goroutine1.txt
goroutine2.txt

The text was updated successfully, but these errors were encountered:

FZambia · 2024-12-17T10:59:45Z

Hello @z7zmey

AWS Elasticache operates using Redis fork – so it may have some differences which should be addressed in a special way.

Please try using latest Centrifugo v5.4.8 – looks like your image is cached for some previous Centrifugo version where it depended on rueidis v1.0.48. v5.4.8 depends on rueidis v1.0.51 – there were some fixes which may be relevant.

If the latest version does not help with the issue – let me know and we can continue investigating, and probably will try to address this to rueidis repo. But we need to make sure the issue reproduces with the latest version of rueidis first.

FZambia · 2024-12-17T16:42:20Z

@z7zmey please also post your Centrifugo config.

z7zmey · 2024-12-17T17:05:36Z

I have updated it to v5.4.8. It could take a few days to ensure it helps.

The application is configured using these environment variables.

CENTRIFUGO_ADMIN_PASSWORD=***
CENTRIFUGO_TOKEN_HMAC_SECRET_KEY=***
CENTRIFUGO_API_KEY=***
CENTRIFUGO_ADMIN_SECRET=true
CENTRIFUGO_ADMIN=true
CENTRIFUGO_DEBUG=true
CENTRIFUGO_LOG_LEVEL=error
CENTRIFUGO_UNI_GRPC=true
CENTRIFUGO_USER_SUBSCRIBE_TO_PERSONAL=true
CENTRIFUGO_HISTORY_SIZE=10
CENTRIFUGO_HISTORY_TTL=300s
CENTRIFUGO_FORCE_RECOVERY=true
CENTRIFUGO_PROMETHEUS=true
CENTRIFUGO_ENGINE=redis
CENTRIFUGO_REDIS_CLUSTER_ADDRESS=notifications-cluster.pnqyhf.clustercfg.euw1.cache.amazonaws.com:6379
CENTRIFUGO_PROXY_GRPC_METADATA=authorization client-id client-version debug
CENTRIFUGO_PROXY_CONNECT_ENDPOINT=http://api.ecs/centrifugo/connect
CENTRIFUGO_PROXY_CONNECT_TIMEOUT=60s
CENTRIFUGO_NAMESPACES=[{"allow_subscribe_for_client":true,"force_recovery":true,"history_size":10,"history_ttl":"300s","name":"public"},...]

FZambia · 2024-12-18T07:54:19Z

I took a closer look at goroutine profiles, and it seems the update to v5.4.8 won't help a lot here.

I was able to reproduce this case. It happens when connection has several server-side subscriptions with recovery on and there is a deadlock due to recovery sync with PUB/SUB. It deadlocks with a goroutine profile very similar to the one provided. It's indirectly related to rueidis client, but the main problem is in Centrifugo code which acquires locks in an order which blocks rueidis reader loop.

Have a fix idea, already had some experiments before (though due to different reasons) – processing PUB/SUB messages using the additional queue on Centrifugo node. Looks like it's time to apply it because it should effectively fix this problem and prevent similar in the future. And while this way may introduce some additional latencies in PUB/SUB message processing, it may help to keep smaller latencies for other Redis operations, like subscribe/unsubscribe, etc.

I'll start working on the fix, but still let me know if you will notice that the problem reproduces with v5.4.8 – will give more confidence that I am addressing the right scenario.

z7zmey · 2024-12-18T09:14:42Z

Thanks. I'll perform a fresh profiling when I detect an incident.

z7zmey · 2024-12-18T11:25:31Z

New profiling for v5.4.8:

goroutine1.txt
goroutine2.txt

z7zmey · 2024-12-18T12:34:55Z

One more thing I forgot to mention. During incidents, the troublesome node is no longer visible to healthy Centrifugo nodes. On the other hand, the problematic node can not obtain the node list from Redis and continues to show the same number of nodes in metrics.

With 4 Centrifugo nodes, I can detect incidents when healthy nodes see 3 nodes and the unhealthy node still displays 4 nodes.

FZambia · 2024-12-23T09:51:37Z

Please try https://github.com/centrifugal/centrifugo/releases/tag/v5.4.9 – I believe it should fix the issue 🤞

z7zmey · 2024-12-23T10:07:49Z

Have updated to v5.4.9

This was referenced Dec 21, 2024

Investigate context deadline exceeded with DragonflyDB centrifugal/centrifuge#389

Open

Fix a deadlock during pub/sub and recovery sync centrifugal/centrifuge#439

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[bug] centrifugo node freezes periodically #925

[bug] centrifugo node freezes periodically #925

z7zmey commented Dec 17, 2024

FZambia commented Dec 17, 2024 •

edited

Loading

FZambia commented Dec 17, 2024

z7zmey commented Dec 17, 2024 •

edited

Loading

FZambia commented Dec 18, 2024 •

edited

Loading

z7zmey commented Dec 18, 2024

z7zmey commented Dec 18, 2024

z7zmey commented Dec 18, 2024

FZambia commented Dec 23, 2024

z7zmey commented Dec 23, 2024

[bug] centrifugo node freezes periodically #925

[bug] centrifugo node freezes periodically #925

Comments

z7zmey commented Dec 17, 2024

FZambia commented Dec 17, 2024 • edited Loading

FZambia commented Dec 17, 2024

z7zmey commented Dec 17, 2024 • edited Loading

FZambia commented Dec 18, 2024 • edited Loading

z7zmey commented Dec 18, 2024

z7zmey commented Dec 18, 2024

z7zmey commented Dec 18, 2024

FZambia commented Dec 23, 2024

z7zmey commented Dec 23, 2024

FZambia commented Dec 17, 2024 •

edited

Loading

z7zmey commented Dec 17, 2024 •

edited

Loading

FZambia commented Dec 18, 2024 •

edited

Loading