Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[bug] centrifugo node freezes periodically #925

Open
z7zmey opened this issue Dec 17, 2024 · 9 comments
Open

[bug] centrifugo node freezes periodically #925

z7zmey opened this issue Dec 17, 2024 · 9 comments

Comments

@z7zmey
Copy link

z7zmey commented Dec 17, 2024

We are experiencing periodic connection issues between one of the Centrifugo instances (AWS ECS running docker image centrifugo/centrifugo:v5.4) and the ElastiCache OSS cluster Redis:7.1.0. Several times a week, one of the running Centrifugo instances starts to log a “context deadline exceeded” error with the message "error adding subscription". At that time the metrics showed memory leaks so I did profiling during the incident, the full file is attached below.

It looks like there is no timeout either in rueredis dedicatedClusterClient or in centrifuge RedisBroker so missing Redis response blocks the channel forever.

goroutine 1690572 [chan receive, 43 minutes]:
github.com/redis/rueidis.(*pipe).Do(0xc000a6d200, {0x21a3b48, 0x2fa1900}, {0xc00882bca0, 0x3000, 0x4000})
	/home/runner/go/pkg/mod/github.com/redis/[email protected]/pipe.go:853 +0x588
github.com/redis/rueidis.(*dedicatedClusterClient).Do(0xc0001d1b60, {0x21a3b48, 0x2fa1900}, {0xc00882bca0, 0x3000, 0x4000})
	/home/runner/go/pkg/mod/github.com/redis/[email protected]/cluster.go:1219 +0xfa
github.com/centrifugal/centrifuge.(*RedisBroker).subscribe(0xc0003e2c00, 0xc0000333b0, {0xc008c00000, 0x25})
	/home/runner/go/pkg/mod/github.com/centrifugal/[email protected]/broker_redis.go:862 +0x471
github.com/centrifugal/centrifuge.(*RedisBroker).Subscribe(0xc000ab2008?, {0xc008c00000?, 0x25?})
	/home/runner/go/pkg/mod/github.com/centrifugal/[email protected]/broker_redis.go:837 +0x85
github.com/centrifugal/centrifuge.(*Node).addSubscription(0xc000508008, {0xc008c00000, 0x25}, {0xc00a203e00?, {0x0?, 0x0?}})
	/home/runner/go/pkg/mod/github.com/centrifugal/[email protected]/node.go:1015 +0x409
github.com/centrifugal/centrifuge.(*Client).subscribeCmd(_, _, {{0x0, {0x0, 0x0, 0x0}, 0x0, 0x0, 0x0, 0x0, ...}, ...}, ...)
	/home/runner/go/pkg/mod/github.com/centrifugal/[email protected]/client.go:2854 +0x5cc
github.com/centrifugal/centrifuge.(*Client).connectCmd.func1({0xc008c00000, 0x25}, {0x0, {0x0, 0x0, 0x0}, 0x0, 0x0, 0x0, 0x0, ...})
	/home/runner/go/pkg/mod/github.com/centrifugal/[email protected]/client.go:2441 +0x20f
created by github.com/centrifugal/centrifuge.(*Client).connectCmd in goroutine 1690577
	/home/runner/go/pkg/mod/github.com/centrifugal/[email protected]/client.go:2431 +0x13a5

Centrifugo version is 5.4
Client library connects with uni-grpc
The environment is AWS ECS running docker image centrifugo/centrifugo:v5.4 (cluster) and AWS ElastiCache OSS cluster Redis:7.1.0 as Redis engine.

goroutine1.txt
goroutine2.txt

@FZambia
Copy link
Member

FZambia commented Dec 17, 2024

Hello @z7zmey

AWS Elasticache operates using Redis fork – so it may have some differences which should be addressed in a special way.

Please try using latest Centrifugo v5.4.8 – looks like your image is cached for some previous Centrifugo version where it depended on rueidis v1.0.48. v5.4.8 depends on rueidis v1.0.51 – there were some fixes which may be relevant.

If the latest version does not help with the issue – let me know and we can continue investigating, and probably will try to address this to rueidis repo. But we need to make sure the issue reproduces with the latest version of rueidis first.

@FZambia
Copy link
Member

FZambia commented Dec 17, 2024

@z7zmey please also post your Centrifugo config.

@z7zmey
Copy link
Author

z7zmey commented Dec 17, 2024

I have updated it to v5.4.8. It could take a few days to ensure it helps.

The application is configured using these environment variables.

CENTRIFUGO_ADMIN_PASSWORD=***
CENTRIFUGO_TOKEN_HMAC_SECRET_KEY=***
CENTRIFUGO_API_KEY=***
CENTRIFUGO_ADMIN_SECRET=true
CENTRIFUGO_ADMIN=true
CENTRIFUGO_DEBUG=true
CENTRIFUGO_LOG_LEVEL=error
CENTRIFUGO_UNI_GRPC=true
CENTRIFUGO_USER_SUBSCRIBE_TO_PERSONAL=true
CENTRIFUGO_HISTORY_SIZE=10
CENTRIFUGO_HISTORY_TTL=300s
CENTRIFUGO_FORCE_RECOVERY=true
CENTRIFUGO_PROMETHEUS=true
CENTRIFUGO_ENGINE=redis
CENTRIFUGO_REDIS_CLUSTER_ADDRESS=notifications-cluster.pnqyhf.clustercfg.euw1.cache.amazonaws.com:6379
CENTRIFUGO_PROXY_GRPC_METADATA=authorization client-id client-version debug
CENTRIFUGO_PROXY_CONNECT_ENDPOINT=http://api.ecs/centrifugo/connect
CENTRIFUGO_PROXY_CONNECT_TIMEOUT=60s
CENTRIFUGO_NAMESPACES=[{"allow_subscribe_for_client":true,"force_recovery":true,"history_size":10,"history_ttl":"300s","name":"public"},...]

@FZambia
Copy link
Member

FZambia commented Dec 18, 2024

I took a closer look at goroutine profiles, and it seems the update to v5.4.8 won't help a lot here.

I was able to reproduce this case. It happens when connection has several server-side subscriptions with recovery on and there is a deadlock due to recovery sync with PUB/SUB. It deadlocks with a goroutine profile very similar to the one provided. It's indirectly related to rueidis client, but the main problem is in Centrifugo code which acquires locks in an order which blocks rueidis reader loop.

Have a fix idea, already had some experiments before (though due to different reasons) – processing PUB/SUB messages using the additional queue on Centrifugo node. Looks like it's time to apply it because it should effectively fix this problem and prevent similar in the future. And while this way may introduce some additional latencies in PUB/SUB message processing, it may help to keep smaller latencies for other Redis operations, like subscribe/unsubscribe, etc.

I'll start working on the fix, but still let me know if you will notice that the problem reproduces with v5.4.8 – will give more confidence that I am addressing the right scenario.

@z7zmey
Copy link
Author

z7zmey commented Dec 18, 2024

Thanks. I'll perform a fresh profiling when I detect an incident.

@z7zmey
Copy link
Author

z7zmey commented Dec 18, 2024

New profiling for v5.4.8:

goroutine1.txt
goroutine2.txt

@z7zmey
Copy link
Author

z7zmey commented Dec 18, 2024

One more thing I forgot to mention. During incidents, the troublesome node is no longer visible to healthy Centrifugo nodes. On the other hand, the problematic node can not obtain the node list from Redis and continues to show the same number of nodes in metrics.

With 4 Centrifugo nodes, I can detect incidents when healthy nodes see 3 nodes and the unhealthy node still displays 4 nodes.

@FZambia
Copy link
Member

FZambia commented Dec 23, 2024

Please try https://github.com/centrifugal/centrifugo/releases/tag/v5.4.9 – I believe it should fix the issue 🤞

@z7zmey
Copy link
Author

z7zmey commented Dec 23, 2024

Have updated to v5.4.9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants