-
-
Notifications
You must be signed in to change notification settings - Fork 601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[bug] centrifugo node freezes periodically #925
Comments
Hello @z7zmey AWS Elasticache operates using Redis fork – so it may have some differences which should be addressed in a special way. Please try using latest Centrifugo v5.4.8 – looks like your image is cached for some previous Centrifugo version where it depended on rueidis v1.0.48. v5.4.8 depends on rueidis v1.0.51 – there were some fixes which may be relevant. If the latest version does not help with the issue – let me know and we can continue investigating, and probably will try to address this to rueidis repo. But we need to make sure the issue reproduces with the latest version of rueidis first. |
@z7zmey please also post your Centrifugo config. |
I have updated it to v5.4.8. It could take a few days to ensure it helps. The application is configured using these environment variables.
|
I took a closer look at goroutine profiles, and it seems the update to v5.4.8 won't help a lot here. I was able to reproduce this case. It happens when connection has several server-side subscriptions with recovery on and there is a deadlock due to recovery sync with PUB/SUB. It deadlocks with a goroutine profile very similar to the one provided. It's indirectly related to rueidis client, but the main problem is in Centrifugo code which acquires locks in an order which blocks rueidis reader loop. Have a fix idea, already had some experiments before (though due to different reasons) – processing PUB/SUB messages using the additional queue on Centrifugo node. Looks like it's time to apply it because it should effectively fix this problem and prevent similar in the future. And while this way may introduce some additional latencies in PUB/SUB message processing, it may help to keep smaller latencies for other Redis operations, like subscribe/unsubscribe, etc. I'll start working on the fix, but still let me know if you will notice that the problem reproduces with v5.4.8 – will give more confidence that I am addressing the right scenario. |
Thanks. I'll perform a fresh profiling when I detect an incident. |
New profiling for v5.4.8: |
One more thing I forgot to mention. During incidents, the troublesome node is no longer visible to healthy Centrifugo nodes. On the other hand, the problematic node can not obtain the node list from Redis and continues to show the same number of nodes in metrics. With 4 Centrifugo nodes, I can detect incidents when healthy nodes see 3 nodes and the unhealthy node still displays 4 nodes. |
Please try https://github.com/centrifugal/centrifugo/releases/tag/v5.4.9 – I believe it should fix the issue 🤞 |
Have updated to v5.4.9 |
We are experiencing periodic connection issues between one of the Centrifugo instances (AWS ECS running docker image centrifugo/centrifugo:v5.4) and the ElastiCache OSS cluster Redis:7.1.0. Several times a week, one of the running Centrifugo instances starts to log a “context deadline exceeded” error with the message "error adding subscription". At that time the metrics showed memory leaks so I did profiling during the incident, the full file is attached below.
It looks like there is no timeout either in rueredis dedicatedClusterClient or in centrifuge RedisBroker so missing Redis response blocks the channel forever.
Centrifugo version is 5.4
Client library connects with uni-grpc
The environment is AWS ECS running docker image centrifugo/centrifugo:v5.4 (cluster) and AWS ElastiCache OSS cluster Redis:7.1.0 as Redis engine.
goroutine1.txt
goroutine2.txt
The text was updated successfully, but these errors were encountered: