AWS ElastiCache Redis maintenance triggered disconnect from cluster #479

rtkcthomson · 2021-09-03T01:47:45Z

rtkcthomson
Sep 3, 2021

We are using AWS ElastiCache Redis configured in cluster mode with master and slave node pairs. Using a RedisCluster instance configured with read_from_replicas=True. During recent maintenance performed by AWS on the cache one of our applications was effectively disconnected from the cache and left in a state where it had no mapping of keys to nodes in the cache, resulting in KeyErrors. Mixed in with the KeyErrors were the following AttributeErrors that seem to indicate that the connection to the cluster was None:

...
File "/opt/awn/venv/site-packages/redis/client.py", line 1606, in get
return self.execute_command('GET', name)
File "/opt/awn/venv/site-packages/rediscluster/client.py", line 555, in execute_command
return self._execute_command(*args, **kwargs)
File "/opt/awn/venv/site-packages/rediscluster/client.py", line 713, in _execute_command
connection.disconnect()
AttributeError: 'NoneType' object has no attribute 'disconnect'

The odd thing is we have multiple services using the same libraries and similar setup for the connection to ElastiCache Redis and similarly provisioned ElastiCache Redis clusters in AWS for each of these services. The same maintenance was performed on the other ElastiCache Redis clusters but only one of our services ended up in this odd state.

Sequence of errors during AWS maintenance for a service that ended up in a disconnected state:

ConnectionError
ClusterDownError
KeyError
AttributeError
where the KeyError and AttributeErrors persisted until we restarted the service instances.

Sequence of errors during AWS maintenance for a service that recovered by itself:

ConnectionError
ClusterDownError
MovedErrors
notably no or very few KeyError or AttributeErrors.

Has anyone else seen similar behaviour?

Answered by Grokzen

Sep 6, 2021

So in some older versions of this lib there has been issues where different threads have manipulated the connection object in different threads and destroyed them and in other threads not looking if that object was removed or invalid properly. There is several places where this is checked before running the functions to avoid this exception. If you are not running the last version you should try it and if you find new places where this None check is not added in then we should add it in.

View full answer

Grokzen · 2021-09-06T09:14:46Z

Grokzen
Sep 6, 2021
Maintainer

So in some older versions of this lib there has been issues where different threads have manipulated the connection object in different threads and destroyed them and in other threads not looking if that object was removed or invalid properly. There is several places where this is checked before running the functions to avoid this exception. If you are not running the last version you should try it and if you find new places where this None check is not added in then we should add it in.

3 replies

rtkcthomson Sep 14, 2021
Author

I've confirmed that we are using version 2.1.3.

Grokzen Sep 24, 2021
Maintainer

@rtkcthomson I can't really help you any further. I do not use AWS myself so i can't really test anything of this out. You will have to further dig into the issues when they occur on your end and see what the issue is. Most likley there is a threagin issue but they are so darn difficult digging down into and cleaning out.

rtkcthomson Sep 27, 2021
Author

Thanks @Grokzen, we will dig more the next time we encounter the problem.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AWS ElastiCache Redis maintenance triggered disconnect from cluster #479

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

AWS ElastiCache Redis maintenance triggered disconnect from cluster #479

rtkcthomson Sep 3, 2021

Replies: 1 comment · 3 replies

Grokzen Sep 6, 2021 Maintainer

rtkcthomson Sep 14, 2021 Author

Grokzen Sep 24, 2021 Maintainer

rtkcthomson Sep 27, 2021 Author

rtkcthomson
Sep 3, 2021

Replies: 1 comment 3 replies

Grokzen
Sep 6, 2021
Maintainer

rtkcthomson Sep 14, 2021
Author

Grokzen Sep 24, 2021
Maintainer

rtkcthomson Sep 27, 2021
Author