fix: tcp_keepalive socket#3140
Conversation
|
This is also impacting me. Unfortunately we are invoking Lambda from ECS via AWS Batch, which doesn't support adding these new options in the task definition yet. |
|
This issue is same for me. |
|
experiencing similar issues |
|
Hi @nateprewitt / @jonathan343 / @alexgromero / @SamRemis, I have an Airflow instance running on AWS and I'm using the Airflow I’ve tested the changes of this PR, and it is working as expected, handling Lambda functions that take up to 15 minutes to run without issues. Is there anything else needed for the review and merging process? I would appreciate any feedback and updates on its status. Thank you! |
|
bump. any movement on this PR? my 200s lambda sync invocations are constantly failing with |
|
Bump. Any news? |
|
just tagging active contributor's on the repo to get some attention: |
|
one year anniversary for this pr; I still need to use this work around. also tagging active contributors: |
|
Bumping this issue too, as I'm having the same problems with Lambdas that take more than 350 seconds. |
|
We are also facing this issue and it's impacting our production pipelines. |
|
While this is definitely worth addressing, these new defaults are significantly different from the old behavior and this would apply to every customer who has opted in to TCP keepalive. Merging this could break existing customer workflows for users who are relying on the current default configurations, and it wouldn't give them any ability to opt out back into the old behavior. To preserve backwards compatibility, I'd be more in favor of making this an opt in client level configuration - the other solution proposed in the description of this PR. I will bring this up to the botocore team to get some more thoughts and see where others stand. |
|
Also facing same issue, when waiting reply from lambda, triggered by MWAA based DAG. Applying proposed fix has resolved issue |
1efb793 to
c922afb
Compare
c922afb to
6ec1b92
Compare
|
Hey @SamRemis, I've refactored the code to preserve backwards compatibility and make it configuration based. ☕ |
|
Bumping since this is affecting me too. @SamRemis thoughts on merging this now? Seems like the fix is backwards compatible now |
When using the botocore.config.Config option tcp_keepalive=True, the TCP socket is configured with the keep alive socket option (
socket.SO_KEEPALIVE). By default, Linux sets the TCP keepalive time parameter to 7200 seconds, which exceeds the AWS NAT Gateway default timeout of 350 seconds [source].This limitation leads to an inability to receive a response from a Lambda function under the following conditions:
Therefore, by configuring
socket.TCP_KEEPIDLE,socket.TCP_KEEPINTVLandsocket.TCP_KEEPCNTwhentcp_keepaliveduring the_compute_socket_optionsfunction call we can overcome this limitation.socket.IPPROTO_TCPis used to support cross platform compatibility.The code submitted automatically calculates these values based on the read timeout. Another option would be to have supplied in the scope/client object.
Fixes issues: boto/boto3#2424, boto/boto3#2510 and #2916.
Fargate recently had a similar solution implemented to support this use case: https://aws.amazon.com/blogs/containers/announcing-additional-linux-controls-for-amazon-ecs-tasks-on-aws-fargate/.