Skip to content

fix: tcp_keepalive socket#3140

Open
ShaneNolan wants to merge 2 commits intoboto:developfrom
ShaneNolan:fix/keepalive-socket
Open

fix: tcp_keepalive socket#3140
ShaneNolan wants to merge 2 commits intoboto:developfrom
ShaneNolan:fix/keepalive-socket

Conversation

@ShaneNolan
Copy link
Copy Markdown

When using the botocore.config.Config option tcp_keepalive=True, the TCP socket is configured with the keep alive socket option (socket.SO_KEEPALIVE). By default, Linux sets the TCP keepalive time parameter to 7200 seconds, which exceeds the AWS NAT Gateway default timeout of 350 seconds [source].

This limitation leads to an inability to receive a response from a Lambda function under the following conditions:

  • The Lambda function is invoked in synchronous mode (InvocationType='RequestResponse').
  • The invocation occurs within VPC where a NAT gateway is required to access the internet from a private subnet.
  • The execution time of the Lambda function exceeds 350 seconds.

Therefore, by configuring socket.TCP_KEEPIDLE, socket.TCP_KEEPINTVL and socket.TCP_KEEPCNT when tcp_keepalive during the _compute_socket_options function call we can overcome this limitation.

socket.IPPROTO_TCP is used to support cross platform compatibility.

The code submitted automatically calculates these values based on the read timeout. Another option would be to have supplied in the scope/client object.

Fixes issues: boto/boto3#2424, boto/boto3#2510 and #2916.

Fargate recently had a similar solution implemented to support this use case: https://aws.amazon.com/blogs/containers/announcing-additional-linux-controls-for-amazon-ecs-tasks-on-aws-fargate/.

@adammcdonagh
Copy link
Copy Markdown

This is also impacting me. Unfortunately we are invoking Lambda from ECS via AWS Batch, which doesn't support adding these new options in the task definition yet.

@smasa1112
Copy link
Copy Markdown

smasa1112 commented Oct 3, 2024

This issue is same for me.
In my case, Lambda connection is read_timeout when EC2 by Codebuild try to connect lambda.
It is OK, when Lambda sleep 300sec but read_timeout is occured when Lambda sleep 450 sec.
EC2 by Codebuild doesn't join any VPC(Codebuild default)

@spawn-guy
Copy link
Copy Markdown

experiencing similar issues

File "/var/app/venv/staging-LQM1lest/lib64/python3.11/site-packages/botocore/retryhandler.py", line 247, in __call__
    return self._check_caught_exception(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/app/venv/staging-LQM1lest/lib64/python3.11/site-packages/botocore/retryhandler.py", line 416, in _check_caught_exception
    raise caught_exception
  File "/var/app/venv/staging-LQM1lest/lib64/python3.11/site-packages/aiobotocore/endpoint.py", line 181, in _do_get_response
    http_response = await self._send(request)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/app/venv/staging-LQM1lest/lib64/python3.11/site-packages/aiobotocore/endpoint.py", line 294, in _send
    return await self.http_session.send(request)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/var/app/venv/staging-LQM1lest/lib64/python3.11/site-packages/aiobotocore/httpsession.py", line 261, in send
    raise ReadTimeoutError(endpoint_url=request.url, error=e)
botocore.exceptions.ReadTimeoutError: Read timeout on endpoint URL: "https://lambda.eu-west-1.amazonaws.com/2015-03-31/functions/redacted/invocations"

@pankajastro
Copy link
Copy Markdown

pankajastro commented Oct 14, 2024

Hi @nateprewitt / @jonathan343 / @alexgromero / @SamRemis,

I have an Airflow instance running on AWS and I'm using the Airflow LambdaInvokeFunctionOperator to run AWS Lambda functions. When a Lambda function takes 5 minutes or longer to execute, we encounter a ReadTimeoutError. There is an issue in the Airflow repo with more information: apache/airflow#41498.

I’ve tested the changes of this PR, and it is working as expected, handling Lambda functions that take up to 15 minutes to run without issues. Is there anything else needed for the review and merging process? I would appreciate any feedback and updates on its status. Thank you!

@spawn-guy
Copy link
Copy Markdown

bump. any movement on this PR? my 200s lambda sync invocations are constantly failing with botocore.ReadTimeoutError on Amazon Linux 2023

@rodrigofp-possiblefinance
Copy link
Copy Markdown

rodrigofp-possiblefinance commented Jan 29, 2025

Bump. Any news?
I'm facing the same issue with a Lambda function invoked at Airflow through LambdaInvokeFunctionOperator

@rawwar
Copy link
Copy Markdown

rawwar commented Feb 13, 2025

just tagging active contributor's on the repo to get some attention:
@alexgromero , @nateprewitt , @ubaskota

@ShaneNolan
Copy link
Copy Markdown
Author

one year anniversary for this pr; I still need to use this work around.

also tagging active contributors:
@ubaskota @nateprewitt @arandito @SamRemis

@MartinBlanchard3012
Copy link
Copy Markdown

MartinBlanchard3012 commented Mar 19, 2025

Bumping this issue too, as I'm having the same problems with Lambdas that take more than 350 seconds.
I'm forced to use requests with specific configuration to circumvent this problem but there is a working workaround.

@paperlinguist
Copy link
Copy Markdown

We are also facing this issue and it's impacting our production pipelines.

@SamRemis
Copy link
Copy Markdown
Contributor

While this is definitely worth addressing, these new defaults are significantly different from the old behavior and this would apply to every customer who has opted in to TCP keepalive. Merging this could break existing customer workflows for users who are relying on the current default configurations, and it wouldn't give them any ability to opt out back into the old behavior.

To preserve backwards compatibility, I'd be more in favor of making this an opt in client level configuration - the other solution proposed in the description of this PR. I will bring this up to the botocore team to get some more thoughts and see where others stand.

@xaerto
Copy link
Copy Markdown

xaerto commented Jun 22, 2025

Also facing same issue, when waiting reply from lambda, triggered by MWAA based DAG. Applying proposed fix has resolved issue

@ShaneNolan ShaneNolan force-pushed the fix/keepalive-socket branch from 1efb793 to c922afb Compare July 16, 2025 17:57
@ShaneNolan ShaneNolan force-pushed the fix/keepalive-socket branch from c922afb to 6ec1b92 Compare July 16, 2025 17:59
@ShaneNolan
Copy link
Copy Markdown
Author

Hey @SamRemis, I've refactored the code to preserve backwards compatibility and make it configuration based. ☕

@kwikadi
Copy link
Copy Markdown

kwikadi commented Jan 8, 2026

Bumping since this is affecting me too. @SamRemis thoughts on merging this now? Seems like the fix is backwards compatible now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.