-
-
Notifications
You must be signed in to change notification settings - Fork 365
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle timeouts more gracefully by allowing the application to shutdown #895
base: master
Are you sure you want to change the base?
Conversation
Isn't every connection timeout an IO operation that PHP is waiting?
This seems to me like a strong enough limitation that may indicate it's not worth it to support it? It does bring a nice idea to the table though: what if we offer a Debugging Layer where you can write a basic function invocation and run the code that's getting timeout from it so that you might get a better idea where it's stopping? |
Hey, I wonder if something similar could be used in the PHP-FPM worker (the process handling the request)? That could be done as a Symfony/Laravel package? I'm not familiar enough with signals like that to know if that could work. Also another solution is to find a way for logs to be streamed correctly. Or else, find a way to force shorter timeouts on all connections in all frameworks, but I have no idea if that's even remotely possible. |
Yeah.. Each curl request, file load or
Im not sure.. I strongly believe in a future were all PHP applications are using a runtime. See #889.
Probably, yes.
Yes, it is called runtime/bref =)
I dont think that is possible. Or.. there is a difference between application logs and FPM logs. For the FPM logs => I have no idea.
I like the idea, But it does not solve the problem where 1/10000 request to my live application times out and I want to find out what happens. If the "debug layer" is not production ready, the I would not be happy adding it to my production environment. |
I reorganized things. I've updated I think this is a good way forward. I would like to get another review before I add tests and docs. |
You may also control this behavior with |
A PHP-FPM user would have to write something like: if (isset($_SERVER['LAMBDA_TASK_ROOT'])) {
Bref\Timeout\Timeout::enable();
} Preferably as soon as possible, ie |
Thanks, following our discussions I've opened Nyholm#3 to remove the option of configuring the duration timeout via an environment variable. |
* Always base the timeout duration on the configured Lambda timeout * Make sure that we can never enable the Bref timeout in the FPM bootstrap process * Remove unnecessary private method `timeoutAfter()` * Allow using falsy values for `BREF_FEATURE_TIMEOUT`, e.g. `0` or `''`, to allow disabling the feature * Cleanup global state * Improve tests following @Nyholm's feedback
The support PR is merged. This PR is now ready for a final rounds of review. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Im 👍
We don't mention that the logs from php-fpm also will be flushed, but that is a technical detail I think it is okey to skip.
error_log('Original stack trace'); | ||
error_log(Timeout::$stackTrace); | ||
Timeout::$stackTrace = null; | ||
exit(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we force a 500
status code before exit?
That would only be taken into account in the FPM layer, so that wouldn't really impact the php-XX
layer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How?
And wouldn't that effect my AWS::ApiGateway::GatewayResponse
that triggers on timeouts or 500?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
http_response_code(500);
And wouldn't that effect my AWS::ApiGateway::GatewayResponse that triggers on timeouts or 500?
🤔 not sure about that, maybe. But in any case, the same would happen if your app returns a pretty 500 error page.
Does that make sense? What I mean is that, when using FPM, there would be no difference between:
- Symfony catches any exception and displays an error page with a 500 status code
- Bref timeout kicks in and forces a 500 status code
In both cases, API Gateway would behave the same way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, no that is not true.
Symfony catches any exception and displays an error page with a 500 status code
This will show you a nice error page. API Gateway does nothing
Bref timeout kicks in and forces a 500 status code
No content(HTML) is provided. I would like the custom error page in API Gateway to be used.
Im not sure if http_response_code(500)
would make any difference.
Also note. We are not sure it is a HTTP context. It could be any other event.. So Im not sure about sending a HTTP status code.
error_log('Lambda timed out'); | ||
error_log((new LambdaTimeout)->getTraceAsString()); | ||
error_log('Original stack trace'); | ||
error_log(Timeout::$stackTrace); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record, I'm testing and notice that error_log
adds noise in FPM:
NOTICE: PHP message: Lambda timed out
NOTICE: PHP message: #0 /var/task/demo/http.php(16): Bref\Timeout\Timeout::Bref\Timeout\{closure}()
#1 {main}
NOTICE: PHP message: Original stack trace
NOTICE: PHP message: #0 /var/task/demo/http.php(9): Bref\Timeout\Timeout::Bref\Timeout\{closure}()
#1 {main}
I'm changing that to:
$stderr = fopen('php://stderr', 'w');
fwrite($stderr, 'Lambda timed out' . PHP_EOL);
fwrite($stderr, (new LambdaTimeout)->getTraceAsString() . PHP_EOL);
fwrite($stderr, 'Original stack trace' . PHP_EOL);
fwrite($stderr, Timeout::$stackTrace . PHP_EOL);
fclose($stderr);
Which gives a cleaner log:
Lambda timed out
#0 /var/task/demo/http.php(16): Bref\Timeout\Timeout::Bref\Timeout\{closure}()
#1 {main}
Original stack trace
#0 /var/task/demo/http.php(9): Bref\Timeout\Timeout::Bref\Timeout\{closure}()
#1 {main}
No need to update the PR
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I didnt know.
Thank you for testing.
OK following discussion through Slack, we were happy with the PR. However I tested and I'm getting weird results in FPM too:
I've been testing locally with PHP-FPM to pinpoint this, but my guess currently is that pcntl signals may not work well with FPM and its worker pool. Maybe we first set a signal handler on the process, and it's not correctly cleaned up by FPM on the next request cycle? I tried cleaning up manually the handlers, no changes. I tried putting the cleanup in a How to reproduce locally:
E.g. in my case: if (isset($_GET['sleep'])) {
echo 'Sleeping';
try {
sleep(6);
} catch (Throwable $e) {
echo 'CAUGHT!';
}
echo 'Sleeping';
sleep(2);
}
|
When Lambda times out with the PHP-FPM layer, the logs written by the PHP script are never flushed to stderr by PHP-FPM. That means they never reach CloudWatch, which makes timeouts really hard to debug. With this change, Bref waits for the FPM response until 1 second before the actual Lambda timeout (via a connection timeout on the FastCGI connection). If Bref reaches that point, it will ask PHP-FPM to gracefully restart the PHP-FPM worker, which: - flushes the logs (logs end up in CloudWatch, which is great) - restarts a clean FPM worker, without doing a full FPM restart (which may take longer) Follow up of #770, #772, #895 May address some of #862 Note: this does not change anything for the Function layer (only affects FPM). Also this does not show a full stack track of the place in the code where the timeout happens (#895 did). Still it's an improvement over the current status.
When Lambda times out with the PHP-FPM layer, the logs written by the PHP script are never flushed to stderr by PHP-FPM. That means they never reach CloudWatch, which makes timeouts really hard to debug. With this change, Bref waits for the FPM response until 1 second before the actual Lambda timeout (via a connection timeout on the FastCGI connection). If Bref reaches that point, it will ask PHP-FPM to gracefully restart the PHP-FPM worker, which: - flushes the logs (logs end up in CloudWatch, which is great) - restarts a clean FPM worker, without doing a full FPM restart (which may take longer) Follow up of #770, #772, #895 May address some of #862 Note: this does not change anything for the Function layer (only affects FPM). Also this does not show a full stack track of the place in the code where the timeout happens (#895 did). Still it's an improvement over the current status.
Since this wasn't working with FPM I went with a different route: #1133 This PR might still be worth following for the Function layer, so I'll leave it open. |
When Lambda times out with the PHP-FPM layer, the logs written by the PHP script are never flushed to stderr by PHP-FPM. That means they never reach CloudWatch, which makes timeouts really hard to debug. With this change, Bref waits for the FPM response until 1 second before the actual Lambda timeout (via a connection timeout on the FastCGI connection). If Bref reaches that point, it will ask PHP-FPM to gracefully restart the PHP-FPM worker, which: - flushes the logs (logs end up in CloudWatch, which is great) - restarts a clean FPM worker, without doing a full FPM restart (which may take longer) Follow up of #770, #772, #895 May address some of #862 Note: this does not change anything for the Function layer (only affects FPM). Also this does not show a full stack track of the place in the code where the timeout happens (#895 did). Still it's an improvement over the current status.
When Lambda times out with the PHP-FPM layer, the logs written by the PHP script are never flushed to stderr by PHP-FPM. That means they never reach CloudWatch, which makes timeouts really hard to debug. With this change, Bref waits for the FPM response until 1 second before the actual Lambda timeout (via a connection timeout on the FastCGI connection). If Bref reaches that point, it will ask PHP-FPM to gracefully restart the PHP-FPM worker, which: - flushes the logs (logs end up in CloudWatch, which is great) - restarts a clean FPM worker, without doing a full FPM restart (which may take longer) Follow up of #770, #772, #895 May address some of #862 Note: this does not change anything for the Function layer (only affects FPM). Also this does not show a full stack track of the place in the code where the timeout happens (#895 did). Still it's an improvement over the current status.
This PR builds on the work started by @t-geindre in #55.
If we are 1 second away from Lambda shutting down, then thrown an exception to let the application shutdown and (more importantly) flush the logs.
There are two things Im not super happy with:
This does not work for PHP-FPM since it the application run in a child processThe question is: do we want to continue down this path?
It would solve issues like #873 and #862.
TODO