Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle timeouts more gracefully by allowing the application to shutdown #895

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

Nyholm
Copy link
Contributor

@Nyholm Nyholm commented Mar 25, 2021

This PR builds on the work started by @t-geindre in #55.

If we are 1 second away from Lambda shutting down, then thrown an exception to let the application shutdown and (more importantly) flush the logs.

There are two things Im not super happy with:

  1. This does not work for PHP-FPM since it the application run in a child process
  2. I get a segmentation fault when SIGALRM is triggers while the application is waiting for IO (https://bugs.php.net/bug.php?id=80907)

The question is: do we want to continue down this path?

It would solve issues like #873 and #862.

TODO

  • Add docs
  • Do some manual testing to make sure this works as expected

@deleugpn
Copy link
Member

deleugpn commented Mar 25, 2021

I get a segmentation fault when SIGALRM is triggers while the application is waiting for IO

Isn't every connection timeout an IO operation that PHP is waiting?

This does not work for PHP-FPM since it the application run in a child process

This seems to me like a strong enough limitation that may indicate it's not worth it to support it? It does bring a nice idea to the table though: what if we offer a Debugging Layer where you can write a basic function invocation and run the code that's getting timeout from it so that you might get a better idea where it's stopping?
By making it an explicit debugging layer, we don't need to guarantee production-ready quality and perhaps we can even try to include default debugging statements that will get logged? In the end people will use this in order to figure out why their Lambda can't connect somewhere (VPC, Firewall, Security Group, NAT Gateway, Internet Gateway, NACL, etc) which has nothing to do with the code itself.

@mnapoli
Copy link
Member

mnapoli commented Mar 25, 2021

Hey, I wonder if something similar could be used in the PHP-FPM worker (the process handling the request)?

That could be done as a Symfony/Laravel package?

I'm not familiar enough with signals like that to know if that could work.

Also another solution is to find a way for logs to be streamed correctly. Or else, find a way to force shorter timeouts on all connections in all frameworks, but I have no idea if that's even remotely possible.

@Nyholm
Copy link
Contributor Author

Nyholm commented Mar 25, 2021

Isn't every connection timeout an IO operation that PHP is waiting?

Yeah.. Each curl request, file load or sleep is IO operations. But this is a PHP bug that I assume will be fixed soon. (I updated the PR description with a bug report).

This seems to me like a strong enough limitation that may indicate it's not worth it to support it

Im not sure.. I strongly believe in a future were all PHP applications are using a runtime. See #889.

I wonder if something similar could be used in the PHP-FPM worker

Probably, yes.

That could be done as a Symfony/Laravel package?

Yes, it is called runtime/bref =)

Also another solution is to find a way for logs to be streamed correctly.

I dont think that is possible. Or.. there is a difference between application logs and FPM logs. For the FPM logs => I have no idea.
For the application logs: Monolog buffers the logs and we need to tell the logger to flush the buffer (ie, destruct the logger). So it needs an exception to be thrown.

Debug layer

I like the idea, But it does not solve the problem where 1/10000 request to my live application times out and I want to find out what happens. If the "debug layer" is not production ready, the I would not be happy adding it to my production environment.

@Nyholm
Copy link
Contributor Author

Nyholm commented Mar 25, 2021

I reorganized things.

I've updated LambdaRuntime to get a bool parameter to enable throwing TimeoutException. This is enabled in the bootstrap files in php-xx and console. It is not enabled in the php-fpm. For FPM the user needs to update their application by adding Bref\Timeout\Timeout::enable(). This will read the environment variables and set the timeout.

I think this is a good way forward. I would like to get another review before I add tests and docs.

@Nyholm Nyholm changed the title POC handle timeouts Handle timeouts more gracefully by allowing the application to shutdown Mar 25, 2021
@Nyholm
Copy link
Contributor Author

Nyholm commented Mar 25, 2021

You may also control this behavior with BREF_TIMEOUT.
Value 0 is "auto". Value "-1" is disabled. Positive values sets the timeout.

src/Runtime/LambdaRuntime.php Outdated Show resolved Hide resolved
src/Runtime/LambdaRuntime.php Outdated Show resolved Hide resolved
tests/Timeout/TimeoutTest.php Outdated Show resolved Hide resolved
tests/Timeout/TimeoutTest.php Outdated Show resolved Hide resolved
@Nyholm
Copy link
Contributor Author

Nyholm commented Mar 26, 2021

A PHP-FPM user would have to write something like:

if (isset($_SERVER['LAMBDA_TASK_ROOT'])) {
    Bref\Timeout\Timeout::enable();
}

Preferably as soon as possible, ie Kernel::handle() or index.php.

src/Timeout/Timeout.php Outdated Show resolved Hide resolved
src/Timeout/Timeout.php Outdated Show resolved Hide resolved
@mnapoli
Copy link
Member

mnapoli commented Mar 26, 2021

Thanks, following our discussions I've opened Nyholm#3 to remove the option of configuring the duration timeout via an environment variable.

* Always base the timeout duration on the configured Lambda timeout

* Make sure that we can never enable the Bref timeout in the FPM bootstrap process

* Remove unnecessary private method `timeoutAfter()`

* Allow using falsy values for `BREF_FEATURE_TIMEOUT`, e.g. `0` or `''`, to allow disabling the feature

* Cleanup global state

* Improve tests following @Nyholm's feedback
@Nyholm
Copy link
Contributor Author

Nyholm commented Mar 30, 2021

The support PR is merged. This PR is now ready for a final rounds of review.

Copy link
Contributor Author

@Nyholm Nyholm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Im 👍

We don't mention that the logs from php-fpm also will be flushed, but that is a technical detail I think it is okey to skip.

error_log('Original stack trace');
error_log(Timeout::$stackTrace);
Timeout::$stackTrace = null;
exit(1);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we force a 500 status code before exit?

That would only be taken into account in the FPM layer, so that wouldn't really impact the php-XX layer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How?
And wouldn't that effect my AWS::ApiGateway::GatewayResponse that triggers on timeouts or 500?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

http_response_code(500);

And wouldn't that effect my AWS::ApiGateway::GatewayResponse that triggers on timeouts or 500?

🤔 not sure about that, maybe. But in any case, the same would happen if your app returns a pretty 500 error page.

Does that make sense? What I mean is that, when using FPM, there would be no difference between:

  • Symfony catches any exception and displays an error page with a 500 status code
  • Bref timeout kicks in and forces a 500 status code

In both cases, API Gateway would behave the same way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, no that is not true.

Symfony catches any exception and displays an error page with a 500 status code

This will show you a nice error page. API Gateway does nothing

Bref timeout kicks in and forces a 500 status code

No content(HTML) is provided. I would like the custom error page in API Gateway to be used.


Im not sure if http_response_code(500) would make any difference.


Also note. We are not sure it is a HTTP context. It could be any other event.. So Im not sure about sending a HTTP status code.

src/Timeout/Timeout.php Outdated Show resolved Hide resolved
error_log('Lambda timed out');
error_log((new LambdaTimeout)->getTraceAsString());
error_log('Original stack trace');
error_log(Timeout::$stackTrace);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, I'm testing and notice that error_log adds noise in FPM:

NOTICE: PHP message: Lambda timed out
NOTICE: PHP message: #0 /var/task/demo/http.php(16): Bref\Timeout\Timeout::Bref\Timeout\{closure}()
#1 {main}
NOTICE: PHP message: Original stack trace
NOTICE: PHP message: #0 /var/task/demo/http.php(9): Bref\Timeout\Timeout::Bref\Timeout\{closure}()
#1 {main}

I'm changing that to:

                $stderr = fopen('php://stderr', 'w');
                fwrite($stderr, 'Lambda timed out' . PHP_EOL);
                fwrite($stderr, (new LambdaTimeout)->getTraceAsString() . PHP_EOL);
                fwrite($stderr, 'Original stack trace' . PHP_EOL);
                fwrite($stderr, Timeout::$stackTrace . PHP_EOL);
                fclose($stderr);

Which gives a cleaner log:

Lambda timed out
#0 /var/task/demo/http.php(16): Bref\Timeout\Timeout::Bref\Timeout\{closure}()
#1 {main}
Original stack trace
#0 /var/task/demo/http.php(9): Bref\Timeout\Timeout::Bref\Timeout\{closure}()
#1 {main}

No need to update the PR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I didnt know.

Thank you for testing.

@mnapoli
Copy link
Member

mnapoli commented Mar 30, 2021

OK following discussion through Slack, we were happy with the PR.

However I tested and I'm getting weird results in FPM too:

  • 1st request: the timeouts work as expected
  • following requests: pcntl_alarm(...) is triggered, but our custom handler isn't called. That causes pending I/O (or sleep) to be interrupted, but without any exception being triggered

I've been testing locally with PHP-FPM to pinpoint this, but my guess currently is that pcntl signals may not work well with FPM and its worker pool. Maybe we first set a signal handler on the process, and it's not correctly cleaned up by FPM on the next request cycle?

I tried cleaning up manually the handlers, no changes. I tried putting the cleanup in a __destruct() to be sure it's executed, no changes. I'm not sure what to do anymore…

How to reproduce locally:

  • docker-compose up
  • add \Bref\Timeout\Timeout::enableInFpm(); at the top of demo/http.php
  • open http://localhost:8000/?sleep=1
  • play with demo/http.php in the sleep block (e.g. catch the exception)

E.g. in my case:

if (isset($_GET['sleep'])) {
    echo 'Sleeping';
    try {
        sleep(6);
    } catch (Throwable $e) {
        echo 'CAUGHT!';
    }
    echo 'Sleeping';
    sleep(2);
}
  • first request: I see CAUGHT!
  • next requests: the catch is not executed, yet the sleep() calls are interrupted (without errors/warnings)

mnapoli added a commit that referenced this pull request Jan 4, 2022
When Lambda times out with the PHP-FPM layer, the logs written by the PHP script are never flushed to stderr by PHP-FPM. That means they never reach CloudWatch, which makes timeouts really hard to debug.

With this change, Bref waits for the FPM response until 1 second before the actual Lambda timeout (via a connection timeout on the FastCGI connection).

If Bref reaches that point, it will ask PHP-FPM to gracefully restart the PHP-FPM worker, which:

- flushes the logs (logs end up in CloudWatch, which is great)
- restarts a clean FPM worker, without doing a full FPM restart (which may take longer)

Follow up of #770, #772, #895

May address some of #862

Note: this does not change anything for the Function layer (only affects FPM). Also this does not show a full stack track of the place in the code where the timeout happens (#895 did). Still it's an improvement over the current status.
mnapoli added a commit that referenced this pull request Jan 4, 2022
When Lambda times out with the PHP-FPM layer, the logs written by the PHP script are never flushed to stderr by PHP-FPM. That means they never reach CloudWatch, which makes timeouts really hard to debug.

With this change, Bref waits for the FPM response until 1 second before the actual Lambda timeout (via a connection timeout on the FastCGI connection).

If Bref reaches that point, it will ask PHP-FPM to gracefully restart the PHP-FPM worker, which:

- flushes the logs (logs end up in CloudWatch, which is great)
- restarts a clean FPM worker, without doing a full FPM restart (which may take longer)

Follow up of #770, #772, #895

May address some of #862

Note: this does not change anything for the Function layer (only affects FPM). Also this does not show a full stack track of the place in the code where the timeout happens (#895 did). Still it's an improvement over the current status.
@mnapoli
Copy link
Member

mnapoli commented Jan 4, 2022

Since this wasn't working with FPM I went with a different route: #1133

This PR might still be worth following for the Function layer, so I'll leave it open.

mnapoli added a commit that referenced this pull request Jan 4, 2022
When Lambda times out with the PHP-FPM layer, the logs written by the PHP script are never flushed to stderr by PHP-FPM. That means they never reach CloudWatch, which makes timeouts really hard to debug.

With this change, Bref waits for the FPM response until 1 second before the actual Lambda timeout (via a connection timeout on the FastCGI connection).

If Bref reaches that point, it will ask PHP-FPM to gracefully restart the PHP-FPM worker, which:

- flushes the logs (logs end up in CloudWatch, which is great)
- restarts a clean FPM worker, without doing a full FPM restart (which may take longer)

Follow up of #770, #772, #895

May address some of #862

Note: this does not change anything for the Function layer (only affects FPM). Also this does not show a full stack track of the place in the code where the timeout happens (#895 did). Still it's an improvement over the current status.
mnapoli added a commit that referenced this pull request Feb 14, 2023
When Lambda times out with the PHP-FPM layer, the logs written by the PHP script are never flushed to stderr by PHP-FPM. That means they never reach CloudWatch, which makes timeouts really hard to debug.

With this change, Bref waits for the FPM response until 1 second before the actual Lambda timeout (via a connection timeout on the FastCGI connection).

If Bref reaches that point, it will ask PHP-FPM to gracefully restart the PHP-FPM worker, which:

- flushes the logs (logs end up in CloudWatch, which is great)
- restarts a clean FPM worker, without doing a full FPM restart (which may take longer)

Follow up of #770, #772, #895

May address some of #862

Note: this does not change anything for the Function layer (only affects FPM). Also this does not show a full stack track of the place in the code where the timeout happens (#895 did). Still it's an improvement over the current status.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants