Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Request handling periodically blocked #2

Open
barasimumatik opened this issue Nov 22, 2023 · 7 comments
Open

Request handling periodically blocked #2

barasimumatik opened this issue Nov 22, 2023 · 7 comments

Comments

@barasimumatik
Copy link

barasimumatik commented Nov 22, 2023

As I mentioned in #1, I have a set of tests which makes requests to a PHP-based API that in turn uses S3 for file storage.
Since the tests make actual requests (i.e. no functionality is mocked), each request have to finish within about one minute or the nginx server in front of the API times out.

Now, the tests creates a lot of requests to test the various API endpoints in different ways. The requests are done synchronously, so there should only be one request in flight at a time. Also, not all requests make use of S3, but those that do usually make multiple calls to S3.

For example, checking if a directory exists, create it if it doesn't and then upload a file to that directory (while S3 is not a file system, the AWS PHP SDK has functionality that allows working with S3 as if it is a regular file system using familiar operations). This is why there are a lot of HeadObject operations in the log below (404 -> the file or directory does not exist).

So to the issue: when running the tests and looking at the test script logs, it is clear that the API request is blocked for a while (the test log output stops). Most of the time it returns a response (after whatever blocked it finishes).
Looking at the localstack logs, it stops at the same time the test stops, and it seems to coincide with the Persisting state of service s3... log entry.

It's not clear what state is persisted, but it usually blocks the callee for a little while (a couple of seconds or so, not a big deal). The problem is that sometimes it is stuck a much longer time (>1 minute), which is enough for the nginx server in front of the API to abort the request and return a 504 Gateway Timeout to the client (in this case, the test script). I verified this by raising the timeouts in the nginx server until I could run the same tests multiple times without a timeout.

In the logs below, I marked the time the last test client request was blocked. After about a minute the request is aborted, but I believe the PHP script continues executing, which is why some S3 requests are made afterwards.

LocalStack version: 3.0.1.dev
LocalStack build date: 2023-11-21
LocalStack build git hash: f354fd43

2023-11-22T07:52:33.428  INFO --- [  MainThread] localstack_persist.state   : Loading persisted state of all services...
2023-11-22T07:52:33.428  INFO --- [  MainThread] localstack_persist.state   : Loading persisted state of service s3...
2023-11-22T07:52:33.984  INFO --- [  MainThread] localstack.utils.bootstrap : Execution of "_load_service_plugin" took 543.60ms
2023-11-22T07:52:34.053  INFO --- [-functhread4] hypercorn.error            : Running on https://0.0.0.0:4566 (CTRL + C to quit)
2023-11-22T07:52:34.053  INFO --- [-functhread4] hypercorn.error            : Running on https://0.0.0.0:4566 (CTRL + C to quit)
Ready.
2023-11-22T07:55:37.260  INFO --- [   asgi_gw_0] localstack.request.aws     : AWS s3.HeadObject => 404 (NoSuchKey)
2023-11-22T07:55:37.334  INFO --- [   asgi_gw_0] localstack.request.aws     : AWS s3.PutObject => 200
2023-11-22T07:55:37.352  INFO --- [   asgi_gw_0] localstack.request.aws     : AWS s3.HeadObject => 404 (NoSuchKey)
2023-11-22T07:55:37.369  INFO --- [   asgi_gw_0] localstack.request.aws     : AWS s3.PutObject => 200
2023-11-22T07:55:37.385  INFO --- [   asgi_gw_0] localstack.request.aws     : AWS s3.HeadObject => 200
2023-11-22T07:55:37.411  INFO --- [   asgi_gw_0] localstack.request.aws     : AWS s3.CopyObject => 200
2023-11-22T07:55:38.177  INFO --- [   asgi_gw_0] localstack.request.aws     : AWS s3.ListObjects => 200
2023-11-22T07:55:38.462  INFO --- [   asgi_gw_0] localstack.request.aws     : AWS s3.DeleteObjects => 200
2023-11-22T07:55:38.730  INFO --- [   asgi_gw_0] localstack.request.aws     : AWS s3.HeadObject => 404 (NoSuchKey)
2023-11-22T07:55:38.758  INFO --- [   asgi_gw_0] localstack.request.aws     : AWS s3.PutObject => 200
2023-11-22T07:55:38.780  INFO --- [   asgi_gw_0] localstack.request.aws     : AWS s3.HeadObject => 404 (NoSuchKey)
2023-11-22T07:55:38.801  INFO --- [   asgi_gw_0] localstack.request.aws     : AWS s3.PutObject => 200
2023-11-22T07:55:38.819  INFO --- [   asgi_gw_0] localstack.request.aws     : AWS s3.HeadObject => 200
2023-11-22T07:55:38.841  INFO --- [   asgi_gw_0] localstack.request.aws     : AWS s3.CopyObject => 200
2023-11-22T07:55:39.103  INFO --- [   asgi_gw_0] localstack.request.aws     : AWS s3.ListObjects => 200
2023-11-22T07:55:39.122  INFO --- [   asgi_gw_0] localstack.request.aws     : AWS s3.DeleteObjects => 200
2023-11-22T07:55:39.302  INFO --- [   asgi_gw_0] localstack.request.aws     : AWS s3.HeadObject => 404 (NoSuchKey)
2023-11-22T07:55:39.319  INFO --- [   asgi_gw_0] localstack.request.aws     : AWS s3.PutObject => 200
2023-11-22T07:55:39.334  INFO --- [   asgi_gw_0] localstack.request.aws     : AWS s3.HeadObject => 404 (NoSuchKey)
2023-11-22T07:55:39.351  INFO --- [   asgi_gw_0] localstack.request.aws     : AWS s3.PutObject => 200
2023-11-22T07:55:39.366  INFO --- [   asgi_gw_0] localstack.request.aws     : AWS s3.HeadObject => 200
2023-11-22T07:55:39.381  INFO --- [   asgi_gw_0] localstack.request.aws     : AWS s3.CopyObject => 200
2023-11-22T07:55:43.991  INFO --- [ead-4 (_run)] localstack_persist.state   : Persisting state of service s3...
2023-11-22T07:55:45.863  INFO --- [   asgi_gw_0] localstack.request.aws     : AWS s3.ListObjects => 200
2023-11-22T07:55:45.885  INFO --- [   asgi_gw_0] localstack.request.aws     : AWS s3.DeleteObjects => 200
2023-11-22T07:55:50.674  INFO --- [   asgi_gw_0] localstack.request.aws     : AWS s3.HeadObject => 404 (NoSuchKey)
2023-11-22T07:55:50.696  INFO --- [   asgi_gw_0] localstack.request.aws     : AWS s3.PutObject => 200
2023-11-22T07:55:50.709  INFO --- [   asgi_gw_0] localstack.request.aws     : AWS s3.HeadObject => 404 (NoSuchKey)
2023-11-22T07:55:50.723  INFO --- [   asgi_gw_0] localstack.request.aws     : AWS s3.PutObject => 200
2023-11-22T07:55:50.739  INFO --- [   asgi_gw_0] localstack.request.aws     : AWS s3.HeadObject => 200
2023-11-22T07:55:50.765  INFO --- [   asgi_gw_0] localstack.request.aws     : AWS s3.CopyObject => 200
2023-11-22T07:55:51.432  INFO --- [   asgi_gw_0] localstack.request.aws     : AWS s3.ListObjects => 200
2023-11-22T07:55:51.451  INFO --- [   asgi_gw_0] localstack.request.aws     : AWS s3.DeleteObjects => 200
2023-11-22T07:55:51.693  INFO --- [   asgi_gw_0] localstack.request.aws     : AWS s3.HeadObject => 404 (NoSuchKey)
2023-11-22T07:55:51.711  INFO --- [   asgi_gw_0] localstack.request.aws     : AWS s3.PutObject => 200
blocks here --> 2023-11-22T07:55:54.036  INFO --- [ead-4 (_run)] localstack_persist.state   : Persisting state of service s3...  
2023-11-22T07:56:56.744  INFO --- [   asgi_gw_0] localstack.request.aws     : AWS s3.HeadObject => 404 (NoSuchKey)
2023-11-22T07:56:56.762  INFO --- [   asgi_gw_0] localstack.request.aws     : AWS s3.PutObject => 200
2023-11-22T07:56:56.780  INFO --- [   asgi_gw_0] localstack.request.aws     : AWS s3.HeadObject => 200
2023-11-22T07:56:56.804  INFO --- [   asgi_gw_0] localstack.request.aws     : AWS s3.CopyObject => 200
2023-11-22T07:57:04.044  INFO --- [ead-4 (_run)] localstack_persist.state   : Persisting state of service s3...
@GREsau
Copy link
Owner

GREsau commented Nov 23, 2023

Thanks for the report! Do you know roughly how many objects are being created? Are they all in the same bucket and/or same account? And which docker tag are you using - latest?

I did some local load testing where I created about 80,000 small objects in a single bucket, and the serialization of the data took about 14 seconds. Obviously this would vary massively depending on system load - but it did indeed prevent any API requests from being handled while serialization was ongoing. The fact that persistence blocks API requests while it's running is not intentional - part of the reason it happens is due to some overzealous locking, so I can improve that.

It should also be feasible to switch to a more efficient serialization mechanism. localstack-persist currently uses jsonpickle, which is fantastic for observability and debuggability, but not great for performance. I whipped up a proof-of-concept that used the pickle module, and that only took 0.5 seconds for the same data, so a 28x speedup! I'll need to clean it up a bit and handle backward-compatibility properly before merging it in properly, but I have a basic implementation working so I don't foresee any major problems.

@GREsau
Copy link
Owner

GREsau commented Nov 26, 2023

localstack-persist v3.0.2 should improve this, as persistence no longer blocks request handling 🙂

However, it can still use up most of the available CPU while it's occurring - you may want to try setting PERSIST_FORMAT=binary which further improves this (using pickle as I mentioned in the above comment). I plan on setting the persistence format to binary by default in a future version.

@barasimumatik
Copy link
Author

Thanks for the report! Do you know roughly how many objects are being created? Are they all in the same bucket and/or same account? And which docker tag are you using - latest?

No problem (sorry for the wall of text btw 😅). I don't really know how many objects are created each test run, but I'm pretty sure it's well below the amount tested (80 000). It's all in the same bucket and account. I used 3.0.1 (when it was tagged latest).

The fact that persistence blocks API requests while it's running is not intentional - part of the reason it happens is due to some overzealous locking, so I can improve that.

Yeah, I suspected it was something like that (not that I know where to look in the source but anyway).

It should also be feasible to switch to a more efficient serialization mechanism. localstack-persist currently uses jsonpickle, which is fantastic for observability and debuggability, but not great for performance. I whipped up a proof-of-concept that used the pickle module, and that only took 0.5 seconds for the same data, so a 28x speedup! I'll need to clean it up a bit and handle backward-compatibility properly before merging it in properly, but I have a basic implementation working so I don't foresee any major problems.

Awesome!🥇

@barasimumatik
Copy link
Author

localstack-persist v3.0.2 should improve this, as persistence no longer blocks request handling 🙂

However, it can still use up most of the available CPU while it's occurring - you may want to try setting PERSIST_FORMAT=binary which further improves this (using pickle as I mentioned in the above comment). I plan on setting the persistence format to binary by default in a future version.

Fantastic! I'll take it for a spin.

@barasimumatik
Copy link
Author

barasimumatik commented Nov 28, 2023

I tested 3.0.2 (which prints LocalStack version: 3.0.1, LocalStack build date: 2023-11-24 , LocalStack build git hash: e5636f4c) and latest (which prints LocalStack version: 3.0.2.dev, LocalStack build date: 2023-11-27, LocalStack build git hash: 7743f159).

Both versions were tested with and without the PERSIST_FORMAT=binary env var set.

The tests took roughly the same time to finish, and I ran them all multiple times (both multiple times in a row and several runs with a new container in between). I'd like to say it feels more stable, though the tests took about the same time to finish both with and without PERSIST_FORMAT=binary.

I got the 504 Gateway Timeout a several times, but it was more or less random and seemed to happen a lot less (I think I could consistently trigger it before). There was no obvious pattern beside the timeout happening around the time of state being persisted (stuck for a minute and then the test script fails). I noticed that the Container CPU chart in Docker Desktop went down to about 0.03% for both my API container and the localstack container (it basically flatlined) while the API was waiting for a response (and the last log statement was about persisting state), so it still seems something is being blocked.

I'm working from home now, but I'll try it out at the office as well. I want to see if it's possible to set a timeout in the S3 client so I can get a proper error before nginx aborts the request.

@GREsau
Copy link
Owner

GREsau commented Nov 28, 2023

When you get a chance could you try running it again with the environment variable DEBUG=1 and upload the container logs?

@barasimumatik
Copy link
Author

I enabled the debug flag and ran lots of tests, but it didn't really say much. However, I may have found the cause, and it's on the client side. When initializing the S3 client. it is possible to specify two different timeouts (connect_timeout and timeout, in seconds) that applies to the underlying HTTP handler interacting with S3 (or other AWS services for that matter).

It's not entirely clear what the timeout is for, but I assume it's an general request timeout. As a test I set the values to 5 and 20 for connect_timeout and timeout respectively. I then ran the tests in a while loop that would stop at the first failure (and timed it with the time command). After about 54 minutes (I kinda forgot about it) the test script was still running and I decided to stop it and experiment with the timeouts individually to try to determine which one to prefer.

It seems like it is sufficient to set the connect_timeout, but if that is the only timeout it could still fail once in a while (but not as frequently and after much more time passes).

For reference, this is how one might initialize the S3 client using the PHP SDK (v3):

$client = new \Aws\S3\S3Client(
    [
        'region' => $config['region'],
        'version' => $config['version'],
        'endpoint' => 'http://localstack:4566',
        'http' => [
            'connect_timeout' => 5,
            'timeout' => 20,
            'debug' => fopen('php://stderr', 'w')
        ]
    ]
);

The debug option was kinda handy as well. It configures the HTTP handler to send debug output to a file. The debug output looks like the verbose output from curl, so I guess the underlying handler uses curl via PHP (libcurl is usually built into PHP as I understand it). I think there has been a few issues with curl hanging indefinitely, so I'm gonna look into that a little bit (it could very well be that my version of PHP and libcurl are a bit outdated).

Also, the S3 client has a retry mechanism with (by default) three retries per call, so it could be that when the timeout is set, one of the retries succeeds (as opposed to waiting indefinitely on the first try). That could possibly explain why the tests kept running instead of failing when the timeout is triggered.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants