-
-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Network.Socket.accept: resource exhausted (Too many open files) #2042
Comments
Seems related to #1399 (comment) and #905. |
A way to reproduce this was found on #2092 (comment), basically using |
Seems yesodweb/wai#831 would help with EMFILE in that postgrest won't become blocked and will start responding if sockets become freed. However, if the db is still busy and can't handle incoming traffic, EMFILE will be reached again and thus, postgrest will stop responding again. I think it would be better if we apply backpressure whenever EMFILE happens so we give a chance for db connections to become available. So we would start rejecting any new request with 429 Too Many Requests, this would also serve as a hint to scale or tune the db. |
I guess we could try catching that exception somehow but I'm not sure how we would stop responding with 429. Can we count the number of used sockets? |
Another option could be to try and fix this at the pool level. With that we could even stop hitting EMFILE by somehow failing when too many requests are waiting for a pool resource. Hasql.Pool is a small wrapper on Data.Pool, which seems to be lacking an introspection feature(bos/pool#36) that could help us. It also looks to be unnmaintained(bos/pool#39). |
429 is a client error, i.e. rate limiting. A user could hit this with just a single request - it's not their fault. This absolutely needs to be a 5xx error code. |
Looks like there is a maintained fork: https://hackage.haskell.org/package/resource-pool-fork-avanov-0.2.4.0 Does that have the introspection feature? Maybe we should just make |
Maybe we can also use https://hackage.haskell.org/package/unix-2.7.2.2/docs/System-Posix-Resource.html. We could:
|
True, agree. I think having a different status than 503 for this would be better. Relevant options:
I'm leaning towards 529 for now. |
I don't see why, yet. 503 is perfect. For the client, there is no difference. The server admin needs to take a look into the logs anyway, to figure out what's happening. |
Yes! It has PoolStats, I think the
Seeing this recommendation by Nikita, I think I'll vendor the Hasql pool code for now and go from there.
Just thought that making the status noticeable would be better so admins can detect faster that it's time to scale or tune, but yeah 503 with a special message will do as well. |
Doesn't help. So I have a branch that for now logs the pool stats on each request. With a
And the request waits until the connections are freed, then it gets a response. Further requests only increase Now The waiter queue is already included in avanov's fork but it doesn't have the size, so adding that would be the next step. Besides that, I'll do a load test and move the pool stats to a Edit: These are the stats when running GETSingle with
|
Hm. Using pool stats for metrics seems like a nice idea. However, I don't really understand why we're even looking at the db-pool in terms of this issue? I think we should be looking at the http connections, not the database connections or a queue for that. This won't help dealing with connections that don't hit the database anyway. So what if we were implementing a simple counter with warp's This would just reject the connection immediately and would not return a "nice" response to the client. But this is actually what we should do, because accepting the connection to be able to return a nice error response might lead to the same issue with enough workload. By rejecting connections temporarily we can at least keep the process alive. We could use the same counter for both admin and main connections - but could then have two different thresholds (higher for admin) to be able to keep some connections free for that use. This way we can still respond 200 to the liveness probe. We could then return 503 for the readiness probe, when the main threshold has been exceeded. This way we would not get any new traffic routed to us in a cluster environment. |
Because EMFILE is the consequence of a starved pool. When all the pool connections are consumed each new request will wait for a connection to become free(as mentioned above), during this time the http connection remains open and thus the socket is open as well, enough of these requests will then generate EMFILE. So if we solve the starved pool(root cause) then EMFILE will follow as well. To be clear, the solution is to respond fast with a 529 status(or any 5xx) whenever the pool is starved, so we won't hit EMFILE.
That doesn't seem reliable at all to me, the amount of http connections supported could be high if the db queries are fast or it could be low if the db queries are slow; seems too hard to measure that for every case. Also this type of error should notify the admins/devs that the server is overloaded(needs scaling), it wouldn't be helpful at all to not return any error message. On the contrary,
🤔 Can you elaborate on that? |
Assuming
The reason we are hitting that limit is, because we are getting more requests/s than we can handle. In my understanding this can happen in regular operation. I don't really know what you mean by "starved pool". Maybe this is a separate problem, that I don't fully understand, yet, which can lead to the same consequence. But surely the "we're getting more requests than we can handle" is a problem that will lead to the EMFILE problem eventually, too - even if you solve the "starved pool" thing.
Think But still: When we're getting busted with maaaaany of those requests, I can imagine getting to the same limit of file descriptors and getting that EMFILE error.
The limit would be based on the number of available file descriptors. The limit would be there only to prevent reaching EMFILE. You can read the current process' softlimit for file descriptors and then base the threshold on that. You don't need to measure anything. This is much more reliable than anything else - if the goal is preventing EMFILE.
The nature of the problem ("having no more http connections available") prevents us from responding with a proper error to every request in that case... because that's exactly the problem. We can't handle more requests. But we can (1) stay alive and (2) report that failure at the readiness endpoint - and this is what admins need to observe. |
That the pool is left out of resources(connections), pool starvation seems to be a common term. See Dealing with database connection pool starvation on Oracle DB.
Oh, I don't think so. Warp is pretty fast, it can handle 100K reqs while we can only handle about 2K reqs(reads) due to the pool limitation. That's why in general I think the direction should be improving the pool so we can get closer to Warp limits. Doesn't seem right to limit Warp when it's not the main culprit and it can handle so much. Improving the pool means, for example, allowing the pool to use more than one db. I've found on supabase/benchmarks#24 that vertically scaling a pg instance doesn't lead to increased reads, so we need to enable horizontal scaling at the pool level. In fact doing this would also clear EMFILE in most cases I've seen since postgrest can now respond so much faster(numbers pending yet, but many recommend horizontal scaling for scaling reads). Of course improving the pool also means it should do backpressure, tuning queries is another option to handle load.
If those respond, that doesn't look like a problem to me. That being said, my main idea was to have a global state whenever we do backpressure so those requests could be rejected as well. Same for admin enpoint, it would fail.
I don't know what other external factors can affect the http connections limits, that's why it seems so unreliable to me. Meanwhile the pool is something we fully control. Why so opposed to the pool approach :)? What's the actual drawback? I cannot be the only one who thought of such an approach(encode/starlette#802 (comment)). |
Because it does not relate to this issue. You have many great ideas to improve performance given in this post, which are well worth discussing. But those don't solve the problem we have here. Those might make it less likely to occur, but the problem itself will still be there.
But Warp is the main culprit. Warp is not dealing with EMFILE correctly. The issue here is the following:
Improving the pool is just a workaround to that very problem. Again: Improving the pool and performance is a great thing on it's own and we should discuss it - elsewhere. There is no point in just trying to make the "postgrest hangs and needs to be restarted manually" less likely, when we can just eliminate it alltogether. I think we're just talking different problems here. Given my example above, with a pool size of 10 and a limit of file descriptors of 1024. I am talking about limiting the queue of waiting http requests at around 1000 to make sure we never have our PostgREST process hang. Now, of course, once we are in that state there will be a different problem: Those 1000 requests still in the queue might take a loong time until all of them are served. And that seems to be what you are concerned about. It makes perfect sense to have a much smaller threshold to start rejecting new requests with proper error messages. And to base this on pool performance. But then again, it also makes a lot of sense to keep this part configurable. I assume it would make a big difference whether you run a single instance of postgrest or a cluster and how you would want to deal with this in both cases. But: This is a different issue. |
Looks like Akka has a similar way to limit its http connection queue:
At the db pool level, SQLAlchemy has a max_overflow: |
Just to inform about some progress. I found that Using more db connections for internal purposes doesn't seem right though(and also we cannot add 200 connections for a waiter queue because connections are scarce in many cases) so it would be better to improve the |
When I try this test(simple table GET with filter) with 500 concurrent users(VUs in k6), I get a {"dbPoolStats":{"currentUsage":10,"creates":421,"highwaterUsage":10,"createFailures":0,"totalWaiters":483,"takes":190445}} After 30 seconds, the Now with 1000 concurrent users, I manage to reproduce EMFILE:
Which happens because the soft limit is 1024 file descriptors:
Cannot query the If I change the soft limit to 5000, EMFILE is not hit, and the total waiters can be seen to be 987: {"dbPoolStats":{"currentUsage":10,"creates":12,"highwaterUsage":10,"createFailures":0,"totalWaiters":987,"takes":38264}} So seems limiting I'll step back and instead explore the soft limit threshold.
Could we respond a proper client error if we compare |
Note the huge difference between soft and hard threshold. I guess most distributions will allow a much higher hard limit. We can set the soft limit from within the postgrest process itself, as mentioned earlier. Other webservers do, too. Increasing the soft limit to the hard limit will already solve this issue in 95% of cases, because the hard limit is so high that the bug will just occur even less frequently. The remaining 5% will be turning down new requests against some kind of threshold. We could compare against:
|
I do see Nginx having the worker_rlimit_nofile, which manipulates the soft limit, tuning that is also a suggested solution to EMFILE. So I guess it's fine to manipulate the soft limit within postgrest itself.
Hm, I'm not sure about setting soft limit to hard limit, do you remember which server did that? I'd like to see what are the actual implications. |
Btw, yesodweb/wai#831 was recently merged. |
Sweet, now updating warp and then raising the soft limit should be enough.
I thought that was nginx, but reading the link you posted I'm not sure anymore. Can't remember where I read that. There's only two options really:
Because raising the soft limit to anything else is always quite an arbitrary choice. Maybe updating warp is enough, too. It's not that often that the bug was hit, right? And the real problem was just that the server was unresponsive after that. Once we reject a few connections, the queue should be getting smaller quickly anyway... |
I tried the new warp version, the good news is that with this change postgrest continues to function when file descriptors become available, there are no more EMFILEs reported from warp. However, there are EMFILEs coming from other places. Same 1000 concurrent users load test as before:
Another EMFILE which I think comes from re-reading the config file(not critical). This is triggered by the 503 that comes from the failed
When re-running the load test, sometimes this error also shows on the postgrest logs:
But as it can be seen, postgrest recovers because it then retries the connection. While the load test runs, I've also tried running other curl requests to see what happens. Unlike before, when curling the admin port it doesn't give Same thing as the above happens for a new request on the main app port(
@wolfgangwalther How about if we just recommend setting the file descriptors higher than 1024 on the docs? The |
My understanding about soft and hard limits is, that soft limits enable a process to self-limit, while hard limits allow the admin/user to limit the process to an enforced limit. Having low soft limits makes sense for applications, that don't expect to have many of them open - it's basically a check against bugs in the application, which might keep those file descriptors open for longer than needed. But that's not the case here: With high load we expect a high number of file descriptors. It's perfectly fine, none of those file descriptors are orphaned. Systemd allows to set I think what we should do:
This will set the default in a way that reaching the limit is more unlikely. |
Cool, that the correct way to handle it.
Yes, that would be expected.
Not sure what you're looking at here. Didn't you just say a few lines up, that you are getting EMFILEs instead of waiting - until the number of connections decreases again? In any case: To still allow re-reading the config file and serving useful responses at the admin port, we need to keep a safety buffer of available file descriptors. This is exactly what I suggested by counting the open http connections - and rejecting new connections before we hit EMFILE. |
Those are curl requests I tried while the load test is running(I've edited my comment above), and no, I don't get EMFILEs(coming from the pg socket) there, they just wait. The EMFILEs I shared came from the k6 load test and seem to happen for some requests randomly(1-4 reqs). Maybe not a problem though, since we keep running.
Hm, keeping the admin port working seems like the right behavior, yes. But I'm not sure if counting http connections would somehow affect performance. Ideally warp would offer us an option to reserve some fds for the admin server, but I don't think there's any. I'll keep digging on this topic.. |
I haven't found other webservers(in other languages) doing something like this. That makes me a bit uneasy about trying that path.
I'll try the soft limit to hard limit approach, that seems enough. The admin port should keep serving requests while the main port requests will wait. |
Tried
Now we don't "fork" but I assume there must be some cost to having many green threads, the soft limit is preventing us from reaching 524288 of those.
We might as well just advise in the docs to set
That looks better, but I don't think we need it. Nginx docs say that worker_rlimit_core is for
Makes sense for nginx since changing LimitNOFILE in systemd would require a restart. However in our case, failure(timeouts) in high load won't be solved by more file descriptors, the db needs to be tuned/scaled. So I'll patch the EMFILE bug, that plus a section in the admin docs about timeouts/EMFILE and how to scale(will take longer as |
👍 |
Noted that these EMFILE errors don't happen when |
Hm. Did you do a test to confirm that those database-side EMFILEs are recoverable from? Given the observation you made, it could also be, that 4 out of 10 pool connections are now in a dead state, similar to warp earlier. You could set |
Yes, when the db-side EMFILE happens the response is 503 and then: postgrest/src/PostgREST/App.hs Lines 175 to 176 in 41a4396
The pool is released and new connections are made, doing a new load test(not hitting EMFILE, say 100 concurrent users) then reveals no errors. |
Perfect. |
(PostgREST v8.0.0.20211102 behind Kong/Nginx)
This happens when there's a good amount of traffic going into PostgREST, then it stops processing any new requests until it's restarted manually.
Ideally PostgREST should recover or at least die and let systemd restart it.
This is a fix that needs to be done upstream: yesodweb/wai#830.
A patch is already up for review: yesodweb/wai#831
Also, a way to reproduce is mentioned on: yesodweb/wai#825.
Not yet confirmed, but increasing the
db-pool
size should help since PostgREST wil be able to process more requests without running out of sockets first.The text was updated successfully, but these errors were encountered: