-
Notifications
You must be signed in to change notification settings - Fork 79
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Flask is erroring out with BrokenProcessPool #205
Comments
Hmm, I never experienced this. As a workaround for now, you could disable internal parallelism by changing this line: You loose the concurrent RGB band retrieval on |
I have several possible fixes in mind, the workaround proposed above should work. Can you see from the logs why the process pool breaks, though? This should be the first exception you see. |
The error-log above is the only output I am getting from gunicorn unfortunately. It's weird because I am actually only using singleband requests. The fix above wouldn't work for me as I don't have a copy of terracotta locally but rather install it over pip (which is really the only way for our production deployments). |
Alright. I'll try to add a setting to disable multiprocessing within the coming days and make a release as soon as #203 is merged. That should do as a workaround. Permanent fix ideas, with increasing cleverness:
|
Respawn broken process pool (#205)
👋 @rico-ci, can you try with the latest master? You can pip-install it with $ pip install git+https://github.com/DHI-GRAS/terracotta.git This should spawn a new pool automatically if it breaks. |
Just tried it out and now am getting this behaviour:
So I guess, when the pool fails but then gets re-instantiated. However, where the |
This seems to be the real error:
So it seems like your I/O is flaky for some reason. Doesn't seem like something we can fix from Terracotta side. I interpret your log like the re-spawning works as intended, as Terracotta is able to recover from the failing I/O. |
BTW, is the the first request or does it occur after some time? Does it re-appear? |
OK, I will look into what could be causing the I/O error. However, it seems weird to me that this is popping up out of nowhere and also just sometimes, saying that the DB does not exist. Regarding the error frequency: |
I can really recommend to switch to a MySQL db. Concurrency + SQLite has caused quite some trouble for me before... |
Yeah I was thinking the same just now. I will have a look whether that solves not only the issue with the Read I/O but also with the ThreadPool. I'll keep you posted. |
Switching to MySQL would probably fix this, whether it's a flaky disk or SQLite acting up. The error saying the file does not exist is just a guess from Terracotta. SQLite is just reporting IO error and this is the most common cause. Can you say a bit more about your setup? Are either rasters or database on S3 or is it all local? |
So we're currently storing COGTiffs locally on the pod where we have terracotta. We are thinking of using S3 further down the line but not for now. Hence, we generate COGTiffs, store them locally to the pod, and insert the dataset keys + filepath into the Terracotta database. Then, we GET the tiles from terracotta. I just switched all I guess that's still to do with I/O on my side?
|
Maybe this is the perfect case for a no-db option #172?
No, that is unfortunately not sufficient. You need a running MySQL server and ingest the datasets into that. We infer which db provider you are using from the scheme: or you can configure it with the You could run the server on the same pod, I guess, by making a hybrid container with MySQL server and Terracotta. But all-local deployments is really what we thought SQLite databases should work for. So maybe we should rather try to get that to work. IMO, MySQL makes most sense if you have that server running outside of your pod and also the rasters stored remotely (in S3). |
Right, that makes much more sense. I thought for some reason that MySQL was also serverless just like SQLite. I'll add another container to the deployment with a MySQL server in it and give that a shot. We'll be looking to move to S3 eventually anyways for horizontal scaling, so the effort is well spent regardless. Well, I would definitely be pro ditching the database. The database was actually the one thing that made me hesitate when exploring options for serving raster tiles. It seemed difficult to keep up to date, especially when you're constantly ingesting new COGTiff files as we are. |
SQLite can handle concurrent processes reading and writing to it perfectly fine. So I would be very surprised if this is a fundamental issue with Terracotta or SQLite. The internet has several suggestions for this error. Some examples:
I think you should at least check for disk space or memory issues before moving on. Especially when you're out of memory you should see semi-random crashes of workers as you are experiencing. |
OKAY! So, I got it to work and I am a bit embarrassed of the underlying issue. |
So it probably was the pod running out of memory eventually? Because each worker gets its own raster cache, which can grow up to |
That is very probable. I never saw the overall Memory / CPU usage to top any of my pod's limits but it could be that the allocation by gunicorn did some background-magic to cause the error. Otherwise, I could also imagine that there were just too many processes running at the same time, causing this assembly of errors. |
No worries. Happy to help, and I think the changes I implemented are still valuable for the future. |
@dionhaefner is there a timeline for when a new version of Terracotta which will include the changes discussed here will be released? |
Good point. I'll make a release later today. |
Hi there! Since I posted last time, this issue has been cropping up again and I'm pretty clueless on what to do about it. My specs and settings:
The error-log I am receiving again:
It's unfortunate that this only happens for one datasets and I can't exactly pinpoint what that dataset does differently than the others that I'm serving. I know for a fact that terracotta has a good time serving COGTiffs of around 20~100 MB. The dataset where the above error code is thrown is around 400 MB. I have also tried to reduce the workers. That actually solves the issue of the error being thrown but the performance is really poor. Most tiles don't load and to load them I need to specifically zoom to a section of the map and wait around 4-10 seconds. Not sure what I can do about this but it prevents us from bringing Terracotta into production. Honestly, at this point any hints are welcome. |
Are you absolutely sure that your files are valid COGs? That would explain your slow tile reads and out of memory errors. |
Some other services were changed and this indeed was the issue. Thanks @dionhaefner! Was really losing my wits but didn't think of that. |
Hi there!
We have started using Terracotta in our K8S infrastructure on production. Basically we are serving the WSGI flask application (
terracotta.server.app:app
) using gunicorn alongside with an internal gRPC server which is taking internal requests and queries the terracotta HTTP endpoint for a singleband tile and returns it as a bytes object.However, while the first 10-50 requests work fine, I now get this error from terracotta afterwards:
The worst thing about this is that the flask application doesn't seem to actually error out. Instead, every subsequent request throws the error above. That's problematic as K8s then doesn't know that the pod needs to be restarted. However, on a longer sight, this also means that we could never cater for the amount of requests (around 50 RPS) we have using terracotta if this persists.
Has anyone encountered this yet?
The text was updated successfully, but these errors were encountered: