Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flask is erroring out with BrokenProcessPool #205

Closed
rico-ci opened this issue Apr 26, 2021 · 26 comments
Closed

Flask is erroring out with BrokenProcessPool #205

rico-ci opened this issue Apr 26, 2021 · 26 comments

Comments

@rico-ci
Copy link

rico-ci commented Apr 26, 2021

Hi there!

We have started using Terracotta in our K8S infrastructure on production. Basically we are serving the WSGI flask application (terracotta.server.app:app) using gunicorn alongside with an internal gRPC server which is taking internal requests and queries the terracotta HTTP endpoint for a singleband tile and returns it as a bytes object.

However, while the first 10-50 requests work fine, I now get this error from terracotta afterwards:

 [-] Exception on /singleband/some_path/25/10/506/313.png [GET]
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 2447, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1952, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1821, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python3.8/dist-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/usr/local/lib/python3.8/dist-packages/terracotta/server/flask_api.py", line 49, in inner
    return fun(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/terracotta/server/singleband.py", line 121, in get_singleband
    return _get_singleband_image(keys, tile_xyz)
  File "/usr/local/lib/python3.8/dist-packages/terracotta/server/singleband.py", line 166, in _get_singleband_image
    image = singleband(parsed_keys, tile_xyz=tile_xyz, **options)
  File "/usr/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "/usr/local/lib/python3.8/dist-packages/terracotta/handlers/singleband.py", line 43, in singleband
    tile_data = xyz.get_tile_data(
  File "/usr/local/lib/python3.8/dist-packages/terracotta/xyz.py", line 44, in get_tile_data
    return driver.get_raster_tile(
  File "/usr/local/lib/python3.8/dist-packages/terracotta/drivers/base.py", line 20, in inner
    return fun(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/terracotta/drivers/raster_base.py", line 557, in get_raster_tile
    future = executor.submit(retrieve_tile)
  File "/usr/lib/python3.8/concurrent/futures/process.py", line 629, in submit
    raise BrokenProcessPool(self._broken)
concurrent.futures.process.BrokenProcessPool: A child process terminated abruptly, the process pool is not usable anymore

The worst thing about this is that the flask application doesn't seem to actually error out. Instead, every subsequent request throws the error above. That's problematic as K8s then doesn't know that the pod needs to be restarted. However, on a longer sight, this also means that we could never cater for the amount of requests (around 50 RPS) we have using terracotta if this persists.

Has anyone encountered this yet?

@j08lue
Copy link
Collaborator

j08lue commented Apr 26, 2021

Hmm, I never experienced this.

As a workaround for now, you could disable internal parallelism by changing this line:

https://github.com/DHI-GRAS/terracotta/blob/76534672bbce38d50a5e33afb2a9a2c5f1339909/terracotta/drivers/raster_base.py#L43

You loose the concurrent RGB band retrieval on /rgb calls, but it should make no difference for /singleband (perhaps even faster, because threads are cheaper than processes).

@dionhaefner
Copy link
Collaborator

I have several possible fixes in mind, the workaround proposed above should work. Can you see from the logs why the process pool breaks, though? This should be the first exception you see.

@rico-ci
Copy link
Author

rico-ci commented Apr 26, 2021

The error-log above is the only output I am getting from gunicorn unfortunately. It's weird because I am actually only using singleband requests. The fix above wouldn't work for me as I don't have a copy of terracotta locally but rather install it over pip (which is really the only way for our production deployments).

@dionhaefner
Copy link
Collaborator

Alright. I'll try to add a setting to disable multiprocessing within the coming days and make a release as soon as #203 is merged. That should do as a workaround.

Permanent fix ideas, with increasing cleverness:

  1. Disable multiprocessing by default and accept that /rgb takes 3x as long as /singleband.
  2. Try multithreading again with GDAL 3, maybe the race condition is fixed by now.
  3. Detect whether multi-process pool is broken and spawn new workers as needed.

dionhaefner added a commit that referenced this issue Apr 27, 2021
@dionhaefner
Copy link
Collaborator

👋 @rico-ci, can you try with the latest master?

You can pip-install it with

$ pip install git+https://github.com/DHI-GRAS/terracotta.git

This should spawn a new pool automatically if it breaks.

@rico-ci
Copy link
Author

rico-ci commented Apr 27, 2021

Just tried it out and now am getting this behaviour:

  [-] Exception on /singleband/some_path/25/12/2022/1251.png [GET]
 Traceback (most recent call last):
   File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 2447, in wsgi_app
     response = self.full_dispatch_request()
   File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1952, in full_dispatch_request
     rv = self.handle_user_exception(e)
   File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1821, in handle_user_exception
     reraise(exc_type, exc_value, tb)
   File "/usr/local/lib/python3.8/dist-packages/flask/_compat.py", line 39, in reraise
     raise value
   File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1950, in full_dispatch_request
     rv = self.dispatch_request()
   File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1936, in dispatch_request
     return self.view_functions[rule.endpoint](**req.view_args)
   File "/usr/local/lib/python3.8/dist-packages/terracotta/server/flask_api.py", line 49, in inner
     return fun(*args, **kwargs)
   File "/usr/local/lib/python3.8/dist-packages/terracotta/server/singleband.py", line 121, in get_singleband
     return _get_singleband_image(keys, tile_xyz)
   File "/usr/local/lib/python3.8/dist-packages/terracotta/server/singleband.py", line 166, in _get_singleband_image
     image = singleband(parsed_keys, tile_xyz=tile_xyz, **options)
   File "/usr/lib/python3.8/contextlib.py", line 75, in inner
     return func(*args, **kwds)
   File "/usr/local/lib/python3.8/dist-packages/terracotta/handlers/singleband.py", line 43, in singleband
     tile_data = xyz.get_tile_data(
   File "/usr/local/lib/python3.8/dist-packages/terracotta/xyz.py", line 44, in get_tile_data
     return driver.get_raster_tile(
   File "/usr/local/lib/python3.8/dist-packages/terracotta/drivers/base.py", line 20, in inner
     return fun(self, *args, **kwargs)
   File "/usr/local/lib/python3.8/dist-packages/terracotta/drivers/raster_base.py", line 607, in get_raster_tile
     result = future.result()
   File "/usr/lib/python3.8/concurrent/futures/_base.py", line 439, in result
     return self.__get_result()
   File "/usr/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
     raise self._exception
 concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
  [-] Exception on /singleband/some_path/25/12/2018/1252.png [GET]
 Traceback (most recent call last):
   File "/usr/local/lib/python3.8/dist-packages/terracotta/drivers/sqlite.py", line 35, in convert_exceptions
     yield
   File "/usr/local/lib/python3.8/dist-packages/terracotta/drivers/sqlite.py", line 121, in _connect
     self._connection = sqlite3.connect(
 sqlite3.OperationalError: disk I/O error
 
 The above exception was the direct cause of the following exception:
 
 Traceback (most recent call last):
   File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 2447, in wsgi_app
     response = self.full_dispatch_request()
   File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1952, in full_dispatch_request
     rv = self.handle_user_exception(e)
   File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1821, in handle_user_exception
     reraise(exc_type, exc_value, tb)
   File "/usr/local/lib/python3.8/dist-packages/flask/_compat.py", line 39, in reraise
     raise value
   File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1950, in full_dispatch_request
     rv = self.dispatch_request()
   File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1936, in dispatch_request
     return self.view_functions[rule.endpoint](**req.view_args)
   File "/usr/local/lib/python3.8/dist-packages/terracotta/server/flask_api.py", line 49, in inner
     return fun(*args, **kwargs)
   File "/usr/local/lib/python3.8/dist-packages/terracotta/server/singleband.py", line 121, in get_singleband
     return _get_singleband_image(keys, tile_xyz)
   File "/usr/local/lib/python3.8/dist-packages/terracotta/server/singleband.py", line 166, in _get_singleband_image
     image = singleband(parsed_keys, tile_xyz=tile_xyz, **options)
   File "/usr/lib/python3.8/contextlib.py", line 75, in inner
     return func(*args, **kwds)
   File "/usr/local/lib/python3.8/dist-packages/terracotta/handlers/singleband.py", line 41, in singleband
     with driver.connect():
   File "/usr/lib/python3.8/contextlib.py", line 113, in __enter__
     return next(self.gen)
   File "/usr/local/lib/python3.8/dist-packages/terracotta/drivers/sqlite.py", line 121, in _connect
     self._connection = sqlite3.connect(
   File "/usr/lib/python3.8/contextlib.py", line 131, in __exit__
     self.gen.throw(type, value, traceback)
   File "/usr/local/lib/python3.8/dist-packages/terracotta/drivers/sqlite.py", line 37, in convert_exceptions
     raise exceptions.InvalidDatabaseError(msg) from exc
 terracotta.exceptions.InvalidDatabaseError: Could not connect to database. Make sure that the given path points to a valid Terracotta database, and that you ran driver.create().

So I guess, when the pool fails but then gets re-instantiated. However, where the InvalidDatabaseError comes from I have no idea. It's strange because subsequent calls are successful, so it can't be the actual error.

@dionhaefner
Copy link
Collaborator

This seems to be the real error:

sqlite3.OperationalError: disk I/O error

So it seems like your I/O is flaky for some reason. Doesn't seem like something we can fix from Terracotta side.

I interpret your log like the re-spawning works as intended, as Terracotta is able to recover from the failing I/O.

@dionhaefner
Copy link
Collaborator

BTW, is the the first request or does it occur after some time? Does it re-appear?

@rico-ci
Copy link
Author

rico-ci commented Apr 27, 2021

OK, I will look into what could be causing the I/O error. However, it seems weird to me that this is popping up out of nowhere and also just sometimes, saying that the DB does not exist.

Regarding the error frequency:
It occurs after some arbitrary number of requests (~40 or so) and then re-appears consistently.

@j08lue
Copy link
Collaborator

j08lue commented Apr 27, 2021

I can really recommend to switch to a MySQL db. Concurrency + SQLite has caused quite some trouble for me before...

@rico-ci rico-ci closed this as completed Apr 27, 2021
@rico-ci rico-ci reopened this Apr 27, 2021
@rico-ci
Copy link
Author

rico-ci commented Apr 27, 2021

Yeah I was thinking the same just now. I will have a look whether that solves not only the issue with the Read I/O but also with the ThreadPool. I'll keep you posted.

@dionhaefner
Copy link
Collaborator

Switching to MySQL would probably fix this, whether it's a flaky disk or SQLite acting up.

The error saying the file does not exist is just a guess from Terracotta. SQLite is just reporting IO error and this is the most common cause.

Can you say a bit more about your setup? Are either rasters or database on S3 or is it all local?

@rico-ci
Copy link
Author

rico-ci commented Apr 27, 2021

So we're currently storing COGTiffs locally on the pod where we have terracotta. We are thinking of using S3 further down the line but not for now. Hence, we generate COGTiffs, store them locally to the pod, and insert the dataset keys + filepath into the Terracotta database. Then, we GET the tiles from terracotta.

I just switched all .sqlite file-endings to .mysql (is that how you are supposed to do it?). While the overall performance is definitely better (or I am imagining it), I now still receive the error messages down belos.

I guess that's still to do with I/O on my side?

 [-] Exception on /singleband/some_path/25/11/1011/623.png [GET]
concurrent.futures.process._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "rasterio/_io.pyx", line 697, in rasterio._io.DatasetReaderBase._read
  File "rasterio/shim_rasterioex.pxi", line 142, in rasterio._shim.io_multi_band
  File "rasterio/_err.pyx", line 190, in rasterio._err.exc_wrap_int
rasterio._err.CPLE_AppDefinedError: IReadBlock failed at X offset 0, Y offset 1: /tiles/bd33e0e2-e1a9-4ba6-8d60-58e259a5ab0b/25_1614280683.tif, band 1: IReadBlock failed at X offset 1, Y offset 3: TIFFReadEncodedTile() failed.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.8/concurrent/futures/process.py", line 239, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/usr/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "/usr/local/lib/python3.8/dist-packages/terracotta/drivers/raster_base.py", line 532, in _get_raster_tile
    tile_data = vrt.read(
  File "rasterio/_warp.pyx", line 1085, in rasterio._warp.WarpedVRTReaderBase.read
  File "rasterio/_io.pyx", line 361, in rasterio._io.DatasetReaderBase.read
  File "rasterio/_io.pyx", line 700, in rasterio._io.DatasetReaderBase._read
rasterio.errors.RasterioIOError: Read or write failed. IReadBlock failed at X offset 0, Y offset 1: /tiles/bd33e0e2-e1a9-4ba6-8d60-58e259a5ab0b/25_1614280683.tif, band 1: IReadBlock failed at X offset 1, Y offset 3: TIFFReadEncodedTile() failed.
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 2447, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1952, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1821, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "/usr/local/lib/python3.8/dist-packages/flask/_compat.py", line 39, in reraise
    raise value
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "/usr/local/lib/python3.8/dist-packages/terracotta/server/flask_api.py", line 49, in inner
    return fun(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/terracotta/server/singleband.py", line 121, in get_singleband
    return _get_singleband_image(keys, tile_xyz)
  File "/usr/local/lib/python3.8/dist-packages/terracotta/server/singleband.py", line 166, in _get_singleband_image
    image = singleband(parsed_keys, tile_xyz=tile_xyz, **options)
  File "/usr/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "/usr/local/lib/python3.8/dist-packages/terracotta/handlers/singleband.py", line 43, in singleband
    tile_data = xyz.get_tile_data(
  File "/usr/local/lib/python3.8/dist-packages/terracotta/xyz.py", line 44, in get_tile_data
    return driver.get_raster_tile(
  File "/usr/local/lib/python3.8/dist-packages/terracotta/drivers/base.py", line 20, in inner
    return fun(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/terracotta/drivers/raster_base.py", line 607, in get_raster_tile
    result = future.result()
  File "/usr/lib/python3.8/concurrent/futures/_base.py", line 439, in result
    return self.__get_result()
  File "/usr/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
    raise self._exception
rasterio.errors.RasterioIOError: Read or write failed. IReadBlock failed at X offset 0, Y offset 1: /tiles/bd33e0e2-e1a9-4ba6-8d60-58e259a5ab0b/25_1614280683.tif, band 1: IReadBlock failed at X offset 1, Y offset 3: TIFFReadEncodedTile() failed.

@j08lue
Copy link
Collaborator

j08lue commented Apr 27, 2021

we're currently storing COGTiffs locally on the pod

Maybe this is the perfect case for a no-db option #172?

I just switched all .sqlite file-endings to .mysql

No, that is unfortunately not sufficient. You need a running MySQL server and ingest the datasets into that. We infer which db provider you are using from the scheme:

https://github.com/DHI-GRAS/terracotta/blob/76534672bbce38d50a5e33afb2a9a2c5f1339909/terracotta/drivers/__init__.py#L64-L73

or you can configure it with the DRIVER_PROVIDER setting:

https://github.com/DHI-GRAS/terracotta/blob/253e02c5361c1c8621638538ca9206e019051a77/terracotta/config.py#L20

You could run the server on the same pod, I guess, by making a hybrid container with MySQL server and Terracotta. But all-local deployments is really what we thought SQLite databases should work for. So maybe we should rather try to get that to work.

IMO, MySQL makes most sense if you have that server running outside of your pod and also the rasters stored remotely (in S3).

@rico-ci
Copy link
Author

rico-ci commented Apr 27, 2021

Right, that makes much more sense. I thought for some reason that MySQL was also serverless just like SQLite. I'll add another container to the deployment with a MySQL server in it and give that a shot. We'll be looking to move to S3 eventually anyways for horizontal scaling, so the effort is well spent regardless.

Well, I would definitely be pro ditching the database. The database was actually the one thing that made me hesitate when exploring options for serving raster tiles. It seemed difficult to keep up to date, especially when you're constantly ingesting new COGTiff files as we are.

@dionhaefner
Copy link
Collaborator

SQLite can handle concurrent processes reading and writing to it perfectly fine. So I would be very surprised if this is a fundamental issue with Terracotta or SQLite.

The internet has several suggestions for this error. Some examples:

  • Disk is full (system disk or the one holding the database)
  • Worker is out of memory
  • File permission issues (journal file is not writable)
  • Uppercase filenames on case-insensitive platforms
  • Disk is an NFS share or mounted Google drive that doesn't support locking

I think you should at least check for disk space or memory issues before moving on. Especially when you're out of memory you should see semi-random crashes of workers as you are experiencing.

@rico-ci
Copy link
Author

rico-ci commented Apr 28, 2021

OKAY! So, I got it to work and I am a bit embarrassed of the underlying issue.
Basically, I had far too many worked for the gunicorn Flask HTTP server for the number of Cores that I had. Reducing these to a lower number got rid of all the concurrency and IO errors as well as making my deployment as a whole more stable. There are no errors of broken pipes or core dumps anymore.

@rico-ci rico-ci closed this as completed Apr 28, 2021
@dionhaefner
Copy link
Collaborator

So it probably was the pod running out of memory eventually? Because each worker gets its own raster cache, which can grow up to TC_RASTER_CACHE_SIZE (490MB by default).

@rico-ci
Copy link
Author

rico-ci commented Apr 28, 2021

That is very probable. I never saw the overall Memory / CPU usage to top any of my pod's limits but it could be that the allocation by gunicorn did some background-magic to cause the error. Otherwise, I could also imagine that there were just too many processes running at the same time, causing this assembly of errors.
Either way, thank you guys a lot for following-up and give some ideas as of what it could have been.
Maybe still a valid discussion for future reference.

@dionhaefner
Copy link
Collaborator

No worries. Happy to help, and I think the changes I implemented are still valuable for the future.

@rico-ci
Copy link
Author

rico-ci commented May 12, 2021

@dionhaefner is there a timeline for when a new version of Terracotta which will include the changes discussed here will be released?

@dionhaefner
Copy link
Collaborator

Good point. I'll make a release later today.

@dionhaefner
Copy link
Collaborator

Done

@rico-ci
Copy link
Author

rico-ci commented Jun 2, 2021

Hi there! Since I posted last time, this issue has been cropping up again and I'm pretty clueless on what to do about it.

My specs and settings:

pod:
    CPU: 1
    RAM: 1Gi
WSGI server:
   name: gunicorn
   workers: 4
   command: gunicorn -w 4 -b :8000 terracotta.server.app:app
terracotta settings:
    TC_DRIVER_PATH: "tiles/terracotta.sqlite"
    RASTER_PATTERN: "{dataset_id}.tiff"
    TC_RESAMPLING_METHOD: "nearest"
    TC_RASTER_CACHE_SIZE: 128

The error-log I am receiving again:

 [-] Exception on /singleband/some_path/tile.png [GET]
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 2070, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1515, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1513, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1499, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
  File "/usr/local/lib/python3.8/dist-packages/terracotta/server/flask_api.py", line 49, in inner
    return fun(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/terracotta/server/singleband.py", line 121, in get_singleband
    return _get_singleband_image(keys, tile_xyz)
  File "/usr/local/lib/python3.8/dist-packages/terracotta/server/singleband.py", line 166, in _get_singleband_image
    image = singleband(parsed_keys, tile_xyz=tile_xyz, **options)
  File "/usr/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "/usr/local/lib/python3.8/dist-packages/terracotta/handlers/singleband.py", line 43, in singleband
    tile_data = xyz.get_tile_data(
  File "/usr/local/lib/python3.8/dist-packages/terracotta/xyz.py", line 44, in get_tile_data
    return driver.get_raster_tile(
  File "/usr/local/lib/python3.8/dist-packages/terracotta/drivers/base.py", line 20, in inner
    return fun(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/terracotta/drivers/raster_base.py", line 607, in get_raster_tile
    result = future.result()
  File "/usr/lib/python3.8/concurrent/futures/_base.py", line 439, in result
    return self.__get_result()
  File "/usr/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
    raise self._exception
concurrent.futures.process.BrokenProcessPool: A process in the process pool was terminated abruptly while the future was running or pending.
 [-] Exception on /singleband/some_path/tile.png [GET]
concurrent.futures.process._RemoteTraceback: 
"""
Traceback (most recent call last):
  File "rasterio/_io.pyx", line 697, in rasterio._io.DatasetReaderBase._read
  File "rasterio/shim_rasterioex.pxi", line 142, in rasterio._shim.io_multi_band
  File "rasterio/_err.pyx", line 190, in rasterio._err.exc_wrap_int
rasterio._err.CPLE_AppDefinedError: IReadBlock failed at X offset 0, Y offset 6: ./tiles/some_path/tile.tif, band 1: IReadBlock failed at X offset 0, Y offset 1186: TIFFReadEncodedStrip() failed.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/lib/python3.8/concurrent/futures/process.py", line 239, in _process_worker
    r = call_item.fn(*call_item.args, **call_item.kwargs)
  File "/usr/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "/usr/local/lib/python3.8/dist-packages/terracotta/drivers/raster_base.py", line 532, in _get_raster_tile
    tile_data = vrt.read(
  File "rasterio/_warp.pyx", line 1085, in rasterio._warp.WarpedVRTReaderBase.read
  File "rasterio/_io.pyx", line 361, in rasterio._io.DatasetReaderBase.read
  File "rasterio/_io.pyx", line 700, in rasterio._io.DatasetReaderBase._read
rasterio.errors.RasterioIOError: Read or write failed. IReadBlock failed at X offset 0, Y offset 6: ./tiles/some_path/tile.tif, band 1: IReadBlock failed at X offset 0, Y offset 1186: TIFFReadEncodedStrip() failed.
"""

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 2070, in wsgi_app
    response = self.full_dispatch_request()
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1515, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1513, in full_dispatch_request
    rv = self.dispatch_request()
  File "/usr/local/lib/python3.8/dist-packages/flask/app.py", line 1499, in dispatch_request
    return self.ensure_sync(self.view_functions[rule.endpoint])(**req.view_args)
  File "/usr/local/lib/python3.8/dist-packages/terracotta/server/flask_api.py", line 49, in inner
    return fun(*args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/terracotta/server/singleband.py", line 121, in get_singleband
    return _get_singleband_image(keys, tile_xyz)
  File "/usr/local/lib/python3.8/dist-packages/terracotta/server/singleband.py", line 166, in _get_singleband_image
    image = singleband(parsed_keys, tile_xyz=tile_xyz, **options)
  File "/usr/lib/python3.8/contextlib.py", line 75, in inner
    return func(*args, **kwds)
  File "/usr/local/lib/python3.8/dist-packages/terracotta/handlers/singleband.py", line 43, in singleband
    tile_data = xyz.get_tile_data(
  File "/usr/local/lib/python3.8/dist-packages/terracotta/xyz.py", line 44, in get_tile_data
    return driver.get_raster_tile(
  File "/usr/local/lib/python3.8/dist-packages/terracotta/drivers/base.py", line 20, in inner
    return fun(self, *args, **kwargs)
  File "/usr/local/lib/python3.8/dist-packages/terracotta/drivers/raster_base.py", line 607, in get_raster_tile
    result = future.result()
  File "/usr/lib/python3.8/concurrent/futures/_base.py", line 439, in result
    return self.__get_result()
  File "/usr/lib/python3.8/concurrent/futures/_base.py", line 388, in __get_result
    raise self._exception
rasterio.errors.RasterioIOError: Read or write failed. IReadBlock failed at X offset 0, Y offset 6: ./tiles/some_path/tile.tif, band 1: IReadBlock failed at X offset 0, Y offset 1186: TIFFReadEncodedStrip() failed.
 [!] Re-creating broken process pool

It's unfortunate that this only happens for one datasets and I can't exactly pinpoint what that dataset does differently than the others that I'm serving. I know for a fact that terracotta has a good time serving COGTiffs of around 20~100 MB. The dataset where the above error code is thrown is around 400 MB.

I have also tried to reduce the workers. That actually solves the issue of the error being thrown but the performance is really poor. Most tiles don't load and to load them I need to specifically zoom to a section of the map and wait around 4-10 seconds.

Not sure what I can do about this but it prevents us from bringing Terracotta into production. Honestly, at this point any hints are welcome.

@rico-ci rico-ci reopened this Jun 2, 2021
@dionhaefner
Copy link
Collaborator

Are you absolutely sure that your files are valid COGs?

That would explain your slow tile reads and out of memory errors.

@rico-ci
Copy link
Author

rico-ci commented Jun 2, 2021

Some other services were changed and this indeed was the issue. Thanks @dionhaefner! Was really losing my wits but didn't think of that.

@rico-ci rico-ci closed this as completed Jun 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants