-
Notifications
You must be signed in to change notification settings - Fork 77
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ditch databases? #172
Comments
I am split about this:
We could also decide to strengthen the DB feature by adding query functionality. But then we open the box of people wanting to query very differently.
Like in GeoTIFF tags? I assume you could still retrieve the metadata reasonably fast from the tags and quickly enough build the path -- metadata mapping for subsequent filtering, don't you think? Its completely parallelizable, at least. |
Well, if you do it right, the user would never know they are looking at paths. Before:
After:
Just have the folder structure
And in my proposal, you would still have the option to use a (fully searchable) database, it just wouldn't contain the raster metadata by default. So you give either
Yes. |
This is a good idea in any case.
This would mean that keys would have to match the file/folder structure directly, would that not be a step backwards? I guess it could be good as a default behavior, then a DB is only needed if the user wants to have arbitrary key->raster mappings. |
But the DB would still be "managed" by Terracotta, right? |
Yes, so you have pattern matching on the file system by default, and the option to use a database. Database could be anything that maps your search criteria to what we call keys now (and might call "path" or just "key" in a new version). Example I have in mind: a STAC catalogue of all your rasters, you index that via SQL-like queries, and by that obtain the key you can use to query TC for the tiles. |
I'm not sure. I think it would be good to do that for the most common DB usecase(s), but we should also have the option to have databases that are externally managed. |
Reason I'm asking is that if we don't have these tools, users would then have to tinker with a database, if they want to have the same functionality as now. |
Yea, we could still fully support e.g. the simple SQLite DB, and make any external DB read-only. |
Not sure though in what way the SQLite DB would be easier to use than the filesystem. Ingest == dump the file into an S3 bucket sounds like the dream to me. |
Yes, it's very plausible that the people who would want the SQLite DB is a small minority, so it might be fully justified to make their lives a tiny bit harder, for the greater good. |
Well, the data has to live somewhere? S3 was just an example, could be any filesystem. |
Aha, S3 could actually be an example where you would want to use a database. You can't run glob patterns on S3 buckets, so if you have tens of thousands of files it could become slow to filter those on |
Right, I'm just thinking out loud, trying to cover every angle from the users perspective. Imagining how this would work from the perspective of the most unlikely user, who wants to have their rasters scattered all over the place 😄 But yeah, decoupling from the DB would probably make it better for the vast majority of users, at the possible cost of making it slightly worse for a small minority. Computing the metadata would then happen during
Okay, that changes things a bit, I guess. Using Lambdas and S3 should not require any external DB tinkering, right? |
Simplest mode of usage:(only recommended for data exploration) $ terracotta serve -r myrasters/{date}/{tile}/{band}.tif In this case
Slightly more advanced:(recommended usage) # This optimizes the files and dumps them into the S3 bucket when done
$ terracotta prepare-rasters myrasters/**/*.tif -o s3://myrasters
$ export TC_KEY_DESC="s3://key_desc.json"
$ terracotta serve -r s3://myrasters/{date}/{tile}/{band}.tif
Advanced:$ terracotta prepare-rasters myrasters/**/*.tif -o s3://myrasters
$ terracotta ingest s3://myrasters/{date}/{tile}/{band}.tif -o s3://myrasters/tc.sqlite
$ export TC_KEY_DESC="s3://key_desc.json"
$ terracotta serve -d s3://myrasters/tc.sqlite
Custom paths to rasters:(keys are not coupled to file paths) Same as before, but use Python API to create SQLite database External database:Option 1a$ terracotta serve -r s3://myrasters/{date}/{tile}/{band}.tif --external
Option 1b$ terracotta serve --external
Option 2a$ terracotta serve -r s3://myrasters/{date}/{tile}/{band}.tif --external mysql://example.com:123456
Option 2b$ terracotta serve -r s3://myrasters/{date}/{tile}/{band}.tif --external mysql://example.com:123456
|
Option 2x look nice but they are a bit half-baked, since there's no real way to browse the database... |
Slower than a database lookup for sure, but listing keys in a bucket and running some regex on them is not that slow either. |
Currently, we have the following tables in a Terracotta database:
keys
: Contains defined key names and their descriptiondatasets
: Maps key values to physical raster pathmetadata
: Maps key values to raster metadata (such as min and max value, bounds, footprint, ...)An alternative model could be to save raster metadata on the rasters themselves. In that case, it would be much less likely for the raster metadata to go out of date. Having the metadata in a database only makes sense if we want to search it, which we currently don't allow.
Doing this would even allow us to decouple the database from Terracotta entirely. We could then have an external database that the frontend can query for valid rasters, and request them from Terracotta by filename. This gives users flexibility to have a searchable catalogue outside of TC, and we could recover the current behavior by running directory listings on the raster folder.
The text was updated successfully, but these errors were encountered: