Defining Bounding Boxes for Optimizing Spatial Queries #188

jwass · 2023-11-13T13:47:41Z

jwass
Nov 13, 2023
Collaborator

After the GeoParquet meetup last week, I said I'd kick off the discussion around spatial indexing and partitioning as part of the drive to 1.1.0

Background & Motivation

GeoParquet files can dramatically improve the performance of spatial queries by including every row's minimum bounding rectangle (MBR) coordinates. With the MBR coordinates stored as ordinary Parquet columns, the min/max summary statistics in the Parquet footer metadata will automatically store the MBR coordinates for each row group. Readers can execute spatial queries by first using the Parquet metadata to quickly determine which row groups have an MBR that intersects a region of interest, needing to process only those row groups any further and ignoring the rest of the dataset. This allows us to use standard Parquet capabilities to make Parquet files behave like a spatial index. The benefit is even more pronounced in remote/cloud-native environments where only subsets of row groups need to be transferred over the network to clients rather than entire files.

Proposed approach and next steps

Define a standard representation for a row-level MBR

Overture Maps distributes large Parquet (soon to be GeoParquet) datasets. The Overture schema includes a bounding box column called bbox defined as a Parquet struct with fields minx, maxx, miny, maxy. This struct column allows the Parquet schema to present the MBR as a single column, but underneath is 4 separate arrays that take advantage of the summary statistics as explained above. Should we start with this definition as a straw man to understand where it helps or is lacking?

Note: when the GeoArrow format is adopted and used, a separate MBR column may be entirely unnecessary.

Specify the MBR column in the GeoParquet Column Metadata

Add a new optional field for identifying the bounding box column of geometry (bbox_column_name)?

Not In Scope

This definition for the MBR will not impose any requirement of how to store or sort spatial data within a GeoParquet file. Instead we should explore different strategies and techniques and measure their performance. We can then recommend best practices and develop tooling to help data producers and consumers.

Prior discussions and exploration:

Add an spatial index in geoparquet format as optional #13
Eugene Cheipesh's presentation: https://www.youtube.com/watch?v=uNQrwMMn1jk
Admin-partitioned GeoParquet (https://cholmes.medium.com/the-admin-partitioned-geoparquet-distribution-59f0ca1c6d96)

paleolimbot · 2023-11-13T14:10:47Z

paleolimbot
Nov 13, 2023
Collaborator

I think identifying a separate column with simplified geometry information amenable to column statistics will probably be useful for quite some time and is independent of geoarrow encoding (although I will try to get the geoarrow encoding PR up before our next sync call anyway). Maybe as an example:

{
  "columns": {
    "geometry": {
      "encoding": "WKB", 
      "covering": {"geometry_bbox": {"encoding": "box"}}
  }
}

...where the geometry_bbox column has the content you described above. The JSON encoding above allows for futures where there is another type of covering is allowed (e.g., H3 index, S2 index, or lists of those defining a covering for spherical geometry).

minx, maxx, miny, maxy

I like this ordering (because the indices of all of those are compile-time constants even if you add in Z or M), but elsewhere in the spec we use [xmin, ymin, xmax, ymax]. I tend to prefer "xmin" vs "minx" but I know there's precedent for both.

9 replies

cholmes Nov 14, 2023
Maintainer

I think identifying a separate column with simplified geometry information amenable to column statistics will probably be useful for quite some time and is independent of geoarrow encoding

+1 - we won't be rid of WKB as an option in any 1.x series, and the geoarrow encoding would have to really take off to argue that we should do a 2.0 that doesn't have WKB. And it seems like it'll be key to making WKB work well with indexing. It'd make sense to me to make it an optional field, so that geoarrow encodings would just not include it.

paleolimbot Nov 15, 2023
Collaborator

is the term "covering" a generally used term? It seems to be used in S2

Probably not common, I plucked it from S2. "Feature bounds" or "bounds" could work too...I think "covering" is intuitive also (if never before used in a Cartesian context).

paleolimbot Nov 15, 2023
Collaborator

Also, s2_cell is probably better than s2_cell_union (can be int64 for one cell or list for multiple). A "cell union" in S2 C++ land implies that the cells don't intersect and I don't think it's necessary to impose that constraint here.

jwass Nov 15, 2023
Collaborator Author

Small nitpick / question on the naming: is the term "covering" a generally used term? It seems to be used in S2, but if I just google "covering GIS", it doesn't really give me anything useful.

I wonder if these really need to sit under a common covering or other term in the first place?

What about:

{
  "columns": {
    "geometry": {
      "encoding": "WKB", 
      "box": {"column", "geometry_bbox"},
      "s2_cell_union": {"column": "geometry_cov"},
       ...
    }
  }
}

I can't see myself iterating over the different coverings rather than just look up the type(s) I'm interested in. Also - this way restricts there to be only one top of "box" per geometry... where if the encoding is a key somebody could do something unusual like:

{
  "columns": {
    "geometry": {
      "encoding": "WKB", 
      "covering": [
        {"column": "geometry_bbox", "encoding": "box"},
        {"column": "geometry_bbox2", "encoding": "box"},
      ]
    }
  }
}

And clients have to pick one. The schema could restrict there to be only one of each kind, but it feels like it can be avoided entirely.

paleolimbot Nov 17, 2023
Collaborator

Definitely a better idea to use the bounding box type as the key rather than the column! Then you could something like if "box" in spec["covering"]. I don't have strong feelings about the nesting although we already have a "bbox" key, so that might be a tiny bit confusing.

PostholerCom · 2023-11-13T18:09:44Z

PostholerCom
Nov 13, 2023

Having a spatially indexed parquet file will be far more useful for the GIS crowd.

In context of cloud native data.

Using the MBR as an index, will the data be stored in a manner making http range requests useable?
How would you store the MBR index data for cloud usage? As a header to the .parquet, making multiple requests to an additional file?

Parquet's real power is columnar layout, working well with massive data sets. Moving huge amounts of data for one time use/storage is fine. Parquet as a cloud native format, moving huge amounts of data across a network for immediate use by a client negates any performance advantage of the layout, even with a spatial index.

Even an 8 byte quadkey can have performance implications on a massive cloud native .parquet file, if useable at all. Have you considered R-Tree?

Not in cloud context where spatial subsets of data are needed.

What is the advantage of using a spatially indexed parquet file over GeoPackage format?

Edit: As you develop this, it would be cool to see a performance comparison to GeoPackage.

1 reply

paleolimbot Nov 13, 2023
Collaborator

I think the spatial-indexing issue ( #13 ) is orthogonal to the issue here...most internal formats (e.g., GeoPackage's prefixed-WKB, PostGIS' internal serialization, DuckDB's internal serialization) have a way to store a covering prefixed to the actual geometry as a low-hanging optimization for operations where the "possible intersection" check can avoid reading entire geometries. We can't/don't want to change the serialization of the geometry column, so the ability to provide another place where that information is stored seems reasonable.

jiayuasu · 2023-11-13T19:59:30Z

jiayuasu
Nov 13, 2023

@jwass @paleolimbot Thanks for the effort. I like the idea of having a explicit indexing column in addition to the current bbox info in the column metadata. In my opinion, this indexing column should be optional and recommended in the best practice.

On a side note, Apache Sedona geoparquet reader is already leveraging the bbox in the column metadata to perform implicit geospatial filter pushdown.

Additional resolution field?

For the type of the spatial indexing system, I believe the min/max XY is good enough. If we want to allow S2/H3/GeoHash, we might want to include another field resolution. Or do you think the users should be able to infer the resolution from the digits of the S2/H3/GeoHash code themselves?

Generic indexing system

We should keep the indexing system generic. In other words, we should allow users to choose any indexing system from S2/H3/GeoHash. This is because the libraries that can generate these indexes might be only available to certain languages. E.g., the core of H3 is in C with bindings to Java. We ran into a problem before: Snowflake does not allow arbitrary C code execution via Java JNI.

Consider geometries other than points

We should also consider geometries like polygons and linestrings. Different from points, these geometries might yield more than 1 H3/S2/GeoHash id. If we just choose one of the produced Id as the index value used in parquet fiels, filtering based on the index will give inaccurate filtering results.

7 replies

paleolimbot Nov 13, 2023
Collaborator

How does Sedona expect this to be specified in the Parquet columns

I'll put an example of this in the PR proposing the GeoArrow encoding, but I think this would be at the row-group level? An example of extracting those values for GeoArrow-encoded columns is here: https://github.com/geoarrow/geoarrow-python/blob/main/geoarrow-pyarrow/src/geoarrow/pyarrow/dataset.py#L434-L468

jiayuasu Nov 13, 2023

@jwass

Corrected some of my statement.

Yes, Sedona stores bbox info in the file level of geoparquet ~~in the row-group level of geoparquet~~ . ~~If a geoparquet file has multiple columns, then the statistics (min, max) of each column are available in the row-group statistics as well~~ . If we have a spatial query like this,

SELECT *
FROM s3://my-parquet-files-*.parquet T
WHERE ST_Contains(WA_State_Bound, T.geom) AND T.minX> -30 AND T.maxX <30

Sedona will first check the file level bbox and then check themin/max value of minX column, and min/max value of maxX column. So there are 3 filters here. A row-group (i.e., file) will be fetched from S3 by Sedona only if all three filters return true.

Existing parquet writer should generate min/max of nested columns as well. If that exists, Sedona will leverage it together with other filters.

jwass Nov 17, 2023
Collaborator Author

@jiayuasu One of the outcomes I was hoping for with the proposal is that when the bounding box is present, Sedona can detect that and use it without needing to be explicitly told by the user.

So for your query above my thought is a user can write:

SELECT *
FROM s3://my-parquet-files-*.parquet T
WHERE ST_Contains(WA_State_Bound, T.geom)

which Sedona can rewrite the query plan as if they wrote

SELECT *
FROM s3://my-parquet-files-*.parquet T
WHERE 
    ST_Contains(WA_State_Bound, T.geom)
    AND WA_State_bound.bbox.xmin < T.bbox.xmax
    AND WA_State_bound.bbox.xmax > T.bbox.xmin
    AND WA_State_bound.bbox.miny < T.bbox.ymax
    AND WA_State_bound.bbox.maxy > T.bbox.ymin

(hopefully I got the bounding box check right but you get the idea).

So users can just write a simple contains check but under the hood, Sedona can just do the right thing and use the bounding box to both:

Sub-select into the Parquet dataset using the row group and file-level bbox bounds
Use the boxes to short-circuit the contains call when the boxes don't overlap

humaidkidwai Apr 3, 2024

@jiayuasu
Sorry I am quite late to the discussion here but I have a question.

Sedona will first check the file level bbox and then check the min/max value of minX column, and min/max value of maxX column
Existing parquet writer should generate min/max of nested columns as well. If that exists, Sedona will leverage it together with other filters.

Does Sedona parquet writer explicitly define the columns maxX, minX for every geometry column?

jiayuasu Apr 4, 2024

@humaidkidwai No, we don't. maxX, minX are not required fields in GeoParquet spec

jorisvandenbossche · 2023-11-14T19:12:22Z

jorisvandenbossche
Nov 14, 2023
Maintainer

@jwass thanks for opening the discussion! I am very much +1 on your proposal, standardizing a way to include such basic (but useful in many cases) bbox information at the geometry-level, with @paleolimbot's suggestions on how to encode that information in the metadata such that we can easily expand it to other types of information in the future. I think this would be a very useful start.

0 replies

rouault · 2023-11-14T19:21:49Z

rouault
Nov 14, 2023

Remark: for 2D point datasets, it is a bit of a pity to have the same coordinate encoded 3 times... in minx,miny , in maxx,maxy and in the WKB blob... This is where a geoarrow encoding would make sense

Regarding the bbox columns, if we want to optimize space a bit, we could suggest (require ?) the data type to be FLOAT32 instead of FLOAT64. For example the SQLite RTree uses Float32 to store bboxes, with rounding down for minx,minyx and rounding up for maxx, maxy

6 replies

jwass Nov 14, 2023
Collaborator Author

Just another thought - Parquet does have a logical type for fixed precision decimal columns which can make use of delta-encoding and some others. That could provide better compression performance over float23s especially when nearby data is clustered together in the file - which is something we have to do anyway to take full advantage of using the bbox for spatial searches. Maybe something to experiment with.

jorisvandenbossche Nov 14, 2023
Maintainer

I personally wouldn't use decimals, because I think they generally have less wide support in reader implementations. We want to use this column for compute (e.g. select all rows where xmin > 10), and thus that compute needs to work on the decimal data.

Parquet also has Byte Stream Split encoding for float data, which combined with compression should also give a good performance, I assume (but indeed, will need some experiments comparing both to have an actual idea).

jwass Nov 14, 2023
Collaborator Author

I personally wouldn't use decimals, because I think they generally have less wide support in reader implementations.

I didn't realize that. That alone seems like a good reason to avoid it.

paleolimbot Nov 15, 2023
Collaborator

we could suggest (require ?) the data type to be FLOAT32

If we do this, I think "suggest" or "allow" would be a better option. I only say this because I think if we require it, we may end up in a situation where writers are doing a blind cast from double -> float, which I don't think works in this context (you have to be very careful to pick the next lowest float for mins and next highest float for maxes, as Even noted).

I have seen Byte Stream Split encoding work here but I forget where the issue is where somebody demonstrated this for me. I'm hoping to have a proof-of-concept implementation of the GeoArrow encoding to accompany the PR proposing it and will try to add in an example there (byte stream split encoding + compression is probably also useful for GeoArrow columns in addition to bounding boxes).

rouault Nov 15, 2023

we may end up in a situation where writers are doing a blind cast from double -> float, which I don't think works in this context (you have to be very careful to pick the next lowest float for mins and next highest float for maxes, as Even noted).

cf https://github.com/sqlite/sqlite/blob/90e4a3b7fcdf63035d6f35eb44d11ff58ff4b068/ext/rtree/rtree.c#L2993-L3020 for a potential implementation

PostholerCom · 2023-11-14T23:44:56Z

PostholerCom
Nov 14, 2023

Readers can execute spatial queries by first using the Parquet metadata to quickly determine which row groups have an MBR that intersects a region of interest

With a large data set, say 150M features like CONUS building footprints, what would the size of this meta data be? What would be minimum bytes per meta data group/row/block? Is it desirable or realistic to move MB's of meta data over a network?

needing to process only those row groups any further and ignoring the rest of the dataset.

This requires additional processing, that cloud native formats do not require.

This allows us to use standard Parquet capabilities to make Parquet files behave like a spatial index. The benefit is even more pronounced in remote/cloud-native environments where only subsets of row groups need to be transferred over the network to clients rather than entire files.

The benefits are not pronounced. Downloading potentially huge amounts of meta data, processing it and then, finally, moving the results. This requires 2 extra steps cloud native formats do not require.

5 replies

cholmes Nov 14, 2023
Maintainer

I'm pretty sure with this scheme that if you wanted the GeoParquet to be a 'cloud native format' according to your definition you can just set the row group size (like row_group_size in https://gdal.org/drivers/vector/parquet.html#layer-creation-options) to 1. Then it is an index of every single row. So it becomes a bit more like zarr, where the chunk size is configurable.

barbuz Nov 15, 2023

This requires additional processing, that cloud native formats do not require.

I think this is the same sort of processing that is done by any Parquet library, min/max statistics for each column chunk are stored exactly for the purpose of filtering entire row groups before looking at each row. What this proposal would do is enable the same sort of filtering on spatial attributes as is normally available for integers or strings.

jwass Nov 15, 2023
Collaborator Author

@barbuz Exactly. This is what I like about this approach of having the bbox right there in the Parquet data - plain Parquet readers are essentially navigating an r-tree (albeit one of limited depth) and doing a spatial search without needing any special spatial software or other capabilities. (This is also what I love about GeoParquet more broadly).

PostholerCom Nov 15, 2023

set the row group size to 1. Then it is an index of every single row.

This isn't an option for a dataset with 100M+ features. Frankly, that's absolutely ridiculous.

I'm just a bit frustrated when organizations like @Overture or @cloudnativegeo drop massive datasets in parquet format only, passing it off as 'map data' or 'cloud native'. It's not and it's not a matter of debate. Parquet is incapable of doing so. The only option is to download the entire dataset and convert it to a useful, genuine cloud native format, which contradicts the whole idea of storing data in the cloud in the first place.

Maybe I'm just being naive. Maybe the whole point is intentionally making this data difficult to access, $pecifically for building infrastructure in front of the data.

Storing vector data in the cloud as FGB, download partial data using bbox and save as parquet, seems much more KISS. Downloading the whole data set to save partial data is not Simple or Standard, it's just Silly.

jorisvandenbossche Nov 15, 2023
Maintainer

To all: can we please leave out the discussion about whether GeoParquet is a cloud native format or not here? You can continue that at cloudnativegeo/cloud-optimized-geospatial-formats-guide#82 if you want. There are plenty of use cases where GeoParquet is very valuable, and where this specific proposal about adding a bbox column is useful. This issue is to discuss that specific proposal.

jiayuasu · 2023-11-15T07:09:05Z

jiayuasu
Nov 15, 2023

Out of curiosity: I think this bbox column is useful. However, since GeoParquet is going to support geoarrow encoding soon, why do we want to introduce this additional bbox indexing and eventually drop it in another release? This might introduce a bunch of compatibility issues for the downstream readers and writers that want to support multiple geoparquet versions.

2 replies

jorisvandenbossche Nov 15, 2023
Maintainer

There is no plan to drop the WKB encoding when geoarrow support will be added (and I think for the foreseeable future, WKB will still be the recommended encoding if you want files that optimize for interoperability).
So given there are and will be a lot of GeoParquet files out there using WKB, IMO this is a useful addition to the spec for that kind of files, regardless of future geoarrow support.

And even for geoarrow encoded files, if your geometries are complex linestrings or polygons, there might still be use cases for storing the bounding box of each geometry, if you want to filter individual geometries on this bbox (in addition to filter on min/max per rowgroup). But that's probably very application-dependent whether that would be useful or not.

paleolimbot Nov 15, 2023
Collaborator

Just a +1 for that...GeoArrow encoding gets you row-group bounding box filtering for free but does not get you feature bounding box filtering for free. Depending how your file is written, row group filtering may not be useful anyway.

Writers are also always free to omit writing the feature-level bounding box, and readers are free to ignore it. If we end up in a situation where it is either always ignored or always omitted, we can deprecate it in the spec.

jwass · 2023-11-20T19:20:17Z

jwass
Nov 20, 2023
Collaborator Author

It seems like there's enough momentum to push this forward. At the GeoParquet meetup today, @jatorre expressed some caution and asked that we first do our homework to ensure most systems will be able to properly take advantage of the bbox proposal.

Our plan is that I will submit the PR to flesh this out. Simultaneously, we'll do a survey of BigQuery, Athena, Snowflake, duckdb, etc. to ensure they can take full advantage of the specific proposal prior to merging it in.

0 replies

jatorre · 2023-11-20T23:10:28Z

jatorre
Nov 20, 2023

thanks jacob,

…

On Mon, Nov 20, 2023 at 8:20 PM Jacob Wasserman ***@***.***> wrote: It seems like there's enough momentum to push this forward. At the GeoParquet meetup today, @jatorre <https://github.com/jatorre> expressed some caution and asked that we first do our homework to ensure most systems will be able to properly take advantage of the bbox proposal. Our plan is that I will submit the PR to flesh this out. Simultaneously, we'll do a survey of BigQuery, Athena, Snowflake, duckdb, etc. to ensure they can take full advantage of the specific proposal prior to merging it in. — Reply to this email directly, view it on GitHub <#188 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAA7GO43PKELTPPSQ7RRTJ3YFOUPXAVCNFSM6AAAAAA7JFYHUSVHI2DSMVQWIX3LMV43SRDJONRXK43TNFXW4Q3PNVWWK3TUHM3TMMRSHE2TS> . You are receiving this because you were mentioned.Message ID: ***@***.*** com>

-- *Javier de la Torre* CSO +34689414420 | ***@***.*** | CARTO <https://www.carto.com/> <https://www.carto.com/>

0 replies

jwass · 2023-11-27T00:48:03Z

jwass
Nov 27, 2023
Collaborator Author

From my earlier comment I went to put together the pull request and realized there's no consensus on how to define the per-row box column in the geoparquet column metadata. I figured it makes more sense to iterate here rather than write up the specifics in the documentation/schema only to overhaul it later.

There were a few constraints we came up with:

Allow multiple "covering" columns to be specified. Bounding box for now, but maybe s2 or h3 in the future.
Make it easy to look for specific covering options for when you can only handle one kind
Allow only one kind of covering encoding per geometry column
Avoid name clash between existing bbox column (total over the file) and the per-row one being defined here.

I think there are two directions we landed on:

Coverings key

Create a covering section like this:

"geometry": {
  "encoding": "WKB",
  "geometry_types": [],
  "bbox": [-71.1, 42.3, -70.9, 42.5],
  "covering": {
    "box": {"column": "bbox"}
  }
}

covering's keys would be enumerated of a few different encoding options (box, s2_cell_union).

Top-Level

Add a fixed key to the top-level definition like this:

"geometry": {
  "encoding": "WKB",
  "geometry_types": [],
  "bbox": [-71.1, 42.3, -70.9, 42.5],
  "geometry_bbox": {"column": "bbox"}
}

I'm partial to just adding a new key in the top-level. Since the keys will be fixed/enumerations (e.g. geometry_bbox, geometry_s2_cell_union, etc), the covering section feels a bit too heavy-handed. But maybe I'm alone? Thoughts @paleolimbot @jorisvandenbossche - who had some opinions in that original discussion. Either way - the definition bbox column (struct with xmin, xmax, ... fields) seems to have solid agreement.

0 replies

jwass · 2023-11-27T20:18:39Z

jwass
Nov 27, 2023
Collaborator Author

Associated PR for this discussion: #191

0 replies

csringhofer · 2024-02-01T19:02:20Z

csringhofer
Feb 1, 2024

Looked through the PR and it looks great in general. The area I missing is page level indexing.
https://github.com/apache/parquet-format/blob/master/PageIndex.md

The discussion so far only mentions row group level min/max stats, while page level stats are also available in Parquet and supported by all the libraries I know (parquet-mr, parquet-cpp, Apache Impala's own cpp scanner/writer (the project I work on))
arrow/parquet-cpp added write support relatively recently (apache/arrow#34053), but read support is there since some time and the other libraries implemented it years ago.

This doesn't really affect how bounding boxes should be defined, as the current proposal with nested minx,maxx,miny,maxy would work perfectly well with page level indexes - but it would be good to mention it, as "smart enough" Parquet readers and writers can benefit from it.

In my understanding the Parquet design provides page indexing as the main solution for fine grained min/max stat filtering. Meanwhile creating smaller row groups is a valid workaround in many uses cases, especially if the reader doesn't use page indexes efficiently.

I saw some mentions of very small row group sizes, e.g. row_group_size=2000 in #183
This may work great in some scenarios, but it can be pretty inefficient if per-row group overhead is large compered to actual row size, e.g. if there are large dictionary pages.

5 replies

jwass Feb 2, 2024
Collaborator Author

Thanks! Continued some of the discussion over on the PR itself. But I updated the language to reflect that both row groups and page indexes can benefit here.

jwass Feb 2, 2024
Collaborator Author

We definitely have work to do on recommending different spatial sorting strategies alongside row group and file sizes to understand how these affect performance under different conditions. Especially as page indexes increase the size of the metadata beyond the row group overhead.

csringhofer Feb 2, 2024

specially as page indexes increase the size of the metadata beyond the row group overhead.

Generally I wouldn't worry too much about the overhead, as by default pages are not that small (AFAIK in 10KB-1MB range by default, depending on the writer library). Compared to that the extra 20-30 bytes in the footer per page should be negligible. There can be some problematic cases, e.g. long strings and very small pages.

Page level indexes were also designed to be skippable in case the reader doesn't need it for a given column chunk - a reader that doesn't support page indexes will simply not read that part of the file. Meanwhile row group stats are part of the big Thrift structure in the footer (FileMetaData), so practically any operation that reads the file (e,g, just to read the num_rows field for count(*)) will have to read and parse all the stats of all columns.

We definitely have work to do on recommending different spatial sorting strategies alongside row group and file sizes

I could look into a few libraries about how they read/write page indexes. Without understanding their capabilities it would be hard to interpret benchmarks and recommend strategies. Which libraries are at focus currently for writing / reading GeoParquet?

kylebarron Feb 2, 2024
Maintainer

We definitely have work to do on recommending different spatial sorting strategies alongside row group and file sizes

I could look into a few libraries about how they read/write page indexes. Without understanding their capabilities it would be hard to interpret benchmarks and recommend strategies. Which libraries are at focus currently for writing / reading GeoParquet?

I think @jwass was referring to the act of spatially sorting rows in general, not the act of writing the page indexes. I assume most implementations would automatically write page indexes for numeric columns?

I do a lot in Rust and I know the parquet crate supports page indexes.

csringhofer Feb 2, 2024

I assume most implementations would automatically write page indexes for numeric columns?

I think that the reality is more messy, because not all libraries read/write the same index.

I was a bit unclear about what I mean by "page level indexing", as there are two mechanism (sorry for the confusion!):
"page index": per page min/max statistics in data page header
"column index": per page min/max statistics stored in a separate area in the file
I meant the newer mechanism "column index", which is described in https://github.com/apache/parquet-format/blob/master/PageIndex.md - pretty confusingly the url has "pageIndex", while the content describes why they created a new mechanism to replace page index. This is not the only place where the two are used in a confusing way. My impression is that some considered the old mechanism deprecated, so assumed that "page index" can be also used as a name for "column index", but this is not stated clearly in the Parquet spec.

The benefit of "column index" is that IO cost of skipped pages can be also saved, while the original "page index" doesn't allow efficient skipping of IO (as it has to read some bytes from the start of each page).

The support of these two seems pretty diverse:
In Impala it was assumed that "column index" is clearly superior to "page index" and makes it redundant. So Impala doesn't write "page index" and ignores it during reading. Meanwhile writing "column index" can be turned on/off (on by default).

parquet-mr also opted for "column index" and removed "page index" writing:
apache/parquet-java@a69f2b3
"Page level statistics were never used in production and became pointless after adding column indexes."
AFAIK paruet-mr only skips decompression of skipped pages but not the IO cost.

Quickly looked into arrow-rs, and my impression is that it writes both "page index" and "column index". It seems to have efficient skipping logic for "column index" that also skips IO:
apache/arrow-rs@2185ce2

The documentation for pyarrow says that it can be configured whether "page index" or "column index" are written, see 'write_page_index':
https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html#pyarrow.parquet.write_table
To add to the confusion, it refers to as "page index" to "column index".
It also states the read is doesn't use "page index" (which probably means "column index" here).

Again, sorry for the confusion about this - I didn't want to go into the details of "page index" vs "column index" in my original comment, but I could have mentioned it.

Generally I think that "column index" is clearly the better way to move forward, but it is not completely supported by all libraries.

jwass · 2024-03-26T01:34:33Z

jwass
Mar 26, 2024
Collaborator Author

Now that #191 got merged and 1.1 is on the way, here's a quick look at how performance improved for spatial lookups on remote geoparquet files using duckdb. There are two main improvements:

duckdb v0.10 gained row group filtering on nested group (struct) fields - e.g. bbox.minx > -71.1
The March Overture Maps release improved on the spatial partitioning of the dataset to better assist spatial queries. Briefly documented here with some nice pictures showing file/row group bounding boxes: GeoParquet Spatial Sorting/Indexing/Partitioning OvertureMaps/data#91 (comment)

We can look at the performance before/after each of these steps.

(1) above isn't really a GeoParquet win other than our nudging some projects to implement row group filtering on struct fields. We tracked these in #191. The projects that now implement this (duckdb, gdal, arrow-cpp, sedona) should likely see similar performance improvements as what's documented here.

This will query the same small bounding box region in the Boston area of the Overture Maps buildings dataset. The entire dataset is 2.3 billion rows, but only 891 are in the region of interest.

Step 1 - Duckdb v0.9 on Overture Feb Release - baseline

Duckdb v0.9 did not have row group filtering of nested fields

Data Transferred: 63.4 GB
Total Runtime: 6660.2 s

Step 2 - Duckdb v0.10 on Overture Feb release (row group filtering)

Data Transferred: 1.1 GB (57x improvement)
Total Runtime: 142.6s (46.7x improvement)

Step 3 - Duckdb v0.10 on Overture March release (spatial partitioning)

Data Transferred: 88.9 MiB (11.8x improvement over Step 2, 681x improvement over baseline)
Total Runtime: 34.75s (4.1x improvement over Step 2, 191x improvement over baseline)

Next Steps

Future work:

Modify Overture's bounding box columns to be floats rather than doubles. The spec allows either. This should further reduce the amount of data transferred and therefore runtime.
Build and include the _metadata summary file that includes the row group summary statistics across all files. This will avoid hundreds of head/get requests and should improve overall runtime dramatically. @kylebarron proposed this (How should metadata be written in a partitioned dataset? #79) nearly two years ago. Not all systems support this but they can fall back to existing behavior to probe each parquet file.
Other spatial partitioning methods
Compare with arrow, gdal, and others

Query profiles

Step 1 - Duckdb v0.9 on Overture February release

D explain analyze select count(1) from read_parquet('s3://overturemaps-us-west-2/release/2024-02-15-alpha.0/theme=buildings/type=building/*') where bbox.minx <  -71.058 and bbox.maxx > -71.068 and bbox.miny < 42.363 and bbox.maxy > 42.353;
100% ▕████████████████████████████████████████████████████████████▏

┌─────────────────────────────┐
│┌───────────────────────────┐│
│└───────────────────────────┘│
└─────────────────────────────┘
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││    Query Profiling Information    ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
explain analyze select count(1) from read_parquet('s3://overturemaps-us-west-2/release/2024-02-15-alpha.0/theme=buildings/type=building/*') where bbox.minx <  -71.058 and bbox.maxx > -71.068 and bbox.miny < 42.363 and bbox.maxy > 42.353;
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││            HTTP Stats:            ││
││                                   ││
││             in: 63.4GB            ││
││            out: 0 bytes           ││
││             #HEAD: 231            ││
││            #GET: 17217            ││
││              #PUT: 0              ││
││              #POST: 0             ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││        Total Time: 6660.28s       ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌───────────────────────────┐
│      EXPLAIN_ANALYZE      │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             0             │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│    UNGROUPED_AGGREGATE    │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│        count_star()       │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             1             │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│           FILTER          │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│  ((struct_extract(bbox,   │
│  'minx') < -71.058) AND   │
│(struct_extract(bbox, ...  │
│ .068) AND (struct_extract │
│(bbox, 'miny') < 42.36...  │
│(struct_extract(bbox, 'maxy│
│       ') > 42.353))       │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│       EC: 3485329504      │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│            891            │
│          (19.76s)         │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│       READ_PARQUET        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│            bbox           │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│       EC: 3485329504      │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│         2350057850        │
│        (106448.56s)       │
└───────────────────────────┘

Step 2 - Duckdb v0.10.0 on Overture February release

D explain analyze select count(1) from read_parquet('s3://overturemaps-us-west-2/release/2024-02-15-alpha.0/theme=buildings/type=building/*') where bbox.minx <  -71.058 and bbox.maxx > -71.068 and bbox.miny < 42.363 and bbox.maxy > 42.353;
100% ▕████████████████████████████████████████████████████████████▏
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││    Query Profiling Information    ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
explain analyze select count(1) from read_parquet('s3://overturemaps-us-west-2/release/2024-02-15-alpha.0/theme=buildings/type=building/*') where bbox.minx <  -71.058 and bbox.maxx > -71.068 and bbox.miny < 42.363 and bbox.maxy > 42.353;
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││            HTTP Stats:            ││
││                                   ││
││            in: 1.1 GiB            ││
││            out: 0 bytes           ││
││             #HEAD: 232            ││
││             #GET: 789             ││
││              #PUT: 0              ││
││              #POST: 0             ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││        Total Time: 142.57s        ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌───────────────────────────┐
│      RESULT_COLLECTOR     │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             0             │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│      EXPLAIN_ANALYZE      │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             0             │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│    UNGROUPED_AGGREGATE    │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│        count_star()       │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             1             │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│       READ_PARQUET        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│ Filters: bbox.minx<-71.058│
│  AND bbox.minx IS NOT NULL│
│  AND bbox.maxx>-71.068 AND│
│  bbox.maxx IS NOT NULL AND│
│ bbox.miny<42.363 AND ...  │
│  IS NOT NULL AND bbox.maxy│
│>42.353 AND bbox.maxy ...  │
│            NULL           │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│       EC: 697065900       │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│            891            │
│         (2153.00s)        │
└───────────────────────────┘

Step 3 - Duckdb v0.10 on Overture March release

D explain analyze select count(1) from read_parquet('s3://overturemaps-us-west-2/release/2024-03-12-alpha.0/theme=buildings/type=building/*') where bbox.minx <  -71.058 and bbox.maxx > -71.068 and bbox.miny < 42.363 and bbox.maxy > 42.353;
100% ▕████████████████████████████████████████████████████████████▏
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││    Query Profiling Information    ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
explain analyze select count(1) from read_parquet('s3://overturemaps-us-west-2/release/2024-03-12-alpha.0/theme=buildings/type=building/*') where bbox.minx <  -71.058 and bbox.maxx > -71.068 and bbox.miny < 42.363 and bbox.maxy > 42.353;
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││            HTTP Stats:            ││
││                                   ││
││            in: 88.9 MiB           ││
││            out: 0 bytes           ││
││             #HEAD: 232            ││
││             #GET: 471             ││
││              #PUT: 0              ││
││              #POST: 0             ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││         Total Time: 34.75s        ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
┌───────────────────────────┐
│      RESULT_COLLECTOR     │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             0             │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│      EXPLAIN_ANALYZE      │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             0             │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│    UNGROUPED_AGGREGATE    │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│        count_star()       │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│             1             │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│       READ_PARQUET        │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│ Filters: bbox.minx<-71.058│
│  AND bbox.minx IS NOT NULL│
│  AND bbox.maxx>-71.068 AND│
│  bbox.maxx IS NOT NULL AND│
│ bbox.miny<42.363 AND ...  │
│  IS NOT NULL AND bbox.maxy│
│>42.353 AND bbox.maxy ...  │
│            NULL           │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│       EC: 467647132       │
│   ─ ─ ─ ─ ─ ─ ─ ─ ─ ─ ─   │
│            891            │
│         (538.63s)         │
└───────────────────────────┘

@paleolimbot You had asked for a copy of these results I mentioned today.

1 reply

paleolimbot Mar 26, 2024
Collaborator

Thanks!

Defining Bounding Boxes for Optimizing Spatial Queries #188

jwass Nov 13, 2023 Collaborator

Background & Motivation

Proposed approach and next steps

Define a standard representation for a row-level MBR

Specify the MBR column in the GeoParquet Column Metadata

Not In Scope

Prior discussions and exploration:

Replies: 13 comments · 36 replies

paleolimbot Nov 13, 2023 Collaborator

cholmes Nov 14, 2023 Maintainer

paleolimbot Nov 15, 2023 Collaborator

paleolimbot Nov 15, 2023 Collaborator

jwass Nov 15, 2023 Collaborator Author

paleolimbot Nov 17, 2023 Collaborator

paleolimbot Nov 13, 2023 Collaborator

Additional resolution field?

Generic indexing system

Consider geometries other than points

paleolimbot Nov 13, 2023 Collaborator

jwass Nov 17, 2023 Collaborator Author

jorisvandenbossche Nov 14, 2023 Maintainer

jwass Nov 14, 2023 Collaborator Author

jorisvandenbossche Nov 14, 2023 Maintainer

jwass Nov 14, 2023 Collaborator Author

paleolimbot Nov 15, 2023 Collaborator

cholmes Nov 14, 2023 Maintainer

jwass Nov 15, 2023 Collaborator Author

jorisvandenbossche Nov 15, 2023 Maintainer

jorisvandenbossche Nov 15, 2023 Maintainer

paleolimbot Nov 15, 2023 Collaborator

jwass Nov 20, 2023 Collaborator Author

jwass Nov 27, 2023 Collaborator Author

Coverings key

Top-Level

jwass Nov 27, 2023 Collaborator Author

jwass Feb 2, 2024 Collaborator Author

jwass Feb 2, 2024 Collaborator Author

kylebarron Feb 2, 2024 Maintainer

jwass Mar 26, 2024 Collaborator Author

Step 1 - Duckdb v0.9 on Overture Feb Release - baseline

Step 2 - Duckdb v0.10 on Overture Feb release (row group filtering)

Step 3 - Duckdb v0.10 on Overture March release (spatial partitioning)

Next Steps

Query profiles

paleolimbot Mar 26, 2024 Collaborator

jwass
Nov 13, 2023
Collaborator

Replies: 13 comments 36 replies

paleolimbot
Nov 13, 2023
Collaborator

cholmes Nov 14, 2023
Maintainer

paleolimbot Nov 15, 2023
Collaborator

paleolimbot Nov 15, 2023
Collaborator

jwass Nov 15, 2023
Collaborator Author

paleolimbot Nov 17, 2023
Collaborator

paleolimbot Nov 13, 2023
Collaborator

paleolimbot Nov 13, 2023
Collaborator

jwass Nov 17, 2023
Collaborator Author

jorisvandenbossche
Nov 14, 2023
Maintainer

jwass Nov 14, 2023
Collaborator Author

jorisvandenbossche Nov 14, 2023
Maintainer

jwass Nov 14, 2023
Collaborator Author

paleolimbot Nov 15, 2023
Collaborator

cholmes Nov 14, 2023
Maintainer

jwass Nov 15, 2023
Collaborator Author

jorisvandenbossche Nov 15, 2023
Maintainer

jorisvandenbossche Nov 15, 2023
Maintainer

paleolimbot Nov 15, 2023
Collaborator

jwass
Nov 20, 2023
Collaborator Author

jwass
Nov 27, 2023
Collaborator Author

jwass
Nov 27, 2023
Collaborator Author

jwass Feb 2, 2024
Collaborator Author

jwass Feb 2, 2024
Collaborator Author

kylebarron Feb 2, 2024
Maintainer

jwass
Mar 26, 2024
Collaborator Author

paleolimbot Mar 26, 2024
Collaborator