Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mixed concerns: Encoding + Geometry Type #207

Closed
m-mohr opened this issue Apr 27, 2024 · 15 comments
Closed

Mixed concerns: Encoding + Geometry Type #207

m-mohr opened this issue Apr 27, 2024 · 15 comments

Comments

@m-mohr
Copy link
Collaborator

m-mohr commented Apr 27, 2024

I stumbled across the new geometry types mentioned in the encoding. It seems the encoding is GeoArrow, but why are the concerns mixed here?

Why are geometry types in the encoding and that happens with the geometry type then? Are they alway the same for GeoArrow?

From an external perspective I'd have expected something like this:
endocing = geoarrow
geometry_types = [Point]

Also, the schema allows for example:
encoding = point
geometry_types = [Polygon]

The empty array for geometry_types also doesn't make sense for GeoArrow encoding.

In case of geoarrow the geometry_types can have a maximum of one array item, but that could be encoded...

(Without reading the spec, I also can't guess from the value that it's a GeoArrow encoding. Might make things just a little simpler to grasp by default.)

Wouldn't this also be more future-proof with regards to apache/parquet-format#44 ?

@jorisvandenbossche
Copy link
Collaborator

There was quite some discussion about this on the PR #189 (see for example the thread above and below this comment: #189 (comment)). The PR initially started with an "encoding": "geoarrow", and the original issue (#185) also mentioned the option of combining this with the existing geometry_types key.

But in the end we moved away of this tight coupling with geoarrow for the naming in the spec here, although we are still using a subset of the geoarrow specification.
(from the top of my head, some reasons: this allows deviating in the future (eg adding a struct-based (sparse union) geometrycollection type that might not match exactly with something from GeoArrow), we are using a subset of the GeoArrow specification anyway (not all things allowed in GeoArrow are allowed here), and the name is also not necessarily relevant for all Parquet implementations (they cam have nothing to do with Arrow)

The format spec itself is still very clear about this being based on GeoArrow: https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md#native-encodings-based-on-geoarrow (and yes, that means you might need to read the spec when encountering a file with such a column, but IMO that would still be the case anyway even if the encoding said "geoarrow").

It's true that the geometry_types key is not very useful in this specific case, but it should of course exactly match the type in the data, so something like encoding = point, geometry_types = [Polygon] is invalid (that already follows from the current spec saying this field should match with the data, but we could also be more explicit and mention in the native encodings section that geometry_type then is always a length-1 list)

@jorisvandenbossche
Copy link
Collaborator

The format spec itself is still very clear about this being based on GeoArrow: https://github.com/opengeospatial/geoparquet/blob/main/format-specs/geoparquet.md#native-encodings-based-on-geoarrow

Although I see we only link to the GeoArrow document with the names, not the the document with the memory layouts (https://geoarrow.org/format.html#native-encoding). That's something we should improve.

@kylebarron
Copy link
Collaborator

It's true that the geometry_types key is not very useful in this specific case

It could be useful to know in some circumstances that the encoding is multi polygon but the column includes both polygons and multi polygons

@paleolimbot
Copy link
Collaborator

I think that it is impossible not to mix some concerns with the single geometry-type encodings...the solution we settled on does have some overlap between the encoding name and the geometry type, but avoids mixing some other concerns and concepts to more accurately convey the relationships among the single-geometry layouts, Parquet and Arrow, for example.

It could be useful to know in some circumstances that the encoding is multi polygon but the column includes both polygons and multi polygons

I was under the impression that any features in a "multipolygon" encoding were all multipolygons (even if they only contained one element). This would be consistent with something like GDAL's reporting a layer's geometry type. We should probably make that explicit in the spec language if it is currently ambiguous.

It's true that the geometry_types key is not very useful in this specific case,

Technically it will tell you if you have Z coordinates or not (because the encoding key does not contain information about dimensions but the geometry_types key does). Leaving it empty would be a strange thing to do and I don't think it will be a problem for writers to get this part right; however, it is also not difficult (and likely safer, since it needs to check the memory layout anyway) for readers to fill in this piece of information themselves.

@jiayuasu
Copy link
Contributor

jiayuasu commented Apr 29, 2024

In the future, GeoParquet should really support mixed geometry types in the same column. Sedona community will propose some solutions soon.

For example, in the lates release of Overture Maps data (https://docs.overturemaps.org/schema/), base/water has LineString type water (rivers) and Polygon type water (lakes). This is not possible in the new GeoParquet encoding.

@cholmes
Copy link
Member

cholmes commented May 29, 2024

But in the end we moved away of this tight coupling with geoarrow for the naming in the spec here, although we are still using a subset of the geoarrow specification.
(from the top of my head, some reasons: this allows deviating in the future (eg adding a struct-based (sparse union) geometrycollection type that might not match exactly with something from GeoArrow), we are using a subset of the GeoArrow specification anyway (not all things allowed in GeoArrow are allowed here), and the name is also not necessarily relevant for all Parquet implementations (they cam have nothing to do with Arrow)

I think it may be too late, as we've already got some implementations released with this, but I was also wondering why we didn't put some 'prefix' on the encodings. Like if not geoarrow.point then at least like columnar.point or something (there's likely some better name). Or a new field for 'columnar_geometry_type' and have the encoding just be 'columnar' and 'wkb'.

Someone recently pointed out to me that there's potentially more efficient geometry encoding techniques, like 'zigzag encoding coordinate deltas in varints', and I had been thinking that geoparquet spec is 'open' to having a newer encoding, but the current arrow ones seem to take up the 'default namespace'. Like we could add some cool new encoding, and they'll likely have the same geometry types. I don't think it's a huge deal to have like new_encoding.point, but the latest names seem to imply that they're the 'right' way to do things, and then they also sorta 'hide' WKB, since there's a big list of arrow ones right next to the single 'wkb' name.

@m-mohr
Copy link
Collaborator Author

m-mohr commented May 29, 2024

That's the risk of implementing an unreleased spec, I'd say... It should still be possible to make a change.

@jorisvandenbossche
Copy link
Collaborator

Someone recently pointed out to me that there's potentially more efficient geometry encoding techniques, like 'zigzag encoding coordinate deltas in varints',

While we didn't add prefixes now, that doesn't mean we can't add suffixes later if we want to add other encodings, like "point_delta" (or at that point add a prefix like "delta.point")

but the latest names seem to imply that they're the 'right' way to do things, and then they also sorta 'hide' WKB, since there's a big list of arrow ones right next to the single 'wkb' name.

I wouldn't say the "right" way, but I think it is the simplest way. I personally think it is fine to use the implicit "default" namespace for that.
About hiding the single "wkb", that's seems is something we can solve with better rephrasing in the documentation (also renaming them with a prefix still requires to list all options)


While I am personally happy with what we have right now, I agree we should still allow ourselves to change things (although if we think we want to do that, I wouldn't wait too long with that, so GDAL/geopandas can be updated to follow this as fast as possible).
Both in geopandas and GDAL I think this is opt-in, and not used by default (well, for this topic; the also unreleased bbox covering column is written by default in GDAL)

(that's generally the trade-off we have with the desire to have some implementations to test things before calling the 1.1 spec final, and then actually thinking to change things before doing that ..)

@jorisvandenbossche
Copy link
Collaborator

have the encoding just be 'columnar'

FWIW I also want to clarify that I find using "columnar" in this context is ambiguous / confusing. When using "columnar" in the context of Parquet as a "columnar file format" or Arrow's "Columnar (in-memory) Format" specification, everything we discuss here is columnar. Also with the WKB encoding, all those WKB values in the geometry column are stored in a "columnar" fashion (all the binary blobs of the WKB values are stored together in one buffer).

@jwass
Copy link
Collaborator

jwass commented May 29, 2024

Someone recently pointed out to me that there's potentially more efficient geometry encoding techniques, like 'zigzag encoding coordinate deltas in varints

Somewhat tangential - I've been pretty eager to test out new encodings like the varint/delta/zigzag Chris mentioned (used in OSM PBF) and also Parquet's native bit-packed integer encodings as well. Although these would also require x,y columns to be integers not doubles - which I thought might be a non-starter

@paleolimbot
Copy link
Collaborator

zigzag encoding coordinate deltas in varints

I am assuming that's from/inspired by twkb? https://github.com/TWKB/Specification/blob/master/twkb.md#zigzag-encode . It would be derailing this thread to talk about that, but I think that would undo the main benefit of this encoding which is that no Geo-specific parsing is required.

That's the risk of implementing an unreleased spec

This particular change was an open PR for months that had some pretty significant discussion around how to name the encodings. It was not easy to come up with a consensus...I don't think the result is perfect, but I also don't think it has to perfectly separate concerns (just be specific enough that implementations are able to work with this, which it seems that they are). If/when a better encoding comes along, the name can be updated.

@jwass
Copy link
Collaborator

jwass commented May 29, 2024

I am assuming that's from/inspired by twkb? https://github.com/TWKB/Specification/blob/master/twkb.md#zigzag-encode . It would be derailing this thread to talk about that, but I think that would undo the main benefit of this encoding which is that no Geo-specific parsing is required.

I agree and don't want to derail the conversation here. However if we could enable x and y to be fixed precision decimal we could utilize parquet native integer encodings (e.g. delta binary packed) and not need geo-specific parsing to read the coordinates properly... but let's leave that for another discussion :)

@cholmes
Copy link
Member

cholmes commented May 31, 2024

This particular change was an open PR for months that had some pretty significant discussion around how to name the encodings. It was not easy to come up with a consensus...I don't think the result is perfect, but I also don't think it has to perfectly separate concerns (just be specific enough that implementations are able to work with this, which it seems that they are).

Yeah, I felt bad that I didn't read it closely enough / didn't think about there being new potential encodings.

If/when a better encoding comes along, the name can be updated.

Ah, that's a good point - I was sorta thinking we'd be 'stuck' with these names. But I guess the client logic would just have to be that at version 1.4 or whatever it would need to check the version number.

While we didn't add prefixes now, that doesn't mean we can't add suffixes later if we want to add other encodings, like "point_delta" (or at that point add a prefix like "delta.point")

Cool, I like this idea.

About hiding the single "wkb", that's seems is something we can solve with better rephrasing in the documentation (also renaming them with a prefix still requires to list all options)

Yeah, that sounds good. I'll look into tweaks that may make it clearer. It's good in the full description (one of the geometry types, or WKB), but in the single line it just looks like one of many options.

I personally think it is fine to use the implicit "default" namespace for that.

Sounds great. Thanks to everyone for sounding in, and I agree. I definitely didn't want to try to be pushing for a change at this late time. I mostly just wanted to be sure we had a path to other encodings, and weren't backing ourselves into a corner with declaring this one. But it sounds like we have lots of options. And it's good to have the discussion, to point at in the future if/when people want to propose other encodings.

@cholmes
Copy link
Member

cholmes commented May 31, 2024

have the encoding just be 'columnar'

FWIW I also want to clarify that I find using "columnar" in this context is ambiguous / confusing. When using "columnar" in the context of Parquet as a "columnar file format" or Arrow's "Columnar (in-memory) Format" specification, everything we discuss here is columnar.

Yeah, I was pretty sure that was a poor name - I just didn't want to spend a lot of time crafting an 'ideal' name when all I was trying to do was to make the point. If we had overwhelming enthusiasm to change (which to be clear I personally don't) then we could figure out the 'right' name.

@cholmes
Copy link
Member

cholmes commented Jun 3, 2024

Closing this issue, as it looks like it was well discussed. There were some interesting things in this discussion, like #207 (comment) but we should just have cleaner issues / PR's.

@cholmes cholmes closed this as completed Jun 3, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants