Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial geoarrow.shapely implementation #2

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

kylebarron
Copy link
Member

@kylebarron kylebarron commented Sep 24, 2023

Change list

  • Add pyarrow extension types for each geometry type
    • The pyarrow type definitions support both interleaved and separated coords, and XY, XYZ, XYM, XYZM dimensions.
  • Add pyarrow extension arrays for each geometry type. These have a to_shapely method to convert back to a shapely array.
  • Add pyarrow extension scalars for each geometry type. Designed to support PointArray()[i].as_py() to get a shapely object.
  • Add some cursory tests for shapely -> geoarrow -> shapely

This is loosely based on @jorisvandenbossche 's python-geoarrow (ref jorisvandenbossche/python-geoarrow#2), but updated to use shapely instead of pygeos and updated to have a separate extension array for each geometry type.

Questions:

  • Thoughts on the package name? This mixes some core pyarrow extension array classes with the shapely IO. In the future we might want to have the core pyarrow arrays without a GEOS dependency via shapely, but we can probably punt that decision for later?

TODO:

  • CRS support (probably storing in Python as a pyproj CRS object and (de)serializing through arrow when needed
  • Handle struct coordinates when converting from pyarrow to shapely
  • Handle offsets correctly for scalars after the first index

@jorisvandenbossche
Copy link

jorisvandenbossche commented Sep 26, 2023

I am still exploring all the work you and Dewey have been doing, but so there are also pyarrow extension type and array definitions in the geoarrow-pyarrow package. So that's something to discuss how we see that interact or work together? I assume we want to keep a version of the pyarrow extension types that doesn't require shapely as a dependency? But at the same time we can't really extend those classes from another namespace package like geoarrow-shapely. Those to_shapely methods could also live in geoarrow-pyarrow with an optional dependency on shapely? Or would there be a use case of having the geoarrow<->shapely conversions without a pyarrow dependency?

@kylebarron
Copy link
Member Author

so there are also pyarrow extension type and array definitions in the geoarrow-pyarrow package

yeah I didn't know those existed when I wrote these 🙈

I assume we want to keep a version of the pyarrow extension types that doesn't require shapely as a dependency? But at the same time we can't really extend those classes from another namespace package like geoarrow-shapely.

This might be a case like in Shapely's API design, where you might like to put methods directly on the array class, but in shapely's case, that's a numpy array, and you can't add methods here. And so likewise even if I'd like an API design that's PointArray.to_shapely(), maybe that's too difficult to square with our namespacing.

Those to_shapely methods could also live in geoarrow-pyarrow with an optional dependency on shapely? Or would there be a use case of having the geoarrow<->shapely conversions without a pyarrow dependency?

I think my questions are:

  • (more for @jorisvandenbossche because Dewey and I chatted a little about this already) is there a usefulness in having pyarrow extension types and arrays that aren't intricately linked to the C parsing. The existing extension arrays in geoarrow-python are very closely linked to geoarrow-c

  • What kind of API design makes sense to make clear that the C and Rust bindings are largely equal citizens in terms of how they can handle the geometry arrays. Like we could have

    • geoarrow.core which is pure-python pyarrow-extension types
    • geoarrow.shapely which is pure-python interop to shapely
    • geoarrow.c which are the C bindings to use the geoarrow.core classes
    • geoarrow.rs as the Rust bindings to use the same geoarrow.core classes

    The .c and .rs names do leak some implementation details. 🤷‍♂️

    I would also love for the API to be designed with good static typing support, but I understand if that's not a priority for you.

@paleolimbot
Copy link
Contributor

I definitely didn't do a good job of advertising that I'd done any of this! I basically was just poking away at it over the last two quarters to make sure our devrel team had something to demo when they write a post about this. The geoarrow-python repo is definitely the place for this (makes it clear what Python users should look for).

FWIW, I think that geoarrow.c (and perhaps geoarrow.rs) shouldn't be imported by users. In geoarrow.pyarrow, it should really be a lazy import: for compute functions where it makes sense to use the C implementation, geoarrow.c can be imported as needed (or maybe it won't be needed when geoarrow.rs can do all the same things). I think the type implementation would be better in pure Python...for the next person that feels like making that PR, there are a lot of tests to help already (although many more tests would be required since our type system is rather complicated and I do a lot of testing at the C level so that I don't have to rewrite it in R). The main motivation for using the C implementation was that I already had a C++ class that implemented the whole thing and didn't feel like rewriting it.

I'm also pro static typing but I'm not very good at it. PRs welcome!

I see the ecosystem as more like:

  • geaorrow.c: Provides zero-dependency type implementation + basic coordinate shuffling functions that take + produce Arrow C Data interface objects
  • geoarrow.rs: Provides zero-dependency type implementation + many more functions (because it can draw on rust/geo)
  • geoarrow.pyarrow: Provide all of geoarrow.c and geoarrow.rs to pyarrow users (e.g., by registering extension types with pyarrow and providing functions that take/return Array or ChunkedArray)
  • geoarrow.pandas: Provide all of geoarrow.c and geoarrow.rs to pandas users (e.g., through the geoarrow accessor that accepts Series and produces Series)
  • shapely and/or geopandas: Provide ability to import/export an Arrow C Data interface array/stream from array of shapely.

...but obviously that's off-the-cuff!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants