Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA]: Improve pyarrow integration/IO performance using geoarrow-python #1288

Open
paleolimbot opened this issue Oct 24, 2023 · 5 comments
Open
Labels
External Issues filed by people outside the team feature request New feature or request Needs Triage Need team to review and classify

Comments

@paleolimbot
Copy link

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Medium

Please provide a clear description of problem you would like to solve.

Now that geoarrow-pyarrow ( https://github.com/geoarrow/geoarrow-python ) is available and the GeoArrow specification has an initial 0.1 release, there are potential synergies we may be able to leverage given the common memory layout! Basically, geoarrow-pyarrow implements a pyarrow.DataType subclass for geometry with a type-level place to store the coordinate reference system. It would be very cool if cudf.Series.from_arrow() could handle these (or whatever the best interface is from your end).

I also think it has the potential to significantly speed up IO from the current geopandas.read_file() + cuspatial.GeoSeries.from_geopandas() (rough estimate from some musings below assembled linestrings from a large ish FlatGeoBuf about 20x faster).

Happy to implement anything in geoarrow-c or geoarrow-python that makes this easier! We're slowly working on getting both on conda-forge (they're on pip already).

Describe any alternatives you have considered

The closest thing that currently provides this functionality is from_geopandas(), with Shapely's to_ragged_array and from_ragged_array also providing similar buffer building/parsing capability.

Additional context

Some musings with a large-ish linestring dataset (with apologies if I'm missing some obvious usage I should be aware of):

# Get the data in .fgb form
# ! curl -L https://github.com/geoarrow/geoarrow-data/releases/download/v0.1.0/ns-water-water_line.fgb.zip \
#     -o ns-water-water_line.fgb.zip
# ! unzip ns-water-water_line.fgb.zip
# pip install geoarrow-pyarrow
import cudf
import cuspatial
import geopandas
import geoarrow.pyarrow as ga
from geoarrow.pyarrow import io
import pyarrow as pa

host_table = io.read_pyogrio_table("ns-water-water_line.fgb")
#> 0.4 sec
#> Would be great if this worked!
#> cudf.Series.from_arrow(host_table["wkb_geometry"])
#> CUDF failure at:/opt/conda/conda-bld/work/cpp/src/interop/from_arrow.cu:87: Unsupported type_id conversion to cudf

# Workaround
def geoarrow_to_cuspatial(arr):
    arr = ga.as_geoarrow(arr, coord_type=ga.CoordType.INTERLEAVED)
    validity, part_offset, geometry_offset, xy = arr.geobuffers()
    assert validity is None # null geometries not supported?
    assert arr.offset == 0 # slices not reflected in geobuffers currently
    return cuspatial.GeoSeries.from_linestrings_xy(xy, geometry_offset, part_offset)

chunks = [geoarrow_to_cuspatial(chunk) for chunk in host_table["wkb_geometry"].chunks]
#> 0.7 sec
#> Can't seem to concatenate to get a contiguous array for direct comparison
#> gpu_geom2 = cudf.concat(chunks)

There are more example datasets at https://geoarrow.org/data as well (although I'm sure you have many internally as well).

@paleolimbot paleolimbot added the feature request New feature or request label Oct 24, 2023
@GPUtester GPUtester added Needs Triage Need team to review and classify External Issues filed by people outside the team labels Oct 24, 2023
@GPUtester
Copy link
Contributor

Hi @paleolimbot!

Thanks for submitting this issue - our team has been notified and we'll get back to you as soon as we can!
In the mean time, feel free to add any relevant information to this issue.

@harrism
Copy link
Member

harrism commented Oct 24, 2023

Thanks for the feature request. @paleolimbot where is the CRS in the example?

@paleolimbot
Copy link
Author

It's a property of the (Arrow) type!

from geoarrow.pyarrow import io

tbl = io.read_pyogrio_table("/vsizip/vsicurl/https://github.com/geoarrow/geoarrow-data/releases/download/v0.1.0/ns-water-basin_point.fgb.zip")
tbl["wkb_geometry"].type.crs
#> '{"$schema":"https://proj.org/schemas/v0.7/projjson.schema.json","type":"Projected...

The full serialization of the type is described in the 'extension types' section ( https://github.com/geoarrow/geoarrow/blob/main/extension-types.md ), and you can access the it using type.__arrow_ext_serialize__() (e.g., tbl["wkb_geometry"].type.__arrow_ext_serialize__() above). (The CRS is the main thing that's in the serialization)

@thomcom
Copy link
Contributor

thomcom commented Oct 25, 2023

Hey @paleolimbot ! Thanks for the update. I've been following your geoarrow work for a long while and am pretty excited to integrate it. I wrote a simple wrapper a few months ago before geoarrow.pyarrow that pulled the offset buffers and was able to construct cuspatial data from it easily and fast. We will definitely be integrating your work. Is it available as a dependency in pip, yet?

@paleolimbot
Copy link
Author

Is it available as a dependency in pip, yet?

Yes! pip install geoarrow-pyarrow should do it. I have the lower-level geoarrow-c on conda-forge and will submit the PR to add geoarrow-pyarrow in the next few days.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
External Issues filed by people outside the team feature request New feature or request Needs Triage Need team to review and classify
Projects
Status: Todo
Development

No branches or pull requests

4 participants