[FEA]: Improve pyarrow integration/IO performance using geoarrow-python #1288

paleolimbot · 2023-10-24T17:23:38Z

Is this a new feature, an improvement, or a change to existing functionality?

New Feature

How would you describe the priority of this feature request

Medium

Please provide a clear description of problem you would like to solve.

Now that geoarrow-pyarrow ( https://github.com/geoarrow/geoarrow-python ) is available and the GeoArrow specification has an initial 0.1 release, there are potential synergies we may be able to leverage given the common memory layout! Basically, geoarrow-pyarrow implements a pyarrow.DataType subclass for geometry with a type-level place to store the coordinate reference system. It would be very cool if cudf.Series.from_arrow() could handle these (or whatever the best interface is from your end).

I also think it has the potential to significantly speed up IO from the current geopandas.read_file() + cuspatial.GeoSeries.from_geopandas() (rough estimate from some musings below assembled linestrings from a large ish FlatGeoBuf about 20x faster).

Happy to implement anything in geoarrow-c or geoarrow-python that makes this easier! We're slowly working on getting both on conda-forge (they're on pip already).

Describe any alternatives you have considered

The closest thing that currently provides this functionality is from_geopandas(), with Shapely's to_ragged_array and from_ragged_array also providing similar buffer building/parsing capability.

Additional context

Some musings with a large-ish linestring dataset (with apologies if I'm missing some obvious usage I should be aware of):

# Get the data in .fgb form
# ! curl -L https://github.com/geoarrow/geoarrow-data/releases/download/v0.1.0/ns-water-water_line.fgb.zip \
#     -o ns-water-water_line.fgb.zip
# ! unzip ns-water-water_line.fgb.zip
# pip install geoarrow-pyarrow
import cudf
import cuspatial
import geopandas
import geoarrow.pyarrow as ga
from geoarrow.pyarrow import io
import pyarrow as pa

host_table = io.read_pyogrio_table("ns-water-water_line.fgb")
#> 0.4 sec
#> Would be great if this worked!
#> cudf.Series.from_arrow(host_table["wkb_geometry"])
#> CUDF failure at:/opt/conda/conda-bld/work/cpp/src/interop/from_arrow.cu:87: Unsupported type_id conversion to cudf

# Workaround
def geoarrow_to_cuspatial(arr):
    arr = ga.as_geoarrow(arr, coord_type=ga.CoordType.INTERLEAVED)
    validity, part_offset, geometry_offset, xy = arr.geobuffers()
    assert validity is None # null geometries not supported?
    assert arr.offset == 0 # slices not reflected in geobuffers currently
    return cuspatial.GeoSeries.from_linestrings_xy(xy, geometry_offset, part_offset)

chunks = [geoarrow_to_cuspatial(chunk) for chunk in host_table["wkb_geometry"].chunks]
#> 0.7 sec
#> Can't seem to concatenate to get a contiguous array for direct comparison
#> gpu_geom2 = cudf.concat(chunks)

There are more example datasets at https://geoarrow.org/data as well (although I'm sure you have many internally as well).

The text was updated successfully, but these errors were encountered:

GPUtester · 2023-10-24T17:23:55Z

Hi @paleolimbot!

Thanks for submitting this issue - our team has been notified and we'll get back to you as soon as we can!
In the mean time, feel free to add any relevant information to this issue.

harrism · 2023-10-24T23:15:20Z

Thanks for the feature request. @paleolimbot where is the CRS in the example?

paleolimbot · 2023-10-25T00:59:45Z

It's a property of the (Arrow) type!

from geoarrow.pyarrow import io

tbl = io.read_pyogrio_table("/vsizip/vsicurl/https://github.com/geoarrow/geoarrow-data/releases/download/v0.1.0/ns-water-basin_point.fgb.zip")
tbl["wkb_geometry"].type.crs
#> '{"$schema":"https://proj.org/schemas/v0.7/projjson.schema.json","type":"Projected...

The full serialization of the type is described in the 'extension types' section ( https://github.com/geoarrow/geoarrow/blob/main/extension-types.md ), and you can access the it using type.__arrow_ext_serialize__() (e.g., tbl["wkb_geometry"].type.__arrow_ext_serialize__() above). (The CRS is the main thing that's in the serialization)

thomcom · 2023-10-25T15:33:41Z

Hey @paleolimbot ! Thanks for the update. I've been following your geoarrow work for a long while and am pretty excited to integrate it. I wrote a simple wrapper a few months ago before geoarrow.pyarrow that pulled the offset buffers and was able to construct cuspatial data from it easily and fast. We will definitely be integrating your work. Is it available as a dependency in pip, yet?

paleolimbot · 2023-10-25T15:36:09Z

Is it available as a dependency in pip, yet?

Yes! pip install geoarrow-pyarrow should do it. I have the lower-level geoarrow-c on conda-forge and will submit the PR to add geoarrow-pyarrow in the next few days.

paleolimbot added the feature request New feature or request label Oct 24, 2023

GPUtester added Needs Triage Need team to review and classify External Issues filed by people outside the team labels Oct 24, 2023

kylebarron mentioned this issue Jan 30, 2024

[FEA]: Implement Arrow PyCapsule Interface #1332

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA]: Improve pyarrow integration/IO performance using geoarrow-python #1288

[FEA]: Improve pyarrow integration/IO performance using geoarrow-python #1288

paleolimbot commented Oct 24, 2023

GPUtester commented Oct 24, 2023

harrism commented Oct 24, 2023

paleolimbot commented Oct 25, 2023

thomcom commented Oct 25, 2023

paleolimbot commented Oct 25, 2023

[FEA]: Improve pyarrow integration/IO performance using geoarrow-python #1288

[FEA]: Improve pyarrow integration/IO performance using geoarrow-python #1288

Comments

paleolimbot commented Oct 24, 2023

Is this a new feature, an improvement, or a change to existing functionality?

How would you describe the priority of this feature request

Please provide a clear description of problem you would like to solve.

Describe any alternatives you have considered

Additional context

GPUtester commented Oct 24, 2023

harrism commented Oct 24, 2023

paleolimbot commented Oct 25, 2023

thomcom commented Oct 25, 2023

paleolimbot commented Oct 25, 2023