geoparquet data on the Microsoft Planetary Comptuer with GDAL/OGR #101
Replies: 6 comments 12 replies
-
Awesome! @rouault - does GDAL/OGR support reading a 'parquet dataset', the directory of parquet files? If yes then any advice would be great. If not then this could be a great dataset to test out to be able to add it - would be a nice win to support that. Also interesting to think about a 'write' option in OGR to split up large datasets into directories of parquet files. |
Beta Was this translation helpful? Give feedback.
-
Sidenote, in Python, both |
Beta Was this translation helpful? Give feedback.
-
@TomAugspurger It looks like the _metadata parquet file is corrupted, or unreadable by recent versions of arrow-cpp (tested with >= 7.0 here, whereas the file has been created by 4.0)
|
Beta Was this translation helpful? Give feedback.
-
I've noticed that https://ai4edataeuwest.blob.core.windows.net/us-census/2020/census_blocks_geo.parquet/_metadata and https://ai4edataeuwest.blob.core.windows.net/us-census/2020/census_blocks_geo.parquet/_common_metadata have in their "geo" schema metadata the bounding box of part.0.parquet, and not the whole dataset. That's probably an assumption of the underlying libraries that the schemas of all parts should be the same. But we'd ideally need to have somewhere the global bounding box. |
Beta Was this translation helpful? Give feedback.
-
I've added basic read support for partitioned datasets per OSGeo/gdal#5759 using the arrowdataset library
or
Note that no optimization is currently done regarding filtering. I can see in the arrowdataset API that attribute filtering can be done, so the driver could likely be improved translating SQL to the arrowdataset API, but I'm not sure how spatial one could be done with WKB. |
Beta Was this translation helpful? Give feedback.
-
@TomAugspurger I would like to download I know that the data href is Thank you |
Beta Was this translation helpful? Give feedback.
-
A quick demo of accessing geoparquet data on the Planetary Computer. There's a couple of things making this more difficult than it should be.
We'll work with some US Census data that's cataloged in https://planetarycomputer.microsoft.com/dataset/us-census. https://planetarycomputer.microsoft.com/dataset/us-census#Example-Notebook is an example accessing the data with geopandas / dask. We'll use just GDAL here.
First, data assets in the Planetary Computer are generally in private storage containers. You need to "sign" the data, which grants you a short-lived read-only access token (https://planetarycomputer.microsoft.com/docs/concepts/sas/). We'll get a token by making an HTTP request to https://planetarycomputer.microsoft.com/api/sas/v1/token/ai4edataeuwest/us-census.
Next we need the URL of the asset. We'll use the data for voting districts, cataloged in the https://planetarycomputer.microsoft.com/api/stac/v1/collections/us-census/items/2020-cb_2020_us_vtd_500k item. Currently, the item just has an fsspec-style
abfs://
URL, which GDAL doesn't understand. We can (manually) transform that to an HTTP URL by putting thetable:storage_options.account_name
as the hostname, and prepending everything with GDAL's/vsicurl
Which hopefully is what's expected :)
What about datasets? https://planetarycomputer.microsoft.com/api/stac/v1/collections/us-census/items/2020-census-blocks-geo is a parquet dataset ("directory" of parquet files) with one parquet file per state. It didn't seem like GDAL was able to read this, but I might have messed it up:
$ ogrinfo "/vsicurl/https://ai4edataeuwest.blob.core.windows.net/us-census/2020/census_blocks_geo.parquet?${SAS_TOKEN}"
I'll work on updating the STAC metadata for the assets. I think having a separate asset that's the HTTPs URL would be helpful for non-fsspec clients.
Beta Was this translation helpful? Give feedback.
All reactions