geoparquet data on the Microsoft Planetary Comptuer with GDAL/OGR #101

TomAugspurger · 2022-05-17T21:41:32Z

TomAugspurger
May 17, 2022
Maintainer

A quick demo of accessing geoparquet data on the Planetary Computer. There's a couple of things making this more difficult than it should be.

We'll work with some US Census data that's cataloged in https://planetarycomputer.microsoft.com/dataset/us-census. https://planetarycomputer.microsoft.com/dataset/us-census#Example-Notebook is an example accessing the data with geopandas / dask. We'll use just GDAL here.

First, data assets in the Planetary Computer are generally in private storage containers. You need to "sign" the data, which grants you a short-lived read-only access token (https://planetarycomputer.microsoft.com/docs/concepts/sas/). We'll get a token by making an HTTP request to https://planetarycomputer.microsoft.com/api/sas/v1/token/ai4edataeuwest/us-census.

$ docker run -it --rm osgeo/gdal /bin/bash
$ apt-get update && apt-get install -y jq  # for parsing the token request
$ SAS_TOKEN=$(curl https://planetarycomputer.microsoft.com/api/sas/v1/token/ai4edataeuwest/us-census | jq -r .token)

Next we need the URL of the asset. We'll use the data for voting districts, cataloged in the https://planetarycomputer.microsoft.com/api/stac/v1/collections/us-census/items/2020-cb_2020_us_vtd_500k item. Currently, the item just has an fsspec-style abfs:// URL, which GDAL doesn't understand. We can (manually) transform that to an HTTP URL by putting the table:storage_options.account_name as the hostname, and prepending everything with GDAL's /vsicurl

$ ogrinfo "/vsicurl/https://ai4edataeuwest.blob.core.windows.net/us-census/2020/cb_2020_us_vtd_500k.parquet?${SAS_TOKEN}"
INFO: Open of `/vsicurl/https://ai4edataeuwest.blob.core.windows.net/us-census/2020/cb_2020_us_vtd_500k.parquet?<token>
      using driver `Parquet' successful.
1: cb_2020_us_vtd_500k (Multi Polygon)

Which hopefully is what's expected :)

What about datasets? https://planetarycomputer.microsoft.com/api/stac/v1/collections/us-census/items/2020-census-blocks-geo is a parquet dataset ("directory" of parquet files) with one parquet file per state. It didn't seem like GDAL was able to read this, but I might have messed it up:

$ ogrinfo "/vsicurl/https://ai4edataeuwest.blob.core.windows.net/us-census/2020/census_blocks_geo.parquet?${SAS_TOKEN}"

I'll work on updating the STAC metadata for the assets. I think having a separate asset that's the HTTPs URL would be helpful for non-fsspec clients.

cholmes · 2022-05-17T21:56:20Z

cholmes
May 17, 2022
Maintainer

Awesome! @rouault - does GDAL/OGR support reading a 'parquet dataset', the directory of parquet files? If yes then any advice would be great. If not then this could be a great dataset to test out to be able to add it - would be a nice win to support that.

Also interesting to think about a 'write' option in OGR to split up large datasets into directories of parquet files.

3 replies

kylebarron May 17, 2022
Maintainer

Just connecting this to the Parquet dataset discussion here: #79

jorisvandenbossche May 17, 2022
Maintainer

I don't think GDAL/OGR supports reading partitioned datasets (it's using the Parquet C++ APIs, which deals with single files)

If GDAL would consider supporting partitioned datasets, one option is to use the Arrow Datasets C++ API, which abstracts this for different file formats, and thus also for Parquet, but already deals with logic around different partitioning types, adding columns in case the partitioning structure encodes data fields, being able to filter on those fields, etc. But, that might not be that straightforward to integrate (for example, GDAL has its own filesystem layer, while the Arrow C++ datasets is integrated with the Arrow filesystem layer)

I would be cautious with implementing this in GDAL itself. The simple things are probably quite straightforward (eg reading a flat directory of file instead of reading a single file), but once you start with this, there is a wide array of features that people will start asking for (things I mentioned above, like nested partitioning, filtering based on the partitioning, different partitioning layouts, etc).

cholmes May 17, 2022
Maintainer

I would also be cautious with implementing this in GDAL itself. The simple things are probably quite straightforward (eg reading a flat directory of file instead of reading a single file), but once you start with this, there is a wide array of features that people will start asking for (things I mentioned above, like nested partitioning, filtering based on the partitioning, different partitioning layouts, etc).

Ah, that's a good warning. I do think it'd be interesting for GDAL to eventually be able to handle more complex cases. But does feel like we should get to GeoParquet 1.0 and then see how much data distributed actually uses partitioned datasets. I could see it being popular for global dataset distribution, like https://github.com/microsoft/GlobalMLBuildingFootprints but instead of just a bunch of geojson files, the data is all partitioned parquet files with a top-level access directory.

jorisvandenbossche · 2022-05-17T22:11:33Z

jorisvandenbossche
May 17, 2022
Maintainer

Currently, the item just has an fsspec-style abfs:// URL, which GDAL doesn't understand.

Sidenote, in Python, both fiona and pyogrio (Python GDAL bindings) support this style of URIs and can convert those to GDAL /vsi paths.

0 replies

rouault · 2022-05-17T22:27:50Z

rouault
May 17, 2022

@TomAugspurger It looks like the _metadata parquet file is corrupted, or unreadable by recent versions of arrow-cpp (tested with >= 7.0 here, whereas the file has been created by 4.0)

$ curl "https://ai4edataeuwest.blob.core.windows.net/us-census/2020/census_blocks_geo.parquet/_metadata?${SAS_TOKEN}" > _metadata.parquet

$ ogrinfo _metadata.parquet  -al -q
Layer name: _metadata
ERROR 1: GetRecordBatchReader() failed

$ ~/arrow/cpp/build/release/parquet-reader _metadata.parquet
File Name: _metadata.parquet
Version: 1.0
Created By: parquet-cpp-arrow version 4.0.0
[snip]
--- Values ---
STATEFP                       |COUNTYFP                      |TRACTCE                       |BLOCKCE                       |
Parquet error: Invalid column metadata (corrupt file?)

2 replies

jorisvandenbossche May 17, 2022
Maintainer

The _metadata file is a metadata-only Parquet file, without any actual RowGroups (grouping all the metadata information of all other Parquet files in the partitioned dataset, see https://arrow.apache.org/docs/python/parquet.html#writing-metadata-and-common-metadata-files).
So it might be normal that GetRecordBatchReader fails (although without looking at the code, I would expected it to result in a RecordBatchReader that will return 0 batches).

jorisvandenbossche May 17, 2022
Maintainer

Downloading it with curl as you did above, and then trying to read the FileMetaData of it with pyarrow 8.0.0 works fine:

In [1]: import pyarrow.parquet as pq

In [2]: pq.read_metadata("_metadata.parquet")
Out[2]: 
<pyarrow._parquet.FileMetaData object at 0x7ff7399f9040>
  created_by: parquet-cpp-arrow version 4.0.0
  num_columns: 10
  num_rows: 8180866
  num_row_groups: 56
  format_version: 1.0
  serialized_size: 111476

Trying to read it as a normal file indeed gives the same error as you showed above as well (that's an error message that could be improved on the Arrow side ...):

In [3]: pq.read_table("_metadata.parquet")
...
OSError: Invalid column metadata (corrupt file?)

rouault · 2022-05-19T14:51:49Z

rouault
May 19, 2022

I've noticed that https://ai4edataeuwest.blob.core.windows.net/us-census/2020/census_blocks_geo.parquet/_metadata and https://ai4edataeuwest.blob.core.windows.net/us-census/2020/census_blocks_geo.parquet/_common_metadata have in their "geo" schema metadata the bounding box of part.0.parquet, and not the whole dataset. That's probably an assumption of the underlying libraries that the schemas of all parts should be the same. But we'd ideally need to have somewhere the global bounding box.

4 replies

TomAugspurger May 22, 2022
Maintainer Author

That global vs. first-partition part is discussed a bit in geopandas/dask-geopandas#94

rouault May 22, 2022

I've found a fix for the global bounding box by iterating over the metadata of all "fragments", which must be the equivalent of what dask-geopandas does : OSGeo/gdal#5775

kylebarron May 23, 2022
Maintainer

Just curious, does the GDAL dataset implementation exclusively use the _metadata file if it exists? Or does it always read the footer of every file in the directory?

rouault May 23, 2022

it uses the arrow-dataset library helpers. When it finds _metadata, it uses the arrow::dataset::ParquetDatasetFactory class, which avoids reading each fragment file for most operations (except for GetExtent() since there's no way of having the global extent without iterating over each fragment metadtaa). When it doesn't find _metadata, it uses the arrow::dataset::FileSystemDatasetFactory class (not sure if that one needs to open each file)

rouault · 2022-05-19T21:32:09Z

rouault
May 19, 2022

I've added basic read support for partitioned datasets per OSGeo/gdal#5759 using the arrowdataset library
With that, the following works:

ogrinfo "/vsicurl/https://ai4edataeuwest.blob.core.windows.net/us-census/2020/census_blocks_geo.parquet?${SAS_TOKEN}" -al -so

or

AZURE_STORAGE_SAS_TOKEN=$SAS_TOKEN AZURE_STORAGE_ACCOUNT=ai4edataeuwest ogrinfo /vsiaz/us-census/2020/census_blocks_geo.parquet -al -so

Note that no optimization is currently done regarding filtering. I can see in the arrowdataset API that attribute filtering can be done, so the driver could likely be improved translating SQL to the arrowdataset API, but I'm not sure how spatial one could be done with WKB.

0 replies

aborruso · 2023-07-20T11:00:45Z

aborruso
Jul 20, 2023

@TomAugspurger I would like to download KingdomofSaudiArabia_123022032_2023-04-25 STAC item.

I know that the data href is abfs://footprints/delta/2023-04-25/ml-buildings.parquet/RegionName=KingdomofSaudiArabia/quadkey=123022032, but I did not understand how to build the vsicurl URL you use.

Thank you

3 replies

TomAugspurger Jul 20, 2023
Maintainer Author

Our full example is at https://planetarycomputer.microsoft.com/dataset/ms-buildings#Example-Notebook.

If you're using GDAL, the URL is something like /vsicurl/https://bingmlbuildings.blob.core.windows.net/footprints/delta/..., where bingmlbuildings is the storage account name, which you can get from the msft:storage_account field in the collection metadata.

aborruso Jul 20, 2023

If you're using GDAL, the URL is something like /vsicurl/https://bingmlbuildings.blob.core.windows.net/footprints/delta/..., where bingmlbuildings is the storage account name, which you can get from the msft:storage_account field in the collection metadata.

I have extracted yesterday the storage_account parameter, and I have used this command:

ogrinfo "/vsicurl/https://bingmlbuildings.blob.core.windows.net/footprints/delta/2023-04-25/ml-buildings.parquet"

But I have this error:

ogrinfo failed - unable to open '/vsicurl/https://bingmlbuildings.blob.core.windows.net/footprints/delta/2023-04-25/ml-buildings.parquet'

What's wrong in my command?

aborruso Jul 20, 2023

I have the same error using the token

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

geoparquet data on the Microsoft Planetary Comptuer with GDAL/OGR #101

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 12 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

geoparquet data on the Microsoft Planetary Comptuer with GDAL/OGR #101

TomAugspurger May 17, 2022 Maintainer

Replies: 6 comments · 12 replies

cholmes May 17, 2022 Maintainer

kylebarron May 17, 2022 Maintainer

jorisvandenbossche May 17, 2022 Maintainer

cholmes May 17, 2022 Maintainer

jorisvandenbossche May 17, 2022 Maintainer

jorisvandenbossche May 17, 2022 Maintainer

jorisvandenbossche May 17, 2022 Maintainer

TomAugspurger May 22, 2022 Maintainer Author

kylebarron May 23, 2022 Maintainer

TomAugspurger Jul 20, 2023 Maintainer Author

TomAugspurger
May 17, 2022
Maintainer

Replies: 6 comments 12 replies

cholmes
May 17, 2022
Maintainer

kylebarron May 17, 2022
Maintainer

jorisvandenbossche May 17, 2022
Maintainer

cholmes May 17, 2022
Maintainer

jorisvandenbossche
May 17, 2022
Maintainer

jorisvandenbossche May 17, 2022
Maintainer

jorisvandenbossche May 17, 2022
Maintainer

TomAugspurger May 22, 2022
Maintainer Author

kylebarron May 23, 2022
Maintainer

TomAugspurger Jul 20, 2023
Maintainer Author