Manifest driven downloads #117

Bonkles · 2024-07-25T18:51:27Z

This PR attempts to improve our download times by reducing file count in two ways:

The use of pre-calculated manifests with bboxes per parquet file
omitting any non-visible themes from the download data.

The first item is accomplished from the manifests generated by @Bonkles/manifest-generator to speed up and improve our download experience.

These two bits of logic allow us to make refinements to the total catalog of files we might need to consider- greatly decreasing the number of files to consult.

So, when the user clicks the download button, we can massively pare down the # of HTTP requests/data loading required to assemble a valid catalog. If the user wants buildings, we'd previously have to send 4 HTTP requests per file, a total of about 800+ requests, before we could start downloading data.

Still to-do:

~~Now that the types are available for enabling/disabling in the selector, we can plumb that info to the download button and further refine our file catalog.~~
~Rebase on top of Charlie's 'Select a feature' and other UI improvements

This PR has been pushed up to the canary for over a month, and at least a few folks that have submitted issues against the main site have been happy with the download code in this branch. So we've already got some good signal from the community that this approach 'just works'.

site/src/DownloadCatalog.js

H-Plus-Time · 2024-07-27T12:06:15Z

Two pieces of information would be super useful in the manifests:

The serialized_size value of each file's FileMetaData - geoarrow-wasm can (with a minor tweak) forward that through to the with_footer_size_hint method of each file's reader instance. Effectively cuts 1 of the 4 requests (instant disk cache hit).
The file size - strictly speaking object-store wants last_modified as well, but that's straightforward to fudge (the implementation in geoarrow-wasm doesn't pay attention to it). This should cut out the HEAD request entirely

It likely won't make a huge difference to this repo given the bounding box optimizations (and the CF distro of course), but a 50% cut in metadata requests is still nice (that and I reckon a bit of offline behaviour/speculative read-ahead is possible with that). Also assuming these manifests make their way out to general use, broad bounding boxes + non-spatial filters will benefit greatly.

Bonkles · 2024-07-27T15:17:16Z

Agreed @H-Plus-Time - the manifests are supposed to be general purpose info that helps everybody, so this is great info for me to consider. While I have this site in mind as a use case, I want to make sure useful info goes in for others.

msbarry · 2024-08-09T11:17:20Z

site/src/07-22-manifest.json

@@ -0,0 +1,4038 @@
+{


Would it be possible to publish a manifest like this (or at least just a list of file names) alongside each overture release in s3 or azure so that other tools like this could process the list of files without having to know or care what blob storage API you need to use to list files?

Ah I realize this is similar to @H-Plus-Time's comment. It seems like this could either go in this repo to benefit this one site, overture releases to benefit all tools that work with overture data, or even into the geoparquet spec to benefit other datasets as well. Seems like it helps bridge the gap from parquet's initial big data roots where downloading an extra 250mb is no big deal to more consumer-facing use cases like this site.

@msbarry yes, that's the eventual goal of this sort of work. I saw that you'd also commented on OvertureMaps/data#25 from last year and understand your sentiments.

I've put the beginnings of a manifest file creation tool in my own personal github repo at https://github.com/Bonkles/manifest-generator to try and alleviate this kind of concern. Definitely interested in any other ideas around info the manifest should contain.

Nice! I think the bare minimum (not overture specific) would be a list of: relative path, file size, and geoparquet metadata (or a subset like bbox, geometry types if the entire metadata is too big). Anything overture-specific beyond that could be nice to limit processing but not strictly necessary.

site/src/06-24-manifest.json

site/src/nav/DownloadButton.jsx

…fore download.

…he download button.

… that aren't visible.

…filespecs relative to that basepath.

…sing concurrent promise pipelining.

…r the 09-18 release.

…rds to speak of.

…wasm release.

… awry.

Bonkles commented Jul 25, 2024

View reviewed changes

site/src/DownloadCatalog.js Show resolved Hide resolved

Bonkles mentioned this pull request Jul 26, 2024

HTTP/1.1 -> HTTP/2 for parquet resources - hosting #122

Closed

Bonkles mentioned this pull request Jul 27, 2024

Add serialized_size and the actual file size OvertureMaps/stac#3

Open

msbarry reviewed Aug 9, 2024

View reviewed changes

Bonkles force-pushed the manifest_driven_downloads branch from 0703d46 to 999499a Compare October 2, 2024 21:44

Bonkles marked this pull request as ready for review October 3, 2024 14:46

Bonkles changed the title ~~[Draft] Manifest driven downloads~~ Manifest driven downloads Oct 3, 2024

Bonkles self-assigned this Oct 3, 2024

Bonkles force-pushed the manifest_driven_downloads branch from 6f71cf5 to 7aa57ab Compare November 14, 2024 16:08

charliemcgrady reviewed Dec 17, 2024

View reviewed changes

site/src/06-24-manifest.json Show resolved Hide resolved

site/src/nav/DownloadButton.jsx Outdated Show resolved Hide resolved

charliemcgrady reviewed Dec 17, 2024

View reviewed changes

site/src/nav/DownloadButton.jsx Outdated Show resolved Hide resolved

charliemcgrady reviewed Dec 17, 2024

View reviewed changes

site/src/nav/DownloadButton.jsx Outdated Show resolved Hide resolved

Benjamin Clark added 15 commits December 18, 2024 11:23

Add manifest and logic to precompute the minimal dataset file list be…

af1aa85

…fore download.

Change maxx, maxy to xmax, ymax, etc.

f854abd

Update to july catalog for further testing.

767ffc2

Lift the theme enablement state up to the App level and plumb it to t…

286eb0e

…he download button.

Get the download catalog to honor the visible themes, discarding ones…

a8e50c6

… that aren't visible.

Update download catalog generation to return a base path and list of …

dbabe5b

…filespecs relative to that basepath.

Split download file catalog by type, and download one file per type u…

909216e

…sing concurrent promise pipelining.

Rebase this branch on top of main, and pick up the latest manifest fo…

4b217be

…r the 09-18 release.

Fix #168 but only writing tables to files if there are any batch reco…

30924fe

…rds to speak of.

Trigger deploy

78f0a37

Update package-lock file.

b137d65

Update geoparquet to support downloads.

101ac37

Fix the download bbox path specs to conform to the latest geoparquet-…

8ccbc27

…wasm release.

Update the download manifest to point to the november release.

4abed7f

Remove the beta labelling from the wordmark. It's time for GA, Bay-bee

13a62b2

Bonkles force-pushed the manifest_driven_downloads branch from 1e58781 to 13a62b2 Compare December 18, 2024 16:29

Benjamin Clark added 3 commits December 18, 2024 11:36

Fix formatting as per PR feedback.

d4ba58f

Add a simple catch statement to alert the user when the download goes…

744e987

… awry.

Move manifests into their own folder.

a63bc92

Bonkles merged commit 4fbdbf8 into main Dec 18, 2024
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Manifest driven downloads #117

Manifest driven downloads #117

Uh oh!

Bonkles commented Jul 25, 2024 •

edited

Loading

Uh oh!

Uh oh!

H-Plus-Time commented Jul 27, 2024

Uh oh!

Bonkles commented Jul 27, 2024

Uh oh!

msbarry Aug 9, 2024 •

edited

Loading

Uh oh!

msbarry Aug 9, 2024 •

edited

Loading

Uh oh!

Bonkles Aug 9, 2024

Uh oh!

msbarry Aug 10, 2024

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Manifest driven downloads #117

Manifest driven downloads #117

Uh oh!

Conversation

Bonkles commented Jul 25, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

H-Plus-Time commented Jul 27, 2024

Uh oh!

Bonkles commented Jul 27, 2024

Uh oh!

msbarry Aug 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

msbarry Aug 9, 2024 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Bonkles Aug 9, 2024

Choose a reason for hiding this comment

Uh oh!

msbarry Aug 10, 2024

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Bonkles commented Jul 25, 2024 •

edited

Loading

msbarry Aug 9, 2024 •

edited

Loading

msbarry Aug 9, 2024 •

edited

Loading