WIP: Add zone-wide Zarr format for consolidated UTM zone stores#194
Draft
avsm wants to merge 74 commits intoucam-eo:mainfrom
Draft
WIP: Add zone-wide Zarr format for consolidated UTM zone stores#194avsm wants to merge 74 commits intoucam-eo:mainfrom
avsm wants to merge 74 commits intoucam-eo:mainfrom
Conversation
Introduces geotessera/zarr_zone.py with support for building Zarr v3 stores that consolidate all tiles within a UTM zone into a single store per year (e.g. utm30_2024.zarr). Each store contains int8 embeddings and float32 scales arrays in native UTM coordinates with 10m pixels, chunked at (1024, 1024, 128) and (1024, 1024) respectively. Water pixels are marked as NaN in scales via landmask TIFFs. Exposed only via `geotessera-registry zarr-build` for building from local tile data, with --zones flag to test a single zone at a time. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace text callbacks with Rich Progress bars showing: - Tile scanning: bar with count and current tile coordinates - Per-zone writing: bar with count, elapsed time, current tile - Dry-run: table of zone/EPSG/tile-count - Zone header: grid dimensions in pixels and km Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use ProcessPoolExecutor to read landmask CRS/transform in parallel during the scan phase. The rasterio Affine transform is serialised as a plain tuple to cross the process boundary. Defaults to min(cpu_count, 8) workers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add compute_tile_grid() which derives EPSG, transform, and pixel dimensions from a tile's lon/lat using pyproj, without opening any files. During scanning, each tile's computed grid is compared against the actual landmask TIFF and mismatches are reported as warnings. This validates whether the landmask-free approach can replace per-tile I/O in future. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
zarr 3.x create_array() expects `compressors=ZstdCodec(level=3)` not `codecs=[BytesCodec(), ZstdCodec(level=3)]`. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The landmask TIFFs use the raw projected upper-left corner as origin (no snap to 10m pixel grid) and round() for dimensions. Our function was using floor/ceil snapping which caused off-by-one pixel mismatches at high latitudes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Zarr v3 uses native dimension_names metadata (not _ARRAY_DIMENSIONS) to tell xarray which arrays are dimension coordinates. Adding these makes xr.open_zarr() automatically assign northing/easting/band as indexed dimension coordinates on embeddings and scales. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
During tile scanning, check whether a tile's west and east edges (lon ± 0.05°) fall in different UTM zones. These tiles are placed in whichever zone the landmask EPSG specifies, but the warning surfaces them for review. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Standalone HTML viewer that streams Zarr v3 UTM zone stores and renders false-colour overlays on a dark CartoDB basemap using MapLibre GL JS. Client-side UTM-to-WGS84 reprojection via proj4, zstd decompression via fzstd, and per-chunk contrast-stretched RGB from any 3 of 128 embedding bands selected by the user. The `serve` CLI command starts a CORS-enabled HTTP server, auto-detects .zarr stores in the target directory, and opens the viewer in the browser. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The server writes a _stores.json manifest listing available .zarr directories. The viewer fetches this on load, populates a dropdown, and auto-loads the first store. Manual URL entry remains available as a collapsible fallback. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
All fetches (_stores.json, zarr.json, chunks) now use relative paths derived from window.location.pathname rather than absolute URLs built from window.location.origin. This ensures everything is same-origin regardless of which hostname the server is accessed from. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Server scans each store's embeddings/c/ directory on startup and writes _chunk_manifest.json listing existing chunk indices. Viewer loads the manifest and skips requests for non-existent chunks. Also stops suppressing 404 log messages on the server. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ter fetching - Chunk grid overlay shows data vs nodata positions with cyan/gray styling - Chunk index labels appear at higher zoom levels - UTM zone boundary shown as dashed green outline - Both overlays toggleable via checkboxes in the control panel - Chunks to fetch are sorted by distance from viewport center so nearest chunks render first when panning - Raster layers insert below overlays for correct z-ordering Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Raster chunks were inserted below overlay fills, making them invisible. Now inserted above fills but below grid lines so data is visible with grid outlines on top. Added console logging for viewport changes, chunk fetch lists, and per-chunk render results. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove chunk-grid-labels symbol layer that caused font 404 from demotiles.maplibre.org, likely breaking MapLibre's rendering pipeline - Remove unused glyphs config from map style - Replace updatePending/updateQueued with loadGeneration counter so viewport changes immediately cancel in-flight chunk loads - Cap chunks loaded per viewport update to 80 (nearest to center first) with yellow status message when capped - Bump fetch concurrency from 3 to 4 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Root cause: using beforeId to insert raster layers at specific positions was unreliable. Now raster layers are added on top, then grid lines and UTM outline are moved above them with map.moveLayer(). This guarantees rasters are visible with overlays on top. Also: auto-zoom to the first rendered chunk so the user can immediately see the data at an appropriate zoom level. Logs full layer order after each render for diagnostics. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add precomputed RGBA uint8 preview to Zarr stores so the viewer can show a fast false-colour overview (~50-200KB per chunk) instead of fetching full 128-band embeddings (~130KB compressed) for every tile. - zarr_zone.py: BandStats reservoir sampler, compute_rgb_chunk, write_rgb_pass (two-pass build), add_rgb_to_existing_store - registry_cli.py: --rgb-only and --no-rgb flags for zarr-build, has_rgb field in chunk manifests - viewer: RGB preview loading, click-to-inspect for full bands, split rendering pipeline with automatic fallback Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use arr.chunks[:2] instead of arr.metadata.chunk_grid.configuration which doesn't exist on RegularChunkGrid in zarr-python v3. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the sequential BandStats reservoir sampler with compute_stretch_from_store() which uses ThreadPoolExecutor to read chunks in parallel (zarr I/O releases the GIL). Each thread reads only the 3 RGB bands and returns a subsample of dequantised values; the main thread computes global p2/p98 percentiles. This also simplifies build_zone_stores: stretch is now computed from the written store rather than accumulated during tile writing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- compute_rgb_chunk: replace NaN scales with 0 before multiplication to avoid RuntimeWarning in np.clip().astype(uint8) - write_rgb_pass: use ThreadPoolExecutor to read, compute, and write RGB chunks in parallel (each chunk is an independent zarr file, so concurrent writes are safe) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
All three I/O-heavy phases now use ThreadPoolExecutor: 1. Tile writing: each tile reads .npy + .tiff from disk and writes to a non-overlapping zarr region — safe for concurrent threads 2. RGB stretch computation (already parallel) 3. RGB chunk writing (already parallel) Thread count is configurable via --workers / -j flag on zarr-build (default: min(cpu_count, 16)). The count is threaded through to all parallel phases. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The --rgb-only path was iterating all .zarr stores in the output directory, ignoring the --zones flag. Now parses zone number from store names (e.g. utm30_2025.zarr) and filters accordingly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move zstd decompression and band rendering off the main thread into a pool of WebWorkers (hardwareConcurrency, capped at 8). Replace the loadGeneration counter with AbortController so in-flight fetches are actually cancelled on viewport change. Add a fixed progress bar at the bottom of the viewport that tracks chunk loading progress. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Compute PCA across all 128 embedding bands and write a pca_rgb RGBA preview array to the zarr store. The 3 principal components map to RGB channels, giving a maximally informative false-colour visualization. Server side: compute_pca_basis() subsamples all bands in parallel, fits SVD, then write_pca_pass() projects and stretches each chunk. CLI: --pca-only adds PCA to existing stores, --no-pca skips it. Viewer: detects pca_rgb, adds RGB/PCA toggle button, routes preview chunk loads through previewSource state. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Query MultiIndex directly for year (avoids iterating all 2.7M tiles), pre-filter by UTM zone before file existence checks, add progress bars for file checking phase, and pass --workers through to gather_tile_infos. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace manual zarr v3 reader + fzstd decompression with zarrita.js which handles sharded stores natively. Workers now only render typed arrays (no decompression). Manifest scanner uses zarr library instead of filesystem traversal. Set zarr async concurrency to 128. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Zarr v3 sharding causes each incremental write to read-modify-write the entire shard file (~2GB when full), making per-tile writes progressively slower. Switch to grouping tiles by shard, assembling each shard in memory, and writing once — benchmarked at ~87ms/tile vs ~542ms/tile incremental. Also iterate RGB/PCA preview passes at SHARD_PX granularity instead of inner chunk size, and extract SHARD_PX/INNER_CHUNK_PX constants. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Revert sharding (shards/inner-chunks) back to simple 1024x1024 chunks and disable compression (compressors=None) since embeddings are high-entropy quantised int8 where zstd gives negligible benefit. Keeps the zarrita.js-based viewer from the previous commit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Static STAC catalog generation from Zarr store directories using pystac. One collection per year, one item per store, with spatial extents computed from store metadata. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Scans a directory of Zarr stores and generates a spec-compliant static STAC catalog using pystac. Produces catalog -> collection (per year) -> item (per store) hierarchy with WGS84 bounding boxes reprojected from UTM store metadata. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Each store covers exactly one UTM zone, so derive longitude directly from the zone number rather than reprojecting pixel grid corners. Only northing needs reprojection for the latitude extent. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add PYRAMID_LEVELS constant and build_preview_pyramid() function that creates iteratively coarsened copies of RGB/PCA preview arrays. Each level halves spatial dimensions via xarray coarsen with mean downsampling, storing per-level metadata (pixel_size, transform, shape). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Store-level wrapper that opens a Zarr store in r+ mode, checks which
previews exist (rgb, pca_rgb), and calls build_preview_pyramid for
each. Updates store attrs with has_{name}_pyramid and
{name}_pyramid_levels. Follows the pattern of add_rgb_to_existing_store
and add_pca_to_existing_store.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add CLI support for generating multi-resolution pyramids during Zarr store builds (--pyramid) or as a post-processing step on existing stores (--pyramid-only). The --pyramid-only handler follows the same pattern as --rgb-only and --pca-only, with zone filtering support. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Progress bar now shows per-level dimensions (e.g. "level 3 (7500x7500px)") using a status field. Chunk writes within each pyramid level are parallelised via ThreadPoolExecutor when --workers/-j > 1, matching the existing parallelism pattern used for RGB/PCA preview generation. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous implementation loaded the entire preview array into memory and used xarray.coarsen().mean() which promotes uint8 to float64 — for a 365680x21800x4 array that requires 238 GiB. Now processes in horizontal strips aligned to chunk boundaries: - Read 2*chunk_h source rows from the zarr array (not into full memory) - Coarsen 2x2 blocks using numpy reshape+mean in float32 (not float64) - Write chunk_h output rows to the new pyramid level - Each subsequent level reads from the previous level's zarr array Memory usage per strip is ~2*1024*W*4 bytes uint8 + float32 intermediate, typically <500 MB regardless of total array size. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace UTM-native preview pyramids with server-side Web Mercator reprojection using ndpyramid. Eliminates client-side per-pixel reprojection, fixes tile/basemap misalignment, and dramatically simplifies the tile handler. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
9 tasks covering: ndpyramid dependency, Python pyramid generation, CLI updates, old code removal, TypeScript types/reader/tile handler simplification, tests, and visual verification. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add three new functions to zarr_zone.py for Web Mercator pyramid support: - _compute_mercator_zoom_range(): determines min/max zoom levels based on native pixel size and latitude - _utm_array_to_xarray(): wraps a UTM zarr preview array as a georeferenced xarray DataArray with rioxarray CRS metadata, in (band, y, x) order - build_mercator_pyramid(): orchestrates reprojection from UTM to EPSG:3857 via ndpyramid.pyramid_reproject, storing results as zoom-level-keyed arrays with 256x256 chunks (1 chunk = 1 XYZ map tile) The existing UTM pyramid functions are preserved for now. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New function that replaces add_pyramids_to_existing_store for the --pyramid-only CLI path. It removes old UTM pyramid groups/attrs and builds Web Mercator pyramids via build_mercator_pyramid(). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update build_zone_stores() pyramid path and --pyramid-only handler to call build_mercator_pyramid / add_mercator_pyramids_to_existing_store instead of the old UTM pyramid functions. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Delete PYRAMID_LEVELS constant, _coarsen_strip(), build_preview_pyramid(), and add_pyramids_to_existing_store() — all superseded by the ndpyramid-based Web Mercator pyramid functions. Update module docstring to reflect new layout. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace old UTM pyramid verification with mercator pyramid checks: has_rgb_mercator attr, zoom_range, levels, shape, chunks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous code loaded the entire UTM array into memory with np.asarray(arr[:]), which OOMs on large zones (e.g. 254860x49313x4 = ~47 GB). Now uses dask.array.from_zarr() for lazy access — the data is only read when ndpyramid actually needs each chunk during reprojection. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…d OOM ndpyramid's pyramid_reproject loads the entire source array into memory for every zoom level via rioxarray's .reproject() (which calls .values). For large stores (~47GB source), this causes OOM kills. New approach: for each zoom level, coarsen the dask-backed source array to roughly match the target resolution (lazy via dask), compute the small result, then reproject with rasterio. Each level is written to zarr and freed before processing the next. Peak memory is now bounded by the coarsened source size (~12GB max at z=12 with 2x coarsen) rather than the full source (~47GB) multiplied by number of levels. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three improvements to the mercator pyramid builder: 1. Crop output to data extent: instead of a full-world grid (mostly zeros), compute which XYZ tiles overlap the UTM data and output only those. Store tile_offset_x/tile_offset_y per level so the client can map z/x/y requests to array positions. This eliminates GDAL warnings about failed transform points outside the UTM zone. 2. Cap per-level output at 4 GB: Mercator distortion at high latitudes (e.g. 84°N = 9.5x stretch) can make fine zoom levels enormous. Skip levels that would exceed 4 GB and log the cap. 3. Replace NaN with 0 before uint8 cast to suppress RuntimeWarning. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The shape parameter to rioxarray's reproject() should be (height, width) for the spatial dimensions only. Including num_bands as the first element caused the output shape to be misinterpreted, silently failing the reprojection for every strip (caught by blanket except). Also added debug logging for failed reprojections. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the manual coarsening and reprojection in _write_global_store() with a two-phase approach: Phase 1 reprojects UTM zones into a temporary level-0 zarr using windowed rasterio.warp.reproject() for bounded memory; Phase 2 uses topozarr.create_pyramid() to build coarsened pyramid levels with proper zarr-conventions/multiscales metadata. Also removes PCA embedding computation, mercator pyramid generation, and their CLI flags (--pca, --pca-only, --pyramid, --pyramid-only) to focus purely on RGB preview via the global-preview command. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Using src_nodata=0 caused rasterio to treat valid black pixels as nodata, triggering CPLE_AppDefined warnings about value substitution. Initialize destination with NaN and omit nodata params so all source pixel values are preserved faithfully. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Introduces geotessera/zarr_zone.py with support for building Zarr v3 stores that consolidate all tiles within a UTM zone into a single store per year (e.g. utm30_2024.zarr). Each store contains int8 embeddings and float32 scales arrays in native UTM coordinates with 10m pixels, chunked at (1024, 1024, 128) and (1024, 1024) respectively. Water pixels are marked as NaN in scales via landmask TIFFs.
Exposed only via
geotessera-registry zarr-buildfor building from local tile data, with --zones flag to test a single zone at a time.Still a WIP while I experiment with a few conversions and build a visualiser.