Skip to content

WIP: Add zone-wide Zarr format for consolidated UTM zone stores#194

Draft
avsm wants to merge 74 commits intoucam-eo:mainfrom
avsm:zarr2
Draft

WIP: Add zone-wide Zarr format for consolidated UTM zone stores#194
avsm wants to merge 74 commits intoucam-eo:mainfrom
avsm:zarr2

Conversation

@avsm
Copy link
Contributor

@avsm avsm commented Feb 22, 2026

Introduces geotessera/zarr_zone.py with support for building Zarr v3 stores that consolidate all tiles within a UTM zone into a single store per year (e.g. utm30_2024.zarr). Each store contains int8 embeddings and float32 scales arrays in native UTM coordinates with 10m pixels, chunked at (1024, 1024, 128) and (1024, 1024) respectively. Water pixels are marked as NaN in scales via landmask TIFFs.

Exposed only via geotessera-registry zarr-build for building from local tile data, with --zones flag to test a single zone at a time.

Still a WIP while I experiment with a few conversions and build a visualiser.

avsm and others added 30 commits February 22, 2026 15:40
Introduces geotessera/zarr_zone.py with support for building Zarr v3
stores that consolidate all tiles within a UTM zone into a single store
per year (e.g. utm30_2024.zarr). Each store contains int8 embeddings
and float32 scales arrays in native UTM coordinates with 10m pixels,
chunked at (1024, 1024, 128) and (1024, 1024) respectively. Water
pixels are marked as NaN in scales via landmask TIFFs.

Exposed only via `geotessera-registry zarr-build` for building from
local tile data, with --zones flag to test a single zone at a time.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace text callbacks with Rich Progress bars showing:
- Tile scanning: bar with count and current tile coordinates
- Per-zone writing: bar with count, elapsed time, current tile
- Dry-run: table of zone/EPSG/tile-count
- Zone header: grid dimensions in pixels and km

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use ProcessPoolExecutor to read landmask CRS/transform in parallel
during the scan phase. The rasterio Affine transform is serialised
as a plain tuple to cross the process boundary. Defaults to
min(cpu_count, 8) workers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add compute_tile_grid() which derives EPSG, transform, and pixel
dimensions from a tile's lon/lat using pyproj, without opening any
files. During scanning, each tile's computed grid is compared against
the actual landmask TIFF and mismatches are reported as warnings.

This validates whether the landmask-free approach can replace per-tile
I/O in future.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
zarr 3.x create_array() expects `compressors=ZstdCodec(level=3)`
not `codecs=[BytesCodec(), ZstdCodec(level=3)]`.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The landmask TIFFs use the raw projected upper-left corner as origin
(no snap to 10m pixel grid) and round() for dimensions. Our function
was using floor/ceil snapping which caused off-by-one pixel mismatches
at high latitudes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Zarr v3 uses native dimension_names metadata (not _ARRAY_DIMENSIONS)
to tell xarray which arrays are dimension coordinates. Adding these
makes xr.open_zarr() automatically assign northing/easting/band as
indexed dimension coordinates on embeddings and scales.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
During tile scanning, check whether a tile's west and east edges
(lon ± 0.05°) fall in different UTM zones. These tiles are placed
in whichever zone the landmask EPSG specifies, but the warning
surfaces them for review.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Standalone HTML viewer that streams Zarr v3 UTM zone stores and renders
false-colour overlays on a dark CartoDB basemap using MapLibre GL JS.
Client-side UTM-to-WGS84 reprojection via proj4, zstd decompression
via fzstd, and per-chunk contrast-stretched RGB from any 3 of 128
embedding bands selected by the user.

The `serve` CLI command starts a CORS-enabled HTTP server, auto-detects
.zarr stores in the target directory, and opens the viewer in the browser.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The server writes a _stores.json manifest listing available .zarr
directories. The viewer fetches this on load, populates a dropdown,
and auto-loads the first store. Manual URL entry remains available
as a collapsible fallback.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
All fetches (_stores.json, zarr.json, chunks) now use relative paths
derived from window.location.pathname rather than absolute URLs built
from window.location.origin. This ensures everything is same-origin
regardless of which hostname the server is accessed from.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Server scans each store's embeddings/c/ directory on startup and writes
_chunk_manifest.json listing existing chunk indices. Viewer loads the
manifest and skips requests for non-existent chunks. Also stops
suppressing 404 log messages on the server.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ter fetching

- Chunk grid overlay shows data vs nodata positions with cyan/gray styling
- Chunk index labels appear at higher zoom levels
- UTM zone boundary shown as dashed green outline
- Both overlays toggleable via checkboxes in the control panel
- Chunks to fetch are sorted by distance from viewport center so nearest
  chunks render first when panning
- Raster layers insert below overlays for correct z-ordering

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Raster chunks were inserted below overlay fills, making them invisible.
Now inserted above fills but below grid lines so data is visible with
grid outlines on top. Added console logging for viewport changes,
chunk fetch lists, and per-chunk render results.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove chunk-grid-labels symbol layer that caused font 404 from
  demotiles.maplibre.org, likely breaking MapLibre's rendering pipeline
- Remove unused glyphs config from map style
- Replace updatePending/updateQueued with loadGeneration counter so
  viewport changes immediately cancel in-flight chunk loads
- Cap chunks loaded per viewport update to 80 (nearest to center first)
  with yellow status message when capped
- Bump fetch concurrency from 3 to 4

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Root cause: using beforeId to insert raster layers at specific positions
was unreliable. Now raster layers are added on top, then grid lines and
UTM outline are moved above them with map.moveLayer(). This guarantees
rasters are visible with overlays on top.

Also: auto-zoom to the first rendered chunk so the user can immediately
see the data at an appropriate zoom level. Logs full layer order after
each render for diagnostics.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add precomputed RGBA uint8 preview to Zarr stores so the viewer can
show a fast false-colour overview (~50-200KB per chunk) instead of
fetching full 128-band embeddings (~130KB compressed) for every tile.

- zarr_zone.py: BandStats reservoir sampler, compute_rgb_chunk,
  write_rgb_pass (two-pass build), add_rgb_to_existing_store
- registry_cli.py: --rgb-only and --no-rgb flags for zarr-build,
  has_rgb field in chunk manifests
- viewer: RGB preview loading, click-to-inspect for full bands,
  split rendering pipeline with automatic fallback

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use arr.chunks[:2] instead of arr.metadata.chunk_grid.configuration
which doesn't exist on RegularChunkGrid in zarr-python v3.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the sequential BandStats reservoir sampler with
compute_stretch_from_store() which uses ThreadPoolExecutor to read
chunks in parallel (zarr I/O releases the GIL). Each thread reads
only the 3 RGB bands and returns a subsample of dequantised values;
the main thread computes global p2/p98 percentiles.

This also simplifies build_zone_stores: stretch is now computed from
the written store rather than accumulated during tile writing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- compute_rgb_chunk: replace NaN scales with 0 before multiplication
  to avoid RuntimeWarning in np.clip().astype(uint8)
- write_rgb_pass: use ThreadPoolExecutor to read, compute, and write
  RGB chunks in parallel (each chunk is an independent zarr file,
  so concurrent writes are safe)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
All three I/O-heavy phases now use ThreadPoolExecutor:
1. Tile writing: each tile reads .npy + .tiff from disk and writes to
   a non-overlapping zarr region — safe for concurrent threads
2. RGB stretch computation (already parallel)
3. RGB chunk writing (already parallel)

Thread count is configurable via --workers / -j flag on zarr-build
(default: min(cpu_count, 16)). The count is threaded through to all
parallel phases.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The --rgb-only path was iterating all .zarr stores in the output
directory, ignoring the --zones flag. Now parses zone number from
store names (e.g. utm30_2025.zarr) and filters accordingly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move zstd decompression and band rendering off the main thread into
a pool of WebWorkers (hardwareConcurrency, capped at 8). Replace the
loadGeneration counter with AbortController so in-flight fetches are
actually cancelled on viewport change. Add a fixed progress bar at the
bottom of the viewport that tracks chunk loading progress.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Compute PCA across all 128 embedding bands and write a pca_rgb RGBA
preview array to the zarr store. The 3 principal components map to
RGB channels, giving a maximally informative false-colour visualization.

Server side: compute_pca_basis() subsamples all bands in parallel,
fits SVD, then write_pca_pass() projects and stretches each chunk.
CLI: --pca-only adds PCA to existing stores, --no-pca skips it.
Viewer: detects pca_rgb, adds RGB/PCA toggle button, routes preview
chunk loads through previewSource state.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Query MultiIndex directly for year (avoids iterating all 2.7M tiles),
pre-filter by UTM zone before file existence checks, add progress bars
for file checking phase, and pass --workers through to gather_tile_infos.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace manual zarr v3 reader + fzstd decompression with zarrita.js
which handles sharded stores natively. Workers now only render typed
arrays (no decompression). Manifest scanner uses zarr library instead
of filesystem traversal. Set zarr async concurrency to 128.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Zarr v3 sharding causes each incremental write to read-modify-write the
entire shard file (~2GB when full), making per-tile writes progressively
slower. Switch to grouping tiles by shard, assembling each shard in memory,
and writing once — benchmarked at ~87ms/tile vs ~542ms/tile incremental.

Also iterate RGB/PCA preview passes at SHARD_PX granularity instead of
inner chunk size, and extract SHARD_PX/INNER_CHUNK_PX constants.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Revert sharding (shards/inner-chunks) back to simple 1024x1024 chunks
and disable compression (compressors=None) since embeddings are
high-entropy quantised int8 where zstd gives negligible benefit.
Keeps the zarrita.js-based viewer from the previous commit.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
avsm and others added 30 commits February 25, 2026 14:55
Static STAC catalog generation from Zarr store directories using
pystac. One collection per year, one item per store, with spatial
extents computed from store metadata.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Scans a directory of Zarr stores and generates a spec-compliant static
STAC catalog using pystac. Produces catalog -> collection (per year) ->
item (per store) hierarchy with WGS84 bounding boxes reprojected from
UTM store metadata.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Each store covers exactly one UTM zone, so derive longitude directly
from the zone number rather than reprojecting pixel grid corners.
Only northing needs reprojection for the latitude extent.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add PYRAMID_LEVELS constant and build_preview_pyramid() function that
creates iteratively coarsened copies of RGB/PCA preview arrays. Each
level halves spatial dimensions via xarray coarsen with mean
downsampling, storing per-level metadata (pixel_size, transform, shape).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Store-level wrapper that opens a Zarr store in r+ mode, checks which
previews exist (rgb, pca_rgb), and calls build_preview_pyramid for
each. Updates store attrs with has_{name}_pyramid and
{name}_pyramid_levels. Follows the pattern of add_rgb_to_existing_store
and add_pca_to_existing_store.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add CLI support for generating multi-resolution pyramids during Zarr
store builds (--pyramid) or as a post-processing step on existing
stores (--pyramid-only). The --pyramid-only handler follows the same
pattern as --rgb-only and --pca-only, with zone filtering support.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Progress bar now shows per-level dimensions (e.g. "level 3 (7500x7500px)")
using a status field. Chunk writes within each pyramid level are
parallelised via ThreadPoolExecutor when --workers/-j > 1, matching the
existing parallelism pattern used for RGB/PCA preview generation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous implementation loaded the entire preview array into memory
and used xarray.coarsen().mean() which promotes uint8 to float64 —
for a 365680x21800x4 array that requires 238 GiB.

Now processes in horizontal strips aligned to chunk boundaries:
- Read 2*chunk_h source rows from the zarr array (not into full memory)
- Coarsen 2x2 blocks using numpy reshape+mean in float32 (not float64)
- Write chunk_h output rows to the new pyramid level
- Each subsequent level reads from the previous level's zarr array

Memory usage per strip is ~2*1024*W*4 bytes uint8 + float32 intermediate,
typically <500 MB regardless of total array size.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace UTM-native preview pyramids with server-side Web Mercator
reprojection using ndpyramid. Eliminates client-side per-pixel
reprojection, fixes tile/basemap misalignment, and dramatically
simplifies the tile handler.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
9 tasks covering: ndpyramid dependency, Python pyramid generation,
CLI updates, old code removal, TypeScript types/reader/tile handler
simplification, tests, and visual verification.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add three new functions to zarr_zone.py for Web Mercator pyramid support:

- _compute_mercator_zoom_range(): determines min/max zoom levels based on
  native pixel size and latitude
- _utm_array_to_xarray(): wraps a UTM zarr preview array as a georeferenced
  xarray DataArray with rioxarray CRS metadata, in (band, y, x) order
- build_mercator_pyramid(): orchestrates reprojection from UTM to EPSG:3857
  via ndpyramid.pyramid_reproject, storing results as zoom-level-keyed
  arrays with 256x256 chunks (1 chunk = 1 XYZ map tile)

The existing UTM pyramid functions are preserved for now.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
New function that replaces add_pyramids_to_existing_store for the
--pyramid-only CLI path. It removes old UTM pyramid groups/attrs
and builds Web Mercator pyramids via build_mercator_pyramid().

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update build_zone_stores() pyramid path and --pyramid-only handler
to call build_mercator_pyramid / add_mercator_pyramids_to_existing_store
instead of the old UTM pyramid functions.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Delete PYRAMID_LEVELS constant, _coarsen_strip(), build_preview_pyramid(),
and add_pyramids_to_existing_store() — all superseded by the ndpyramid-based
Web Mercator pyramid functions. Update module docstring to reflect new layout.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace old UTM pyramid verification with mercator pyramid checks:
has_rgb_mercator attr, zoom_range, levels, shape, chunks.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The previous code loaded the entire UTM array into memory with
np.asarray(arr[:]), which OOMs on large zones (e.g. 254860x49313x4
= ~47 GB). Now uses dask.array.from_zarr() for lazy access — the
data is only read when ndpyramid actually needs each chunk during
reprojection.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…d OOM

ndpyramid's pyramid_reproject loads the entire source array into memory
for every zoom level via rioxarray's .reproject() (which calls .values).
For large stores (~47GB source), this causes OOM kills.

New approach: for each zoom level, coarsen the dask-backed source array
to roughly match the target resolution (lazy via dask), compute the
small result, then reproject with rasterio. Each level is written to
zarr and freed before processing the next. Peak memory is now bounded
by the coarsened source size (~12GB max at z=12 with 2x coarsen) rather
than the full source (~47GB) multiplied by number of levels.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Three improvements to the mercator pyramid builder:

1. Crop output to data extent: instead of a full-world grid (mostly
   zeros), compute which XYZ tiles overlap the UTM data and output only
   those. Store tile_offset_x/tile_offset_y per level so the client can
   map z/x/y requests to array positions. This eliminates GDAL warnings
   about failed transform points outside the UTM zone.

2. Cap per-level output at 4 GB: Mercator distortion at high latitudes
   (e.g. 84°N = 9.5x stretch) can make fine zoom levels enormous. Skip
   levels that would exceed 4 GB and log the cap.

3. Replace NaN with 0 before uint8 cast to suppress RuntimeWarning.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The shape parameter to rioxarray's reproject() should be (height, width)
for the spatial dimensions only. Including num_bands as the first element
caused the output shape to be misinterpreted, silently failing the
reprojection for every strip (caught by blanket except). Also added
debug logging for failed reprojections.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the manual coarsening and reprojection in _write_global_store()
with a two-phase approach: Phase 1 reprojects UTM zones into a temporary
level-0 zarr using windowed rasterio.warp.reproject() for bounded memory;
Phase 2 uses topozarr.create_pyramid() to build coarsened pyramid levels
with proper zarr-conventions/multiscales metadata.

Also removes PCA embedding computation, mercator pyramid generation, and
their CLI flags (--pca, --pca-only, --pyramid, --pyramid-only) to focus
purely on RGB preview via the global-preview command.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Using src_nodata=0 caused rasterio to treat valid black pixels as
nodata, triggering CPLE_AppDefined warnings about value substitution.
Initialize destination with NaN and omit nodata params so all source
pixel values are preserved faithfully.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant