Skip to content

Comments

WIP: Add zone-wide Zarr format for consolidated UTM zone stores#194

Draft
avsm wants to merge 43 commits intoucam-eo:mainfrom
avsm:zarr2
Draft

WIP: Add zone-wide Zarr format for consolidated UTM zone stores#194
avsm wants to merge 43 commits intoucam-eo:mainfrom
avsm:zarr2

Conversation

@avsm
Copy link
Contributor

@avsm avsm commented Feb 22, 2026

Introduces geotessera/zarr_zone.py with support for building Zarr v3 stores that consolidate all tiles within a UTM zone into a single store per year (e.g. utm30_2024.zarr). Each store contains int8 embeddings and float32 scales arrays in native UTM coordinates with 10m pixels, chunked at (1024, 1024, 128) and (1024, 1024) respectively. Water pixels are marked as NaN in scales via landmask TIFFs.

Exposed only via geotessera-registry zarr-build for building from local tile data, with --zones flag to test a single zone at a time.

Still a WIP while I experiment with a few conversions and build a visualiser.

avsm and others added 30 commits February 22, 2026 15:40
Introduces geotessera/zarr_zone.py with support for building Zarr v3
stores that consolidate all tiles within a UTM zone into a single store
per year (e.g. utm30_2024.zarr). Each store contains int8 embeddings
and float32 scales arrays in native UTM coordinates with 10m pixels,
chunked at (1024, 1024, 128) and (1024, 1024) respectively. Water
pixels are marked as NaN in scales via landmask TIFFs.

Exposed only via `geotessera-registry zarr-build` for building from
local tile data, with --zones flag to test a single zone at a time.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace text callbacks with Rich Progress bars showing:
- Tile scanning: bar with count and current tile coordinates
- Per-zone writing: bar with count, elapsed time, current tile
- Dry-run: table of zone/EPSG/tile-count
- Zone header: grid dimensions in pixels and km

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use ProcessPoolExecutor to read landmask CRS/transform in parallel
during the scan phase. The rasterio Affine transform is serialised
as a plain tuple to cross the process boundary. Defaults to
min(cpu_count, 8) workers.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add compute_tile_grid() which derives EPSG, transform, and pixel
dimensions from a tile's lon/lat using pyproj, without opening any
files. During scanning, each tile's computed grid is compared against
the actual landmask TIFF and mismatches are reported as warnings.

This validates whether the landmask-free approach can replace per-tile
I/O in future.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
zarr 3.x create_array() expects `compressors=ZstdCodec(level=3)`
not `codecs=[BytesCodec(), ZstdCodec(level=3)]`.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The landmask TIFFs use the raw projected upper-left corner as origin
(no snap to 10m pixel grid) and round() for dimensions. Our function
was using floor/ceil snapping which caused off-by-one pixel mismatches
at high latitudes.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Zarr v3 uses native dimension_names metadata (not _ARRAY_DIMENSIONS)
to tell xarray which arrays are dimension coordinates. Adding these
makes xr.open_zarr() automatically assign northing/easting/band as
indexed dimension coordinates on embeddings and scales.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
During tile scanning, check whether a tile's west and east edges
(lon ± 0.05°) fall in different UTM zones. These tiles are placed
in whichever zone the landmask EPSG specifies, but the warning
surfaces them for review.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Standalone HTML viewer that streams Zarr v3 UTM zone stores and renders
false-colour overlays on a dark CartoDB basemap using MapLibre GL JS.
Client-side UTM-to-WGS84 reprojection via proj4, zstd decompression
via fzstd, and per-chunk contrast-stretched RGB from any 3 of 128
embedding bands selected by the user.

The `serve` CLI command starts a CORS-enabled HTTP server, auto-detects
.zarr stores in the target directory, and opens the viewer in the browser.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The server writes a _stores.json manifest listing available .zarr
directories. The viewer fetches this on load, populates a dropdown,
and auto-loads the first store. Manual URL entry remains available
as a collapsible fallback.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
All fetches (_stores.json, zarr.json, chunks) now use relative paths
derived from window.location.pathname rather than absolute URLs built
from window.location.origin. This ensures everything is same-origin
regardless of which hostname the server is accessed from.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Server scans each store's embeddings/c/ directory on startup and writes
_chunk_manifest.json listing existing chunk indices. Viewer loads the
manifest and skips requests for non-existent chunks. Also stops
suppressing 404 log messages on the server.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ter fetching

- Chunk grid overlay shows data vs nodata positions with cyan/gray styling
- Chunk index labels appear at higher zoom levels
- UTM zone boundary shown as dashed green outline
- Both overlays toggleable via checkboxes in the control panel
- Chunks to fetch are sorted by distance from viewport center so nearest
  chunks render first when panning
- Raster layers insert below overlays for correct z-ordering

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Raster chunks were inserted below overlay fills, making them invisible.
Now inserted above fills but below grid lines so data is visible with
grid outlines on top. Added console logging for viewport changes,
chunk fetch lists, and per-chunk render results.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove chunk-grid-labels symbol layer that caused font 404 from
  demotiles.maplibre.org, likely breaking MapLibre's rendering pipeline
- Remove unused glyphs config from map style
- Replace updatePending/updateQueued with loadGeneration counter so
  viewport changes immediately cancel in-flight chunk loads
- Cap chunks loaded per viewport update to 80 (nearest to center first)
  with yellow status message when capped
- Bump fetch concurrency from 3 to 4

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Root cause: using beforeId to insert raster layers at specific positions
was unreliable. Now raster layers are added on top, then grid lines and
UTM outline are moved above them with map.moveLayer(). This guarantees
rasters are visible with overlays on top.

Also: auto-zoom to the first rendered chunk so the user can immediately
see the data at an appropriate zoom level. Logs full layer order after
each render for diagnostics.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add precomputed RGBA uint8 preview to Zarr stores so the viewer can
show a fast false-colour overview (~50-200KB per chunk) instead of
fetching full 128-band embeddings (~130KB compressed) for every tile.

- zarr_zone.py: BandStats reservoir sampler, compute_rgb_chunk,
  write_rgb_pass (two-pass build), add_rgb_to_existing_store
- registry_cli.py: --rgb-only and --no-rgb flags for zarr-build,
  has_rgb field in chunk manifests
- viewer: RGB preview loading, click-to-inspect for full bands,
  split rendering pipeline with automatic fallback

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use arr.chunks[:2] instead of arr.metadata.chunk_grid.configuration
which doesn't exist on RegularChunkGrid in zarr-python v3.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the sequential BandStats reservoir sampler with
compute_stretch_from_store() which uses ThreadPoolExecutor to read
chunks in parallel (zarr I/O releases the GIL). Each thread reads
only the 3 RGB bands and returns a subsample of dequantised values;
the main thread computes global p2/p98 percentiles.

This also simplifies build_zone_stores: stretch is now computed from
the written store rather than accumulated during tile writing.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- compute_rgb_chunk: replace NaN scales with 0 before multiplication
  to avoid RuntimeWarning in np.clip().astype(uint8)
- write_rgb_pass: use ThreadPoolExecutor to read, compute, and write
  RGB chunks in parallel (each chunk is an independent zarr file,
  so concurrent writes are safe)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
All three I/O-heavy phases now use ThreadPoolExecutor:
1. Tile writing: each tile reads .npy + .tiff from disk and writes to
   a non-overlapping zarr region — safe for concurrent threads
2. RGB stretch computation (already parallel)
3. RGB chunk writing (already parallel)

Thread count is configurable via --workers / -j flag on zarr-build
(default: min(cpu_count, 16)). The count is threaded through to all
parallel phases.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The --rgb-only path was iterating all .zarr stores in the output
directory, ignoring the --zones flag. Now parses zone number from
store names (e.g. utm30_2025.zarr) and filters accordingly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move zstd decompression and band rendering off the main thread into
a pool of WebWorkers (hardwareConcurrency, capped at 8). Replace the
loadGeneration counter with AbortController so in-flight fetches are
actually cancelled on viewport change. Add a fixed progress bar at the
bottom of the viewport that tracks chunk loading progress.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Compute PCA across all 128 embedding bands and write a pca_rgb RGBA
preview array to the zarr store. The 3 principal components map to
RGB channels, giving a maximally informative false-colour visualization.

Server side: compute_pca_basis() subsamples all bands in parallel,
fits SVD, then write_pca_pass() projects and stretches each chunk.
CLI: --pca-only adds PCA to existing stores, --no-pca skips it.
Viewer: detects pca_rgb, adds RGB/PCA toggle button, routes preview
chunk loads through previewSource state.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Query MultiIndex directly for year (avoids iterating all 2.7M tiles),
pre-filter by UTM zone before file existence checks, add progress bars
for file checking phase, and pass --workers through to gather_tile_infos.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace manual zarr v3 reader + fzstd decompression with zarrita.js
which handles sharded stores natively. Workers now only render typed
arrays (no decompression). Manifest scanner uses zarr library instead
of filesystem traversal. Set zarr async concurrency to 128.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Zarr v3 sharding causes each incremental write to read-modify-write the
entire shard file (~2GB when full), making per-tile writes progressively
slower. Switch to grouping tiles by shard, assembling each shard in memory,
and writing once — benchmarked at ~87ms/tile vs ~542ms/tile incremental.

Also iterate RGB/PCA preview passes at SHARD_PX granularity instead of
inner chunk size, and extract SHARD_PX/INNER_CHUNK_PX constants.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Revert sharding (shards/inner-chunks) back to simple 1024x1024 chunks
and disable compression (compressors=None) since embeddings are
high-entropy quantised int8 where zstd gives negligible benefit.
Keeps the zarrita.js-based viewer from the previous commit.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
avsm and others added 13 commits February 24, 2026 07:02
Adds optional zarrs (Rust) backend for zarr I/O. When the zarrs
package is installed, the Rust codec pipeline is activated automatically;
otherwise falls back to zarr-python with an info log either way.

Install with: uv pip install geotessera[fast]

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
zarrs has issues with partial writes to non-existent chunks
(read-modify-write on sparse stores), so restrict the Rust codec
pipeline to read-only operations and use zarr-python for writes.
Logs pipeline switches only on transitions to reduce noise.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
These passes are slow and not needed for the initial build. Use
--rgb / --pca to include them, or --rgb-only / --pca-only to add
them to existing stores afterwards.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two-phase write strategy:
1. Read all tiles in parallel (I/O bound, all workers)
2. Group tiles by chunk, assemble each 1024x1024 chunk buffer
   in memory, write once — no read-modify-write on chunk files

Each chunk buffer is ~132 MB (128 MB embeddings + 4 MB scales)
with writers capped at 8 to limit peak memory to ~1 GB.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Not reliable for our write patterns. Pure zarr-python is sufficient
now that chunk-batched writes eliminate read-modify-write.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update zarrita.js from broken 0.4.0-next.18 pre-release to stable
0.6.1 (fixes missing _internal_get_array_context export). Add --store
flag to `geotessera-registry serve` to skip discovery and serve a
single store directly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
zarrita 0.6.1 requires absolute URLs in FetchStore constructor.
resolveStoreUrl now prepends window.location.origin.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace chunk-batched writes with simple per-tile sequential writes
- Merge _sample_chunk_stats and _sample_chunk_pca_stats into one function
- Merge write_rgb_pass and write_pca_pass into generic write_preview_pass
- Add _run_parallel() helper to eliminate repeated progress/executor boilerplate
- Remove ~850 lines of redundant code (2124 -> 1440 lines)
- Improve PCA colour range by equalising per-component variance (IQR
  normalisation) so all three channels use the full [0,255] range
- Add diagnostic message showing expected paths when all tiles are missing

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Search base_dir and up to 2 parent directories for registry.parquet
instead of requiring --registry-dir. The flag still works as an
explicit override.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move os.path.exists() checks from a separate loop in gather_tile_infos
into _scan_one_tile, eliminating a redundant pass over all tiles.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Eliminate the parallel rasterio scanning phase that opened every
landmask file to read CRS/transform metadata. Instead, compute grid
info deterministically from tile coordinates via compute_tile_grid().
This avoids thousands of file I/O operations during gather_tile_infos,
which was especially slow on network filesystems like CephFS.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Cache pyproj Transformer per EPSG (one per zone instead of one per
tile) and remove os.path.exists() checks that were slow on network
filesystems. Missing files are already handled by the try/except in
the write phase. This makes gather_tile_infos near-instant for any
number of tiles.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Scales arrays can contain inf values, which propagate through
dequantisation (int8 * inf = inf) into the PCA sample matrix.
This causes mean() to produce inf, centred data to contain inf/NaN,
and ultimately np.linalg.svd to fail with "SVD did not converge".

Fix at two layers:
- _sample_chunk_stats: use np.isfinite() instead of ~np.isnan() to
  filter out both NaN and inf scales at the source
- compute_pca_basis: drop any remaining non-finite rows from the
  sample matrix before computing mean and SVD

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant