WIP: Add zone-wide Zarr format for consolidated UTM zone stores#194
Draft
avsm wants to merge 43 commits intoucam-eo:mainfrom
Draft
WIP: Add zone-wide Zarr format for consolidated UTM zone stores#194avsm wants to merge 43 commits intoucam-eo:mainfrom
avsm wants to merge 43 commits intoucam-eo:mainfrom
Conversation
Introduces geotessera/zarr_zone.py with support for building Zarr v3 stores that consolidate all tiles within a UTM zone into a single store per year (e.g. utm30_2024.zarr). Each store contains int8 embeddings and float32 scales arrays in native UTM coordinates with 10m pixels, chunked at (1024, 1024, 128) and (1024, 1024) respectively. Water pixels are marked as NaN in scales via landmask TIFFs. Exposed only via `geotessera-registry zarr-build` for building from local tile data, with --zones flag to test a single zone at a time. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace text callbacks with Rich Progress bars showing: - Tile scanning: bar with count and current tile coordinates - Per-zone writing: bar with count, elapsed time, current tile - Dry-run: table of zone/EPSG/tile-count - Zone header: grid dimensions in pixels and km Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use ProcessPoolExecutor to read landmask CRS/transform in parallel during the scan phase. The rasterio Affine transform is serialised as a plain tuple to cross the process boundary. Defaults to min(cpu_count, 8) workers. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add compute_tile_grid() which derives EPSG, transform, and pixel dimensions from a tile's lon/lat using pyproj, without opening any files. During scanning, each tile's computed grid is compared against the actual landmask TIFF and mismatches are reported as warnings. This validates whether the landmask-free approach can replace per-tile I/O in future. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
zarr 3.x create_array() expects `compressors=ZstdCodec(level=3)` not `codecs=[BytesCodec(), ZstdCodec(level=3)]`. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The landmask TIFFs use the raw projected upper-left corner as origin (no snap to 10m pixel grid) and round() for dimensions. Our function was using floor/ceil snapping which caused off-by-one pixel mismatches at high latitudes. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Zarr v3 uses native dimension_names metadata (not _ARRAY_DIMENSIONS) to tell xarray which arrays are dimension coordinates. Adding these makes xr.open_zarr() automatically assign northing/easting/band as indexed dimension coordinates on embeddings and scales. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
During tile scanning, check whether a tile's west and east edges (lon ± 0.05°) fall in different UTM zones. These tiles are placed in whichever zone the landmask EPSG specifies, but the warning surfaces them for review. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Standalone HTML viewer that streams Zarr v3 UTM zone stores and renders false-colour overlays on a dark CartoDB basemap using MapLibre GL JS. Client-side UTM-to-WGS84 reprojection via proj4, zstd decompression via fzstd, and per-chunk contrast-stretched RGB from any 3 of 128 embedding bands selected by the user. The `serve` CLI command starts a CORS-enabled HTTP server, auto-detects .zarr stores in the target directory, and opens the viewer in the browser. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The server writes a _stores.json manifest listing available .zarr directories. The viewer fetches this on load, populates a dropdown, and auto-loads the first store. Manual URL entry remains available as a collapsible fallback. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
All fetches (_stores.json, zarr.json, chunks) now use relative paths derived from window.location.pathname rather than absolute URLs built from window.location.origin. This ensures everything is same-origin regardless of which hostname the server is accessed from. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Server scans each store's embeddings/c/ directory on startup and writes _chunk_manifest.json listing existing chunk indices. Viewer loads the manifest and skips requests for non-existent chunks. Also stops suppressing 404 log messages on the server. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ter fetching - Chunk grid overlay shows data vs nodata positions with cyan/gray styling - Chunk index labels appear at higher zoom levels - UTM zone boundary shown as dashed green outline - Both overlays toggleable via checkboxes in the control panel - Chunks to fetch are sorted by distance from viewport center so nearest chunks render first when panning - Raster layers insert below overlays for correct z-ordering Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Raster chunks were inserted below overlay fills, making them invisible. Now inserted above fills but below grid lines so data is visible with grid outlines on top. Added console logging for viewport changes, chunk fetch lists, and per-chunk render results. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Remove chunk-grid-labels symbol layer that caused font 404 from demotiles.maplibre.org, likely breaking MapLibre's rendering pipeline - Remove unused glyphs config from map style - Replace updatePending/updateQueued with loadGeneration counter so viewport changes immediately cancel in-flight chunk loads - Cap chunks loaded per viewport update to 80 (nearest to center first) with yellow status message when capped - Bump fetch concurrency from 3 to 4 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Root cause: using beforeId to insert raster layers at specific positions was unreliable. Now raster layers are added on top, then grid lines and UTM outline are moved above them with map.moveLayer(). This guarantees rasters are visible with overlays on top. Also: auto-zoom to the first rendered chunk so the user can immediately see the data at an appropriate zoom level. Logs full layer order after each render for diagnostics. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add precomputed RGBA uint8 preview to Zarr stores so the viewer can show a fast false-colour overview (~50-200KB per chunk) instead of fetching full 128-band embeddings (~130KB compressed) for every tile. - zarr_zone.py: BandStats reservoir sampler, compute_rgb_chunk, write_rgb_pass (two-pass build), add_rgb_to_existing_store - registry_cli.py: --rgb-only and --no-rgb flags for zarr-build, has_rgb field in chunk manifests - viewer: RGB preview loading, click-to-inspect for full bands, split rendering pipeline with automatic fallback Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use arr.chunks[:2] instead of arr.metadata.chunk_grid.configuration which doesn't exist on RegularChunkGrid in zarr-python v3. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace the sequential BandStats reservoir sampler with compute_stretch_from_store() which uses ThreadPoolExecutor to read chunks in parallel (zarr I/O releases the GIL). Each thread reads only the 3 RGB bands and returns a subsample of dequantised values; the main thread computes global p2/p98 percentiles. This also simplifies build_zone_stores: stretch is now computed from the written store rather than accumulated during tile writing. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- compute_rgb_chunk: replace NaN scales with 0 before multiplication to avoid RuntimeWarning in np.clip().astype(uint8) - write_rgb_pass: use ThreadPoolExecutor to read, compute, and write RGB chunks in parallel (each chunk is an independent zarr file, so concurrent writes are safe) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
All three I/O-heavy phases now use ThreadPoolExecutor: 1. Tile writing: each tile reads .npy + .tiff from disk and writes to a non-overlapping zarr region — safe for concurrent threads 2. RGB stretch computation (already parallel) 3. RGB chunk writing (already parallel) Thread count is configurable via --workers / -j flag on zarr-build (default: min(cpu_count, 16)). The count is threaded through to all parallel phases. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The --rgb-only path was iterating all .zarr stores in the output directory, ignoring the --zones flag. Now parses zone number from store names (e.g. utm30_2025.zarr) and filters accordingly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move zstd decompression and band rendering off the main thread into a pool of WebWorkers (hardwareConcurrency, capped at 8). Replace the loadGeneration counter with AbortController so in-flight fetches are actually cancelled on viewport change. Add a fixed progress bar at the bottom of the viewport that tracks chunk loading progress. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Compute PCA across all 128 embedding bands and write a pca_rgb RGBA preview array to the zarr store. The 3 principal components map to RGB channels, giving a maximally informative false-colour visualization. Server side: compute_pca_basis() subsamples all bands in parallel, fits SVD, then write_pca_pass() projects and stretches each chunk. CLI: --pca-only adds PCA to existing stores, --no-pca skips it. Viewer: detects pca_rgb, adds RGB/PCA toggle button, routes preview chunk loads through previewSource state. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Query MultiIndex directly for year (avoids iterating all 2.7M tiles), pre-filter by UTM zone before file existence checks, add progress bars for file checking phase, and pass --workers through to gather_tile_infos. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Replace manual zarr v3 reader + fzstd decompression with zarrita.js which handles sharded stores natively. Workers now only render typed arrays (no decompression). Manifest scanner uses zarr library instead of filesystem traversal. Set zarr async concurrency to 128. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Zarr v3 sharding causes each incremental write to read-modify-write the entire shard file (~2GB when full), making per-tile writes progressively slower. Switch to grouping tiles by shard, assembling each shard in memory, and writing once — benchmarked at ~87ms/tile vs ~542ms/tile incremental. Also iterate RGB/PCA preview passes at SHARD_PX granularity instead of inner chunk size, and extract SHARD_PX/INNER_CHUNK_PX constants. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Revert sharding (shards/inner-chunks) back to simple 1024x1024 chunks and disable compression (compressors=None) since embeddings are high-entropy quantised int8 where zstd gives negligible benefit. Keeps the zarrita.js-based viewer from the previous commit. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Adds optional zarrs (Rust) backend for zarr I/O. When the zarrs package is installed, the Rust codec pipeline is activated automatically; otherwise falls back to zarr-python with an info log either way. Install with: uv pip install geotessera[fast] Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
zarrs has issues with partial writes to non-existent chunks (read-modify-write on sparse stores), so restrict the Rust codec pipeline to read-only operations and use zarr-python for writes. Logs pipeline switches only on transitions to reduce noise. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
These passes are slow and not needed for the initial build. Use --rgb / --pca to include them, or --rgb-only / --pca-only to add them to existing stores afterwards. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Two-phase write strategy: 1. Read all tiles in parallel (I/O bound, all workers) 2. Group tiles by chunk, assemble each 1024x1024 chunk buffer in memory, write once — no read-modify-write on chunk files Each chunk buffer is ~132 MB (128 MB embeddings + 4 MB scales) with writers capped at 8 to limit peak memory to ~1 GB. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Not reliable for our write patterns. Pure zarr-python is sufficient now that chunk-batched writes eliminate read-modify-write. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Update zarrita.js from broken 0.4.0-next.18 pre-release to stable 0.6.1 (fixes missing _internal_get_array_context export). Add --store flag to `geotessera-registry serve` to skip discovery and serve a single store directly. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
zarrita 0.6.1 requires absolute URLs in FetchStore constructor. resolveStoreUrl now prepends window.location.origin. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Replace chunk-batched writes with simple per-tile sequential writes - Merge _sample_chunk_stats and _sample_chunk_pca_stats into one function - Merge write_rgb_pass and write_pca_pass into generic write_preview_pass - Add _run_parallel() helper to eliminate repeated progress/executor boilerplate - Remove ~850 lines of redundant code (2124 -> 1440 lines) - Improve PCA colour range by equalising per-component variance (IQR normalisation) so all three channels use the full [0,255] range - Add diagnostic message showing expected paths when all tiles are missing Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Search base_dir and up to 2 parent directories for registry.parquet instead of requiring --registry-dir. The flag still works as an explicit override. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Move os.path.exists() checks from a separate loop in gather_tile_infos into _scan_one_tile, eliminating a redundant pass over all tiles. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Eliminate the parallel rasterio scanning phase that opened every landmask file to read CRS/transform metadata. Instead, compute grid info deterministically from tile coordinates via compute_tile_grid(). This avoids thousands of file I/O operations during gather_tile_infos, which was especially slow on network filesystems like CephFS. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Cache pyproj Transformer per EPSG (one per zone instead of one per tile) and remove os.path.exists() checks that were slow on network filesystems. Missing files are already handled by the try/except in the write phase. This makes gather_tile_infos near-instant for any number of tiles. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Scales arrays can contain inf values, which propagate through dequantisation (int8 * inf = inf) into the PCA sample matrix. This causes mean() to produce inf, centred data to contain inf/NaN, and ultimately np.linalg.svd to fail with "SVD did not converge". Fix at two layers: - _sample_chunk_stats: use np.isfinite() instead of ~np.isnan() to filter out both NaN and inf scales at the source - compute_pca_basis: drop any remaining non-finite rows from the sample matrix before computing mean and SVD Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Introduces geotessera/zarr_zone.py with support for building Zarr v3 stores that consolidate all tiles within a UTM zone into a single store per year (e.g. utm30_2024.zarr). Each store contains int8 embeddings and float32 scales arrays in native UTM coordinates with 10m pixels, chunked at (1024, 1024, 128) and (1024, 1024) respectively. Water pixels are marked as NaN in scales via landmask TIFFs.
Exposed only via
geotessera-registry zarr-buildfor building from local tile data, with --zones flag to test a single zone at a time.Still a WIP while I experiment with a few conversions and build a visualiser.