You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If something goes wrong while writing a component of a Zarr store, especially a .zattrs (JSON) file, then the corrupted data could prevent the Zarr store to be read. We could implement a tolerance mechanism that reads as much non-corrupted data as possible and reports to the user what has been detected as corrupted. In this way the user knows what is broken and can manually fix it.
The leading question is how do I handle corrupted data that users bring to me? The bigger a dataset collection, the greater its value, but also the greater the risk that somewhere a tiny bit can become corrupted.
By corruption I mostly mean:
inconsistent JSON file:
labels/.zattrs referring to a label name not existing in labels
table/table/.zarr referring to a region that is not found
table/table/obs/.zarr referring to a column that is not found or has been renamed
consolidated zmetadata inconsistent with actually existing elements
unreadable JSON file (aborted during write, syntax error, element not recoverable)
unreadable binary array data (element not recoverable)
If it's not a top-level file or not required (table region/instances column), other elements should remain readable.
Motivation
spatialdata is the only API able to read SpatialData, so when a store is partially corrupted, we can currently only use external tools to manipulate (or delete) files until a valid state is achieved. One should expect that when handling non-corrupted files, the official API is safer than any external tool.
Feature
The read function should have an optional, "forgiving" read mode where the severety level of read errors can be reduced (to a warning, or maybe a pydantic-like collection of validation errors), so that – with this mode – always SpatialData object is returned that contains at least the valid elements (in the worst case none). Then I can cleanly remove corrupted elements or overwrite them with valid data.
If something goes wrong while writing a component of a Zarr store, especially a .zattrs (JSON) file, then the corrupted data could prevent the Zarr store to be read. We could implement a tolerance mechanism that reads as much non-corrupted data as possible and reports to the user what has been detected as corrupted. In this way the user knows what is broken and can manually fix it.
Originally suggested by @aeisenbarth
The text was updated successfully, but these errors were encountered: