Tolerance when reading corrupted data #457

LucaMarconato · 2024-02-14T15:24:04Z

If something goes wrong while writing a component of a Zarr store, especially a .zattrs (JSON) file, then the corrupted data could prevent the Zarr store to be read. We could implement a tolerance mechanism that reads as much non-corrupted data as possible and reports to the user what has been detected as corrupted. In this way the user knows what is broken and can manually fix it.

Originally suggested by @aeisenbarth

aeisenbarth · 2024-02-14T17:02:49Z

Thanks! To expand on the issue:

The leading question is how do I handle corrupted data that users bring to me? The bigger a dataset collection, the greater its value, but also the greater the risk that somewhere a tiny bit can become corrupted.

By corruption I mostly mean:

inconsistent JSON file:
- labels/.zattrs referring to a label name not existing in labels
- table/table/.zarr referring to a region that is not found
- table/table/obs/.zarr referring to a column that is not found or has been renamed
- consolidated zmetadata inconsistent with actually existing elements
unreadable JSON file (aborted during write, syntax error, element not recoverable)
unreadable binary array data (element not recoverable)

If it's not a top-level file or not required (table region/instances column), other elements should remain readable.

Motivation

spatialdata is the only API able to read SpatialData, so when a store is partially corrupted, we can currently only use external tools to manipulate (or delete) files until a valid state is achieved. One should expect that when handling non-corrupted files, the official API is safer than any external tool.

Feature

The read function should have an optional, "forgiving" read mode where the severety level of read errors can be reduced (to a warning, or maybe a pydantic-like collection of validation errors), so that – with this mode – always SpatialData object is returned that contains at least the valid elements (in the worst case none). Then I can cleanly remove corrupted elements or overwrite them with valid data.

Examples

pandas.read_csv

on_bad_lines: {‘error’, ‘warn’, ‘skip’} or Callable, default ‘error’
encoding_errors: str, optional, default ‘strict’
with values from standard library codecs which even has repair options

LucaMarconato added the I/O 💿 label Feb 14, 2024

aeisenbarth mentioned this issue Oct 17, 2024

More robust name validation #703

Open

aeisenbarth linked a pull request Nov 6, 2024 that will close this issue

Partial reading #765

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tolerance when reading corrupted data #457

Tolerance when reading corrupted data #457

LucaMarconato commented Feb 14, 2024

aeisenbarth commented Feb 14, 2024

Tolerance when reading corrupted data #457

Tolerance when reading corrupted data #457

Comments

LucaMarconato commented Feb 14, 2024

aeisenbarth commented Feb 14, 2024

Motivation

Feature

Examples