Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tolerance when reading corrupted data #457

Open
LucaMarconato opened this issue Feb 14, 2024 · 1 comment · May be fixed by #765
Open

Tolerance when reading corrupted data #457

LucaMarconato opened this issue Feb 14, 2024 · 1 comment · May be fixed by #765
Labels

Comments

@LucaMarconato
Copy link
Member

If something goes wrong while writing a component of a Zarr store, especially a .zattrs (JSON) file, then the corrupted data could prevent the Zarr store to be read. We could implement a tolerance mechanism that reads as much non-corrupted data as possible and reports to the user what has been detected as corrupted. In this way the user knows what is broken and can manually fix it.

Originally suggested by @aeisenbarth

@aeisenbarth
Copy link
Contributor

Thanks! To expand on the issue:

The leading question is how do I handle corrupted data that users bring to me? The bigger a dataset collection, the greater its value, but also the greater the risk that somewhere a tiny bit can become corrupted.

By corruption I mostly mean:

  • inconsistent JSON file:
    • labels/.zattrs referring to a label name not existing in labels
    • table/table/.zarr referring to a region that is not found
    • table/table/obs/.zarr referring to a column that is not found or has been renamed
    • consolidated zmetadata inconsistent with actually existing elements
  • unreadable JSON file (aborted during write, syntax error, element not recoverable)
  • unreadable binary array data (element not recoverable)

If it's not a top-level file or not required (table region/instances column), other elements should remain readable.

Motivation

spatialdata is the only API able to read SpatialData, so when a store is partially corrupted, we can currently only use external tools to manipulate (or delete) files until a valid state is achieved. One should expect that when handling non-corrupted files, the official API is safer than any external tool.

Feature

The read function should have an optional, "forgiving" read mode where the severety level of read errors can be reduced (to a warning, or maybe a pydantic-like collection of validation errors), so that – with this mode – always SpatialData object is returned that contains at least the valid elements (in the worst case none). Then I can cleanly remove corrupted elements or overwrite them with valid data.

Examples

pandas.read_csv

  • on_bad_lines: {‘error’, ‘warn’, ‘skip’} or Callable, default ‘error’
  • encoding_errors: str, optional, default ‘strict’
    with values from standard library codecs which even has repair options

@aeisenbarth aeisenbarth linked a pull request Nov 6, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants