Skip to content

Latest commit

 

History

History
207 lines (158 loc) · 7.82 KB

development_overview.md

File metadata and controls

207 lines (158 loc) · 7.82 KB

zarr_implementations Developer Guide

Overview

This repository contains scripts to generate datasets in zarr-v2, zarr-v3 and N5 formats via a number of different implementations. Once generated, the test suite will attempt to have each library read all datasets with a supported format.

Both data generation and testing steps can be be run from the root of the repository via the command:

make test

To generate data only use:

make data

or to generate only the data from a specific library, use:

make zarrita

Creating a development environment

The environment.yml file in the base of the repository lists the requirements for zarr_implementations. To create a new conda environment named zarr_impl_dev with the necessary dependencies installed, use

conda env create --name zarr_impl_dev --file environment.yml

and then activate the environment via

conda activate zarr_impl_dev

Data Generation

All test data is currently generated from a common image file that gets generated by the script generate_data/generate_reference_image.py. Each library that generates data currently has either a python script or a shell script that will generate datasets for the supported file types and codecs. Some implementations also generate versions using both flat and nested file storage.

The data generation script is a python script in the case of libraries with a Python API. For non-Python projects, a subfolder named after the project is typically created. Within this subfolder is a shell script that compiles and/or runs a program to generate the data. See the Makefile and the corresponding scripts that it executes for reference. At the time of writing these include the following concrete examples:

Library Name data generation script language(s) base output file name
zarr-python Python zarr
z5py Python z5py
zarrita Python zarrita
pyn5 Python pyn5
zarr.js bash + node-js js
n5-java bash + Java (Maven) n5-java
xtensor-zarr bash + C++ (CMake) xtensor_zarr

Care should be taken to note the output file name used during generation, as the same name will need to be used in the test scripts later.

The name of the directory used to store the data should end with one of the following extensions. The extension will be parsed later at testing time, to determine the contained file format.

File Format Directory Extension
zarr (v2) .zr
zarr (v3) .zr3
N5 .n5

Codec storage

Currently the same data is stored as separate arrays within the dataset where the names of the arrays match the codec used. Currently the following are tested for libraries that support them:

Codec Array (Dataset) Name
None raw
gzip gzip
zlib zlib
BLOSC + lz4 blosc/lz4

Flat vs. Nested Data

Some libraries support storing arrays in either a flat (single directory) format or a nested storage format. In such cases the base filename (excluding its extension) should end in either _flat or _nested to indicate this. For example xtensor-zarr data stored in zarr-v3 format with nested storage would be named: xtensor_zarr_nested.zr3. Not all implementations have metadata indicating of the storage is flat or nested so the testing scripts currently tend to set a nested boolean flag depending on whether _nested is in the filename.

Storage classes

Currently zarr-python also stores data with multiple different underlying storage classes. In this specific case, the name of the storage class is also indicated in the filename. For example, zarr (v2) is stored in flat format with both a Directory_Store class and an FSStore class, corresponding to the filenames zarr_DirectoryStore_flat.zr and zarr_FSStore_flat.zr, respectively.

Read Tests

Running the tests

Currently all tests reside in a single file, test/test_read_all.py. If the data has already been generated, tests can be run from the root folder of the repository via:

pytest test/test_real_all.py -v

alternatively, the make command, make test, will generate the data and then run the tests.

Writing tests

When adding a new library to the test suite, a corresponding entry for the library should be added to the READABLE_CODECS dictionary. Here is the entry for zarrita as a concrete example:

    "zarrita": {
        "zarr": [],
        "zarr-v3": ["blosc", "gzip", "raw", "zlib"],
        "N5": [],
    },

The value here is a dictionary where the keys correspond to the three file types tested by zarr_implementations and the values are the codecs that this implementation should be able to read. For zarrita, the lists for zarr (v2) and N5 are empty because these file formats are unsupported.

Secondly, a Python function capable of reading a file with the library needs to be created and a corresponding entry added to the dictionary within _get_read_fn that maps implementation names to a corresponding fucntion for reading from that format. This read function should have the following signature:

 read_with_LIBNAME(fpath, ds_name, nested=None)

where fpath is a file path, ds_name is a dataset name within the file and nested is an optional boolean flag indicating whether the data is stored in a nested format. Typically this function just reads in the specified dataset and returns a NumPy array corresponding to it.

Read tests for non-Python implementations

For non-Python implementations, the data reader function may call a binary program or shell script that does the actual reading and/or validation. This program can be called via Python's subprocess module.

For non-Python implementations it is also possible to have the data reader function that is returned by _get_read_fn do the array validation internally rather than generate a NumPy array. In the case where internal validation is being performed, the function should return None to indicate that validation ucceeded and raise an error, otherwise.

For concrete examples of non-Python validation functions see the xtensor-zarr (C++) and jzarr (Java) implementations. xtensor-zarr calls a compiled C++ program that reads the specified data file and writes the contents out to a NumPy .npy array for later validation in Python. On the contraray, the jzarr reading program does the validation internally and returns a non-zero exit code if validation failed.

Test parametrization

Adding the read function entry to _get_read_fn and updating READABLE_CODECS dictionary as described above should be sufficient to get new parameterized test cases automatically generated for the new library via the create_params functions. The test_correct_read function is the core test function that then runs over all sets of parameters generated by create_params. In most cases it should not be necessary to modify create_params or test_correct_read itself.

Summary Report Generation

A summary report can be generated by calling the same test function, but via python rather than pytest. In this case, Pandas summary tables corresponding to success or failure of each write/read library combination are generated and exported as Markdown (report.md) and HTML (report.html) files. The reports are generated using the same create_params parameter sets used by pytest so it should not be necessary to make any changes to the reporting functions themselves as new libraries are added.

The specific format of the reports and method of report generation is still considered experimental and is subject to change as the library evolves.