This repository contains scripts to generate datasets in zarr-v2, zarr-v3 and N5 formats via a number of different implementations. Once generated, the test suite will attempt to have each library read all datasets with a supported format.
Both data generation and testing steps can be be run from the root of the repository via the command:
make test
To generate data only use:
make data
or to generate only the data from a specific library, use:
make zarrita
The environment.yml
file in the base of the repository lists the
requirements for zarr_implementations
. To create a new conda environment
named zarr_impl_dev
with the necessary dependencies installed, use
conda env create --name zarr_impl_dev --file environment.yml
and then activate the environment via
conda activate zarr_impl_dev
All test data is currently generated from a common image file that gets
generated by the script generate_data/generate_reference_image.py
. Each
library that generates data currently has either a python script or a shell
script that will generate datasets for the supported file types and codecs.
Some implementations also generate versions using both flat and nested file
storage.
The data generation script is a python script in the case of libraries with a
Python API. For non-Python projects, a subfolder named after the project is
typically created. Within this subfolder is a shell script that compiles and/or
runs a program to generate the data. See the Makefile
and the corresponding
scripts that it executes for reference. At the time of writing these include
the following concrete examples:
Library Name | data generation script language(s) | base output file name |
---|---|---|
zarr-python | Python | zarr |
z5py | Python | z5py |
zarrita | Python | zarrita |
pyn5 | Python | pyn5 |
zarr.js | bash + node-js | js |
n5-java | bash + Java (Maven) | n5-java |
xtensor-zarr | bash + C++ (CMake) | xtensor_zarr |
Care should be taken to note the output file name used during generation, as the same name will need to be used in the test scripts later.
The name of the directory used to store the data should end with one of the following extensions. The extension will be parsed later at testing time, to determine the contained file format.
File Format | Directory Extension |
---|---|
zarr (v2) | .zr |
zarr (v3) | .zr3 |
N5 | .n5 |
Currently the same data is stored as separate arrays within the dataset where the names of the arrays match the codec used. Currently the following are tested for libraries that support them:
Codec | Array (Dataset) Name |
---|---|
None | raw |
gzip | gzip |
zlib | zlib |
BLOSC + lz4 | blosc/lz4 |
Some libraries support storing arrays in either a flat (single directory)
format or a nested storage format. In such cases the base filename (excluding
its extension) should end in either _flat
or _nested
to indicate this.
For example xtensor-zarr
data stored in zarr-v3 format with nested storage
would be named: xtensor_zarr_nested.zr3
. Not all implementations have
metadata indicating of the storage is flat or nested so the testing scripts
currently tend to set a nested
boolean flag depending on whether _nested
is
in the filename.
Currently zarr-python
also stores data with multiple different underlying
storage classes. In this specific case, the name of the storage class is also
indicated in the filename. For example, zarr (v2) is stored in flat format with
both a Directory_Store
class and an FSStore
class, corresponding to the
filenames zarr_DirectoryStore_flat.zr
and zarr_FSStore_flat.zr
,
respectively.
Currently all tests reside in a single file, test/test_read_all.py
. If the
data has already been generated, tests can be run from the root folder of the
repository via:
pytest test/test_real_all.py -v
alternatively, the make command, make test
, will generate the data and then
run the tests.
When adding a new library to the test suite, a corresponding entry for the
library should be added to the READABLE_CODECS
dictionary. Here is the entry
for zarrita as a concrete example:
"zarrita": {
"zarr": [],
"zarr-v3": ["blosc", "gzip", "raw", "zlib"],
"N5": [],
},
The value here is a dictionary where the keys correspond to the three file
types tested by zarr_implementations
and the values are the codecs that this
implementation should be able to read. For zarrita
, the lists for zarr (v2)
and N5 are empty because these file formats are unsupported.
Secondly, a Python function capable of reading a file with the library needs
to be created and a corresponding entry added to the dictionary within
_get_read_fn
that maps implementation names to a corresponding fucntion for
reading from that format. This read function should have the following
signature:
read_with_LIBNAME(fpath, ds_name, nested=None)
where fpath
is a file path, ds_name
is a dataset name within the file and
nested
is an optional boolean flag indicating whether the data is stored in
a nested format. Typically this function just reads in the specified dataset
and returns a NumPy array corresponding to it.
For non-Python implementations, the data reader function may call a binary
program or shell script that does the actual reading and/or validation. This
program can be called via Python's subprocess
module.
For non-Python implementations it is also possible to have the data reader
function that is returned by _get_read_fn
do the array validation internally
rather than generate a NumPy array. In the case where internal validation is
being performed, the function should return None
to indicate that validation
ucceeded and raise an error, otherwise.
For concrete examples of non-Python validation functions see the xtensor-zarr
(C++) and jzarr
(Java) implementations. xtensor-zarr
calls a compiled C++
program that reads the specified data file and writes the contents out to a
NumPy .npy
array for later validation in Python. On the contraray, the
jzarr
reading program does the validation internally and returns a non-zero
exit code if validation failed.
Adding the read function entry to _get_read_fn
and updating READABLE_CODECS
dictionary as described above should be sufficient to get new parameterized
test cases automatically generated for the new library via the create_params
functions. The test_correct_read
function is the core test function that then
runs over all sets of parameters generated by create_params
. In most cases it
should not be necessary to modify create_params
or test_correct_read
itself.
A summary report can be generated by calling the same test function, but via
python rather than pytest. In this case, Pandas summary tables corresponding to
success or failure of each write/read library combination are generated and
exported as Markdown (report.md
) and HTML (report.html
) files. The
reports are generated using the same create_params
parameter sets used by
pytest so it should not be necessary to make any changes to the reporting
functions themselves as new libraries are added.
The specific format of the reports and method of report generation is still considered experimental and is subject to change as the library evolves.