Review Comments from ESO #340

Jan-Willem · 2024-11-26T15:28:57Z

https://xradio.readthedocs.io/en/latest/measurement_set_overview.html

The documentation says: "The current MS v4 schema focuses on offline processing capabilities and
does not encompass all information present in the ASDM."
- It would be useful to list all the aspects of the ASDM that are not (yet?) covered by the MSv4.
- If there is work on "future expansion to incorporate additional data", would that be mere
  additions or is there a risk that more fundamental changes to the MSv4 would be needed?
It is clear that the MSv4 is not backwards compatible with MSv2/3. What about the backwards
compatibility of MSv4 itself?
- Will all MSv4 be backwards compatible with all other future MSv4?
- I.e. does a paradigm like 'Once FITS always FITS' also apply to MSv4?
- I.e. MSv4.X.X would only contain additions, no changes of existing concepts, nor stucture
- If not: this will reduce the longetivity of the data stored in MSv4 massively. (See also below)
  if yes: this guarantee should be prominetly spelled out and it should be made clear which
  organizational structure will guarantee the back/forwards compatibility of MSv4 itself.
"The sub-package currently allows direct opening of data from zarr and will support WSU ASDM
(Wide Band Sensitivity Upgrade) and NetCDF in the future."

It would be very beneficial if the XRADIO package would support reading of current ALMA ASDMs for
bulk reprocessing. Comparing the read times with the processing times seems to indicate that for
all practical purposes, the reading time from current ASDMs would be entirely negligible compared
to the processing time.
In general, the documentation seems to be mixing the data-model, the serialization (zarr) of the
data as well as the implementation of the data-access quite a bit.

The data model can and should exist independently of which serialization (zarr) is used.
The serialization can and should exist independently of which data-access mechanism/language/tool
is used.

I would suggest to make extremely clear that
- data model with semantics
- serialization
- software to access
  are entirely separate throughout the documentation.
It should also be spelled out clearly what is meant by 'data-format'. Often people use that
expression to indicate a combination of
- data-model
- the semantics (defining what each concept precisely means and what units are possible for
  each)
- the serialization to e.g. disk
I must say that the semantics specified in

https://xradio.readthedocs.io/en/latest/measurement_set/schema_and_api/measurement_set_schema.html

are really very well done. Congrats!

ESO's FITS keywords are here in case someone wants to have a look for potentially missing
concepts:
https://www.eso.org/sci/observing/phase3/p3sdpstd.pdf
https://hst-docs.stsci.edu/acsdhb/chapter-2-acs-data-structure/2-4-headers-and-keywords
In particular, following 4), the data model and the serialization must be entirely programming
language independent!

It seems to be the case that by using zarr the serialization is programming language
independent. And it looks really good and future-proof, too.
Provenance?
One issue that was problematic for ALMA was that if manipulations were done to a MS, the
information of which original row was which was lost due to the constant renumbering. In other
words, the provenance information was missing.

I suggest that the MSv4 is augmented so that the data-model itself allows tracking of what
happened to each spectral window in a machine-readable way.

This might be alleviated in MSv4 compared to MSv2 as the SPWs seem to be identified by a name
rather than by a row number.
The current MSv4 is centered around processing. That's also why several SPWs are called a
processing set. Is that good enough also for the future? Would a different name and concept be
more useful than 'processing set'? Measurement sets typically would exist independently of the
action of processing. Maybe 'observation set' instead of 'processing set'?
Depending on how self-contained the MSv4 should be, there would be other metadata required, like
Project code, title of observation, PI name, ...

For the ObservationInfoDict there are also items that are missing for ALMA:
Member_OUS_ID
Group_OUS_ID

and certainly many more.
These could be taken from the IVOA data-model standards, the FITS keyword dictionaries, the ASDM
etc.

At least for ALMA processing, it would be ideal/crucial if all the information that was needed to
create the FITS files was also present in the MS directly.

https://xradio.readthedocs.io/en/latest/measurement_set/tutorials/ps_vis.html

It is clear that a new implementation of data-access software should make use of existing tools
rather than writing everything by itself. However, the danger is that there are many dependencies
which render the product unstable.

The list of dependencies of the XRADIO installation is 155 packages long!
Each of these packages has a version requirement.

Compare that to CFITSIO who's only dependency is that a working c-compiler exists.

It seems to be that this is an enormous risk. XRADIO data-access will need to exist for
decades. While the existence of a c compiler can be guaranteed for 40 years, none of these
packages can, probably not even python itself can.

We have seen that it is very hard to run old versions of CASA on modern OS even though the entire
python and all packages were shipped together with the tar-ball. We also know from machine
learning applications that typically use a similar amount of packages, that is essentially
impossible to install software that was put onto github even just a couple of years earlier
because packages have changed, versions have changed, functions are deprecated, functionality has
moved.

To me this is a BLOCKER item: The access function of a serialization of a next generation data-
model must be relying on a very small number of libraries, which are all under the control of the
entity providing the serialization and which can be reasonably well guaranteed to be existing in
e.g. 40 years from now. (Not that relevant for ALMA itself as ALMA will most likely continue to
store in ASDM and FITS, but highly relevant for all other observatories who might store data in
MSv4).

The conversion from MSV2 to MSv4 uses casatasks. Are they guaranteed to be available for 40
years?

Maybe several packages are needed. One that is only using numpy as dependency and allows to read
the MSv4. That could also move to astropy natively, like fitsio.

And then other packages with more and more remove functionality that users can but do not have to
install.

Overall, the number of packages that XRADIO uses must be reduced to the absolutely bare
minimum.
Related the BLOCKER in 8) is that I got the following error message when trying install the
XRADIO package

ERROR: pip's dependency resolver does not currently take into account all the packages that are
installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.13.0 requires numpy<=1.24.3,>=1.22, but you have numpy 2.0.2 which is incompatible.
tensorflow 2.13.0 requires typing-extensions<4.6.0,>=3.6.6, but you have typing-extensions 4.12.2
which is incompatible.
boto3 1.24.85 requires botocore<1.28.0,>=1.27.85, but you have botocore 1.35.36 which is
incompatible.

Sure, people can use venv, but that already proves the point: The software to just read a
serialization of a MSv4 can not already be so complicated that it requires to use an entire
virtual environment!

Just following the documentation breaks:
from xradio.measurement_set import estimate_conversion_memory_and_cores
returns the error
ImportError: cannot import name 'estimate_conversion_memory_and_cores' from
'xradio.measurement_set
Related to the BLOCKER in 9), is that with these dependencies for the data-access
the data-format can not be used for long-term storage of data, i.e. archiving.

Here is the definition of sustainability that the US Library of the Congress uses:

https://www.loc.gov/preservation/digital/formats/sustain/sustain.shtml

Here is the entry for FITS just as an example:

https://www.loc.gov/preservation/digital/formats/fdd/fdd000317.shtml

zarr is suitable for long-term storage. It defines the byte-layout on the disk. The main reading
routines (C, C++, Java) should be so low-level that also in 40 years they would be useable.
Whatever language/AI will be fancy in 40 years from now, can then still at least wrap that
reading routine.

And XRADIO should make use of that minimal function itself, so that it is guaranteed that this
core reading function of MSv4 is always up to date and maintained and bug-free.
Related to the BLOCKER in 9), is a non-exhaustive list of properties I think are needed
to call a data-format a format suitable for archiving.
- Format needs to be open, widely adopted, transparent and documented, patent-free, (self)-
  documented, standardized, without external dependencies
- Needs to be readable with as many tools that astronomers use as possible
- Needs low-entry barrier for usage (simple is better than complex)
- Needs to be managed by deliberate process under configuration control (e.g. a
  standardization body/group)
- Must be efficient to download, read and manipulate
- Has to remain stable (forwards compatible) in time (once FITS always FITS)
- Must be described by the byte-layout on the disk, not by an API (Must be possible to write a
  reading routine in a few decades from now in a then modern programming language)
- Must be programming-language agnostic and have readers in many programming languages
  available
- ...
ALMA is currently planning to store the final products in FITS also for the WSU era. Therefore
the XRADIO dependencies and risks are not an issue.

For processing it is good enough if the data-format is stably accessible for the time of
processing a single dataset.

But for other facilities, in case they would plan to store MSv4 for longer than a month or two,
this is a serious issue in my opinion that needs sufficient attention.
Related to the BLOCKER in 9) it would be probably good to explicitly specify non-functional
requirements for the MSv4 as well as for XRADIO somewhere. I.e. the vision and mission goal.

Even if only a single version of data-access is provided for the MSv4 with the zarr
serialization, i.e. XRADIO using python as a progrmming language, the requirements would need to
be that
- the reading software can be installed on operating systems at least going 10 years back in time
- the reading software must work without containers and be simple to use, not more complicated
  than cfitsio (or e.g. pyfits).
- there must be at least two reference implementations for the reading software (that's
  for example a requirement that the IVOA gave themselves for each new standard they approve)
- ...
Do not use Conda
This page
https://github.com/casangi/xradio
starts with using conda.

Do not use conda! Anaconda is a commercial entity. Organizations with more than 200 employees
need to get a license:
"Use of Anaconda’s Offerings at an organization of more than 200 employees requires a Business or
Enterprise license. For more information, see our full Terms of Service, or read Frequently Asked
Questions about our Terms of Service."

The entire workflow and usage must only rely entirely open-source and free software (e.g. GPL,
LGPL, ...)!
Just like conda was free previously, all software used must be carefully checked so that even in
10 or 20 years from now, there is no dependence that could turn into a vendor lock in.
While the serialization (zarr) is really programming language independent, I had overlooked that
the definition of the data model and thus the origin from which code-generation can happen is
python

https://xradio.readthedocs.io/en/latest/_modules/xradio/measurement_set/schema.html

Python is in general not really suited for longevity. There is no standard as in 'ANSI C' for
example. The schema definition imports 'annotations' from 'future', it uses 'typing', it
uses 'xarray_dataset_schema', definitions in 'nympy' etc. All those are subject to substantially
change over time.

This seems to be a risk.

The main source of the schema should in my opinion be written in a standardized language (UML?,
XML?) that is not subject to change for decades and that can be used for code-generation in any
other language where needed. In the same way zarr is language agnostic, the schema needs to be,
too.

In other words: if there is a change to pyhon6 and a change to numpy, xarray, ... none of those
should require changing the schema. And as well-supported xarray is, it is not guaranteed to
live for e.g. 40 years either.

A suggestion would be to make such a programming language independent schema. And then to
prove that the schema and serialization are programming language dependent, one way could be to
implement code-generation for reading and writing directly from the schema (as it does exist for
ASDM) in several programming languages e.g. python, C and java. And to make that test a standard
regression test included into the xradio test suite.

On Data-formats in general

One of the FITS fathers: "Creating a new data-format is easy. Everybody can do it. The challenge
is to make it the world-wide used standarad."

This is certainly as true now as it was then.

I think that for MSv4 you are taking the right steps. Involving the largest community, getting
buy in, presenting at ADASS relentlessly. There is no guarantee that this works (see the ASDF
format where everything was done right but there is still no real takeup) but I keep fingers
crossed.

The text was updated successfully, but these errors were encountered:

Jan-Willem · 2024-12-12T14:59:40Z

We can add a list of tables that have not been included. The future expansion currently planned would be backwards compatible changes (similar to the changes made to MSv2 and ASDM. For example, VLBI sub-tables were added to MSv2 last year).
MSv4.x.x will be backward compatible (no changes to existing concepts nor structure). See https://xradio.readthedocs.io/en/latest/overview.html#Schema-Versioning.
Reading from the current ALMA ASDM should be possible using PyASDM (which is under development); however, efficiency cannot be guaranteed.
We will add a section defining the data model with semantics, serialization, software to access, etc. We will then check the rest of the documentation for consistency in our usage.
Yes, https://zarr.dev/implementations/.
We have moved away from using indices to string labels (antenna_names, spw_names, field_names, etc.). The only remaining numbering is scan_number and baseline_id. We are considering changing scan_numbers to scan_names. The baseline_id represents a type of multi-index where baseline_antenna1_name and baseline_antenna2_name should be used. Consequently, this will allow comparing a modified MSv4 with the original. We have also started discussing adding a log/history to track changes with the requirement that it is machine-readable.

Jan-Willem · 2024-12-12T15:59:27Z

The processing set is a construct used specifically for processing data together (similar to a Multi-Measurement Set). It can consist of any collection of MSv4s (including those from different telescopes and observation times). Using observation set naming was proposed earlier but was rejected because it's already used in another context.
The MSv4 is intended to be completely self-contained. We will add Project code, title of observation, PI name, Member_OUS_ID, and Group_OUS_ID as optional keys (note that Member_OUS_ID and Group_OUS_ID are not specified in the ASDM standard). Since it will be very difficult to account for all differences in observatory labeling of their data, we could allow for observatory-specified items in the observation_info.
The 155 dependencies include all the interaction and visualization tools that are not necessary for accessing the data (Jupyter Labs, GraphViz, Matplotlib, etc.). To read the data, only the Zarr package and its dependencies are required:

numpy>=1.25
numcodecs[crc32c]>=0.14
typing_extensions>=4.9
donfig>=0.8

Jan-Willem · 2024-12-12T17:18:48Z

See response to 340.8.
See response to 340.8.
Non-exhaustive list of archive properties:
- Format needs to be open, widely adopted, transparent and documented, patent-free, (self)-
  documented, standardized, without external dependencies:
  Zarr is widely used: https://zarr.dev/adopters/
- Needs to be readable with as many tools that astronomers use as possible:
  https://zarr.dev/implementations/
- Needs low-entry barrier for usage (simple is better than complex):
  Pip install zarr is very simple.
- Needs to be managed by deliberate process under configuration control (e.g. a
  standardization body/group):
  https://zarr.dev/community/
- Must be efficient to download, read and manipulate:
  From the initial benchmarking we have done it is efficient
- Has to remain stable (forwards compatible) in time (once FITS always FITS)
  Both Zarr3.x and MSv4.x.x are backwards compatible.
- Must be described by the byte-layout on the disk, not by an API (Must be possible to write a
  reading routine in a few decades from now in a then modern programming language)
  https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html
- Must be programming-language agnostic and have readers in many programming languages
  available
  https://zarr.dev/implementations/
Ask for input from the review committee on the suggested non-functional requirements.
The suggested conda does not come from Anaconda but from but from miniforge:

"It is recommended to use the conda environment manager from miniforge to create a clean, self-contained runtime where XRADIO and all its dependencies can be installed"

miniforge is a community-driven open source project and makes use of the conda-forge channel which is not controlled by Anaconda. Miniforge license is BSD-3: https://github.com/conda-forge/miniforge/blob/main/LICENSE. XRADIO can also be installed using a standard Python virtual environment on Linux but for Mac, the conda forge channel is needed (this is a restriction from python-casacore).

Jan-Willem added the MSv4 Review label Nov 26, 2024

Jan-Willem changed the title ~~Review~~ Review Comments from ESO Nov 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Review Comments from ESO #340

Review Comments from ESO #340

Jan-Willem commented Nov 26, 2024 •

edited

Loading

Jan-Willem commented Dec 12, 2024

Jan-Willem commented Dec 12, 2024

Jan-Willem commented Dec 12, 2024

Review Comments from ESO #340

Review Comments from ESO #340

Comments

Jan-Willem commented Nov 26, 2024 • edited Loading

https://xradio.readthedocs.io/en/latest/measurement_set_overview.html

https://xradio.readthedocs.io/en/latest/measurement_set/tutorials/ps_vis.html

Jan-Willem commented Dec 12, 2024

Jan-Willem commented Dec 12, 2024

Jan-Willem commented Dec 12, 2024

Jan-Willem commented Nov 26, 2024 •

edited

Loading