-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Review Comments from ESO #340
Comments
|
|
"It is recommended to use the conda environment manager from miniforge to create a clean, self-contained runtime where XRADIO and all its dependencies can be installed" miniforge is a community-driven open source project and makes use of the conda-forge channel which is not controlled by Anaconda. Miniforge license is BSD-3: https://github.com/conda-forge/miniforge/blob/main/LICENSE. XRADIO can also be installed using a standard Python virtual environment on Linux but for Mac, the conda forge channel is needed (this is a restriction from python-casacore). |
https://xradio.readthedocs.io/en/latest/measurement_set_overview.html
The documentation says: "The current MS v4 schema focuses on offline processing capabilities and
does not encompass all information present in the ASDM."
additions or is there a risk that more fundamental changes to the MSv4 would be needed?
It is clear that the MSv4 is not backwards compatible with MSv2/3. What about the backwards
compatibility of MSv4 itself?
if yes: this guarantee should be prominetly spelled out and it should be made clear which
organizational structure will guarantee the back/forwards compatibility of MSv4 itself.
"The sub-package currently allows direct opening of data from zarr and will support WSU ASDM
(Wide Band Sensitivity Upgrade) and NetCDF in the future."
It would be very beneficial if the XRADIO package would support reading of current ALMA ASDMs for
bulk reprocessing. Comparing the read times with the processing times seems to indicate that for
all practical purposes, the reading time from current ASDMs would be entirely negligible compared
to the processing time.
In general, the documentation seems to be mixing the data-model, the serialization (zarr) of the
data as well as the implementation of the data-access quite a bit.
The data model can and should exist independently of which serialization (zarr) is used.
The serialization can and should exist independently of which data-access mechanism/language/tool
is used.
I would suggest to make extremely clear that
are entirely separate throughout the documentation.
It should also be spelled out clearly what is meant by 'data-format'. Often people use that
expression to indicate a combination of
each)
I must say that the semantics specified in
https://xradio.readthedocs.io/en/latest/measurement_set/schema_and_api/measurement_set_schema.html
are really very well done. Congrats!
ESO's FITS keywords are here in case someone wants to have a look for potentially missing
concepts:
https://www.eso.org/sci/observing/phase3/p3sdpstd.pdf
https://hst-docs.stsci.edu/acsdhb/chapter-2-acs-data-structure/2-4-headers-and-keywords
In particular, following 4), the data model and the serialization must be entirely programming
language independent!
It seems to be the case that by using zarr the serialization is programming language
independent. And it looks really good and future-proof, too.
Provenance?
One issue that was problematic for ALMA was that if manipulations were done to a MS, the
information of which original row was which was lost due to the constant renumbering. In other
words, the provenance information was missing.
I suggest that the MSv4 is augmented so that the data-model itself allows tracking of what
happened to each spectral window in a machine-readable way.
This might be alleviated in MSv4 compared to MSv2 as the SPWs seem to be identified by a name
rather than by a row number.
The current MSv4 is centered around processing. That's also why several SPWs are called a
processing set. Is that good enough also for the future? Would a different name and concept be
more useful than 'processing set'? Measurement sets typically would exist independently of the
action of processing. Maybe 'observation set' instead of 'processing set'?
Depending on how self-contained the MSv4 should be, there would be other metadata required, like
Project code, title of observation, PI name, ...
For the ObservationInfoDict there are also items that are missing for ALMA:
Member_OUS_ID
Group_OUS_ID
and certainly many more.
These could be taken from the IVOA data-model standards, the FITS keyword dictionaries, the ASDM
etc.
At least for ALMA processing, it would be ideal/crucial if all the information that was needed to
create the FITS files was also present in the MS directly.
https://xradio.readthedocs.io/en/latest/measurement_set/tutorials/ps_vis.html
It is clear that a new implementation of data-access software should make use of existing tools
rather than writing everything by itself. However, the danger is that there are many dependencies
which render the product unstable.
The list of dependencies of the XRADIO installation is 155 packages long!
Each of these packages has a version requirement.
Compare that to CFITSIO who's only dependency is that a working c-compiler exists.
It seems to be that this is an enormous risk. XRADIO data-access will need to exist for
decades. While the existence of a c compiler can be guaranteed for 40 years, none of these
packages can, probably not even python itself can.
We have seen that it is very hard to run old versions of CASA on modern OS even though the entire
python and all packages were shipped together with the tar-ball. We also know from machine
learning applications that typically use a similar amount of packages, that is essentially
impossible to install software that was put onto github even just a couple of years earlier
because packages have changed, versions have changed, functions are deprecated, functionality has
moved.
To me this is a BLOCKER item: The access function of a serialization of a next generation data-
model must be relying on a very small number of libraries, which are all under the control of the
entity providing the serialization and which can be reasonably well guaranteed to be existing in
e.g. 40 years from now. (Not that relevant for ALMA itself as ALMA will most likely continue to
store in ASDM and FITS, but highly relevant for all other observatories who might store data in
MSv4).
The conversion from MSV2 to MSv4 uses casatasks. Are they guaranteed to be available for 40
years?
Maybe several packages are needed. One that is only using numpy as dependency and allows to read
the MSv4. That could also move to astropy natively, like fitsio.
And then other packages with more and more remove functionality that users can but do not have to
install.
Overall, the number of packages that XRADIO uses must be reduced to the absolutely bare
minimum.
Related the BLOCKER in 8) is that I got the following error message when trying install the
XRADIO package
ERROR: pip's dependency resolver does not currently take into account all the packages that are
installed. This behaviour is the source of the following dependency conflicts.
tensorflow 2.13.0 requires numpy<=1.24.3,>=1.22, but you have numpy 2.0.2 which is incompatible.
tensorflow 2.13.0 requires typing-extensions<4.6.0,>=3.6.6, but you have typing-extensions 4.12.2
which is incompatible.
boto3 1.24.85 requires botocore<1.28.0,>=1.27.85, but you have botocore 1.35.36 which is
incompatible.
Sure, people can use venv, but that already proves the point: The software to just read a
serialization of a MSv4 can not already be so complicated that it requires to use an entire
virtual environment!
Just following the documentation breaks:
from xradio.measurement_set import estimate_conversion_memory_and_cores
returns the error
ImportError: cannot import name 'estimate_conversion_memory_and_cores' from
'xradio.measurement_set
Related to the BLOCKER in 9), is that with these dependencies for the data-access
the data-format can not be used for long-term storage of data, i.e. archiving.
Here is the definition of sustainability that the US Library of the Congress uses:
https://www.loc.gov/preservation/digital/formats/sustain/sustain.shtml
Here is the entry for FITS just as an example:
https://www.loc.gov/preservation/digital/formats/fdd/fdd000317.shtml
zarr is suitable for long-term storage. It defines the byte-layout on the disk. The main reading
routines (C, C++, Java) should be so low-level that also in 40 years they would be useable.
Whatever language/AI will be fancy in 40 years from now, can then still at least wrap that
reading routine.
And XRADIO should make use of that minimal function itself, so that it is guaranteed that this
core reading function of MSv4 is always up to date and maintained and bug-free.
Related to the BLOCKER in 9), is a non-exhaustive list of properties I think are needed
to call a data-format a format suitable for archiving.
documented, standardized, without external dependencies
standardization body/group)
reading routine in a few decades from now in a then modern programming language)
available
ALMA is currently planning to store the final products in FITS also for the WSU era. Therefore
the XRADIO dependencies and risks are not an issue.
For processing it is good enough if the data-format is stably accessible for the time of
processing a single dataset.
But for other facilities, in case they would plan to store MSv4 for longer than a month or two,
this is a serious issue in my opinion that needs sufficient attention.
Related to the BLOCKER in 9) it would be probably good to explicitly specify non-functional
requirements for the MSv4 as well as for XRADIO somewhere. I.e. the vision and mission goal.
Even if only a single version of data-access is provided for the MSv4 with the zarr
serialization, i.e. XRADIO using python as a progrmming language, the requirements would need to
be that
than cfitsio (or e.g. pyfits).
for example a requirement that the IVOA gave themselves for each new standard they approve)
Do not use Conda
This page
https://github.com/casangi/xradio
starts with using conda.
Do not use conda! Anaconda is a commercial entity. Organizations with more than 200 employees
need to get a license:
"Use of Anaconda’s Offerings at an organization of more than 200 employees requires a Business or
Enterprise license. For more information, see our full Terms of Service, or read Frequently Asked
Questions about our Terms of Service."
The entire workflow and usage must only rely entirely open-source and free software (e.g. GPL,
LGPL, ...)!
Just like conda was free previously, all software used must be carefully checked so that even in
10 or 20 years from now, there is no dependence that could turn into a vendor lock in.
While the serialization (zarr) is really programming language independent, I had overlooked that
the definition of the data model and thus the origin from which code-generation can happen is
python
https://xradio.readthedocs.io/en/latest/_modules/xradio/measurement_set/schema.html
Python is in general not really suited for longevity. There is no standard as in 'ANSI C' for
example. The schema definition imports 'annotations' from 'future', it uses 'typing', it
uses 'xarray_dataset_schema', definitions in 'nympy' etc. All those are subject to substantially
change over time.
This seems to be a risk.
The main source of the schema should in my opinion be written in a standardized language (UML?,
XML?) that is not subject to change for decades and that can be used for code-generation in any
other language where needed. In the same way zarr is language agnostic, the schema needs to be,
too.
In other words: if there is a change to pyhon6 and a change to numpy, xarray, ... none of those
should require changing the schema. And as well-supported xarray is, it is not guaranteed to
live for e.g. 40 years either.
A suggestion would be to make such a programming language independent schema. And then to
prove that the schema and serialization are programming language dependent, one way could be to
implement code-generation for reading and writing directly from the schema (as it does exist for
ASDM) in several programming languages e.g. python, C and java. And to make that test a standard
regression test included into the xradio test suite.
On Data-formats in general
The text was updated successfully, but these errors were encountered: