Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Review Comments from ESO #340

Open
Jan-Willem opened this issue Nov 26, 2024 · 3 comments
Open

Review Comments from ESO #340

Jan-Willem opened this issue Nov 26, 2024 · 3 comments

Comments

@Jan-Willem
Copy link
Member

Jan-Willem commented Nov 26, 2024

https://xradio.readthedocs.io/en/latest/measurement_set_overview.html

  1. The documentation says: "The current MS v4 schema focuses on offline processing capabilities and
    does not encompass all information present in the ASDM."

    • It would be useful to list all the aspects of the ASDM that are not (yet?) covered by the MSv4.
    • If there is work on "future expansion to incorporate additional data", would that be mere
      additions or is there a risk that more fundamental changes to the MSv4 would be needed?
  2. It is clear that the MSv4 is not backwards compatible with MSv2/3. What about the backwards
    compatibility of MSv4 itself?

    • Will all MSv4 be backwards compatible with all other future MSv4?
    • I.e. does a paradigm like 'Once FITS always FITS' also apply to MSv4?
    • I.e. MSv4.X.X would only contain additions, no changes of existing concepts, nor stucture
    • If not: this will reduce the longetivity of the data stored in MSv4 massively. (See also below)
      if yes: this guarantee should be prominetly spelled out and it should be made clear which
      organizational structure will guarantee the back/forwards compatibility of MSv4 itself.
  3. "The sub-package currently allows direct opening of data from zarr and will support WSU ASDM
    (Wide Band Sensitivity Upgrade) and NetCDF in the future."

    It would be very beneficial if the XRADIO package would support reading of current ALMA ASDMs for
    bulk reprocessing. Comparing the read times with the processing times seems to indicate that for
    all practical purposes, the reading time from current ASDMs would be entirely negligible compared
    to the processing time.

  4. In general, the documentation seems to be mixing the data-model, the serialization (zarr) of the
    data as well as the implementation of the data-access quite a bit.

    The data model can and should exist independently of which serialization (zarr) is used.
    The serialization can and should exist independently of which data-access mechanism/language/tool
    is used.

    I would suggest to make extremely clear that

    • data model with semantics
    • serialization
    • software to access
      are entirely separate throughout the documentation.

    It should also be spelled out clearly what is meant by 'data-format'. Often people use that
    expression to indicate a combination of

    • data-model
    • the semantics (defining what each concept precisely means and what units are possible for
      each)
    • the serialization to e.g. disk

    I must say that the semantics specified in

    https://xradio.readthedocs.io/en/latest/measurement_set/schema_and_api/measurement_set_schema.html

    are really very well done. Congrats!

    ESO's FITS keywords are here in case someone wants to have a look for potentially missing
    concepts:
    https://www.eso.org/sci/observing/phase3/p3sdpstd.pdf
    https://hst-docs.stsci.edu/acsdhb/chapter-2-acs-data-structure/2-4-headers-and-keywords

  5. In particular, following 4), the data model and the serialization must be entirely programming
    language independent
    !

    It seems to be the case that by using zarr the serialization is programming language
    independent. And it looks really good and future-proof, too.

  6. Provenance?
    One issue that was problematic for ALMA was that if manipulations were done to a MS, the
    information of which original row was which was lost due to the constant renumbering. In other
    words, the provenance information was missing.

    I suggest that the MSv4 is augmented so that the data-model itself allows tracking of what
    happened to each spectral window in a machine-readable way.

    This might be alleviated in MSv4 compared to MSv2 as the SPWs seem to be identified by a name
    rather than by a row number.

  7. The current MSv4 is centered around processing. That's also why several SPWs are called a
    processing set. Is that good enough also for the future? Would a different name and concept be
    more useful than 'processing set'? Measurement sets typically would exist independently of the
    action of processing. Maybe 'observation set' instead of 'processing set'?

  8. Depending on how self-contained the MSv4 should be, there would be other metadata required, like
    Project code, title of observation, PI name, ...

    For the ObservationInfoDict there are also items that are missing for ALMA:
    Member_OUS_ID
    Group_OUS_ID

    and certainly many more.
    These could be taken from the IVOA data-model standards, the FITS keyword dictionaries, the ASDM
    etc.

    At least for ALMA processing, it would be ideal/crucial if all the information that was needed to
    create the FITS files was also present in the MS directly.

https://xradio.readthedocs.io/en/latest/measurement_set/tutorials/ps_vis.html

  1. It is clear that a new implementation of data-access software should make use of existing tools
    rather than writing everything by itself. However, the danger is that there are many dependencies
    which render the product unstable.

    The list of dependencies of the XRADIO installation is 155 packages long!
    Each of these packages has a version requirement.

    Compare that to CFITSIO who's only dependency is that a working c-compiler exists.

    It seems to be that this is an enormous risk. XRADIO data-access will need to exist for
    decades. While the existence of a c compiler can be guaranteed for 40 years, none of these
    packages can, probably not even python itself can.

    We have seen that it is very hard to run old versions of CASA on modern OS even though the entire
    python and all packages were shipped together with the tar-ball. We also know from machine
    learning applications that typically use a similar amount of packages, that is essentially
    impossible to install software that was put onto github even just a couple of years earlier
    because packages have changed, versions have changed, functions are deprecated, functionality has
    moved.

    To me this is a BLOCKER item: The access function of a serialization of a next generation data-
    model must be relying on a very small number of libraries, which are all under the control of the
    entity providing the serialization and which can be reasonably well guaranteed to be existing in
    e.g. 40 years from now. (Not that relevant for ALMA itself as ALMA will most likely continue to
    store in ASDM and FITS, but highly relevant for all other observatories who might store data in
    MSv4).

    The conversion from MSV2 to MSv4 uses casatasks. Are they guaranteed to be available for 40
    years?

    Maybe several packages are needed. One that is only using numpy as dependency and allows to read
    the MSv4. That could also move to astropy natively, like fitsio.

    And then other packages with more and more remove functionality that users can but do not have to
    install.

    Overall, the number of packages that XRADIO uses must be reduced to the absolutely bare
    minimum.

  2. Related the BLOCKER in 8) is that I got the following error message when trying install the
    XRADIO package

    ERROR: pip's dependency resolver does not currently take into account all the packages that are
    installed. This behaviour is the source of the following dependency conflicts.
    tensorflow 2.13.0 requires numpy<=1.24.3,>=1.22, but you have numpy 2.0.2 which is incompatible.
    tensorflow 2.13.0 requires typing-extensions<4.6.0,>=3.6.6, but you have typing-extensions 4.12.2
    which is incompatible.
    boto3 1.24.85 requires botocore<1.28.0,>=1.27.85, but you have botocore 1.35.36 which is
    incompatible.

    Sure, people can use venv, but that already proves the point: The software to just read a
    serialization of a MSv4 can not already be so complicated that it requires to use an entire
    virtual environment
    !

    Just following the documentation breaks:
    from xradio.measurement_set import estimate_conversion_memory_and_cores
    returns the error
    ImportError: cannot import name 'estimate_conversion_memory_and_cores' from
    'xradio.measurement_set

  3. Related to the BLOCKER in 9), is that with these dependencies for the data-access
    the data-format can not be used for long-term storage of data, i.e. archiving.

    Here is the definition of sustainability that the US Library of the Congress uses:

    https://www.loc.gov/preservation/digital/formats/sustain/sustain.shtml

    Here is the entry for FITS just as an example:

    https://www.loc.gov/preservation/digital/formats/fdd/fdd000317.shtml

    zarr is suitable for long-term storage. It defines the byte-layout on the disk. The main reading
    routines (C, C++, Java) should be so low-level that also in 40 years they would be useable.
    Whatever language/AI will be fancy in 40 years from now, can then still at least wrap that
    reading routine.

    And XRADIO should make use of that minimal function itself, so that it is guaranteed that this
    core reading function of MSv4 is always up to date and maintained and bug-free.

  4. Related to the BLOCKER in 9), is a non-exhaustive list of properties I think are needed
    to call a data-format a format suitable for archiving.

    • Format needs to be open, widely adopted, transparent and documented, patent-free, (self)-
      documented, standardized, without external dependencies
    • Needs to be readable with as many tools that astronomers use as possible
    • Needs low-entry barrier for usage (simple is better than complex)
    • Needs to be managed by deliberate process under configuration control (e.g. a
      standardization body/group)
    • Must be efficient to download, read and manipulate
    • Has to remain stable (forwards compatible) in time (once FITS always FITS)
    • Must be described by the byte-layout on the disk, not by an API (Must be possible to write a
      reading routine in a few decades from now in a then modern programming language)
    • Must be programming-language agnostic and have readers in many programming languages
      available
    • ...

    ALMA is currently planning to store the final products in FITS also for the WSU era. Therefore
    the XRADIO dependencies and risks are not an issue.

    For processing it is good enough if the data-format is stably accessible for the time of
    processing a single dataset.

    But for other facilities, in case they would plan to store MSv4 for longer than a month or two,
    this is a serious issue in my opinion that needs sufficient attention.

  5. Related to the BLOCKER in 9) it would be probably good to explicitly specify non-functional
    requirements for the MSv4 as well as for XRADIO somewhere. I.e. the vision and mission goal.

    Even if only a single version of data-access is provided for the MSv4 with the zarr
    serialization, i.e. XRADIO using python as a progrmming language, the requirements would need to
    be that

    • the reading software can be installed on operating systems at least going 10 years back in time
    • the reading software must work without containers and be simple to use, not more complicated
      than cfitsio (or e.g. pyfits).
    • there must be at least two reference implementations for the reading software (that's
      for example a requirement that the IVOA gave themselves for each new standard they approve)
    • ...
  6. Do not use Conda
    This page
    https://github.com/casangi/xradio
    starts with using conda.

    Do not use conda! Anaconda is a commercial entity. Organizations with more than 200 employees
    need to get a license:
    "Use of Anaconda’s Offerings at an organization of more than 200 employees requires a Business or
    Enterprise license. For more information, see our full Terms of Service, or read Frequently Asked
    Questions about our Terms of Service."

    The entire workflow and usage must only rely entirely open-source and free software (e.g. GPL,
    LGPL, ...)!
    Just like conda was free previously, all software used must be carefully checked so that even in
    10 or 20 years from now, there is no dependence that could turn into a vendor lock in.

  7. While the serialization (zarr) is really programming language independent, I had overlooked that
    the definition of the data model and thus the origin from which code-generation can happen is
    python

    https://xradio.readthedocs.io/en/latest/_modules/xradio/measurement_set/schema.html

    Python is in general not really suited for longevity. There is no standard as in 'ANSI C' for
    example. The schema definition imports 'annotations' from 'future', it uses 'typing', it
    uses 'xarray_dataset_schema', definitions in 'nympy' etc. All those are subject to substantially
    change over time.

    This seems to be a risk.

    The main source of the schema should in my opinion be written in a standardized language (UML?,
    XML?) that is not subject to change for decades and that can be used for code-generation in any
    other language where needed. In the same way zarr is language agnostic, the schema needs to be,
    too.

    In other words: if there is a change to pyhon6 and a change to numpy, xarray, ... none of those
    should require changing the schema. And as well-supported xarray is, it is not guaranteed to
    live for e.g. 40 years either.

    A suggestion would be to make such a programming language independent schema. And then to
    prove that the schema and serialization are programming language dependent, one way could be to
    implement code-generation for reading and writing directly from the schema (as it does exist for
    ASDM) in several programming languages e.g. python, C and java. And to make that test a standard
    regression test included into the xradio test suite.

On Data-formats in general

One of the FITS fathers: "Creating a new data-format is easy. Everybody can do it. The challenge
is to make it the world-wide used standarad."

This is certainly as true now as it was then.

I think that for MSv4 you are taking the right steps. Involving the largest community, getting
buy in, presenting at ADASS relentlessly. There is no guarantee that this works (see the ASDF
format where everything was done right but there is still no real takeup) but I keep fingers
crossed.
@Jan-Willem Jan-Willem changed the title Review Review Comments from ESO Nov 26, 2024
@Jan-Willem
Copy link
Member Author

  1. We can add a list of tables that have not been included. The future expansion currently planned would be backwards compatible changes (similar to the changes made to MSv2 and ASDM. For example, VLBI sub-tables were added to MSv2 last year).

  2. MSv4.x.x will be backward compatible (no changes to existing concepts nor structure). See https://xradio.readthedocs.io/en/latest/overview.html#Schema-Versioning.

  3. Reading from the current ALMA ASDM should be possible using PyASDM (which is under development); however, efficiency cannot be guaranteed.

  4. We will add a section defining the data model with semantics, serialization, software to access, etc. We will then check the rest of the documentation for consistency in our usage.

  5. Yes, https://zarr.dev/implementations/.

  6. We have moved away from using indices to string labels (antenna_names, spw_names, field_names, etc.). The only remaining numbering is scan_number and baseline_id. We are considering changing scan_numbers to scan_names. The baseline_id represents a type of multi-index where baseline_antenna1_name and baseline_antenna2_name should be used. Consequently, this will allow comparing a modified MSv4 with the original. We have also started discussing adding a log/history to track changes with the requirement that it is machine-readable.

@Jan-Willem
Copy link
Member Author

  1. The processing set is a construct used specifically for processing data together (similar to a Multi-Measurement Set). It can consist of any collection of MSv4s (including those from different telescopes and observation times). Using observation set naming was proposed earlier but was rejected because it's already used in another context.

  2. The MSv4 is intended to be completely self-contained. We will add Project code, title of observation, PI name, Member_OUS_ID, and Group_OUS_ID as optional keys (note that Member_OUS_ID and Group_OUS_ID are not specified in the ASDM standard). Since it will be very difficult to account for all differences in observatory labeling of their data, we could allow for observatory-specified items in the observation_info.

  3. The 155 dependencies include all the interaction and visualization tools that are not necessary for accessing the data (Jupyter Labs, GraphViz, Matplotlib, etc.). To read the data, only the Zarr package and its dependencies are required:

numpy>=1.25
numcodecs[crc32c]>=0.14
typing_extensions>=4.9
donfig>=0.8

@Jan-Willem
Copy link
Member Author

  1. See response to 340.8.

  2. See response to 340.8.

  3. Non-exhaustive list of archive properties:

    • Format needs to be open, widely adopted, transparent and documented, patent-free, (self)-
      documented, standardized, without external dependencies:
      Zarr is widely used: https://zarr.dev/adopters/
    • Needs to be readable with as many tools that astronomers use as possible:
      https://zarr.dev/implementations/
    • Needs low-entry barrier for usage (simple is better than complex):
      Pip install zarr is very simple.
    • Needs to be managed by deliberate process under configuration control (e.g. a
      standardization body/group):
      https://zarr.dev/community/
    • Must be efficient to download, read and manipulate:
      From the initial benchmarking we have done it is efficient
    • Has to remain stable (forwards compatible) in time (once FITS always FITS)
      Both Zarr3.x and MSv4.x.x are backwards compatible.
    • Must be described by the byte-layout on the disk, not by an API (Must be possible to write a
      reading routine in a few decades from now in a then modern programming language)
      https://zarr-specs.readthedocs.io/en/latest/v3/core/v3.0.html
    • Must be programming-language agnostic and have readers in many programming languages
      available
      https://zarr.dev/implementations/
  4. Ask for input from the review committee on the suggested non-functional requirements.

  5. The suggested conda does not come from Anaconda but from but from miniforge:

"It is recommended to use the conda environment manager from miniforge to create a clean, self-contained runtime where XRADIO and all its dependencies can be installed"

miniforge is a community-driven open source project and makes use of the conda-forge channel which is not controlled by Anaconda. Miniforge license is BSD-3: https://github.com/conda-forge/miniforge/blob/main/LICENSE. XRADIO can also be installed using a standard Python virtual environment on Linux but for Mac, the conda forge channel is needed (this is a restriction from python-casacore).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant