Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Historic (prov-record) vs up-to-date compute instructions #2

Open
mih opened this issue Apr 29, 2024 · 0 comments
Open

Historic (prov-record) vs up-to-date compute instructions #2

mih opened this issue Apr 29, 2024 · 0 comments

Comments

@mih
Copy link
Member

mih commented Apr 29, 2024

There are two principle use cases for a compute instruction record

  • provenance record: how something was computed. Here, there primary aim is to document.
  • specify has something can be computed

The current implementation of run/rerun in datalad tries to ignore that these are two different things, and documents in a format that aims to be re-executable. However, this has problems:

  • when a prior-record no longer works, because the necessary environment is no longer available and the instructions have to be updated: we need a new record, but other than that there is no change to a dataset or individual keys to record that they are now provided via different means.
  • updating a record nevertheless means rewriting history

We need to come up with a specification that supports both cases equally well. This possibly means:

  • establish a dedicated prov-record (document-only) to be used in commits for the purpose of recording outcomes of datalad run
  • develop the concept of compute instructions as a (library of) template for executing code with prov-capture.

CWL-based solution

With #10 we would factor a specification into three components

  • (1) what to compute on?
  • (2) how to compute?
  • (3) what to store?

A "historic" record needs to capture all three verbatim. This would be easy, because the would appear in the form of a modular CWL workflow with three steps, represented would three CWL sub-workflows (or rather command line tool invocation): data provisioning, compute, data capture.
Together they become part of the commit that captures the outcome (just like a traditional run record, but more modular).

Theoretically each of the three components can be "fixed up", and would need that whenever any API of an underlying tool changes.

Importantly, (1) and (3) will be more likely to change (they would need to track datalad evolution and run on the client-system natively, while (2) could be implemented in some more "static" fashion via a container-based execution.

When rerunning historic records it should be possible to provide updated workflow step definitions. It may be meaningful to employ an updatable approach from the beginning. Something like:

  • The three-step workflow is read, and upgraded to the recent CWL version, which is written to a temp location
  • Only then CWL runs

Now, when a workflow is written out (by rerun, or by the special remote handler) we can also apply updates to the data provisioning and capture steps, possibly replacing them entirely, informed by previous configuration). However, we would not write such updates back to the dataset, but instead maintain the pristine original record. Upgrades are applied on the fly, each time.

@mih mih transferred this issue from another repository May 2, 2024
@github-project-automation github-project-automation bot moved this to discussion needed in DataLad remake May 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: discussion needed
Development

No branches or pull requests

1 participant