Historic (prov-record) vs up-to-date compute instructions #2

mih · 2024-04-29T06:57:11Z

There are two principle use cases for a compute instruction record

provenance record: how something was computed. Here, there primary aim is to document.
specify has something can be computed

The current implementation of run/rerun in datalad tries to ignore that these are two different things, and documents in a format that aims to be re-executable. However, this has problems:

when a prior-record no longer works, because the necessary environment is no longer available and the instructions have to be updated: we need a new record, but other than that there is no change to a dataset or individual keys to record that they are now provided via different means.
updating a record nevertheless means rewriting history

We need to come up with a specification that supports both cases equally well. This possibly means:

establish a dedicated prov-record (document-only) to be used in commits for the purpose of recording outcomes of datalad run
develop the concept of compute instructions as a (library of) template for executing code with prov-capture.

CWL-based solution

With #10 we would factor a specification into three components

(1) what to compute on?
(2) how to compute?
(3) what to store?

A "historic" record needs to capture all three verbatim. This would be easy, because the would appear in the form of a modular CWL workflow with three steps, represented would three CWL sub-workflows (or rather command line tool invocation): data provisioning, compute, data capture.
Together they become part of the commit that captures the outcome (just like a traditional run record, but more modular).

Theoretically each of the three components can be "fixed up", and would need that whenever any API of an underlying tool changes.

Importantly, (1) and (3) will be more likely to change (they would need to track datalad evolution and run on the client-system natively, while (2) could be implemented in some more "static" fashion via a container-based execution.

When rerunning historic records it should be possible to provide updated workflow step definitions. It may be meaningful to employ an updatable approach from the beginning. Something like:

The three-step workflow is read, and upgraded to the recent CWL version, which is written to a temp location
Only then CWL runs

Now, when a workflow is written out (by rerun, or by the special remote handler) we can also apply updates to the data provisioning and capture steps, possibly replacing them entirely, informed by previous configuration). However, we would not write such updates back to the dataset, but instead maintain the pristine original record. Upgrades are applied on the fly, each time.

The text was updated successfully, but these errors were encountered:

mih mentioned this issue May 2, 2024

Define specification for compute instructions #5

Open

mih transferred this issue from another repository May 2, 2024

mih added this to DataLad remake May 2, 2024

github-project-automation bot moved this to discussion needed in DataLad remake May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Historic (prov-record) vs up-to-date compute instructions #2

Historic (prov-record) vs up-to-date compute instructions #2

mih commented Apr 29, 2024 •

edited

Loading

Historic (prov-record) vs up-to-date compute instructions #2

Historic (prov-record) vs up-to-date compute instructions #2

Comments

mih commented Apr 29, 2024 • edited Loading

CWL-based solution

mih commented Apr 29, 2024 •

edited

Loading