Relationship to `datalad-run/rerun` #11

mih · 2024-05-02T12:30:21Z

Status quo

datalad-run (as shipped with datalad v1) is executing an opaque, single-step workflow that is defined by a sequence of strings that ultimately given to Python's subprocess for execution. The command supports place-holder expansion for the strings with a few commonly defined variables, and any number of custom definitions that are evaluated at runtime.

As an execution precondition, datalad-run requires a checkout of a Git repository to run. The --input parameter is used to guarantee the presence of select annex'ed files. The --output parameter is used to ensure that particular files can be written to.

From a data sink perspective again a checkout of a Git repository is required. Typically all workflow outputs in the working directory are committed to that worktree as a new commit. A partial capture can be achieved via a combination of --output and --explicit.

Possible re-envisioning

Running a future datalad-run could

generate a data source workflow step with the dataset id and commit hash of HEAD of the worktree.
- if --input is given, the respective items are added to the specification
- if --output is given, those are made writable
add a workflow step with the equivalent of datalad-run --explicit --assume-ready both <cmd> to get the placeholder expansion
generate a data sink workflow step that either commits all modifications, or just declared outputs

The actual execution could also take place in a temporary clone/worktree without an issue.

A datalad rerun would be very similar. It would keep the workflow discovery (--since), and either execute sequentially on top of HEAD, or a different branch (i.e., adjusted data sink).

Why blow up a perfectly simple implementation with a complex CWL dependency?

That is not necessary. What the above sketch boils down to is a refactoring. Rather than having one monolithic run, we factor out provisioning, and output capture, from the execution. Whether we run the two respective commands directly, or via CWL does not matter much. However, the resulting new helpers would also become available in a CWL-context (and for remake, thereby increasing usage.

The text was updated successfully, but these errors were encountered:

mslw · 2024-05-03T15:24:12Z

Maybe worth an addendum: apart from --since, the closest thing to a workflow management with current datalad rerun is the procedure described in Handbook's subsection 5.1.4.2 (DVC comparison): execute a series of datalad run commands, tag important steps, re-run a range of commits (optionally creating a new branch) with datalad rerun --branch foo start-tag..end-tag.

mih mentioned this issue May 2, 2024

Design datalad remake-capture or remake-sink #13

Open

mih added this to DataLad remake May 2, 2024

github-project-automation bot moved this to discussion needed in DataLad remake May 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Relationship to `datalad-run/rerun` #11

Relationship to `datalad-run/rerun` #11

mih commented May 2, 2024 •

edited

Loading

mslw commented May 3, 2024

Relationship to datalad-run/rerun #11

Relationship to datalad-run/rerun #11

Comments

mih commented May 2, 2024 • edited Loading

Status quo

Possible re-envisioning

Why blow up a perfectly simple implementation with a complex CWL dependency?

mslw commented May 3, 2024

Relationship to `datalad-run/rerun` #11

Relationship to `datalad-run/rerun` #11

mih commented May 2, 2024 •

edited

Loading