Build demo mapping of a datalad-run record as a CWL CommandLineTool #7

mih · 2024-05-02T07:01:47Z

Here is the target spec: https://www.commonwl.org/v1.2/CommandLineTool.html

The source is much simpler: http://docs.datalad.org/en/stable/design/provenance_capture.html#the-provenance-record

The challenge is that the datalad run record is a combination of three things that are recognized as separate entities in the CWL world:

workflow/command line tool specification
workflow inputs
workflow execution provenance

Following the cwltool documentation, the first two can be linked to form a single execution specification:

positional arguments:
  cwl_document
          path or URL to a CWL Workflow, CommandLineTool, or ExpressionTool.
          If the `inputs_object` has a `cwl:tool` field indicating
          the path or URL to the cwl_document, then the `cwl_document`
          argument is optional.
  inputs_object
         path or URL to a YAML or JSON formatted description of the required input
         values for the given `cwl_document`.

Here is a demo of that:

cp.cwl.yaml

cwlVersion: v1.2
class: CommandLineTool
baseCommand: [cp, -v]
inputs:
  src:
    type: File
    inputBinding:
      position: 1
  dstpath:
    type: string
    inputBinding:
      position: 2
outputs:
  dst:
    type: File
    outputBinding:
      glob: $(inputs.dstpath)

cp.inputs.yaml

cwl:tool: cp.cwl.yaml # this is the key bit
src:
  class: File
  path: input.txt
dstpath: output.txt

This can be executed as one instruction set

❯ cwltool cp.inputs.yaml
INFO /usr/bin/cwltool 3.1.20240404144621
INFO Resolved 'cp.inputs.yaml' to 'file:///tmp/cwl/some/cp.inputs.yaml'
INFO [job cp.cwl.yaml] /tmp/pi5sa5fc$ cp \
    -v \
    /tmp/sbqjcgn3/stg4b4f504d-626f-43c2-92bc-fe2cca85ab43/input.txt \
    output.txt
'/tmp/sbqjcgn3/stg4b4f504d-626f-43c2-92bc-fe2cca85ab43/input.txt' -> 'output.txt'
INFO [job cp.cwl.yaml] completed success
{
    "dst": {
        "location": "file:///tmp/cwl/some/output.txt",
        "basename": "output.txt",
        "class": "File",
        "checksum": "sha1$b63c7c3a7543014bd34d99d31a85606d485837f9",
        "size": 7,
        "path": "/tmp/cwl/some/output.txt"
    }
}INFO Final process status is success

Now this can be taken a step further. With cwlprov https://github.com/common-workflow-language/cwlprov we can have an instant PROV record as a BagIt

❯ cwltool --provenance prov.out --enable-host-provenance cp.inputs.yaml
INFO /home/mih/env/datalad-dev/bin/cwltool 3.1.20240404144621
INFO [cwltool] /home/mih/env/datalad-dev/bin/cwltool --provenance prov.out --enable-host-provenance cp.inputs.yaml
INFO Resolved 'cp.inputs.yaml' to 'file:///tmp/cwl/some/cp.inputs.yaml'
INFO [provenance] Adding to RO file:///tmp/cwl/some/input.txt
INFO [job cp.cwl.yaml] /tmp/_5kqmfwq$ cp \
    -v \
    /tmp/b3ftolbq/stg52f39a5d-524a-46da-9bd5-4df4e513a05e/input.txt \
    output.txt
'/tmp/b3ftolbq/stg52f39a5d-524a-46da-9bd5-4df4e513a05e/input.txt' -> 'output.txt'
INFO [job cp.cwl.yaml] completed success
/home/mih/env/datalad-dev/lib/python3.11/site-packages/rdflib/plugins/serializers/nt.py:40: UserWarning: NTSerializer always uses UTF-8 encoding. Given encoding was: None
  warnings.warn(
{
    "dst": {
        "location": "file:///tmp/cwl/some/output.txt",
        "basename": "output.txt",
        "class": "File",
        "checksum": "sha1$b63c7c3a7543014bd34d99d31a85606d485837f9",
        "size": 7,
        "path": "/tmp/cwl/some/output.txt"
    }
}INFO Final process status is success
INFO [provenance] Finalizing Research Object
INFO [provenance] Research Object saved to /tmp/cwl/some/prov.out

Leading to

❯ tree -a prov.out
prov.out
├── bag-info.txt
├── bagit.txt
├── data
│   ├── ad
│   │   └── adbe6c7d3c0d8b19ecd492bec9532c13a6e1c9ad
│   └── b6
│       └── b63c7c3a7543014bd34d99d31a85606d485837f9
├── manifest-sha1.txt
├── metadata
│   ├── logs
│   │   └── engine.a844e9af-9c50-4208-be9f-76db7579c11b.txt
│   ├── manifest.json
│   └── provenance
│       ├── primary.cwlprov.json
│       ├── primary.cwlprov.jsonld
│       ├── primary.cwlprov.nt
│       ├── primary.cwlprov.provn
│       ├── primary.cwlprov.ttl
│       └── primary.cwlprov.xml
├── snapshot
│   └── cp.cwl.yaml
├── tagmanifest-sha1.txt
├── tagmanifest-sha256.txt
├── tagmanifest-sha512.txt
└── workflow
    ├── packed.cwl
    ├── primary-job.json
    └── primary-output.json

9 directories, 20 files

Does this have all information from a datalad run-record?

cmd is comprehensively captured in the workflow declaration
inputs in the workflow inputs, much more detailed. Also in prov.out/workflow/primary-job.json (careful with absolute file:// URL)
outputs see prov.out/workflow/primary-output.json
dsid is absent, CWL has no concept of this. a related "associated with dataset" property an be defined easily. But with Design datalad remake-provision #12 the dsid could even become an explicit workflow parameter
exit is not recorded verbatim, but CWL allows for labeling exit codes into success, temporary failure and permanent failure. Although this reduced information, it is also more flexible (not every non-zero is a problem), and also enables decision-making
pwd in the prov output everything is recoded to match the organization of the bagit, which includes its own data hashtree.

So not everything is readily available in the right format, but missing bits can be added easily.

Going with the bagit as main/only output format seems unnecessarily complex. With a datalad dataset we can capture most/all info without taking apart the dataset worktree.

The text was updated successfully, but these errors were encountered:

mih · 2024-05-15T08:29:40Z

Closing. Continued in #14

mih added this to DataLad remake May 2, 2024

mih converted this from a draft issue May 2, 2024

This was referenced May 2, 2024

Define API for recording/setting compute instructions in dataset #4

Open

Define specification for compute instructions #5

Open

mih mentioned this issue May 15, 2024

CWL-aligned design/implementation #14

Open

1 task

mih closed this as completed May 15, 2024

github-project-automation bot moved this from workable to done in DataLad remake May 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build demo mapping of a datalad-run record as a CWL CommandLineTool #7

Build demo mapping of a datalad-run record as a CWL CommandLineTool #7

mih commented May 2, 2024 •

edited

Loading

mih commented May 15, 2024

Build demo mapping of a datalad-run record as a CWL CommandLineTool #7

Build demo mapping of a datalad-run record as a CWL CommandLineTool #7

Comments

mih commented May 2, 2024 • edited Loading

mih commented May 15, 2024

mih commented May 2, 2024 •

edited

Loading