Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build demo mapping of a datalad-run record as a CWL CommandLineTool #7

Closed
5 of 6 tasks
mih opened this issue May 2, 2024 · 1 comment
Closed
5 of 6 tasks

Comments

@mih
Copy link
Member

mih commented May 2, 2024

Here is the target spec: https://www.commonwl.org/v1.2/CommandLineTool.html

The source is much simpler: http://docs.datalad.org/en/stable/design/provenance_capture.html#the-provenance-record

The challenge is that the datalad run record is a combination of three things that are recognized as separate entities in the CWL world:

  • workflow/command line tool specification
  • workflow inputs
  • workflow execution provenance

Following the cwltool documentation, the first two can be linked to form a single execution specification:

positional arguments:
  cwl_document
          path or URL to a CWL Workflow, CommandLineTool, or ExpressionTool.
          If the `inputs_object` has a `cwl:tool` field indicating
          the path or URL to the cwl_document, then the `cwl_document`
          argument is optional.
  inputs_object
         path or URL to a YAML or JSON formatted description of the required input
         values for the given `cwl_document`.

Here is a demo of that:

cp.cwl.yaml

cwlVersion: v1.2
class: CommandLineTool
baseCommand: [cp, -v]
inputs:
  src:
    type: File
    inputBinding:
      position: 1
  dstpath:
    type: string
    inputBinding:
      position: 2
outputs:
  dst:
    type: File
    outputBinding:
      glob: $(inputs.dstpath)

cp.inputs.yaml

cwl:tool: cp.cwl.yaml # this is the key bit
src:
  class: File
  path: input.txt
dstpath: output.txt

This can be executed as one instruction set

❯ cwltool cp.inputs.yaml
INFO /usr/bin/cwltool 3.1.20240404144621
INFO Resolved 'cp.inputs.yaml' to 'file:///tmp/cwl/some/cp.inputs.yaml'
INFO [job cp.cwl.yaml] /tmp/pi5sa5fc$ cp \
    -v \
    /tmp/sbqjcgn3/stg4b4f504d-626f-43c2-92bc-fe2cca85ab43/input.txt \
    output.txt
'/tmp/sbqjcgn3/stg4b4f504d-626f-43c2-92bc-fe2cca85ab43/input.txt' -> 'output.txt'
INFO [job cp.cwl.yaml] completed success
{
    "dst": {
        "location": "file:///tmp/cwl/some/output.txt",
        "basename": "output.txt",
        "class": "File",
        "checksum": "sha1$b63c7c3a7543014bd34d99d31a85606d485837f9",
        "size": 7,
        "path": "/tmp/cwl/some/output.txt"
    }
}INFO Final process status is success

Now this can be taken a step further. With cwlprov https://github.com/common-workflow-language/cwlprov we can have an instant PROV record as a BagIt

❯ cwltool --provenance prov.out --enable-host-provenance cp.inputs.yaml
INFO /home/mih/env/datalad-dev/bin/cwltool 3.1.20240404144621
INFO [cwltool] /home/mih/env/datalad-dev/bin/cwltool --provenance prov.out --enable-host-provenance cp.inputs.yaml
INFO Resolved 'cp.inputs.yaml' to 'file:///tmp/cwl/some/cp.inputs.yaml'
INFO [provenance] Adding to RO file:///tmp/cwl/some/input.txt
INFO [job cp.cwl.yaml] /tmp/_5kqmfwq$ cp \
    -v \
    /tmp/b3ftolbq/stg52f39a5d-524a-46da-9bd5-4df4e513a05e/input.txt \
    output.txt
'/tmp/b3ftolbq/stg52f39a5d-524a-46da-9bd5-4df4e513a05e/input.txt' -> 'output.txt'
INFO [job cp.cwl.yaml] completed success
/home/mih/env/datalad-dev/lib/python3.11/site-packages/rdflib/plugins/serializers/nt.py:40: UserWarning: NTSerializer always uses UTF-8 encoding. Given encoding was: None
  warnings.warn(
{
    "dst": {
        "location": "file:///tmp/cwl/some/output.txt",
        "basename": "output.txt",
        "class": "File",
        "checksum": "sha1$b63c7c3a7543014bd34d99d31a85606d485837f9",
        "size": 7,
        "path": "/tmp/cwl/some/output.txt"
    }
}INFO Final process status is success
INFO [provenance] Finalizing Research Object
INFO [provenance] Research Object saved to /tmp/cwl/some/prov.out

Leading to

❯ tree -a prov.out
prov.out
├── bag-info.txt
├── bagit.txt
├── data
│   ├── ad
│   │   └── adbe6c7d3c0d8b19ecd492bec9532c13a6e1c9ad
│   └── b6
│       └── b63c7c3a7543014bd34d99d31a85606d485837f9
├── manifest-sha1.txt
├── metadata
│   ├── logs
│   │   └── engine.a844e9af-9c50-4208-be9f-76db7579c11b.txt
│   ├── manifest.json
│   └── provenance
│       ├── primary.cwlprov.json
│       ├── primary.cwlprov.jsonld
│       ├── primary.cwlprov.nt
│       ├── primary.cwlprov.provn
│       ├── primary.cwlprov.ttl
│       └── primary.cwlprov.xml
├── snapshot
│   └── cp.cwl.yaml
├── tagmanifest-sha1.txt
├── tagmanifest-sha256.txt
├── tagmanifest-sha512.txt
└── workflow
    ├── packed.cwl
    ├── primary-job.json
    └── primary-output.json

9 directories, 20 files

Does this have all information from a datalad run-record?

  • cmd is comprehensively captured in the workflow declaration
  • inputs in the workflow inputs, much more detailed. Also in prov.out/workflow/primary-job.json (careful with absolute file:// URL)
  • outputs see prov.out/workflow/primary-output.json
  • dsid is absent, CWL has no concept of this. a related "associated with dataset" property an be defined easily. But with Design datalad remake-provision #12 the dsid could even become an explicit workflow parameter
  • exit is not recorded verbatim, but CWL allows for labeling exit codes into success, temporary failure and permanent failure. Although this reduced information, it is also more flexible (not every non-zero is a problem), and also enables decision-making
  • pwd in the prov output everything is recoded to match the organization of the bagit, which includes its own data hashtree.

So not everything is readily available in the right format, but missing bits can be added easily.

Going with the bagit as main/only output format seems unnecessarily complex. With a datalad dataset we can capture most/all info without taking apart the dataset worktree.

@mih
Copy link
Member Author

mih commented May 15, 2024

Closing. Continued in #14

@mih mih closed this as completed May 15, 2024
@github-project-automation github-project-automation bot moved this from workable to done in DataLad remake May 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: done
Development

No branches or pull requests

1 participant