-
Notifications
You must be signed in to change notification settings - Fork 187
[ENH][BEP028] Specification update for BEP028 BIDS-Prov #2099
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
for more information, see https://pre-commit.ci
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some initial notes
!!! bug | ||
TODO: Environment not currently defined in the context |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what to be done about that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should be added.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an IRI from PROV-O or from any other ontology we should map with Enviroment
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Environment
is a prov:Entity
!!! example "`Activity`, `Environment`, `Software` naming examples" | ||
- `bids:ds001734:prov#conversion-xfMMbHK1`: a conversion `Activity` described inside the `ds001734` dataset; | ||
- `bids::prov#fedora-uldfv058`: a Fedora based `Environment` described inside the current dataset. | ||
- `bids:derivatives:prov#fmriprep-r4kzzMt8`: the fMRIPrep `Software` described inside the `derivatives` dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I feel we would need some very explicit statement about uniqueness of Ids, and enforce that in validator somehow.
- `bids:derivatives:prov#fmriprep-r4kzzMt8`: the fMRIPrep `Software` described inside the `derivatives` dataset. | |
- `bids:derivatives:prov#fmriprep-r4kzzMt8`: the fMRIPrep `Software` described inside the `derivatives` dataset. | |
An `Id` must be unique, as to identify an instance and thus REQUIRED to be different across entries and different instances MUST have different `Id`s. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see my suggestion above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added a sentence in the Consistency and uniqueness of Ids section, as suggested by Satra. Please let me know if it is ok for both of you.
|
||
!!! example | ||
The following graph represents examples of links between provenance records. In this example, `Entities` *sub-001_brainmask.nii* and *sub-001_T1w.nii* represent files used by `Activity` *Brain extraction* to generate another file represented by `Entity` *sub-001_T1w_preproc.nii*. `Activities` *Brain extraction* and *Move to MNI* were associated with the `Software` *FSL* and used `Environment` *Linux* as a software environment. | ||
 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a way to ship an original graph or ideally a collection of prov files + script to produce such a graph?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For now, the bids_prov.merge
and bids_prov.visualize
python modules are available in https://github.com/bids-standard/BEP028_BIDSprov/, to produce such a graph from a collection of prov files.
Would you like to have the prov files for this minimal example inside bids examples ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO prov files should be provided in this repo. makefile could be provided to generate them invoking code from that package, or even provide that code here (may be later).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@yarikoptic I just discussed this with Boris. I am not sure I follow here, can you detail a bit? I mean typically examples have been saved in BIDS-examples. Is there a specific reason here to instead have the prov files in the repo of the BIDS standard? If so, could you describe a bit more how this would look like.
} | ||
) }} | ||
|
||
The `dataset_description.json` file of a BIDS-Derivatives dataset MUST include the following key to describe provenance: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is from the "common-principles", so better be referred to it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you suggest to remove all metadata tables from the Provenance at dataset level section because these metadata are already listed in src/modality-agnostic-files/dataset-description.json
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks very nice @bclenet - left some comments which should be very lightweight to address.
{ | ||
"Id": "bids::sub-01/func/sub-01_task-tonecounting_bold.nii", | ||
"Label": "sub-01_task-tonecounting_bold.nii", | ||
"AtLocation": "sub-01/func/sub-01_task-tonecounting_bold.nii", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason not to use the same uri as Id
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the Label
, I guess we want to use the name of the file (without directory path) so that is more human readable.
About AtLocation
, we describe this metadata as:
For input files, this is the relative path to the file on disk.
Should we be more specific for this definition ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i meant to use the bids uri here as well.
!!! example "`Activity`, `Environment`, `Software` naming examples" | ||
- `bids:ds001734:prov#conversion-xfMMbHK1`: a conversion `Activity` described inside the `ds001734` dataset; | ||
- `bids::prov#fedora-uldfv058`: a Fedora based `Environment` described inside the current dataset. | ||
- `bids:derivatives:prov#fmriprep-r4kzzMt8`: the fMRIPrep `Software` described inside the `derivatives` dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see my suggestion above.
|
||
## Minimal example | ||
|
||
Here is a comprehensive example that considers the following dataset: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for these examples, do consider the changes about study-level organization and what that means for prov.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @satra! I discussed this with Boris. Can you give us some more insights on what you mean by "study-level organization". Is this a change in the BIDS spec? Or do you mean a re-organization of the files within a BIDS folder for a given study?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here is the relevant doc: https://bids-specification.readthedocs.io/en/stable/common-principles.html#study-dataset this was merged with the recent release.
The following provenance record is defined in `prov/prov-dcm2niix_soft.json`, so its `Id` SHOULD start with `bids:<dataset>:prov#` or `bids::prov#`. | ||
|
||
```JSON | ||
{ | ||
"Software": [ | ||
{ | ||
"Id": "bids::prov#dcm2niix-70ug8pl5", | ||
"Label": "dcm2niix", | ||
"Version": "v1.1.3" | ||
} | ||
] | ||
} | ||
``` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure i'm following this one. does the Id
really need to match the filename, or would that simply be a good convention?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i say this because in reproschema we make the id
match the filename so the file can be accessed through a http server. here it doesn't seem necessary and also because the suffix-id would be unique.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just rephrased this ; is it ok ?
We simply give an example illustrating the Consistency and uniqueness of Ids section here. There is indeed no correlation between the file prov/prov-dcm2niix_soft.json
and the fact that the Id starts with bids:<dataset>:prov#
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Next batch of comments and recommendations
https://github.com/bids-standard/bids-specification/blob/master/macros_doc.md | ||
--> | ||
{{ MACROS___make_subobject_table("metadata.GeneratedBy.items") }} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All these removals is text and headings leave examples "hanging". If all of this is to move to the provenance section, surgery might be needed to become more thorough and have explicit reference to that section.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a link to the Provenance section in the description of GeneratedBy
to cover the fact that this table was moved to the Provenance section.
|
||
### Principles for encoding provenance in BIDS | ||
|
||
- Provenance information SHOULD be included in a BIDS dataset when possible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe that GeneratedBy is just MAY, so here it then should also be such to be consistent
- Provenance information SHOULD be included in a BIDS dataset when possible. | |
- Provenance information MAY be included in a BIDS dataset when possible. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GeneratedBy
is RECOMMENDED in BIDS datasets and REQUIRED in BIDS derivative datasets.
|
||
- Provenance information SHOULD be included in a BIDS dataset when possible. | ||
- If provenance records are included, these MUST be described using the conventions detailed by this specification. | ||
- Provenance records MAY be used to reflect the provenance of a full dataset and/or of specific files at any level of the BIDS hierarchy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is not a matter of possibility or enforcement, rather a factual statement
- Provenance records MAY be used to reflect the provenance of a full dataset and/or of specific files at any level of the BIDS hierarchy. | |
- Provenance records reflect the provenance of a full dataset and/or of specific files at any level of the BIDS hierarchy. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here, I think we want to say that provenance records may have different scopes. Would this sentence be better ?
Provenance records MAY describe the provenance of a full dataset or of specific files at any level of the BIDS hierarchy.
|
||
!!! example | ||
The following graph represents examples of links between provenance records. In this example, `Entities` *sub-001_brainmask.nii* and *sub-001_T1w.nii* represent files used by `Activity` *Brain extraction* to generate another file represented by `Entity` *sub-001_T1w_preproc.nii*. `Activities` *Brain extraction* and *Move to MNI* were associated with the `Software` *FSL* and used `Environment` *Linux* as a software environment. | ||
 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMHO prov files should be provided in this repo. makefile could be provided to generate them invoking code from that package, or even provide that code here (may be later).
## Provenance at dataset level | ||
|
||
Provenance metadata MAY be stored inside the `dataset_description.json` of any BIDS dataset (or BIDS-Derivatives dataset) it applies to. | ||
This metadata describes the provenance of the whole dataset. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly to SidecarGeneratedBy
vs GeneratedBy
, we have ambiguity as to define the scope, ie - does it apply to all files? What if file had it's own records? We could instruct the same need for a copy but it might be too hard to keep consent etc...
We should spell it out one way or another here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure to fully understand what you mean.
But we can leave Provenance at dataset level as a (minimal) option to describe the provenance of the dataset as an Entity.
Of course this most probably contains partial information, but this does not prevent more precise descriptions from being included inside sidecars.
Hi @effigies , Here are quick questions about the schema modifications in this PR.
I think this is due to the fact that the Also, files inside derivative datasets lead to NOT_INCLUDED / ALL_FILENAME_RULES_HAVE_ISSUES / FILENAME_MISMATCH / ENTITY_WITH_NO_LABEL errors because their names do not conform to BIDS naming. It seems to be contradictory to this part of the spec Could you please give your thoughts about these ? Let me know if I should post these anywhere else. Thanks, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@bclenet - wondering if these could be created with mermaid markdown? this should not hold things up, but may be easier to embed and render if possible.
] | ||
} | ||
``` | ||
This snippet is an extract of the following comprehensive example: [Provenance records for fMRI preprocessing using `fMRIPrep`](https://github.com/bclenet/bids-examples/tree/BEP028_fmriprep/provenance_fmriprep). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if we are going to point to a separate repo, we should note that it will have to be maintained after the PR is merged. perhaps something to think about and consider whether these examples are transitioned into bids examples repo and linked here.
This is a work in progress PR proposing a specification update for BEP028 BIDS-Prov.
- [ ] being proofread
- [ ] validator error :
/prov/*
NOT_INCLUDED- [ ] validator error :
/prov/*.json
SIDECAR_WITHOUT_DATAFILE- [ ] validator error : derivative files are listed as NOT_INCLUDED / ALL_FILENAME_RULES_HAVE_ISSUES /FILENAME_MISMATCH / ENTITY_WITH_NO_LABEL