Skip to content

Conversation

bclenet
Copy link
Contributor

@bclenet bclenet commented Apr 10, 2025

This is a work in progress PR proposing a specification update for BEP028 BIDS-Prov.

bclenet and others added 30 commits March 18, 2025 11:06
@effigies effigies added this to the 1.11.0 milestone Aug 27, 2025
Copy link
Collaborator

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some initial notes

Comment on lines 379 to 380
!!! bug
TODO: Environment not currently defined in the context
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what to be done about that?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should be added.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an IRI from PROV-O or from any other ontology we should map with Enviroment ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Environment is a prov:Entity

!!! example "`Activity`, `Environment`, `Software` naming examples"
- `bids:ds001734:prov#conversion-xfMMbHK1`: a conversion `Activity` described inside the `ds001734` dataset;
- `bids::prov#fedora-uldfv058`: a Fedora based `Environment` described inside the current dataset.
- `bids:derivatives:prov#fmriprep-r4kzzMt8`: the fMRIPrep `Software` described inside the `derivatives` dataset.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we would need some very explicit statement about uniqueness of Ids, and enforce that in validator somehow.

Suggested change
- `bids:derivatives:prov#fmriprep-r4kzzMt8`: the fMRIPrep `Software` described inside the `derivatives` dataset.
- `bids:derivatives:prov#fmriprep-r4kzzMt8`: the fMRIPrep `Software` described inside the `derivatives` dataset.
An `Id` must be unique, as to identify an instance and thus REQUIRED to be different across entries and different instances MUST have different `Id`s.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my suggestion above.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a sentence in the Consistency and uniqueness of Ids section, as suggested by Satra. Please let me know if it is ok for both of you.


!!! example
The following graph represents examples of links between provenance records. In this example, `Entities` *sub-001_brainmask.nii* and *sub-001_T1w.nii* represent files used by `Activity` *Brain extraction* to generate another file represented by `Entity` *sub-001_T1w_preproc.nii*. `Activities` *Brain extraction* and *Move to MNI* were associated with the `Software` *FSL* and used `Environment` *Linux* as a software environment.
![](../images/provenance_definitions_graph.png)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is there a way to ship an original graph or ideally a collection of prov files + script to produce such a graph?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now, the bids_prov.merge and bids_prov.visualize python modules are available in https://github.com/bids-standard/BEP028_BIDSprov/, to produce such a graph from a collection of prov files.
Would you like to have the prov files for this minimal example inside bids examples ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO prov files should be provided in this repo. makefile could be provided to generate them invoking code from that package, or even provide that code here (may be later).

Copy link
Collaborator

@cmaumet cmaumet Sep 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yarikoptic I just discussed this with Boris. I am not sure I follow here, can you detail a bit? I mean typically examples have been saved in BIDS-examples. Is there a specific reason here to instead have the prov files in the repo of the BIDS standard? If so, could you describe a bit more how this would look like.

}
) }}

The `dataset_description.json` file of a BIDS-Derivatives dataset MUST include the following key to describe provenance:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is from the "common-principles", so better be referred to it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you suggest to remove all metadata tables from the Provenance at dataset level section because these metadata are already listed in src/modality-agnostic-files/dataset-description.json ?

Copy link
Collaborator

@satra satra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks very nice @bclenet - left some comments which should be very lightweight to address.

{
"Id": "bids::sub-01/func/sub-01_task-tonecounting_bold.nii",
"Label": "sub-01_task-tonecounting_bold.nii",
"AtLocation": "sub-01/func/sub-01_task-tonecounting_bold.nii",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason not to use the same uri as Id here?

Copy link
Contributor Author

@bclenet bclenet Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the Label, I guess we want to use the name of the file (without directory path) so that is more human readable.
About AtLocation, we describe this metadata as:

For input files, this is the relative path to the file on disk.

Should we be more specific for this definition ?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i meant to use the bids uri here as well.

!!! example "`Activity`, `Environment`, `Software` naming examples"
- `bids:ds001734:prov#conversion-xfMMbHK1`: a conversion `Activity` described inside the `ds001734` dataset;
- `bids::prov#fedora-uldfv058`: a Fedora based `Environment` described inside the current dataset.
- `bids:derivatives:prov#fmriprep-r4kzzMt8`: the fMRIPrep `Software` described inside the `derivatives` dataset.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see my suggestion above.


## Minimal example

Here is a comprehensive example that considers the following dataset:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for these examples, do consider the changes about study-level organization and what that means for prov.

Copy link
Collaborator

@cmaumet cmaumet Sep 9, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @satra! I discussed this with Boris. Can you give us some more insights on what you mean by "study-level organization". Is this a change in the BIDS spec? Or do you mean a re-organization of the files within a BIDS folder for a given study?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here is the relevant doc: https://bids-specification.readthedocs.io/en/stable/common-principles.html#study-dataset this was merged with the recent release.

Comment on lines 450 to 462
The following provenance record is defined in `prov/prov-dcm2niix_soft.json`, so its `Id` SHOULD start with `bids:<dataset>:prov#` or `bids::prov#`.

```JSON
{
"Software": [
{
"Id": "bids::prov#dcm2niix-70ug8pl5",
"Label": "dcm2niix",
"Version": "v1.1.3"
}
]
}
```
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure i'm following this one. does the Id really need to match the filename, or would that simply be a good convention?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i say this because in reproschema we make the id match the filename so the file can be accessed through a http server. here it doesn't seem necessary and also because the suffix-id would be unique.

Copy link
Contributor Author

@bclenet bclenet Sep 3, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just rephrased this ; is it ok ?

We simply give an example illustrating the Consistency and uniqueness of Ids section here. There is indeed no correlation between the file prov/prov-dcm2niix_soft.json and the fact that the Id starts with bids:<dataset>:prov#

Copy link
Collaborator

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Next batch of comments and recommendations

https://github.com/bids-standard/bids-specification/blob/master/macros_doc.md
-->
{{ MACROS___make_subobject_table("metadata.GeneratedBy.items") }}

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All these removals is text and headings leave examples "hanging". If all of this is to move to the provenance section, surgery might be needed to become more thorough and have explicit reference to that section.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a link to the Provenance section in the description of GeneratedBy to cover the fact that this table was moved to the Provenance section.


### Principles for encoding provenance in BIDS

- Provenance information SHOULD be included in a BIDS dataset when possible.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe that GeneratedBy is just MAY, so here it then should also be such to be consistent

Suggested change
- Provenance information SHOULD be included in a BIDS dataset when possible.
- Provenance information MAY be included in a BIDS dataset when possible.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GeneratedBy is RECOMMENDED in BIDS datasets and REQUIRED in BIDS derivative datasets.


- Provenance information SHOULD be included in a BIDS dataset when possible.
- If provenance records are included, these MUST be described using the conventions detailed by this specification.
- Provenance records MAY be used to reflect the provenance of a full dataset and/or of specific files at any level of the BIDS hierarchy.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is not a matter of possibility or enforcement, rather a factual statement

Suggested change
- Provenance records MAY be used to reflect the provenance of a full dataset and/or of specific files at any level of the BIDS hierarchy.
- Provenance records reflect the provenance of a full dataset and/or of specific files at any level of the BIDS hierarchy.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here, I think we want to say that provenance records may have different scopes. Would this sentence be better ?

Provenance records MAY describe the provenance of a full dataset or of specific files at any level of the BIDS hierarchy.


!!! example
The following graph represents examples of links between provenance records. In this example, `Entities` *sub-001_brainmask.nii* and *sub-001_T1w.nii* represent files used by `Activity` *Brain extraction* to generate another file represented by `Entity` *sub-001_T1w_preproc.nii*. `Activities` *Brain extraction* and *Move to MNI* were associated with the `Software` *FSL* and used `Environment` *Linux* as a software environment.
![](../images/provenance_definitions_graph.png)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMHO prov files should be provided in this repo. makefile could be provided to generate them invoking code from that package, or even provide that code here (may be later).

## Provenance at dataset level

Provenance metadata MAY be stored inside the `dataset_description.json` of any BIDS dataset (or BIDS-Derivatives dataset) it applies to.
This metadata describes the provenance of the whole dataset.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly to SidecarGeneratedBy vs GeneratedBy, we have ambiguity as to define the scope, ie - does it apply to all files? What if file had it's own records? We could instruct the same need for a copy but it might be too hard to keep consent etc...

We should spell it out one way or another here

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure to fully understand what you mean.

But we can leave Provenance at dataset level as a (minimal) option to describe the provenance of the dataset as an Entity.
Of course this most probably contains partial information, but this does not prevent more precise descriptions from being included inside sidecars.

@effigies effigies added the BEP label Sep 4, 2025
@bclenet
Copy link
Contributor Author

bclenet commented Sep 11, 2025

Hi @effigies ,

Here are quick questions about the schema modifications in this PR.
I tested the BIDS examples listed in the description of this PR against a BIDS validator using the modified schema, but I still encounter errors.

  • Provenance files (.json) inside the prov directory are considered as sidecars, leading to SIDECAR_WITHOUT_DATAFILE error files
  • Provenance files (.json) inside the prov directory are NOT_INCLUDED according to the validator, although

I think this is due to the fact that the src/schema/rules/files/common/modality_agnostic.yaml file I've added has no impact for now.

Also, files inside derivative datasets lead to NOT_INCLUDED / ALL_FILENAME_RULES_HAVE_ISSUES / FILENAME_MISMATCH / ENTITY_WITH_NO_LABEL errors because their names do not conform to BIDS naming. It seems to be contradictory to this part of the spec

Could you please give your thoughts about these ? Let me know if I should post these anywhere else.

Thanks,

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bclenet - wondering if these could be created with mermaid markdown? this should not hold things up, but may be easier to embed and render if possible.

]
}
```
This snippet is an extract of the following comprehensive example: [Provenance records for fMRI preprocessing using `fMRIPrep`](https://github.com/bclenet/bids-examples/tree/BEP028_fmriprep/provenance_fmriprep).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if we are going to point to a separate repo, we should note that it will have to be maintained after the PR is merged. perhaps something to think about and consider whether these examples are transitioned into bids examples repo and linked here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants