Rethinking pseudonymization

I have failed to utilize heudiconv whenever I use both `-o` and `-a` to split the destinations for pseudonymized (anonymized) and clear-id outputs. I always run into this kind of error

```
nipype.pipeline.engine.nodes.NodeExecutionError: Exception raised while executing Node convert.

Cmdline:
        dcm2niix -b y -z y -x n -t n -m 0 -f sub-aachen_T1w_heudiconv484 -o sub-aachen/anat -s n -v n /tmp/dcm2niixz3uhnlmj/convert
Stdout:
        Chris Rorden's dcm2niiX version v1.0.20241211  (JP2:OpenJPEG) GCC14.2.0 x86-64 (64-bit Linux)
Stderr:
        Error: Output folder invalid: sub-aachen/anat
Traceback:
        RuntimeError: subprocess exited with code 6.
```

With the data I am working with (individual DICOM files, sorted into some directory structure), I appear to be force to use `--files`. Using `-d` I run into

```
IsADirectoryError: [Errno 21] Is a directory: '/tmp/bids/sourcedata/sessions/aachen/dicom'
[INFO   ] == Command exit (modification check follows) ===== 
```

That all being said: I am looking for a setup that is suitable for use in a minimize-personal-data-leakage scenario -- and be usable in a datalad provenance tracking context. This means that I need to arrive at a CLI call that must not include any identifiers used in DICOM dataset organization or file names. Otherwise they would "leak" into the provenance record of the BIDS output dataset. It appears that the existing pseudonymization features are not suitable for that, because they ultimately require the specification of "original" subject identifiers in the heudiconv call.

### Alternative approach to pseudonymization

Here is a sketch of a setup that I am presently favoring which should achieve that (and coincidentally does not require heudiconv support). I am describing it in slightly more detail than what is needed to understand the relation to heudiconv functionality, maybe it is helpful for someone else.

First the full code to perform the conversion via datalad and an apptainer-driven heudiconv container.

```bash
datalad create bids
cd bids
datalad clone -d . https://hub.trr379.de/q02/heudiconv-container.git code/heudiconv
datalad clone -d . https://hub.trr379.de/q01/phantom-mri-dicoms.git sourcedata
datalad containers-run \
  -m "Convert subject 001 data" \
  -n code/heudiconv/apptainer \
  -o sub-001 -o .heudiconv/001 \
  -i sourcedata/code/heuristic-q01.py \
  -- \
  --bids notop --overwrite --minmeta \
  -o . \
  -f '{inputs[0]}' \
  -s 001 \
  --files 'sourcedata/sessions/$(python3 sourcedata/reidentify 001)'
```

(note the single-quoted value for the `--files` parameter. This causes the `reidentify` utility to run inside the heudiconv container)

### Components

[All referenced datasets are public examples, and could be inspected for more details]

`code/heudiconv/`: a datalad dataset with a configured heudiconv container. Using a dataset for this has the advantage that a common, separately maintained setup can be reused, rather than inventing something only for a specific study

`sourcedata/`: a datalad dataset that trackes the DICOM acquisitions for a study as individual datalad (sub)datasets. This `sourcedata/` dataset also provides the heudiconv heuristic (more or less an arbitrary decision, it could also come from elsewhere, but if it is not a standard one (like repronim), it makes sense to keep it with the rest of the study sources). Importantly, this dataset also provides a reidentification helper that is used to translate a pseudonymized ID into an identifier that is used to select particular DICOMs (directories) associated with an individual subject. Within this `sourcedata/` dataset, institutional acquisition identifiers can be used, but it is also meaningful to apply a first level of pseudonymization at the acquisiton-level here, only only have institutional identifiers in the individual DICOM acquisition subdatasets.

### Outcome

The code above yields a dataset state with a provenance record that (thanks to heudiconv and dcm2niix) afford re-execution that produces a bit-identical outcome (which is extremely helpful! THANKS!).

The provenance record looks like this (and does not contain the original, non-pseudonymized subject identifier (`aachen`):

```json
    {
     "chain": [],
     "cmd": "apptainer exec code/heudiconv/apptainer/nipy-heudiconv--1.3.3.sing bash -c 'export PATH=/opt/dcm2niix-v1.0.20240202/bin:/opt/miniconda-py39_4.12.0/bin:$PATH && heudiconv --bids notop --overwrite --minmeta -o . -f '{inputs[0]}' -s 001 --files 'sourcedata/sessions/$(python3 sourcedata/reidentify 001)''",
     "dsid": "07efeddd-26eb-4345-aabe-bac882205d71",
     "exit": 0,
     "extra_inputs": [
      "code/heudiconv/apptainer/nipy-heudiconv--1.3.3.sing"
     ],
     "inputs": [
      "sourcedata/code/heuristic-q01.py"
     ],
     "outputs": [
      "sub-001",
      ".heudiconv/001"
     ],
     "pwd": "."
    }
```

The one major insufficiency of this approach is that the key data input dependency (the source DICOMS for subject 001) are not declared. This is not possible, because the true path contains an identifier that must not be leaked. It is also not possible to use the `reidentify` script, because the subdataset containing it is also not guaranteed to be present yet.

This means that the data provisioning cannot be handled by datalad based on the provenance record alone.

It would be an option to declare a dependency on `sourcedata/` as a whole. But this is not desirable in my use case (1000+ subjects) with multiple acquisitions.

It would be an option to provide a "linkfarm" inside `sourcedata/` which links true DICOM subdataset locations via symlinks using only pseudonymized IDs in their names. However, this is also not desirable in my case, because one `sourcedata/` DICOM dataset needs to be converted into a large number of BIDS datasets, each using its own pseudonymization setup. Maintaining the linkfarm would be quite an effort, and systems without filesystem support for symlink would be tricky to handle.

### Data protection risk assessment

The top-level dataset resulting from the above procedure tracks the BIDS/NIfTI-format data. However, hardly any content is tracked directly with Git, as documented below

<details>
<summary>Full output of `git fast export HEAD` (user-facing branch)</summary>

```
blob
mark :1
data 32
config annex.largefiles=nothing

blob
mark :2
data 63
[datalad "dataset"]
        id = 07efeddd-26eb-4345-aabe-bac882205d71

blob
mark :3
data 55
* annex.backend=MD5E
**/.git* annex.largefiles=nothing

reset refs/heads/main
commit refs/heads/main
mark :4
author Michael Hanke <michael.hanke@gmail.com> 1756625245 +0200
committer Michael Hanke <michael.hanke@gmail.com> 1756625245 +0200
data 22
[DATALAD] new dataset
M 100644 :1 .datalad/.gitattributes
M 100644 :2 .datalad/config
M 100644 :3 .gitattributes

blob
mark :5
data 225
[submodule "code/heudiconv"]
        path = code/heudiconv
        url = https://hub.trr379.de/q02/heudiconv-container.git
        datalad-id = eca069c6-f79d-4ed5-bee8-cebea17709cb
        datalad-url = https://hub.trr379.de/q02/heudiconv-container.git

commit refs/heads/main
mark :6
author Michael Hanke <michael.hanke@gmail.com> 1756625255 +0200
committer Michael Hanke <michael.hanke@gmail.com> 1756625255 +0200
data 27
[DATALAD] Added subdataset
from :4
M 100644 :5 .gitmodules
M 160000 079c3e842182ca08f5812f8b0eac166228d30ece code/heudiconv

blob
mark :7
data 440
[submodule "code/heudiconv"]
        path = code/heudiconv
        url = https://hub.trr379.de/q02/heudiconv-container.git
        datalad-id = eca069c6-f79d-4ed5-bee8-cebea17709cb
        datalad-url = https://hub.trr379.de/q02/heudiconv-container.git
[submodule "sourcedata"]
        path = sourcedata
        url = https://hub.trr379.de/q01/phantom-mri-dicoms.git
        datalad-id = ce3c1975-413c-420c-9a83-f06d45b50880
        datalad-url = https://hub.trr379.de/q01/phantom-mri-dicoms.git

commit refs/heads/main
mark :8
author Michael Hanke <michael.hanke@gmail.com> 1756625269 +0200
committer Michael Hanke <michael.hanke@gmail.com> 1756625269 +0200
data 27
[DATALAD] Added subdataset
from :6
M 100644 :7 .gitmodules
M 160000 7d3e1c35b8a620593e07571a9ac460000d2de2af sourcedata

blob
mark :9
data 141
../../../.git/annex/objects/7K/mZ/MD5E-s2010--b5c39f810e4e506ef07692f3afc4ab9e.auto.txt/MD5E-s2010--b5c39f810e4e506ef07692f3afc4ab9e.auto.txt
blob
mark :10
data 141
../../../.git/annex/objects/Q4/70/MD5E-s2010--b5c39f810e4e506ef07692f3afc4ab9e.edit.txt/MD5E-s2010--b5c39f810e4e506ef07692f3afc4ab9e.edit.txt
blob
mark :11
data 131
../../../.git/annex/objects/34/pZ/MD5E-s4738--276821c0ca36fcdda7f4e1fcc34dd54c.tsv/MD5E-s4738--276821c0ca36fcdda7f4e1fcc34dd54c.tsv
blob
mark :12
data 135
../../../.git/annex/objects/k6/FG/MD5E-s76571--ee3104153096532204302a421d4bf008.json/MD5E-s76571--ee3104153096532204302a421d4bf008.json
blob
mark :13
data 129
../../../.git/annex/objects/Gg/04/MD5E-s6118--b3a3c788aa562963310e9c30d3892f7a.py/MD5E-s6118--b3a3c788aa562963310e9c30d3892f7a.py
blob
mark :14
data 130
../../.git/annex/objects/vm/J9/MD5E-s2281--3497c389890035b63646ecdea6c7d8fb.json/MD5E-s2281--3497c389890035b63646ecdea6c7d8fb.json
blob
mark :15
data 142
../../.git/annex/objects/J4/VF/MD5E-s16630016--67802d9a81a5324faf49eb25dff5faa6.nii.gz/MD5E-s16630016--67802d9a81a5324faf49eb25dff5faa6.nii.gz
blob
mark :16
data 130
../../.git/annex/objects/zw/Mj/MD5E-s2420--8fb30965880e0c3a26619367ffe75d40.json/MD5E-s2420--8fb30965880e0c3a26619367ffe75d40.json
blob
mark :17
data 142
../../.git/annex/objects/zf/6f/MD5E-s14559806--bc03ce1530c2600d5d7bf55728c5b53e.nii.gz/MD5E-s14559806--bc03ce1530c2600d5d7bf55728c5b53e.nii.gz
blob
mark :18
data 128
../../.git/annex/objects/90/wk/MD5E-s340--45d49b5c77a98584f509e2e3c260346d.bval/MD5E-s340--45d49b5c77a98584f509e2e3c260346d.bval
blob
mark :19
data 130
../../.git/annex/objects/V6/7J/MD5E-s1885--0b3d83e4d4b76be65033d684d3a42470.bvec/MD5E-s1885--0b3d83e4d4b76be65033d684d3a42470.bvec
blob
mark :20
data 130
../../.git/annex/objects/vK/04/MD5E-s3217--88217e8074f9a9a5345564d827375e53.json/MD5E-s3217--88217e8074f9a9a5345564d827375e53.json
blob
mark :21
data 144
../../.git/annex/objects/GF/1J/MD5E-s144018434--1b991de8707c01933cf7b70ee72f4d68.nii.gz/MD5E-s144018434--1b991de8707c01933cf7b70ee72f4d68.nii.gz
blob
mark :22
data 126
../../.git/annex/objects/8P/24/MD5E-s11--18f27c2f7d350592cf726bff51068858.bval/MD5E-s11--18f27c2f7d350592cf726bff51068858.bval
blob
mark :23
data 126
../../.git/annex/objects/WG/pM/MD5E-s33--cba8264586708b2eaa0f610f6c776aad.bvec/MD5E-s33--cba8264586708b2eaa0f610f6c776aad.bvec
blob
mark :24
data 130
../../.git/annex/objects/9z/7m/MD5E-s3201--d02d1bb431a5564a3f959d60e5eda88a.json/MD5E-s3201--d02d1bb431a5564a3f959d60e5eda88a.json
blob
mark :25
data 142
../../.git/annex/objects/qM/j1/MD5E-s11214965--e17d1f29f3790b22232429d405a4b80d.nii.gz/MD5E-s11214965--e17d1f29f3790b22232429d405a4b80d.nii.gz
blob
mark :26
data 128
../../.git/annex/objects/WP/17/MD5E-s479--ca6f1d8b58ed10ca7dc007404f2d74f3.bval/MD5E-s479--ca6f1d8b58ed10ca7dc007404f2d74f3.bval
blob
mark :27
data 130
../../.git/annex/objects/1P/g2/MD5E-s2770--02345a06531ba20e61332cb54e6d8540.bvec/MD5E-s2770--02345a06531ba20e61332cb54e6d8540.bvec
blob
mark :28
data 130
../../.git/annex/objects/M4/kj/MD5E-s3255--f4c668fd83e9ee7a7aaf356f7019f72e.json/MD5E-s3255--f4c668fd83e9ee7a7aaf356f7019f72e.json
blob
mark :29
data 144
../../.git/annex/objects/P4/kZ/MD5E-s195106707--b98fcf14bad1cafa1f148ec6359c17d3.nii.gz/MD5E-s195106707--b98fcf14bad1cafa1f148ec6359c17d3.nii.gz
blob
mark :30
data 130
../../.git/annex/objects/63/gV/MD5E-s3259--7b0a2f2272f828654f7bf027fa3962ce.json/MD5E-s3259--7b0a2f2272f828654f7bf027fa3962ce.json
blob
mark :31
data 142
../../.git/annex/objects/Mz/V8/MD5E-s11502025--1f9042bd5aadf570036c4280aa571eb4.nii.gz/MD5E-s11502025--1f9042bd5aadf570036c4280aa571eb4.nii.gz
blob
mark :32
data 130
../../.git/annex/objects/Q6/37/MD5E-s2935--17810a7cf136c891458665fbec767c86.json/MD5E-s2935--17810a7cf136c891458665fbec767c86.json
blob
mark :33
data 140
../../.git/annex/objects/xP/g4/MD5E-s2800702--930ded27c217cd7d6487abeaff6b932b.nii.gz/MD5E-s2800702--930ded27c217cd7d6487abeaff6b932b.nii.gz
blob
mark :34
data 130
../../.git/annex/objects/1j/MP/MD5E-s2943--7b35241683424727145a6fbd999a5e49.json/MD5E-s2943--7b35241683424727145a6fbd999a5e49.json
blob
mark :35
data 140
../../.git/annex/objects/2g/pF/MD5E-s2767826--33b1da7b14a76b5c7b156937a2e53a68.nii.gz/MD5E-s2767826--33b1da7b14a76b5c7b156937a2e53a68.nii.gz
blob
mark :36
data 130
../../.git/annex/objects/Gv/GQ/MD5E-s2982--27abcc738236f991acaa6f369ec90e04.json/MD5E-s2982--27abcc738236f991acaa6f369ec90e04.json
blob
mark :37
data 144
../../.git/annex/objects/MQ/P9/MD5E-s460389665--e79aaba0b6cf97561bf7e2169fd7018e.nii.gz/MD5E-s460389665--e79aaba0b6cf97561bf7e2169fd7018e.nii.gz
blob
mark :38
data 123
../.git/annex/objects/6Q/GP/MD5E-s700--2f28a67f65535bc4806a814476c18ee0.tsv/MD5E-s700--2f28a67f65535bc4806a814476c18ee0.tsv
commit refs/heads/main
mark :39
author Michael Hanke <michael.hanke@gmail.com> 1756629742 +0200
committer Michael Hanke <michael.hanke@gmail.com> 1756629742 +0200
data 698
[DATALAD RUNCMD] Convert subject 001 data

=== Do not change lines below ===
{
 "chain": [],
 "cmd": "apptainer exec code/heudiconv/apptainer/nipy-heudiconv--1.3.3.sing bash -c 'export PATH=/opt/dcm2niix-v1.0.20240202/bin:/opt/miniconda-py39_4.12.0/bin:$PATH && heudiconv --bids notop --overwrite --minmeta -o . -f '{inputs[0]}' -s 001 --files 'sourcedata/sessions/$(python3 sourcedata/reidentify 001)''",
 "dsid": "07efeddd-26eb-4345-aabe-bac882205d71",
 "exit": 0,
 "extra_inputs": [
  "code/heudiconv/apptainer/nipy-heudiconv--1.3.3.sing"
 ],
 "inputs": [
  "sourcedata/code/heuristic-q01.py"
 ],
 "outputs": [
  "sub-001",
  ".heudiconv/001"
 ],
 "pwd": "."
}
^^^ Do not change lines above ^^^
from :8
M 120000 :9 .heudiconv/001/info/001.auto.txt
M 120000 :10 .heudiconv/001/info/001.edit.txt
M 120000 :11 .heudiconv/001/info/dicominfo.tsv
M 120000 :12 .heudiconv/001/info/filegroup.json
M 120000 :13 .heudiconv/001/info/heuristic.py
M 120000 :14 sub-001/anat/sub-001_T1w.json
M 120000 :15 sub-001/anat/sub-001_T1w.nii.gz
M 120000 :16 sub-001/anat/sub-001_T2w.json
M 120000 :17 sub-001/anat/sub-001_T2w.nii.gz
M 120000 :18 sub-001/dwi/sub-001_acq-b1200_dwi.bval
M 120000 :19 sub-001/dwi/sub-001_acq-b1200_dwi.bvec
M 120000 :20 sub-001/dwi/sub-001_acq-b1200_dwi.json
M 120000 :21 sub-001/dwi/sub-001_acq-b1200_dwi.nii.gz
M 120000 :22 sub-001/dwi/sub-001_acq-b1200_sbref.bval
M 120000 :23 sub-001/dwi/sub-001_acq-b1200_sbref.bvec
M 120000 :24 sub-001/dwi/sub-001_acq-b1200_sbref.json
M 120000 :25 sub-001/dwi/sub-001_acq-b1200_sbref.nii.gz
M 120000 :26 sub-001/dwi/sub-001_acq-mshell_dwi.bval
M 120000 :27 sub-001/dwi/sub-001_acq-mshell_dwi.bvec
M 120000 :28 sub-001/dwi/sub-001_acq-mshell_dwi.json
M 120000 :29 sub-001/dwi/sub-001_acq-mshell_dwi.nii.gz
M 120000 :22 sub-001/dwi/sub-001_acq-mshell_sbref.bval
M 120000 :23 sub-001/dwi/sub-001_acq-mshell_sbref.bvec
M 120000 :30 sub-001/dwi/sub-001_acq-mshell_sbref.json
M 120000 :31 sub-001/dwi/sub-001_acq-mshell_sbref.nii.gz
M 120000 :32 sub-001/fmap/sub-001_dir-ap_run-1_epi.json
M 120000 :33 sub-001/fmap/sub-001_dir-ap_run-1_epi.nii.gz
M 120000 :34 sub-001/fmap/sub-001_dir-pa_run-1_epi.json
M 120000 :35 sub-001/fmap/sub-001_dir-pa_run-1_epi.nii.gz
M 120000 :36 sub-001/func/sub-001_task-rest_bold.json
M 120000 :37 sub-001/func/sub-001_task-rest_bold.nii.gz
M 120000 :38 sub-001/sub-001_scans.tsv
```
</details>


<details>
<summary>Full output of `git fast export git-annex` (identity and availability metadata)</summary>

```
reset refs/heads/git-annex
commit refs/heads/git-annex
mark :1
author Michael Hanke <michael.hanke@gmail.com> 1756625245 +0200
committer Michael Hanke <michael.hanke@gmail.com> 1756625245 +0200
data 15
branch created

blob
mark :2
data 83
96f9ea23-28cf-4e81-a388-0d70fc430e8d mih@meiner:/tmp/bidsnew timestamp=1756625245s

commit refs/heads/git-annex
mark :3
author Michael Hanke <michael.hanke@gmail.com> 1756625245 +0200
committer Michael Hanke <michael.hanke@gmail.com> 1756625245 +0200
data 7
update
from :1
M 100644 :2 uuid.log

blob
mark :4
data 51
1756627031s 1 96f9ea23-28cf-4e81-a388-0d70fc430e8d

blob
mark :5
data 51
1756627030s 1 96f9ea23-28cf-4e81-a388-0d70fc430e8d

blob
mark :6
data 51
1756627032s 1 96f9ea23-28cf-4e81-a388-0d70fc430e8d

commit refs/heads/git-annex
mark :7
author Michael Hanke <michael.hanke@gmail.com> 1756627032 +0200
committer Michael Hanke <michael.hanke@gmail.com> 1756627032 +0200
data 7
update
from :3
M 100644 :4 1da/74f/MD5E-s33--cba8264586708b2eaa0f610f6c776aad.bvec.log
M 100644 :4 1e4/a8a/MD5E-s11--18f27c2f7d350592cf726bff51068858.bval.log
M 100644 :4 1e7/f04/MD5E-s479--ca6f1d8b58ed10ca7dc007404f2d74f3.bval.log
M 100644 :5 2f9/cde/MD5E-s2281--3497c389890035b63646ecdea6c7d8fb.json.log
M 100644 :5 394/c00/MD5E-s3217--88217e8074f9a9a5345564d827375e53.json.log
M 100644 :5 40e/a44/MD5E-s340--45d49b5c77a98584f509e2e3c260346d.bval.log
M 100644 :4 4ce/869/MD5E-s2943--7b35241683424727145a6fbd999a5e49.json.log
M 100644 :4 5e2/84a/MD5E-s2770--02345a06531ba20e61332cb54e6d8540.bvec.log
M 100644 :4 5e4/b48/MD5E-s2800702--930ded27c217cd7d6487abeaff6b932b.nii.gz.log
M 100644 :4 6af/a9e/MD5E-s3201--d02d1bb431a5564a3f959d60e5eda88a.json.log
M 100644 :4 706/7f7/MD5E-s2982--27abcc738236f991acaa6f369ec90e04.json.log
M 100644 :5 724/f02/MD5E-s6118--b3a3c788aa562963310e9c30d3892f7a.py.log
M 100644 :4 7f7/f07/MD5E-s144018434--1b991de8707c01933cf7b70ee72f4d68.nii.gz.log
M 100644 :4 83b/9cb/MD5E-s3259--7b0a2f2272f828654f7bf027fa3962ce.json.log
M 100644 :4 845/f3b/MD5E-s195106707--b98fcf14bad1cafa1f148ec6359c17d3.nii.gz.log
M 100644 :4 84c/e38/MD5E-s3255--f4c668fd83e9ee7a7aaf356f7019f72e.json.log
M 100644 :4 867/50e/MD5E-s2935--17810a7cf136c891458665fbec767c86.json.log
M 100644 :5 86d/bff/MD5E-s76571--ee3104153096532204302a421d4bf008.json.log
M 100644 :4 8a8/6ee/MD5E-s11502025--1f9042bd5aadf570036c4280aa571eb4.nii.gz.log
M 100644 :5 944/29b/MD5E-s14559806--bc03ce1530c2600d5d7bf55728c5b53e.nii.gz.log
M 100644 :4 969/efa/MD5E-s460389665--e79aaba0b6cf97561bf7e2169fd7018e.nii.gz.log
M 100644 :5 a40/51c/MD5E-s2010--b5c39f810e4e506ef07692f3afc4ab9e.edit.txt.log
M 100644 :5 b1c/aea/MD5E-s2420--8fb30965880e0c3a26619367ffe75d40.json.log
M 100644 :4 b2f/04d/MD5E-s2767826--33b1da7b14a76b5c7b156937a2e53a68.nii.gz.log
M 100644 :6 b6e/9f5/MD5E-s700--2f28a67f65535bc4806a814476c18ee0.tsv.log
M 100644 :5 c45/0cd/MD5E-s4738--276821c0ca36fcdda7f4e1fcc34dd54c.tsv.log
M 100644 :5 c4f/56d/MD5E-s16630016--67802d9a81a5324faf49eb25dff5faa6.nii.gz.log
M 100644 :4 da1/a32/MD5E-s11214965--e17d1f29f3790b22232429d405a4b80d.nii.gz.log
M 100644 :5 e67/e1d/MD5E-s1885--0b3d83e4d4b76be65033d684d3a42470.bvec.log
M 100644 :5 f95/13f/MD5E-s2010--b5c39f810e4e506ef07692f3afc4ab9e.auto.txt.log

blob
mark :8
data 51
1756629213s 1 96f9ea23-28cf-4e81-a388-0d70fc430e8d

commit refs/heads/git-annex
mark :9
author Michael Hanke <michael.hanke@gmail.com> 1756629213 +0200
committer Michael Hanke <michael.hanke@gmail.com> 1756629213 +0200
data 7
update
from :7
M 100644 :8 5db/0cb/MD5E-s101--610bf165083743f24eb04ba672c08b1c.tsv.log
M 100644 :8 7c0/b1c/MD5E-s1837--3a48d986a296cf56136bb2c33655f763.auto.txt.log
M 100644 :8 8f1/553/MD5E-s186--932cf1963a98118612e727f6757dac72.json.log
M 100644 :8 a9d/32c/MD5E-s1837--3a48d986a296cf56136bb2c33655f763.edit.txt.log
M 100644 :8 bc2/bfe/MD5E-s621--8fa96ba41aaa3217284ddb90d9ae0b36.tsv.log
```
</details>

This is critical for being able to address subject consent withdrawal and other events that require removing information from a dataset -- without having to rewrite history (which is undesirable, because it would break any Git-based version/state identifiers in any consuming metadata systems).

Importantly, all information on the source DICOMs (and participant identifier mappings) is hidden behind a GITSHA state identifier of the `sourcedata/` dataset, its DataLad UUID, and the download URL.

Heudiconv's feature to populate top-level BIDS metadata files (like `participants.tsv`) has been disabled for two reasons. (1) it only covers information available in the DICOMs (which may be inaccurate due to institutional metadata policies), and (2) still might contain information (like participant age) that may be undesirable to share in a particular context. A dedicated/separate metadata injection implementation is likely to be needed in many cases.

The top-level dataset includes the heudiconv conversion/override records as annex'ed files (only key is tracked with Git):

```
.heudiconv
└── 001
    └── info
        ├── 001.auto.txt
        ├── 001.edit.txt
        ├── dicominfo.tsv
        ├── filegroup.json
        └── heuristic.py
```

The content of these files is problematic:

- `heuristic.py` might contain per-participant special-casing and thereby participant identifiers
- `filegroup.json` contains full original path to each DICOM file (likely to contain participant-related identifiers)
- `dicominfo.tsv`: contains DICOM metadata excerpts (likely to contain participant-related identifiers)

With the approach above, this file content can be deposited (separately) on a protected infrastructure.

Alternatively, the `.heudiconv/` directory can be configured to be gitignored, and excluded from any tracking. This is means that this information is lost at some point. However, given that the conversion is deterministic, it can be regenerated at any point. It would need to be kept, however, if participant-specific configuration overrides need to be implemented (`001.edit.txt`).

Yet another alternative would be to place `.heudiconv/` into its own DataLad subdataset, and manage all conversion configuration related files separately in there.

Files tracked with git-annex might still contain information that is undesirable in a particular context. Examples are

- exact acqusiiton time reports in `*_scans.tsv`
- custom DICOM metadata in JSON sidecar files (e.g. `ProcedureStepDescription` or `SeriesDescription`) which could contain identifying information in some scenarios

However, a custom metadata post-processing script can remove or alter any of this information to produce a final dataset state without leaving undesirable traces in the Git history (as long as the original file content versions are discarded and not shared). An implementation of such a procedure could be kept alongside the `reidentify` tool in the `sourcedata/` dataset.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rethinking pseudonymization #832

Alternative approach to pseudonymization

Components

Outcome

Data protection risk assessment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Rethinking pseudonymization #832

Description

Alternative approach to pseudonymization

Components

Outcome

Data protection risk assessment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions