Add option to anonymize acquisition datetimes in scans.tsv #265

tsalo · 2023-08-17T14:09:22Z

Following a discussion in today's informatics scrum, I was thinking that it would be nice to be able to anonymize acquisition datetimes in the scans.tsv files (and potentially in sidecar JSON files). @mattcieslak thought this could be made part of the purge-metadata command.

Desiderata:

Set first scan's acquisition to 1800/01/01.
Give users the option to either anonymize the full datetime or just anonymize the date (i.e., retain the time of day).
Preserve relative timing between scans in each session.
Preserve relative timing between sessions.

tsalo · 2023-09-07T12:50:36Z

Here's some code I've used to do this in another project:

"""Anonymize acquisition datetimes for a dataset.

Anonymize acquisition datetimes for a dataset. Works for both longitudinal
and cross-sectional studies. The time of day is preserved, but the first
scan is set to January 1st, 1800. In a longitudinal study, each session is
anonymized relative to the first session, so that time between sessions is
preserved.

Overwrites scan tsv files in dataset. Only run this *after* data collection
is complete for the study, especially if it's longitudinal.
"""
import os
from glob import glob

import pandas as pd
from dateutil import parser

if __name__ == "__main__":
    dset_dir = "/path/to/dset"

    bl_dt = parser.parse("1800-01-01")

    subject_dirs = sorted(glob(os.path.join(dset_dir, "sub-*")))
    for subject_dir in subject_dirs:
        sub_id = os.path.basename(subject_dir)
        print(f"Processing {sub_id}")

        scans_files = sorted(glob(os.path.join(subject_dir, "ses-*/*_scans.tsv")))

        for i_ses, scans_file in enumerate(scans_files):
            ses_dir = os.path.dirname(scans_file)
            ses_name = os.path.basename(ses_dir)
            print(f"\t{ses_name}")

            df = pd.read_table(scans_file)
            if i_ses == 0:
                # Anonymize in terms of first scan for subject.
                first_scan = df["acq_time"].min()
                first_dt = parser.parse(first_scan.split("T")[0])
                diff = first_dt - bl_dt

            acq_times = df["acq_time"].apply(parser.parse)
            acq_times = (acq_times - diff).astype(str)
            df["acq_time"] = acq_times
            df["acq_time"] = df["acq_time"].str.replace(" ", "T")

            # Delete the original file instead of just overwriting it, for Datalad.
            os.remove(scans_file)

            df.to_csv(
                scans_file,
                sep="\t",
                line_terminator="\n",
                na_rep="n/a",
                index=False,
            )

tsalo added the enhancement New feature or request label Aug 17, 2023

mattcieslak self-assigned this Aug 28, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add option to anonymize acquisition datetimes in scans.tsv #265

Add option to anonymize acquisition datetimes in scans.tsv #265

tsalo commented Aug 17, 2023

tsalo commented Sep 7, 2023

Add option to anonymize acquisition datetimes in scans.tsv #265

Add option to anonymize acquisition datetimes in scans.tsv #265

Comments

tsalo commented Aug 17, 2023

tsalo commented Sep 7, 2023