This repository contains a data pipeline for processing medical imaging data. It includes modules for anonymizing DICOM files, encrypting patient IDs, extracting metadata, and processing the data. Additionally, the data pipeline offers flexibility and extensibility, allowing users to customize and expand its functionality according to specific project requirements. With a focus on scalability and performance optimization, the pipeline is capable of handling large volumes of medical imaging data efficiently. Its modular design fosters modularity and code reusability, promoting ease of maintenance and future enhancements.
Below are the key functionalities encapsulated within the pipeline:
-
Anonymization Module: This module is responsible for anonymizing DICOM files, ensuring the removal of sensitive patient-related information while adhering to regulatory compliance standards. It sanitizes the data by eliminating identifiable attributes, thereby safeguarding patient privacy.
-
Encryption Module: The encryption module adds an extra layer of security by encrypting patient IDs, thus enhancing data protection measures. By encrypting sensitive identifiers, the module ensures that patient information remains confidential and inaccessible to unauthorized parties.
-
Metadata Extraction: This module facilitates the extraction of metadata from DICOM files, enabling users to access valuable information embedded within the imaging data. It parses the DICOM headers to retrieve essential metadata attributes, providing insights into the imaging parameters and acquisition details.
-
Data Processing: The data processing module orchestrates the sequential execution of various operations, including preprocessing, analysis, and transformation of medical imaging data. It streamlines the processing pipeline, enabling seamless integration of diverse data processing tasks.
Encompassing these modules, the data pipeline provides a robust framework for effectively managing medical imaging data. Whether it involves anonymizing patient information, encrypting identifiers, extracting metadata, or processing imaging data, the pipeline offers a versatile solution tailored to meet the intricate demands of medical and biomedical imaging workflows (10.1007/s10278-021-00522-6). With its modular architecture, the pipeline facilitates seamless integration into existing healthcare systems and can be customized to accommodate specific use cases and requirements.
identifier.py
: This script processes DICOM files in the "checking" folder by extracting the SOP Instance UID and comparing it to files in the "raw" folder. If a match is found, it renames the file with the corresponding Anonymized Patient ID and moves it to the "identified" folder.anonymizer.py
: Module for anonymizing DICOM files by removing patient-related information and renaming them according to a specified format.encryption.py
: Module for encrypting patient IDs.extractor.py
: Module for extracting metadata from DICOM files.main.py
: Main script for executing the data processing pipeline.processor.py
: Module for processing medical imaging data.
To use the data pipeline, follow these steps:
- Clone the repository:
git clone https://github.com/MIMBCD-UI/data-pipeline.git
- Install the required dependencies by creating a virtual environment and installing the packages listed in
requirements.txt
:
cd data-pipeline
pip3 install -r requirements.txt
- Run the main script to execute the data processing pipeline:
python3 src/main.py
This section details the scripts involved in processing DICOM files within the MIMBCD-UI data pipeline. These scripts are responsible for handling various aspects of anonymization, metadata extraction, and file validation, ensuring the integrity and consistency of medical imaging data.
The data post-processing curation involves a series of steps to verify, anonymize, and validate DICOM files. Inside the curation/
folder of the dataset-multimodal-breast
repository, containing DICOM files at different stages of processing.
The following sequence outlines the steps involved in the post-processing pipeline inside the curation/
folder:
If at any stage we find the file to be incorrect, we move it to the curation/unsolvable/
folder. If the file is correct, we move it to the dicom/
folder.
The following scripts should be executed in sequence as part of the data processing pipeline. Each script serves a specific purpose and contributes to the overall goal of maintaining high-quality, anonymized medical imaging data.
-
identifier.py
- Initial DICOM File Identification- Purpose: This script processes DICOM files in the "checking" folder by extracting the SOP Instance UID and comparing it to files in the "raw" folder. If a match is found, it renames the file with the corresponding Anonymized Patient ID and moves it to the "identified" folder.
- When to Run: Run this script first to identify and organize the DICOM files before any further processing.
- Outcome: The files are identified and renamed based on the SOP Instance UID, making them ready for further processing.
-
laterality.py
- Initial Metadata Extraction and File Preparation- Purpose: This script processes DICOM files by converting anonymized patient IDs to their corresponding real patient IDs. It extracts critical metadata such as laterality (which side of the body the image represents) and renames/moves the files accordingly.
- When to Run: Run this script after
identifier.py
to further organize and prepare the DICOM files. - Outcome: The files are organized with accurate metadata, making them ready for comparison and validation.
-
compare.py
- Verification of Anonymized and Non-Anonymized File Correspondence- Purpose: This script compares anonymized and non-anonymized DICOM files to ensure they match based on metadata like
InstanceNumber
,ViewPosition
, andImageLaterality
. It also renames the files and moves them to a "checked" directory for further processing. - When to Run: Run this script after
laterality.py
to verify the correspondence between anonymized and non-anonymized files. - Outcome: Matched files are confirmed and organized in the "checked" directory.
- Purpose: This script compares anonymized and non-anonymized DICOM files to ensure they match based on metadata like
-
checker.py
- File Comparison and Logging- Purpose: This script provides an additional verification step by comparing anonymized and non-anonymized DICOM files based on
InstanceNumber
. It logs the paths of matching files to a CSV file for auditing and further analysis. - When to Run: Execute this script after
compare.py
to ensure a documented trail of matched files. - Outcome: A CSV file is generated, listing the paths of successfully matched files, ensuring traceability in the pipeline.
- Purpose: This script provides an additional verification step by comparing anonymized and non-anonymized DICOM files based on
-
reanonimyzer.py
- Final Correction and Re-Anonymization- Purpose: The final script in the sequence,
reanonimyzer.py
, corrects any discrepancies in the anonymized patient IDs and metadata based on predefined mappings. It updates the filenames and DICOM metadata as necessary and moves the corrected files to the final "checked" directory. - When to Run: This script should be run last, after
checker.py
, to finalize the anonymization and ensure data consistency. - Outcome: The DICOM files are fully re-anonymized, with all metadata and filenames accurately reflecting the correct anonymized patient IDs, ensuring they are ready for secure storage or further analysis.
- Purpose: The final script in the sequence,
- To execute the pipeline, follow the order outlined above:
# Step 1: Run main.py
python3 src/main.py
- After that, open the
curation/verifynig/
folder and move the files to thecuration/checking/
folder.
# Step 3: Run identifier.py
python3 src/identifier.py
python3 src/laterality.py
python3 src/compare.py
python3 src/checker.py
# Step 7: Run reanonimyzer.py
python3 src/reanonimyzer.py
Contributions are welcome! If you'd like to contribute to this project, please fork the repository and submit a pull request with your proposed changes.
This project is licensed under the MIT License.
Our team brings everything together sharing ideas and the same purpose, developing even better work. In this section, we will nominate the full list of important people for this repository, as well as respective links.
-
Francisco Maria Calisto [ Academic Website | ResearchGate | GitHub | Twitter | LinkedIn ]
-
Diogo AraĂşjo
-
Carlos Santiago [ ResearchGate ]
-
Catarina Barata
-
Jacinto C. Nascimento [ ResearchGate ]
-
JoĂŁo Fernandes [ ResearchGate ]
-
Margarida Morais [ ResearchGate ]
-
JoĂŁo Maria Abrantes [ ResearchGate ]
-
Nuno Nunes [ ResearchGate ]
- Hugo Lencastre
- Nádia Mourão
- Miguel Bastos
- Pedro Diogo
- JoĂŁo Bernardo
- Madalena Pedreira
- Mauro Machado
- Bruno Dias
- Bruno Oliveira
- LuĂs Ribeiro Gomes
This work was partially supported by national funds by FCT through both UID/EEA/50009/2013 and LARSyS - FCT Project 2022.04485.PTDC (MIA-BREAST) projects hosted by IST, as well as both BL89/2017-IST-ID and PD/BD/150629/2020 grants. We are indebted to those who gave their time and expertise to evaluate our work, who among others are giving us crucial information for the BreastScreening project.
Our organization is a non-profit organization. However, we have many needs across our activity. From infrastructure to service needs, we need some time and contribution, as well as help, to support our team and projects.
This project exists thanks to all the people who contribute. [Contribute].
Thank you to all our backers! 🙏 [Become a backer]
Support this project by becoming a sponsor. Your logo will show up here with a link to your website. [Become a sponsor]