Skip to content

CarDS-Yale/target-ai-shared

Repository files navigation

Repository for the project "TARGET-AI: a foundational approach for the targeted deployment of artificial intelligence electrocardiography in the electronic health record"

Authors: Evangelos K. Oikonomou, Bruno Batinica, Lovedeep S. Dhingra, Arya Aminorroaya, Andreas Coppi, Rohan Khera

Section of Cardiovascular Medicine, Department of Internal Medicine, Yale School of Medicine, New Haven, CT, USA. Cardiovascular Data Science (CarDS) Lab, Yale School of Medicine, New Haven, CT, USA.

This is a public code repository for the project "TARGET-AI: a foundational approach for the targeted deployment of artificial intelligence electrocardiography in the electronic health record" Parts of this code have been deidentified to prevent references to potentially identifiable or health system-specific information.

Model weights for the pre-trained ECG image vision transformer (ViT) can be accessed on: https://huggingface.co/CarDSLab/ecg-clip-beit-base-384.


Environment Setup

# first clone the repo and navigate to the local folder
cd target-ai-shared
conda env create -f ./environment_files/ecg_ehr_clip_env.yml # heavy environment with all packages
conda env create -f ./environment_files/ecg_image_vit_light.yml # light environment with only packages for inference

For Quick Inference on New Images

You can run quick inference on a folder of ECG images by accessing the model's weights on HuggingFace. In the command below, the following can be adjusted:

  • hf_repo: this points to the HF repo of the published model and weights
  • image_dir: point to a folder containing ECG images in the 4 standard layouts - as shown below
  • centroid_csv: point to the csv containing reference embeddings/centroids for cases/controls - we provide examples from our training set and EchoNext (Elias, P., & Finer, J. (2025). EchoNext: A Dataset for Detecting Echocardiogram-Confirmed Structural Heart Disease from ECGs (version 1.1.0). PhysioNet. RRID:SCR_007345. https://doi.org/10.13026/3ykd-bf14)
  • output_csv: path where the output will be stored

conda activate ecg_image_vit_light
python ./demo/zero_shot_from_reference_embedding.py \
  --hf_repo "CarDSLab/ecg-clip-beit-base-384" \
  --image_dir "./demo/online_ecg_images_credit_to_liftl" \
  --centroid_csv "./demo/reference_embeddings/ynhhs_reference_centroids.csv" \
  --output_csv "./demo/output/ynhhs_output.csv" \
  --batch_size 32 \
  --device cuda

Part I: Defining the Project Cohorts

The manuscript includes pairs of ECG-TTE studies performed within 90 days of each other. The inclusion and exclusion criteria are described in the accompanying manuscript. Briefly:

  • Development cohort: 2016 to 2021
  • Temporally distinct internal test set: 2022 to 2023

Part II: CLIP Pre-training of ECG Image ViT and TTE Text Encoder

conda activate ecg_ehr_clip_env
python /ecg_echo_clip_training/main_run_experiments.py

Each training run produces:

Tokenizer and Configuration Files:

  • added_tokens.json, config.json, merges.txt, preprocessor_config.json, special_tokens_map.json, tokenizer_config.json, tokenizer.json, vocab.json

Model Checkpoints:

  • best_model.pth, final_model.pth, epoch_10.pth, ...

Processor and Metrics:

  • processor_initial/, best_processor/, metrics.txt

Data Files:

  • train_data.csv, val_data.csv, frequent_words.txt

Output files are saved under hyperparameter-specific subdirectories in output_dir.


Part III: CLMBR-T Mapping of EHR

This section is based on JDAT extraction from the Yale EHR.

# Stepwise preprocessing
python /clmbrt-mapping/ynhhspreprocess_ynhhs_population.py
python /clmbrt-mapping/preprocess_ynhhs_labs.py
python /clmbrt-mapping/preprocess_ynhhs_meds.py
python /clmbrt-mapping/preprocess_ynhhs_pool_events.py
python /clmbrt-mapping/preprocess_split_parquets.py

# Parallel representation generation
/clmbrt-mapping/ynhhs/run_parallel.sh

# Merge outputs
python /clmbrt-mapping/combine_representation_chunks.py
python /clmbrt-mapping/combine_ecg_echo_clmbrt_embeddings.py

Final output: combined_ecg_echo_clmbr_embeddings_YYYYMMDD.parquet For UKB, see analogous scripts under /clmbrt-mapping/ukb


Part IV: Zero-shot Inference

This module performs zero-shot evaluation using image embeddings and centroids.

Step 1: Extract Centroids

python /ecg_echo_clip_training/extract_embeddings.py

Inputs: train_image_embeddings.parquet, train_labels.csv
Output: centroid_results.csv

Step 2: Evaluate on New Data

python ecg_echo_clip_training/zero_shot_from_reference_embeddings.py \
    --val_data_path /path/to/val.csv \
    --train_data_path /path/to/train.csv \
    --embeddings_path /path/to/clip_embeddings \
    --output_dir /path/to/output \
    --gpu 0

Outputs: overall_results.json, val_mrn.csv

Optional: Evaluate CLMBR-T Representations

python /clmbrt-mapping/inference/run_zero_shot_clmbrt.py/run_zero_shot_clmbrt.py

Part V: Experiments of EHR-guided vs Untargeted AI-ECG

This module compares 3 strategies:

  • Approach I (Image Only): ECG embeddings + classifier (F1-optimized)
  • Approach II (Concatenated): ECG + CLMBR-T embeddings + classifier (F1-optimized)
  • Approach III (Gated): CLMBR-T gating (90% sens.) --> ECG classifier (F1-optimized)
# Assumes a parquet or other file that contains the image_ and clmbr_embeddings, see script
python ./targeted_vs_untargeted_experiments/targeted_vs_untargeted_aiecg_deployment.py

Output: a csv file containing a table with label-level performance metrics for each strategy


About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published