Source code and data for
"Contrastive learning-based computational histopathology predict differential expression of cancer driver genes"
arxiv
HistCode is a multi-stage model. Firstly, adversarial contrastive learning is used to unsupervised extract tile-level features, then the attention-pooling is used to aggregate tile-level features into slide-level features, and finally it is used in the downstream tumor diagnosis and differential gene expression prediction tasks.
You can download your own wsi dataset to the directory slides,
then run data_processing/create_patches_fp.py to seg and tile wsis,
adjust the parameters according to your needs.
For example, you can use following command for segment and tile.
python create_patches_fp.py --source ../slides/TCGA-LUNG --data_type tcga_lung --patch_size 256 --save_dir ../tile_results --patch --seg
When you run this command, it will run in default parameter, if you want to run with your parameter, you can modify tcga_lung.csv in directory preset, and add --preset ../preset/tcga_lung.csv
.
Then the coordinate files will be saved to tile_results/patches
and the mask files that show contours of slides will be saved to tile_results/masks
.
Run train/train_adco.py to train contrast learning model on tiles,
you should write Adco/ops/argparser.py to configure the data source
and the save address and ADCO related parameters firstly.
In addition, you need to prepare a CSV file similar to dataset_csv/sample_data.csv,
this file needs to save the name of the WSI file used for training.
For example, you can use following command for training ADCO model with default parameter.
python train_adco.py --csv_path ../dataset_csv/sample_data.csv --save_path ../MODELS --data_h5_dir ../tile_result --data_slide_dir ../slides/TCGA-LUNG --data_type tcga_lung
Run data_processing/extract_features_fp.py to extract the tile-level features. For example, you can use following command for extracting features.
python extract_features_fp.py --data_h5_dir ../tile_results --data_slide_dir ../slides/TCGA-LUNG --csv_path ../dataset_csv/sample_data.csv --feat_dir ../FEATURES --data_type tcga_lung --model_path ../MODELS/adco_tcga_lung_not_sym.pth.tar
The above command will use the trained ADCO model in model_path
to extract tile features in data_slide_dir
and save the features to feat_dir
.
Run train/train_clf_model.py to perform downstream classification task. For example:
python train_clf_model.py --data_root_dir ../FEATURES --extract_model ADCO --results_dir ../results
The above command will use the feature file in data_root_dir
to train the classification model, and then output the test results to results_dir
.
User needs to divide the data set into training set, verification set and test set in advance and put them under dataset_csv/clf, such as:
dataset_csv/clf
├── train_dataset_1.csv
├── ...
├── train_dataset_5.csv
├── test_dataset_1.csv
├── ...
├── test_dataset_5.csv
├── val_dataset_1.csv
├── ...
├── val_dataset_5.csv
The default number of folds is 5, if user want to change fold numbers, add --k fold_number
and prepare corresponding training files in dataset_csv/clf.
The training files is like dataset_csv/clf/sample_data2.csv.
Run train/train_gene_reg_model.py to perform downstream regression task. For example:
python train_gene_reg_model.py --data_root_dir ../FEATURES --extract_model ADCO --results_dir ../MODELS/gene --csv_dir ../dataset_csv/gene --k 5
The above command will train regression that using attention-pooling to aggregate tile features by default. User should prepare gene dataset like this:
dataset_csv/gene
├── train_dataset_1.csv
├── ...
├── train_dataset_5.csv
├── test_dataset_1.csv
├── ...
├── test_dataset_5.csv
The training files is like dataset_csv/gene/sample_gene_data.csv.