Weizhe Li, Weijie Chen
Code link: Training image extraction
Code link: Image processing
Code link: Neural network training
Code link: Heatmap construction
Code link: Slide-based prediction
Code link: Lesion-based prediction
Code link: WSI-heatmap visualization
- Note: The codes were developed on python 3.5, but also work for python 3.6
-
Package Installation for Color Normalization
Note: SPAMS doesn't need to be installed mannually from source anymore since it can be installed through pip.
-
Tensorflow and Keras version
The code here was based on Keras 2.0.0 and Tensorflow 1.9. The compatibility between Tensorflow and Keras, and between Tensorflow and CUDA, are important, especially when the code runs on different machines. Certain machinse can only run some low version of Tensorflow that is compatible with a low version of Keras. When loading a model trained with higher version of Tensorflow and Keras to such a machine, the weights of the trained model would not be fully loaded, but it still works for testing (however, not for training, e.g., transfer learning). The code would need some changes if Keras with a version higher than 2.0.0 is used for model training (see the comments in the code for model training).
-
OpenSlide can serve as a backend for a web viewer of WSI.
-
Mask images can be generated from the xml files that store the pathologist's annotation of tumor contours and serve as ground truth for model training. The mask image is a binary image with normal tissue coded as ‘0’ and tumor tissue coded as ‘1’ for each corresponding pixel of a WSI image. To directly display the masks, the code here use 255 (rather than 1) for tumor pixels. See update for mask file generation on Camelyon 17 website.
-
Note:
-
The mask file has a pyramid structure corresponding to multiple levels of magnification. Except for the 40x magnification at which the xml annotation was made, the mask size may be slightly different from the corresponding WSI because the method used for creating the pyramid structure in the mask file may be different from that used for WSI.
-
The code below for the CAMELYON16 training slides is based on ASAP code, but we found that it did not work for testing slides. We therefore wrote our own mask generation code for the testing slides.
-
-
Time consuming
-
WSI and Mask file (Example): tumor_026
- Note: Some tumor slides were not fully annotated. Normal_86: Originally misclassified, renamed to Tumor_111. Test_049: Duplicate slide. Test_114: Does not have exhaustive annotations. Test_049 was removed (by the organizer) for slide based and lesion based tasks; Test_114 was removed (by the organizer) from lesion based tasks.
- Annotation Visulization Over Image Base on xml file
To reduce computation, the blank regions (no tissue) on slide will be excluded.
-
Color space switch to HSV
-
Tissue region segmentation (Otsu’s method of foreground segmentation)
(code embedded in Patch Extraction)
Extract normal image patches from normal slides
Extract normal and tumor image patches from tumor slides
- Note: The codes for the following procedures are part of the modules for CNN training.
Step 0 : tumor training images and normal training images were randomly divided into training and validation data set.
80% for training data set;
20% for validation data set.
images for validation data set:
tumor: tumor_002.tif, tumor_008.tif, tumor_010.tif, tumor_019.tif, tumor_022.tif, tumor_024.tif, tumor_025.tif,
tumor_031.tif, tumor_040.tif, tumor_045.tif, tumor_049.tif, tumor_069.tif, tumor_076.tif, tumor_083.tif,
tumor_084.tif, tumor_085.tif, tumor_088.tif, tumor_091.tif, tumor_101.tif, tumor_102.tif, tumor_108.tif,
tumor_109.tif
normal: normal_003.tif, normal_013.tif, normal_021.tif, normal_023.tif, normal_024.tif, normal_030.tif, normal_031.tif
normal_040.tif, normal_045.tif, normal_057.tif, normal_062.tif, normal_066.tif, normal_068.tif,normal_075.tif,
normal_076.tif, normal_080.tif, normal_087.tif, normal_099.tif, normal_100.tif, normal_102.tif, normal_106.tif
normal_112.tif, normal_117.tif, normal_127.tif, normal_132.tif, normal_139.tif, normal_141.tif, normal_149.tif
normal_150.tif, normal_151.tif, normal_152.tif, normal_156.tif
Tumor slide : 1K positive and 1K negative from each slide. So total patches for tumor tissue: 111k
Normal slide: 1K negative from each slide. So total patches for normal tissue: 111k + 159k = 270k
-
Method I: flip, rotation and cropping
For a 256x256 image patch from step 1, it was flipped,rotated 3 times. The original 256x256 image patch became 4 image patches. Then based on the 4 256x256 image patches, 2 224x224 image patches were randomly cropped from them. So I had total 8 224x224 image patches derived from 1 256x256 image patch. However, for normal image patches, only 1 224x224 image patches were randomly cropped from them. So, I have total image patches for tumor tissue: 111 x 8 = 888k, for normal tissue: 270 x 4 = 1080k
-
Method II: stain (color) normalization and adding color noise
For method II, I will have 2 millions of 224x224 image patches for training model. Because, all the previous 224x224 image patches will be color-normalized, then be added color noise. The orginal image patch with color-normalization, plus the one with color noise will give me two 224 x 224 images from 1 224 x 224 images. So I will have total 888 x 2 = 1776k for tumor; 1080 x 2 = 2160k for normal tissue.
-
The color variety among patches
The patches before and after stain normalization
Dayong Wang's method (PathAI) (Based on HSV image patches):
If the image patches were added by a big value that will cause the some pixel values larger than 255, the image patches after adding color noise will look like this (not preferred):
Yun Liu's method (Google) (also called color perturbation):
Step 3 : Image Generator
Patches:
Ground Truth:
-
Optimization method: Stochastic gradient descent
-
Weight initialization: Random sampling from a Gaussian distribution
-
Batch size: 32
-
Batch normalization: No
-
Regularization: L2-regularization (0.0005) and 50% dropout
-
Learning rate: 0.01, multiplied by 0.5 every 50,000 iterations (0.01, multiplied by 0.1 per epoch)
-
Activation function: ReLu
-
Loss function: Cross-entropy
-
Number of training epochs/iterations: 300,000 iterations
Details (11-06-18): 120k 224x224 image patches were extracted based on the predicted; then augmentation were performed by using rotation and horizontal flip. For example, 1 224x224 image patch was rotated 3 times, and get 4 image patches. Then the 4 image patches were flipped horizontal and get total 8 different patches. So, total about 1 million of patches were used for hard negative mining.
To get a model with hard negative mining, googlenet v1 was retrained by adding the above mentioned 1 million of patches to my original training patches (1 million of normal 224 patches and 1 million of tumor 224 patches).
-
Prerequest
A: a trained CNN model B: [patch_index.py in utils](https://github.com/DIDSR/DeepLearningCamelyon/blob/master/dldp/dldp/utils/patch_index.py needs to be run first to get postion information of images patches to be predicted.
-
False positive patches
-
Some false positive patches from partially annotated tumor slides are real positive and will be excluded.
The training will use the same code, but with the folder (hnm_dir) for false positive patches included.
The training will use the same code, but with the folder ("hnm_dir") for false positive patches and the folder ("pnt_dir") for normal patches near tumor regions included.
Test images were divided into non-overlapping small patches; each patch will get a predicted image for each pixel assigned by probability.
Heatmap Generation
-
Prerequrest: patch_index.py in utils needs to be run first to get image patch information like location in WSI.
-
Code for prediction - slide window - googlenet - HPC version
-
Code for prediction - slide window - googlenet - Workstation version
-
Put all the patches together and get prediction for the whole slide (code for heatmap stitching - HPC version).
-
Put all the patches together and get prediction for the whole slide (code for heatmap stitching - Workstation version).
the heatmap for test_075 (the part with score < 0.5 is not shown here)
the heatmap for test_073 (the part with score < 0.5 is not shown here)
Comparison of predicted with ground truth for tumor_005:
Step 1: find the failed tasks
Step 2: redo the failed tasks
Step 3: copy the results
- The ratio between the area of metastatic regions and the tissue area.
- The sum of all cancer metastases probailities detected in the metastasis identification task, divided by the tissue area. caculate them at 5 different thresholds (0.5, 0.6, 0.7, 0.8, 0.9), so the total 10 global features
Based on 2 largest metastatic candidate regions (select them based on a threshold of 0.5).
10 features were extracted from the 2 largest regions:
- Area: the area of connected region
- Eccentricity: The eccentricity of the ellipse that has the same second-moments as the region
- Extend: The ratio of region area over the total bounding box area
- Bounding box area
- Major axis length: the length of the major axis of the ellipse that has the same normalized second central moments as the region
- Max/mean/min intensity: The max/mean/minimum probability value in the region
- Aspect ratio of the bounding box
- Solidity: Ratio of region area over the surrounding convex area
- Combine the prediction results from Model-1 and Model-2
Model-1 is the model from step 2 (with hard negative mining) in section 3 - "Training Neural Netowrk";
Model-2 is the model from step 3 (with hard negative mining patches and normal patches near tumor regions) in section 3 - "Training Neural Netowrk".
The x, y coordinates of predicted tumor lesion come from Model-1; The scores of predicted tumor lesion are the average of scores from Model-1 and Model-2.
- Results
-
Note: ROC curve is generated in the python script as mentioned above.
-
Result
-
Wang, D., et al., Deep learning for identifying metastatic breast cancer https://arxiv.org/abs/1606.05718
-
Diagnostic Assessment of Deep Learning Algorithms for Detection of Lymph Node Metastases in Women With Breast Cancer. https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5820737/?report=reader#!po=59.4340
-
Litjens G., et al., 1399 H&E-stained sentinel lymph node sections of breast cancer patients: the CAMELYON dataset https://academic.oup.com/gigascience/article/7/6/giy065/5026175