Skip to content

02_PreTrained

Michael Bornholdt edited this page Sep 16, 2021 · 11 revisions

In the second chapter, I use DeepProfiler, the cell images, and the masked cell locations to infer the features of each cell by running the images through a pre-trained net on DeepProfiler. After using similar postprocessing as for the CellProfiler (CP), I can compare the CP baseline to the pre-trained nets via the evaluation metrics.

Content

  • Images & location files
  • DeepProfiler
  • Results

Images & location files

The CellPainting assay outputs 5 channels for each site. This can be thought of as the equivalent of RGB images which are 3 channels taken of the same subject. One of the many capabilities of the CellPainting software is to segment and mask cells such that the locations of each cell are saved in so-called location files.

Both images and location files can be downloaded from the S3 jump-cellpainting bucket. All images are compressed to PNG format and acessible under /projects/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/2016_04_01_a549_48hr_batch1_compressed/images/ while the locatino csv files can be found under s3://jump-cellpainting/projects/2015_10_05_DrugRepurposing_AravindSubramanian_GolubLab_Broad/workspace/locations/. Note that these location files can be extracted again from the original SQlite files, if needed.

You will need to download these files into the correct spaces of your DeepProfiler project (project/inputs/images and project/inputs/locations). Refer to the DeepProfiler wiki and demo for details.

Nomenclature

Images Images are currently held in a shallow structure /Plate/image.png but this may change in the future since we try to have fewer files per folder. An example image name is r01c01f01p01-ch1sk1fk1fl1.png with the following information stored in the name: rXXcYYfZZp01-chCsk1fk1fl1.png, where:

  • XX is the row number with A = 01, B = 02 etc.
  • YY is the number of the column: c02 = columns 2
  • ZZ is the site number: f09 = image of the 9th site
  • C is the CP channel number with 1 = AGP, 2 = ER, 3 = RNA, 4 = DNA and 5 = Mito

Below is an example image of the RNA channel showing the Nucleoli and cytoplasmic RNA.

The location files have an easier nomenclature with RowColumn-Site-Nuclei.csv as in A01-1-Nuclei.csv for example. Such a location file will simply hold two columns (X value and Y value) with a list of the positions of all cells in that particular site. These files can be conveniently read by pandas pd.read_csv("location_file"). Here are the first rows of such a file.

Nuclei_Location_Center_X,Nuclei_Location_Center_Y
1224.92495189224,29.781911481718996
302.20882352941203,40.9143382352941
1902.96504782929,46.1699779249448
1247.293622142,54.040914560770204

DeepProfiler

DeepProfiler (DP) is a set of tools written by Juan Caicedo and his colleagues that allows me to use deep learning to train models as well as infer profiles from the cell images.

Since DeepProfiler is a work-in-progress code base that has recently been updated to Tensorflow 2.5, there are some tricks to be aware of when using DP or trying to reproduce my work. On this separate page, I have accumulated a long list of practical tips for working with DeepProfiler. I ran most of my experiments on a HPC cluster from the University of Wisconsin. Running DP on such a server requires a docker container that contains all relevant software and installations.

Docker images All docker images can be found here. Currently, the tags tf2_v13 and tf2_v12 are active. V13 refers to the image that already incorporates augmentation during training, V12 has no augmentation. Tf2 refers to the fact that both these images run on Tensorflow 2.5 and the new tf2 branch of DeepProfiler.

The models I have used for creating a baseline for neural net profiles are the EfficientNetB0 and the ResNet50V2 since these two nets were already available in DP. I used DeepProfiler_processing an additional function of pycytominer that was started by Greg and finished by myself to aggregate the resulting single cell output of Deepprofiler. After aggregating this data to a well level, the same Pycytominer functions can be used to proceed to the evaluation techniques. !!!! As can be seen in these notebooks, feature selection does not increase or decrease metrics.

The folders /pre-trained/efficientnet /pre-trained/resnet contain the resulting profiles

Methods

File sizes Given the images and location files and a functional docker image.

Be aware that a few Terrabytes will be needed to host these experiments. Approximate sizes are shown below.

  • The original images and locations: ~900G
  • sample folder containing crops: 50-100G
  • profile folder containing profiles: 50-80G
  • Training folder containing training checkpoints: 1G

Naturally, sizes will differ, depending on the number of crops in the sample for example.

Input file DeepProfiler reads all information from the index file such as the location of location files, the images and what set of data belongs to training and testing. The full index files and the sub selected index can be found in the pre-trained/data-prep/02_index_preperation/ folder.

All details on how to train, sample, or profile can be found in the config. The DP wiki has a good description of all functions and an example config can be found on the DeepProfiler page.

Results

Again, I can show how Spherizing is the best way to alleviate batch effects and improve the metric scores. Furthermore, there is no significant difference to be found between Efficientnet and Resnet. The Enrichment scores for the pretrained nets outperform the classic baseline by a factor of about 1.3.

Clone this wiki locally