Skip to content

01_Baseline

Michael Bornholdt edited this page Sep 21, 2021 · 5 revisions

The aim of this first part is to decide on useful evaluation metrics and thus create a baseline of how well the class pipeline performs. Classical feature extraction is done by employing CellProfiler and following the postprocessing steps found in the LINCS repository.

Content

  • LINCS data
  • Pycytominer pipeline
  • Evaluation metrics
  • Results

LINCS dataset

Details on the LINCS dataset can be found in the repository but the most important details are:

  • A549 cell line
  • 1,571 compounds/perturbations
  • 6 doses per compound
  • 5 replicas of each dose and compound

Size:

  • 5 batches
  • 136 plates
  • 386 wells per plate
  • 9 sites per well

With each well containing ~2,000 cells and with ~52,000 wells in total, the LINCS database holds around 100 million cells.

The different stages of processing, from images of each site to the aggregated profile (feature space) of a compound, are described in the profiles section of the LINCS repository.

This repository contains selected outputs of the LINCS database and both level 3 and level 5 data. The /baseline/01_data/ folder contains the data needed for all baseline analysis metadata. Part of the notebooks in the /baseline/02_analysis/ folder also give additional insights on the contents of the data frames. I highly recommend looking at the 01_data_insights.ipnyb notebook.


Pycytominer pipeline

Pycytominer is a codebase built by members of the Carpenter-Singh lab. It allows for easy processing of CellProfiler (CP) and DeepProfiler (DP) data and contains all functions that were used to process the data in this repository.

Pycytominer allows to aggregate data to higher levels such as aggregating single-cell profiles into one profile per well or you can use it to select a subset of features (which is a relevant task for CP profiles). Most importantly tho, Pycytominer contains different normalization and sphering functions where your CP or DP data is normalized in respect to controls or transformed such that plate and well effects are less relevant.

The LINCS repository describes this processing pipeline in detail. and the Pycytominer repo will allow you to understand the details of each function as well as give example scripts. The pipeline decisions made in this project can be traced in the notebooks in the /baseline/02_analysis folder.

Furthermore, I have built custom helper functions specific to the post-processing and analysis of the neural profiling project. These functions make the pipeline more manageable and easier to work in notebooks.


Evaluation metrics

Over the course of this project, four evaluation metrics have been identified as helpful tools to assess the quality of the profiles.

  • Enrichment
  • Average Precision @k
  • Average Recall @k
  • Hits @k

Generally, we want profiles to be a good representation of the real morphology of the cells. "Good" in this context means for example that:

  • the profile of compounds are differentiable from the negative controls (DMSO)
  • the profiles only contain minimal batch effects, i.e. wells, plates, and batch numbers do not correlate with each other.
  • compounds with the same MOA (Method of action) are close neighbors.

The last point in the above list is what most metrics are based off. We define replicates as compounds that have the same MOA and can then calculate how many compounds are closest neighbors with their replicates.

All evaluation metrics used in this project can be found in the cytominer-eval repository.

Enrichment

Enrichment is a score which looks at the closest connections between all compounds and checks for replicates. This metric is used throughout the whole project and has proven to be the most consistent and comparable metric of all. A simple demonstration of the enrichment metric can be found in this notebook.

Precision and Recall

While these are technically two different metrics, they are calculated within one function and are thus often mentioned hand in hand. We calculate the precision and recall of the first k neighbors for every compound and then average this number. Examples: Say compound A has two other compounds (B and C) with the same MOA. The 5 nearest neighbors A are [W, X, Y, Z, B]. Then the precision is 0.2 and the recall is 0.5. These values are averaged for all compounds in the dataset to arrive at the average precision/recall @ k.

An example analysis can be found here.

!! Comment about top 20 precision?

Hits @k

I designed the hit @k metric specifically for this project since I was looking for a metric that was useful to the pharmacological discovery process while being simple enough to easily understand by people without a statistical background. Similar to the precision calculations, the position of replicates in the list of nearest neighbors (NN) is determined and saved to a long list. This list can then be plotted in a histogram showing how many MOA replicates can be found in the first 5 NN and how many are found in the 50-55 NN. Obviously, the first few NN should be much more represented.

!! LInk to eval hits@k

!! Add link to notebooks


Results

All important results from the analysis folder can be found in /baseline/03_results/. This folder contains csv files holding the best evaluation metrics and plot images of those metrics. The main learning from this first part of the overall project is that normalizing and then spherizing the data leads to the best metric scores.