README.Rmd

---
output: github_document
---

<!-- README.md is generated from README.Rmd. Please edit that file -->

```{r, include = FALSE}
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  fig.align="center",
  out.width = "100%"
)
```

# SepstratifieR

<!-- badges: start -->
<!-- badges: end -->

The goal of SepstratifieR is to stratify patients with suspected infection into groups with different molecular characteristics. This is done based on the expression level of a small set of genes measured from whole blood.

License: MIT + file LICENSE

## Installation

You can install the development version of SepstratifieR from [GitHub](https://github.com/) with:

``` r
# Install dependencies
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
    
BiocManager::install("batchelor")
BiocManager::install("MatrixGenerics")

# Install SepstratifieR
# install.packages("devtools")
devtools::install_github("jknightlab/SepstratifieR")
```
## Details

### Background
This package is designed to stratify patients with suspected infectious disease into different molecular groups based on a sample of their gene expression from whole blood. These molecular groups are defined based on a small gene signature, and are referred to as sepsis response signature (SRS) groups.

There are three SRS groups, which are as follows:

SRS1 = Composed of sepsis patients with an immunosupressed profile. These individuals are often at high risk of mortality.

SRS2 = Composed of sepsis patients with an immunocompetent profile. These individuals are at lower risk of mortality.

SRS3 = Composed mostly of healthy individuals.


For more information on how SRS groups were originally defined, please refer to the following publications:

https://doi.org/10.1016/S2213-2600(16)00046-1

https://doi.org/10.1164/rccm.201608-1685OC

### The stratification algorithm
To perform stratification on a group of patient samples (i.e. the user's input), SepstratifieR first aligns the input samples to a reference data set containing gene expression profiles from healthy individuals and sepsis patients. This alignment is performed using the mutual nearest neighbours (mNN) algorithm for batch correction. This has the purpose of bringing the predictor variables to the desired scale. 

Next, the samples are classified into SRS groups based on a previously trained random forest model. In addition, each sample is also assigned a quantitative sepsis response score (SRSq) based on a second random forest prediction model. This score (SRSq) goes from 0 to 1. Patients with SRSq close to zero are likely to be healthy, while patients with SRSq close to one are at high risk.

The diagram below describes how the models used by SepstratifieR were built (top panel), as well as how the package's functions perform alignment and classification (bottom panel):

![Schematic diagram of the analysis steps performed by the SepstratifieR package](./man/figures/README-method-diagram.png)

### Input format
The input expected by this function is a data frame object with rows corresponding to individuals/samples and columns corresponding to genes. This data frame must contain at least the following columns: 

1. When using the 7-gene signature defined by Davenport et al:
```{r show_table_of_columns, results='asis', echo=FALSE}
knitr::kable(
  data.frame(
    "Column_name"=c("ENSG00000152219", "ENSG00000100814", "ENSG00000127334","ENSG00000131355","ENSG00000137337","ENSG00000156414","ENSG00000115085"),
    "Gene_name"=c("ARL14EP", "CCNB1IP1", "DYRK2", "ADGRE3","MDC1","TDRD9","ZAP70")
    )
  )
```

2. When using the extended 19-gene signature:
```{r show_table_of_columns_extended_set, results='asis', echo=FALSE}
knitr::kable(
  data.frame(
    "Column_name"=c("ENSG00000144659", "ENSG00000103423", "ENSG00000135372", "ENSG00000079134", "ENSG00000135972", "ENSG00000087157", "ENSG00000165006", "ENSG00000111667", "ENSG00000182670", "ENSG00000097033", "ENSG00000165733", "ENSG00000103264","ENSG00000152219", "ENSG00000100814", "ENSG00000127334","ENSG00000131355","ENSG00000137337","ENSG00000156414","ENSG00000115085"),
    "Gene_name"=c("SLC25A38","DNAJA3","NAT10","THOC1","MRPS9","PGS1","UBAP1","USP5","TTC3","SH3GLB1","BMS1","FBXO31","ARL14EP", "CCNB1IP1", "DYRK2", "ADGRE3","MDC1","TDRD9","ZAP70")
             )
  )
```

If more columns are present, they will simply be ignored.


We recommend that predictor variables have the following units:

**Microarray:** Background-corrected, VSN-normalized, log-transformed intensity values

**RNA-seq:** Log-transformed counts per million (i.e. log-cpm)

**qRT-PCR:** Negative Cq values


In addition, any technical batch effects should be removedfrom the input data set before using SepstratifieR.

### A brief example
Below is a basic example which shows you how to use this package to stratify a small set of patients into sepsis response groups:

```{r example}
# Load package
library(SepstratifieR)

# Load test data set
data(test_data)
head(test_data)

# Stratify patients
predictions <- stratifyPatients(test_data)
```

The results from this prediction look as follows:
```{r example_output}
predictions
```

Futhermore, you can use SepstratifieR's built-in plotting function to check whether the input samples were successfully mapped to the reference set and if there are any clear outliers.

```{r example_plot}
plotAlignedSamples(predictions)
```

## Using a minimal or an extended gene signature for prediction
SepstratifieR enables predictions based on two different gene signatures:

1. A minimal set of 7 genes defined by Davenport et al. in 2016
2. An extended set of 19 genes defined by Cano-Gamez et al. in 2021. This extended set includes the 7 genes proposed by Davenport plus an additional 12 genes derived from integrative analysis of RNA-seq and microarray data for sepsis patients in the GAinS study. 

For further details on the definition of these signatures, please refer to the relevant publications.

The user can specify which gene signature to use for prediction by simply assigning a value to the 'gene_set' parameter. The default behaviour of SepstratifieR is to use the minimal 7-gene signature. This parameter can be modified as illustrated bellow:

```{r stratify_with_minimal_and_extended_gene_signatures, message=FALSE, warning=FALSE}
# Stratify patients based on the 7-gene signature (this is the default option)
predictions <- stratifyPatients(test_data, gene_set = "davenport")

# Stratify patients based on the 19-gene signature
predictions_extended <- stratifyPatients(test_data, gene_set = "extended")
```

IMPORTANT: Note that the extended gene signature was trained using RNA-seq and microarray measurements only. Thus, in its current state this signature should not be used to predict labels from qPCR data.

## Setting the number of mutual nearest neighbours (k)
Perhaps the most important parameter to take into account when performing patient stratification is the number of mutual nearest neighbours in the data alignment step (k). The impact of this parameter on data integration has been previously summarized in the documentation of mNN (https://rdrr.io/github/LTLA/batchelor/man/mnnCorrect.html).

In brief, lower values of 'k' will retain more substructure in the input data, with samples that do not closely resemble the reference set being flagged as outliers (i.e. samples for which no mutual nearest neighbor was found). Conversely, higher values of 'k' will result in a more lenient and aggressive merging, where samples are forced to align with the reference data even if they differ in certain aspects. Higher values of 'k' often result in better performance of data integration, but can also cause outliers to pass undetected, especially when a small group of samples is not well represented in the reference set. 

The authors of mNN suggest that 'k' be set to the expected size for the smallest subpopulation. Based on the proportion of individuals from different sepsis response (SRS) groups previously reported in the context of sepsis patients in intensive care, we recommend that this parameter be set to 20-30% the number of input samples. However, please not that this value might not be ideal if you are using this algorithm in a different patient population.

In the section below we explain how to assess if the predictions obtained with SepstratifieR are robust to the choice of 'k'.

## Running a sensitivity analysis
It is often unclear which value of 'k' is appropriate for a specific analysis. Moreover, if 'k' is low, many samples might be flagged as potential outliers by SepstratifieR, and it can be difficult to distinguish whether they are true outliers or simply samples which remained unmapped in the mNN step. 

The best way to tackle both of these problems is by performing a sensitivity analysis. In this analysis, patient stratification and SRS prediction are repeatedly performed for a range of 'k' values, and the results from each iteration  are compared to each other so as to assess their stability. 

SepstratifieR has a built in function for sensitivity analysis. You can run this function on the same input used for patient stratification, as shown below:
```{r sensitivity_analysis_example}
sensitivity_results <- runSensitivityAnalysis(test_data)
```

The heatmap above clearly shows that the quantitative predictions from SepstratifieR (SRSq) are robust to the choice of 'k', as they do not seem to change. In addition, when computing the variability in predicted SRSq scores across all of the evaluated values of 'k', we obtain a very low variance, as shown in the scatter plot. Finally, because samples flagged as outliers are randomly distributed in the plot and do not show a higher variance compared to the rest of the data, we can be fairly confident that they are not true outliers.

Now let's see how this would look like if a group of outlier samples were present in the input data.

We begin by artificially increasing the expression levels of ARL14EP for the last 30 samples in the data set.
```{r create_outliers}
set.seed(1)
test_data$ENSG00000152219[114:143] <- test_data$ENSG00000152219[114:143] + rnorm(30, mean=8, sd=1)

tail(test_data)
```

These samples should now act as a subgruop of outliers and, indeed, they very clearly separate along PC3.
```{r plot_pca_with_outliers}
preds <- stratifyPatients(test_data, verbose=F)
plotAlignedSamples(preds, pcs=c(1,3), color_by = "mNN_outlier")
```

To confirm this, let's now re-run the sensitivity analysis on this data set. Notice how now the scatter plot clearly shows that the group of samples set as outliers has higher variance in their SRSq estimations. Indeed, most of them are correctly flagged as outliers during mNN alignment. You can also see this in the heatmap, where the SRSq value estimated for the outlier samples abruptly decreases as 'k' increases.
```{r sensitivity_analysis_with_outliers}
sensitivity_results <- runSensitivityAnalysis(test_data)
```

Under this last scenario, the sensitivity analysis suggests we should exclude at least a subset of the samples flagged as outliers. 

## Working with single samples or limited sample sizes
SepstratifieR relies on aligning samples to a reference gene expression set. This step requires the availability of information from multiple samples, which is used to identify shared patterns of variation between batches and achieve a high quality alignment. Due to this requirement, we do not recommend using the main functions in SepstratifieR when dealing with a single sample or a limited sample size. This is because instability in batch alignment makes these predictions unreliable. 

Based on simulations and data subsampling, we estimate that the stratifyPatients() function should only be applied to data sets over 25 samples.

For situations where sample size is limited, we instead provide a purpose-built function which uses a 'lazy learning' approach to estimate SRS and SRSq.
This approach is based on identifying the samples in our reference set which are most similar to the sample of interest (i.e. their nearest neighbours), and then "projecting" the SRS and SRSq labels of these nearest neighbours into the sample of interest. Similarities between sample are estimated using cosine similarities, which are independent of scale differences and thus robust to technical variation. Projection is then done using a "majority vote" system, where each nearest neighbour contributes information proportionally to its similarity to the sample of interest.

The following diagram illustrates our lazy learning approach for patient stratification:

![Schematic diagram of the lazy learning approach for predicting SRS/SRSq in individual samples](./man/figures/README-sample-projection-approach.png)


### Model parameters and input variables
Our lazy learning algorithm can be performed using either of the two gene signatures, as specified by the user. Moreover, the number of nearest neighbours (k) used to estimate SRS/SRSq by majority voting can also be specified.

For this function, we recommend that predictor variables have the following units:

**Microarray:** Background-corrected, VSN-normalized, log-transformed intensity values

**RNA-seq:** Log-transformed counts per million (i.e. log-cpm)

**qRT-PCR:** 2^(Negative Cq values)

**IMPORTANT NOTES:**

1. The expected units for qRT-PCR data are not the same in stratifyPatients() than in projectPatient(). The latter function expects positive values (i.e. 2^-Cq).

2. The meaning of 'k' in this function is not the same as in stratifyPatients(). The latter uses k for alignment but not for prediction. For lazy learning, 'k' has a direct impact on prediction.

### A brief example
Below is an example of how to predict SRS/SRSq for a single isolated sample.

Let's first choose one random sample from our test set:
```{r choose_example_sample}
set.seed(2)
test_sample <- test_data[sample(rownames(test_data),1),]
```

We can stratify this sample by SRS/SRSq by calling the projectPatient() function within SepstratifieR, as follows:
```{r single_sample_mode}
# Predict SRS and SRSq using kNN-based label projection
prediction <- projectPatient(test_sample)
```

The resulting object, of class SepsisProjection, contains the SRS and SRSq values estimated using this approach, as well as metadata on how the algorithm was run:
```{r print_sepsis_projection}
prediction
```

Note that this function is not as accurate as stratifyPatients(), since the latter is based on cross-validated random forest models, for which accuracy is known and stable. In contrast, projectPatient() is not model based and is substantially more computationally intensive. Thus, we only recommend using this approach when less than 25 samples are available. 

## Contact
Eddie Cano-Gamez:  ecg@well.ox.ac.uk