D. Smirnov, P. Mazin, M. Osetrova, E. Stekolshchikova, E. Khrameeva 8/18/2021
Here we present step-by-step guide to bioinformatic analysis of untargeted LS-MS lipidomic data. This manual reproduces the key steps described in the manuscript:
-
Data importing
-
Lipid signal extraction (peak picking, peak alignment, peak grouping)
-
Filtering and normalization
-
Visualization
Note: normalization and statistical methods as well as preprocessing parameters cannot be universally applicable and should be chosen in each individual case based on experimental conditions, instrument characteristics and study purposes!
Loading the packages required for analysis
library(xcms)
library(ggplot2)
library(DT)
library(IPO)
library(mixOmics)
library(dplyr)
library(missForest)
library(reshape2)
library(gridExtra)
library(SummarizedExperiment)
We will demonstrate the key concepts of LC-MS untargeted lipidomic analysis using a test dataset. Raw MS files (~1.87 GB) from this dataset converted into the .mzXML format can be downloaded into the current directory using the following code:
#url <- "https://makarich.fbb.msu.ru/khrameeva/brainmap/sampledata.tar.gz"
#download.file(url, destfile = 'sampledata.tar.gz', method = "curl")
#untar('sampledata.tar.gz')
Raw MS files located in the sampledata/
folder are organized into
three subfolders according to the sample groups (2 files per group +
blank measurement). The code below will create a table with sample
metadata
mzfiles <- list.files('sampledata/', recursive = TRUE, full.names = TRUE, pattern = '.mzXML')
group <- unlist(lapply(strsplit(mzfiles,"/"), function (x) x[[3]]))
pd <- data.frame(sample_name = sub(basename(mzfiles), pattern = ".mzXML", replacement = "", fixed = TRUE),
sample_group = group,
stringsAsFactors = FALSE)
knitr::kable(pd)
sample_name | sample_group |
---|---|
Blank7_B_23 | blank |
MS299_HB_38 | group1 |
MS650_HD_346 | group1 |
MS122_MA_307 | group2 |
MS423_MB_628 | group2 |
Now .mzXML files can be imported into MSnExp
object via readMSData
function
raw_data <- readMSData(files = mzfiles,
pdata = new("NAnnotatedDataFrame", pd),
mode = "onDisk",
msLevel = 1,
verbose = T,
centroided = T)
## Reading 2880 spectra from file Blank7_B_23.mzXML
## Reading 2881 spectra from file MS299_HB_38.mzXML
## Reading 2876 spectra from file MS650_HD_346.mzXML
## Reading 2874 spectra from file MS122_MA_307.mzXML
## Reading 2877 spectra from file MS423_MB_628.mzXML
Feature detection CentWave
algorithm based on continuous wavelet
transformation allows to distinguish lipid peaks from background noise.
To perform peak calling we need to set up CentWave parameters first
cwp <- CentWaveParam(peakwidth = c(9.5, 36),
ppm = 11.5,
noise = 0,
snthresh = 10,
mzdiff = -0.001,
prefilter = c(3, 100),
mzCenterFun = "wMean",
integrate = 1,
fitgauss = FALSE)
It’s highly recommended to specify peakwidth
(minimum and maximum MS
peak width in RT dimension) and ppm
(width of region of interest in the
m/z dimension) parameters based on ion chromatograms for internal
standards.
Once the parameters are set one can proceed to chromatographic peak
detection using findChromPeaks
function.
xset <- findChromPeaks(raw_data, param = cwp)
Peak alignment procedure aims to eliminate retention times shifts
between identified chromatographic peaks from samples. We will use
OBI-warp algorithm implemented within xcms
package to match peaks
between MS runs.
arp <- ObiwarpParam(distFun = "cor_opt",
binSize = 1,
response = 1,
gapInit = 0.32,
gapExtend = 2.688,
factorDiag = 2,
factorGap = 1,
localAlignment = FALSE)
xset <- adjustRtime(xset, param = arp)
## Sample number 3 used as center sample.
## Applying retention time adjustment to the identified chromatographic peaks ... OK
ObiwarpParam
function creates an object with parameters for the
alignment and adjustRtime
performs the peak matching.
An example of peaks before and after RT shift correction is shown below
group_colors <- c("black", "red", "forestgreen")
names(group_colors) <- unique(xset$sample_group)
par(mfrow=c(1,2),las=1)
chr <- chromatogram(xset, rt = c(405, 435), mz = c(797.58, 797.63), aggregationFun = "max", adjustedRtime = F)
chr.adj <- chromatogram(xset, rt = c(405, 435), mz = c(797.58, 797.63), aggregationFun = "max", adjustedRtime = T)
plot(chr, peakType = "none", col=group_colors[xset$sample_group], main = "Before alignment")
legend(426, 32000, legend=c("Blank", "Group1", "Group2"), col=group_colors, lty=1:1, cex=0.45)
plot(chr.adj, peakType = "none", col=group_colors[xset$sample_group], main = "After alignment")
legend(426, 32000, legend=c("Blank", "Group1", "Group2"), col=group_colors, lty=1:1, cex=0.45)
Once retention time correction is done one can proceed to the
correspondence analysis of aligned peaks. The general idea of peak
grouping is to identify peaks from the same ion across samples and group
them together to form a lipid feature. In order to do this analysis, we
will use Peak density method that accessible in xcms
via
groupChromPeaks
function. PeakDensityParam
object contains grouping
settings and is used as input for groupChromPeaks
.
pdp <- PeakDensityParam(sampleGroups = xset$sample_group,
bw = 0.879999999999999,
binSize = 0.02412,
minFraction = 0.000001,
minSamples = 1,
maxFeatures = 50)
xset <- groupChromPeaks(xset, param = pdp)
## Processing 90525 mz slices ... OK
For simplicity and saving time, in the code sections above we provided
parameters optimized for untargeted lipidome LC-MS measurements on a
Reversed-Phase Bridged Ethyl Hybrid (BEH) C8 column reverse coupled to a
Vanguard precolumn, using a Waters Acquity UPLC system and a heated
electrospray ionization source in combination with a Bruker Impact II
QTOF (quadrupole-Time-of-Flight) mass spectrometer. To customize the
parameters for a particular LS-MS experiment we recommend optimizing
them via IPO
package.
Please note that optimizing peak calling parameters with
optimizeXcmsSet
function is a quite computationally intensive
operation. It may take hours or even days (depending on the number of
samples) before the optimization process ends!
To perform the optimization just uncomment the code below. It will return the R script with optimized processing parameters.
Set up default parameters for peak picking optimization procedure:
#peakpickingParameters <- getDefaultXcmsSetStartingParams('centWave')
#peakpickingParameters$min_peakwidth = c(0,10)
#peakpickingParameters$max_peakwidth = c(10,30)
#peakpickingParameters$ppm = c(0,10)
Optimize peak picking parameters:
#resultPeakpicking <- optimizeXcmsSet(files = mzfiles,
# params = peakpickingParameters,
# nSlaves = 0,
# subdir = NULL)
#optimizedXcmsSetObject <- resultPeakpicking$best_settings$xset
Optimize retention time correction and grouping parameters:
#retcorGroupParameters <- getDefaultRetGroupStartingParams()
#resultRetcorGroup <- optimizeRetGroup(xset = optimizedXcmsSetObject,
# params = retcorGroupParameters,
# nSlaves = 0,
# subdir = NULL)
#writeRScript(resultPeakpicking$best_settings$parameters,
# resultRetcorGroup$best_settings, 1)
Unfortunately, peak peaking algorithm may produces a sufficient numbers
of NAs for those samples in which it wasn’t able to identify MS peaks.
We will try to impute missing chromatographic peaks within samples using
fillChromPeaks
function.
xset <- fillChromPeaks(xset)
## Defining peak areas for filling-in .... OK
## Start integrating peak areas from original files
Please note that fillChromPeaks
may not impute all the gaps in MS
data. The remaining missing values will be further removed/imputed in
the section “Filtering of peaks” below.
Now we can extract feature matrix from xset
object
pks <- chromPeaks(xset)
grs <- featureDefinitions(xset)
mtx <- featureValues(xset, method="maxint", value="into", filled=T)
knitr::kable(head(mtx))
Blank7_B_23.mzXML | MS299_HB_38.mzXML | MS650_HD_346.mzXML | MS122_MA_307.mzXML | MS423_MB_628.mzXML | |
---|---|---|---|---|---|
FT00001 | 1353.6443 | 992.3192 | 786.8389 | 858.8071 | 806.18414 |
FT00002 | 3119.5129 | 724.5263 | 2068.3573 | 146.3082 | 56.23612 |
FT00003 | 51652.7105 | 59957.6374 | 46196.7493 | 41635.2908 | 49026.81721 |
FT00004 | 255.9576 | 737.5624 | 1122.6784 | 1180.6987 | 1649.92238 |
FT00005 | 4420.6834 | 4915.5724 | 4860.1878 | 2665.9554 | 4412.18887 |
FT00006 | 32.2304 | 3396.0094 | 1792.0359 | 942.3421 | 3048.21657 |
Change the column names of mtx matrix
colnames(mtx) <- unlist(strsplit(colnames(mtx), split = '.mzXML'))
To correctly normalize abundance matrix mtx
further we will need the
information about internal standard used in the experiment. The code
below will extract peaks corresponding to TG(15:0-18:1-d7-15:0).
source("src/rt-mz.annotator.R") # load the annotator function
Generate TAG that corresponds to the standard:
std = generateTGL(15+18+15,1)
std$FORMULA = 'C51H89D7O6' #change 7 H to 7 D
std$EXACT_MASS = calcExactMass(std$FORMULA) # recalculate the exact mass using updated formula
knitr::kable(std)
n | k | FORMULA | EXACT_MASS | LM_ID | |
---|---|---|---|---|---|
C51H96O6 | 48 | 1 | C51H89D7O6 | 811.7646 | TAG_48:1 |
Using annotateByMass
function we will find all the peaks in the data
that matched internal standard by exact mass value and filter them based
on the elution time. In this function adducts can be defined by ions
parameter, by default the function uses H, Na, NH4, K and aNH4. We will
specify ppm threshold for acceptable m/z deviation in annotateByMass
to 20.
std.ann = annotateByMass(data.frame(id=rownames(grs),rt=grs$rtmed,mz=grs$mzmed), std, ppm=20)
std.ann = std.ann[std.ann$rt > 600, ] # Although we don't know the exact RT value we know that TAGs elute after 10 minutes
knitr::kable(std.ann)
id | rt | mz | ion | LM_ID | EXACT_MASS | FORMULA | ppm | delta | ppmd | |
---|---|---|---|---|---|---|---|---|---|---|
1 | FT11076 | 935.0534 | 812.7639 | H | TAG_48:1 | 811.7646 | C51H89D7O6 | 9.856172 | 0.0080108 | 9.856172 |
01 | FT11602 | 928.0410 | 829.8016 | NH4 | TAG_48:1 | 811.7646 | C51H89D7O6 | 3.813919 | 0.0031648 | -3.813919 |
11 | FT11737 | 927.5400 | 834.7573 | Na | TAG_48:1 | 811.7646 | C51H89D7O6 | 4.137053 | 0.0034534 | -4.137053 |
03 | FT12188 | 928.0410 | 850.7315 | K | TAG_48:1 | 811.7646 | C51H89D7O6 | 4.402539 | 0.0037454 | -4.402539 |
04 | FT12700 | 928.0410 | 870.8291 | aNH4 | TAG_48:1 | 811.7646 | C51H89D7O6 | 4.700613 | 0.0040934 | -4.700613 |
H-adduct looks suspicious (high ppm, different rt), so we can exclude it from analysis:
std.ann = std.ann[std.ann$ppm < 5, ]
Finally, compute the median intensity of peaks found within each sample. Those will come in handy for normalization procedure.
div <- apply(mtx[std.ann$id, c(2:5)], 2, function (x) median(x, na.rm = T))
Annotation is arguably the most tricky part of untargeted LS-MS analysis. Here we present a method to create annotation for lipid features obtained in the previous steps.
To obtain the full annotation set we will again utilize annotateByMass
function, but this time m/z values of peaks will be matched to exact
masses of existing lipids from LIPID MAPS database.
ann <- annotateByMass(data.frame(id=rownames(grs),rt=grs$rtmed,mz=grs$mzmed), db = LMDB, ppm = 20)
knitr::kable(head(ann))
id | rt | mz | ion | LM_ID | EXACT_MASS | FORMULA | SYSTEMATIC_NAME | ABBREV | ppm | delta | ppmd | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | FT00001 | 286.703735 | 113.1329 | H | LMFA11000592 | 112.1252 | C8H16 | 3-methyl-1-heptene//cis-1,2-dimethylcyclohexane//trans-1,2-dimethylcyclohexane//1,4-dimethylcyclohexane//Ethylcyclohexane | - | 3.336934 | 0.0003775 | -3.336934 |
02 | FT00003 | 286.204010 | 121.0283 | Na | LMFA01030099 | 98.0368 | C5H6O2 | 2,4-pentadienoic acid//penta-2,4-dienoic acid | FA 5:2 | 18.635676 | 0.0022554 | -18.635676 |
03 | FT00006 | 35.376669 | 123.0556 | K | LMFA11000035 | 84.0939 | C6H12 | 2E-Hexene | - | 11.744882 | 0.0014453 | 11.744882 |
1 | FT00012 | 5.270351 | 125.0961 | H | LMFA06000034 | 124.0888 | C8H12O | 2,4-octadienal//5,7-octadienal//2,4-Dimethyl-2E,4E-hexadienal//2E,7-Octadienal//2E,6E-Octadienal//2E,4Z-Octadienal//2E,6Z-Octadienal | FAL 8:2 | 0.435522 | 0.0000545 | -0.435522 |
2 | FT00012 | 5.270351 | 125.0961 | H | LMFA12000017 | 124.0888 | C8H12O | 3E,5E-Octadien-2-one//6-Methyl-3E,5-heptadien-2-one | - | 0.435522 | 0.0000545 | -0.435522 |
11002 | FT00012 | 5.270351 | 125.0961 | Na | LMFA05000111 | 102.1045 | C6H14O | 3-Methylpentan-1-ol//Hexan-1-ol//Hexan-3S-ol//Hexan-2-ol//3-methyl-3-Pentanol//4-Methyl-pentan-1-ol | FOH 6:0 | 19.285398 | 0.0024125 | -19.285398 |
table(ann$ion)
##
## aNH4 H K Na NH4
## 9666 9941 7243 9040 9602
Here ppmd
is calculated as
where is a peak m/z
value,
is an exact
mass of known lipid from LIPID MAPS database and
represents the adduct mass.
Next we will explore the distribution of ppmd
values from the
annotation table:
h = hist(ann$ppmd, 1000, main = 'ppmd distribution', xlab = 'ppmd')
mode = which.max(h$counts)
mode = h$breaks[mode]/2 + h$breaks[mode+1]/2
abline(v = mode, col = 'red')
Most likely there is a m/z shift around -4 ppm (shown by red vertical
line), it also corresponds well with
ppmd
of the internal standard.
Display the value of m/z shift
mode
## [1] -3.925
Take only annotation within mode +- 10
ann = ann[ann$ppmd > (mode - 10) & ann$ppmd < (mode + 10), ]
Check annotation uniqueness
annotation.freq <- table(table(ann$id))
print(paste(annotation.freq[1], "peaks have unique annotation"))
## [1] "3590 peaks have unique annotation"
print(paste(annotation.freq[2], "peaks are annotated with two lipids"))
## [1] "2182 peaks are annotated with two lipids"
Add two columns with lipid categories and classes to ann
table
ann$category = substr(ann$LM_ID,3,4)
ann$class = substr(ann$LM_ID,3,6)
Display the number of lipids annotated uniquely on category level
annc = unique(ann[,c('id','rt','mz','category')])
annc.freq <- table(table(annc$id))
print(paste(annc.freq[1], "lipids have unique annotation on category level"))
## [1] "5845 lipids have unique annotation on category level"
We will retain features with unique annotation to build rt-mz plot:
t = table(annc$id) # count the annotation variants for all features
annc = annc[annc$id %in% names(t)[t==1],] # filter out all features with multiply annotation names
cats = unique(annc$category) # get the list of lipid categories
cols = setNames(RColorBrewer::brewer.pal(length(cats),'Set1'),cats) # set color palette
annc <- annc[annc$rt > 120 & annc$rt < 1200, ] # specify the range of RT
plot(annc$rt,annc$mz,pch=16,col=cols[annc$category],xlab='RT', ylab='m/z', xlim = c(120, 1330), bty='n')
legend('topright',pch=16,col=cols,legend=cats)
Print the number of lipids annotated uniquely on class level
annc <- unique(ann[,c('id','rt','mz','class')])
annc.freq <- table(table(annc$id))
print(paste(annc.freq[1], "lipids have unique annotation on class level"))
## [1] "4567 lipids have unique annotation on class level"
Keep annotated features only
mtx <- mtx[unique(ann$id),]
Also one can try to perform annotation for specific lipid class only. We will illustrate such approach using a custom TAG generator that creates a table of triacylglycerols (TAG(10:0) - TAG(70:8)) with corresponding lipid formula and exact masses.
tags = generateTGL(10:70,0:8)
electron.mass = 0.00054858
knitr::kable(head(tags))
n | k | FORMULA | EXACT_MASS | LM_ID | |
---|---|---|---|---|---|
C13H22O6 | 10 | 0 | C13H22O6 | 274.1416 | TAG_10:0 |
C13H20O6 | 10 | 1 | C13H20O6 | 272.1260 | TAG_10:1 |
C13H18O6 | 10 | 2 | C13H18O6 | 270.1103 | TAG_10:2 |
C13H16O6 | 10 | 3 | C13H16O6 | 268.0947 | TAG_10:3 |
C13H14O6 | 10 | 4 | C13H14O6 | 266.0790 | TAG_10:4 |
C13H12O6 | 10 | 5 | C13H12O6 | 264.0634 | TAG_10:5 |
We will use the same annotateByMass
function as a before, but with
previously generated TAG table instead of full LIPIDMAPS database.
ann2 = annotateByMass(data.frame(id=rownames(grs),rt=grs$rtmed,mz=grs$mzmed),
tags,
ions = c(NH4=calcExactMass('NH4') - electron.mass),
ppm=20)
knitr::kable(head(ann2))
id | rt | mz | ion | LM_ID | EXACT_MASS | FORMULA | ppm | delta | ppmd | |
---|---|---|---|---|---|---|---|---|---|---|
0 | FT01070 | 37.19504 | 290.1612 | NH4 | TAG_10:1 | 272.1260 | C13H20O6 | 4.632649 | 0.0013442 | -4.632649 |
1 | FT01208 | 60.22953 | 304.0841 | NH4 | TAG_12:8 | 286.0477 | C15H10O6 | 8.201367 | 0.0024939 | -8.201367 |
2 | FT01355 | 41.18413 | 318.1925 | NH4 | TAG_12:1 | 300.1573 | C15H24O6 | 4.487987 | 0.0014280 | -4.487987 |
3 | FT01815 | 50.21149 | 358.2254 | NH4 | TAG_15:2 | 340.1886 | C18H28O6 | 8.342262 | 0.0029884 | -8.342262 |
4 | FT01816 | 53.73159 | 358.2288 | NH4 | TAG_15:2 | 340.1886 | C18H28O6 | 17.826423 | 0.0063858 | -17.826423 |
5 | FT01953 | 227.58200 | 366.1916 | NH4 | TAG_16:5 | 348.1573 | C19H24O6 | 1.326931 | 0.0004859 | -1.326931 |
ann2
annotation table contains a number of peaks that have non-unique
annotation. We will plot the rt-m/z scatter plot and highlight the area
of correct TAGs by red rectangle
plot(ann2$rt,ann2$mz, xlab = 'RT', ylab = 'm/z')
rect(820, 700, 1050, 1000, border = 'red', col=NA)
Since the plot above showed that there are additional peaks in obtained TAG annotation table, we will try to exclude them using an approach based on grid-like pattern in rt-mz coordinates (please see the manuscript for the method description).
nets <- lookForNets(ann2,rt.win=c(1,30)) # by default rt.win is in mins, we will change it to secs
net.table <- sort(table(nets$start)) # count the number of features per net
start <- names(net.table)[length(names(net.table))] # get id of the biggest net
tags = nets[!is.na(nets$start) & nets$start==as.numeric(start),] # select the biggest net
plotNet(tags)
The net looks quite noisy, so we will plot it again, but with circle
size based on value of
ppmd
deviation of peaks
plot(tags$rt, tags$mz, cex = abs(tags$ppmd-mode)/6, xlab = 'RT', ylab = 'm/z')
Filter out the peaks with high ppmd
deviation
net <- plotNet(tags[abs(tags$ppmd-mode) < 5, ])
The net looks better, but there are still some peaks that pretends to be same lipid, so the filtering procedure needs manual curation.
We will perform feature filtering based on blank samples to retain all features for which the median concentration ratio between biological samples and blank samples is greater than 2.
med.MS <- apply(mtx, 1, function(x) log10(median(x[grep("MS",colnames(mtx),perl=T)], na.rm=T)))
med.blank <- apply(mtx, 1, function(x) log10(median(x[grep("Blank",colnames(mtx))], na.rm=T)))
filter.blank <- (med.MS - med.blank) > log10(2)
filter.blank[filter.blank==T] <- NA
filter.blank[is.na(filter.blank)] <- T
mtx <- mtx[filter.blank, c(2:5)]
The proportion of filtered features can be visualized using a mean-difference plot:
plot((med.MS+med.blank)/2,
med.MS-med.blank,
pch=21, las=1, bg="gray",
col="dimgray", cex=1.2, lwd=0.4, xlab = "Sample Intensity", ylab = "Sample - Blank")
points((med.MS[filter.blank]+med.blank[filter.blank])/2,
med.MS[filter.blank]-med.blank[filter.blank],
pch=21, bg="#B20F25", col="dimgray", cex=1.2, lwd=0.4)
abline(h=log10(2),col="#B20F25")
The code below removes all features that possess more than 30% of NA across samples.
th <- 0.3
peaks.nas <- apply(mtx, 1, function (x) sum(is.na(x)))
mtx <- mtx[(peaks.nas/ncol(mtx)) < th, ]
To impute missing values not filled by fillChromPeaks
and not removed
by filtering we will use an implementation of random forest algorithm
from MissForest
package.
mtx.imp <- missForest(mtx)
## missForest iteration 1 in progress...done!
## missForest iteration 2 in progress...done!
## missForest iteration 3 in progress...done!
## missForest iteration 4 in progress...done!
## missForest iteration 5 in progress...done!
## missForest iteration 6 in progress...done!
mtx <- mtx.imp$ximp
In order to make samples comparable to each other we will utilize sample-specific normalization by wet weight and internal standard.
Load the matrix of wet weights for our samples
wetw <- as.matrix(read.csv("POS.WETWEIGHT.csv", header=F, row.names=1))
knitr::kable(wetw)
V2 | |
---|---|
MS122_MA_307 | 9.1 |
MS299_HB_38 | 10.5 |
MS423_MB_628 | 14.0 |
MS650_HD_346 | 12.9 |
Perform the normalization
wetw <- wetw[colnames(mtx), ]
mtx.normalized <- t(apply(mtx, 1, function (x) x*mean(wetw)*mean(div)/(wetw*div)))
mtx.log <- log10(mtx.normalized)
Processed matrix with quantified lipid abundances can be used for downstream analysis. We will apply two classical multivariate approaches (PCA and PLS-DA) as well as univariate statistical methods to analyze the intergroup differences between lipid profiles.
PCA is a multivariate technique that extremely useful for classification
purpose. The key idea of the method is to project original matrix of
lipid abundances into low dimensional space. To perform dimensionality
reduction PCA computes the reduced set of uncorrelated variables named
Principal Components
. For a given matrix
,
where features are rows and samples are columns, principal component
vectors can be defined by finding eigenvectors of the following sample
covariance matrix
:
where
is a centering matrix,
represents an
identity matrix of size
.
To calculate principal components we will use base R function prcomp
pca <- prcomp(t(mtx.log), center = TRUE, scale. = TRUE)
To visualize relationships between samples in a new low dimensional space we will plot PC1 and PC2 against each other.
Y <- pd$sample_group[2:5]
pca.data <- data.frame(PC1 = pca$x[, 1], PC2 = pca$x[, 2], class = Y)
ggplot(data = pca.data, aes_string(x = "PC1", y = "PC2", color = "class", shape = "class")) +
geom_point(size = 5) +
theme_light()
While both PCA and PLS-DA achieve dimensionality reduction computing the
principal components, PLS-DA can be used rather for classification and
feature selection than for clustering purpose. Mathematically, PLS-DA
principal components can be obtained in a similar to PCA manner, as
eigenvectors of a matrix of covariances between
and
:
where and
represent a centering
matrix and a total number of samples, respectively.
We will use the sparse version of the algorithm - sPLS-DA. First of all, matrix of predictors and vector of responses should be defined
X <- t(mtx.normalized)
Y <- as.factor(Y)
Tune sPLS-DA parameters using Leave-One-Out cross validation
list.keepX <- seq(1, 100, 2)
tune.splsda <- tune.splsda(X, Y, ncomp = 2, validation = 'loo', folds = 4,
progressBar = FALSE, dist = 'max.dist',
test.keepX = list.keepX, nrepeat = 1)
Run sPLS-DA with optimized parameters
splsda.model <- splsda(X, Y, ncomp = 2, keepX = tune.splsda$choice.keepX)
plotIndiv(splsda.model, ind.names = FALSE, legend=TRUE, ellipse = TRUE)
A vector of feature contributions can be retrieved from the model in the following way
plsda.contributions <- selectVar(splsda.model, comp = 1)$value
Univariate methods (e.g. t-test, Wilcoxon rank sum test, ANOVA) are especially useful for detecting differences in concentration between samples on the level of single molecules.
As an illustrative example, we will apply Wilcoxon rank sum test (WRST) to the task of comparing Group1 and Group1 lipid abundances. In contrast to t-test, this test doesn’t assume that the data has a normal distribution.
stats <- apply(mtx.normalized, 1, function (x) wilcox.test(x[1:2], x[3:4])$p.value)
Given the small sample size of groups studied (two samples per group) the minimal raw p-value we can obtain with WRST is 0.33.
min(stats)
## [1] 0.3333333
The one more important step when dealing with statistical testing of hundreds or thousands features is a multiple testing correction procedure. Setting a significance level to 0.05 one can found a 5% of features to be significant even they are not. So, if the 5000 lipids are tested, 250 lipids could be significant by chance. To avoid having large number of false positive results, raw p-value can be adjust using multiply testing correction methods (e.g. Bonferroni, Holm or FDR corrections).
p.val <- p.adjust(stats, method = 'fdr')
The statistical testing often accompanied by Fold Change
(FC)
calculation. FC shows the magnitude and direction of change in lipid
concentration between group studied.
lfc <- log2(rowMeans(mtx.normalized[,c(1,2)])) - log2(rowMeans(mtx.normalized[,c(3,4)]))
The code below creates a table with columns describing the mean concentrations across samples and log2(Fold Change) values obtained above. We will also specify the 2-fold FC cutoff for up- and downregulated features.
mean.abundance <- (log2(rowMeans(mtx.normalized[,c(1,2)])) + log2(rowMeans(mtx.normalized[,c(3,4)])))/2
ma.data <- as.data.frame(mean.abundance)
ma.data$lfc <- lfc
colnames(ma.data) <- c('MeanAbundance', 'Lfc')
ma.data$Sig <- 'NS'
ma.data$Sig[ma.data$Lfc > 1] <- 'Up'
ma.data$Sig[ma.data$Lfc < -1] <- 'Down'
Explore the number of up- and downregulated features:
table(ma.data$Sig)
##
## Down NS Up
## 1288 2456 502
Now the differences in lipid abundance between Group1 and Group2 can be visualize using MA plot, where x-axis represents average abundance level and y-axis represents log2(Fold Change) values.
p <- ggplot(ma.data, aes(x = MeanAbundance, y = Lfc, color = Sig)) +
geom_point() +
geom_hline(yintercept = c(-1, 1), linetype = "dashed") +
scale_colour_manual(values=c("#1465AC", "darkgray", "#B31B21"))+
xlab('log2(Mean normalized abundance)') + ylab('log2(Fold Change)') +
theme_light()
p
sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
##
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
##
## locale:
## [1] ru_RU.UTF-8/ru_RU.UTF-8/ru_RU.UTF-8/C/ru_RU.UTF-8/ru_RU.UTF-8
##
## attached base packages:
## [1] stats4 parallel stats graphics grDevices utils datasets
## [8] methods base
##
## other attached packages:
## [1] SummarizedExperiment_1.20.0 GenomicRanges_1.42.0
## [3] GenomeInfoDb_1.26.7 IRanges_2.24.1
## [5] MatrixGenerics_1.2.1 matrixStats_0.60.1
## [7] gridExtra_2.3 reshape2_1.4.4
## [9] missForest_1.4 itertools_0.1-3
## [11] iterators_1.0.13 foreach_1.5.1
## [13] randomForest_4.6-14 dplyr_1.0.7
## [15] mixOmics_6.14.1 lattice_0.20-44
## [17] MASS_7.3-54 IPO_1.16.0
## [19] CAMERA_1.46.0 rsm_2.10.2
## [21] DT_0.18 ggplot2_3.3.5
## [23] xcms_3.12.0 MSnbase_2.16.1
## [25] ProtGenerics_1.22.0 S4Vectors_0.28.1
## [27] mzR_2.24.1 Rcpp_1.0.7
## [29] BiocParallel_1.24.1 Biobase_2.50.0
## [31] BiocGenerics_0.36.1
##
## loaded via a namespace (and not attached):
## [1] colorspace_2.0-2 ellipsis_0.3.2 htmlTable_2.2.1
## [4] corpcor_1.6.9 XVector_0.30.0 base64enc_0.1-3
## [7] rstudioapi_0.13 farver_2.1.0 affyio_1.60.0
## [10] ggrepel_0.9.1 RSpectra_0.16-0 fansi_0.5.0
## [13] codetools_0.2-18 splines_4.0.3 ncdf4_1.17
## [16] doParallel_1.0.16 impute_1.64.0 robustbase_0.93-8
## [19] knitr_1.33 Formula_1.2-4 cluster_2.1.2
## [22] vsn_3.58.0 png_0.1-7 graph_1.68.0
## [25] BiocManager_1.30.16 compiler_4.0.3 backports_1.2.1
## [28] assertthat_0.2.1 Matrix_1.3-4 limma_3.46.0
## [31] htmltools_0.5.1.1 tools_4.0.3 igraph_1.2.6
## [34] gtable_0.3.0 glue_1.4.2 GenomeInfoDbData_1.2.4
## [37] affy_1.68.0 RANN_2.6.1 MALDIquant_1.20
## [40] vctrs_0.3.8 preprocessCore_1.52.1 xfun_0.25
## [43] stringr_1.4.0 lifecycle_1.0.0 XML_3.99-0.7
## [46] DEoptimR_1.0-9 zlibbioc_1.36.0 scales_1.1.1
## [49] pcaMethods_1.82.0 RBGL_1.66.0 MassSpecWavelet_1.56.0
## [52] RColorBrewer_1.1-2 yaml_2.2.1 rpart_4.1-15
## [55] latticeExtra_0.6-29 stringi_1.7.3 highr_0.9
## [58] checkmate_2.0.0 rlang_0.4.11 pkgconfig_2.0.3
## [61] bitops_1.0-7 mzID_1.28.0 evaluate_0.14
## [64] purrr_0.3.4 labeling_0.4.2 htmlwidgets_1.5.3
## [67] tidyselect_1.1.1 plyr_1.8.6 magrittr_2.0.1
## [70] R6_2.5.1 generics_0.1.0 Hmisc_4.5-0
## [73] DelayedArray_0.16.3 DBI_1.1.1 pillar_1.6.2
## [76] foreign_0.8-81 withr_2.4.2 MsCoreUtils_1.2.0
## [79] survival_3.2-13 RCurl_1.98-1.4 nnet_7.3-16
## [82] tibble_3.1.3 crayon_1.4.1 rARPACK_0.11-0
## [85] utf8_1.2.2 ellipse_0.4.2 rmarkdown_2.10
## [88] jpeg_0.1-9 grid_4.0.3 data.table_1.14.0
## [91] digest_0.6.27 tidyr_1.1.3 munsell_0.5.0