Skip to content

Latest commit

 

History

History
912 lines (707 loc) · 35.4 KB

LipidomicAnalysis.md

File metadata and controls

912 lines (707 loc) · 35.4 KB

The Hitchhiker’s Guide to untargeted lipidomics analysis: Practical guidelines

D. Smirnov, P. Mazin, M. Osetrova, E. Stekolshchikova, E. Khrameeva 8/18/2021

Introduction

Here we present step-by-step guide to bioinformatic analysis of untargeted LS-MS lipidomic data. This manual reproduces the key steps described in the manuscript:

  • Data importing

  • Lipid signal extraction (peak picking, peak alignment, peak grouping)

  • Filtering and normalization

  • Visualization

Note: normalization and statistical methods as well as preprocessing parameters cannot be universally applicable and should be chosen in each individual case based on experimental conditions, instrument characteristics and study purposes!

Package import

Loading the packages required for analysis

library(xcms)
library(ggplot2)
library(DT)
library(IPO)
library(mixOmics)
library(dplyr)
library(missForest)
library(reshape2)
library(gridExtra)
library(SummarizedExperiment)

Data import

We will demonstrate the key concepts of LC-MS untargeted lipidomic analysis using a test dataset. Raw MS files (~1.87 GB) from this dataset converted into the .mzXML format can be downloaded into the current directory using the following code:

#url <- "https://makarich.fbb.msu.ru/khrameeva/brainmap/sampledata.tar.gz"
#download.file(url, destfile = 'sampledata.tar.gz', method = "curl")
#untar('sampledata.tar.gz')

Raw MS files located in the sampledata/ folder are organized into three subfolders according to the sample groups (2 files per group + blank measurement). The code below will create a table with sample metadata

mzfiles <- list.files('sampledata/', recursive = TRUE, full.names = TRUE, pattern = '.mzXML')
group <- unlist(lapply(strsplit(mzfiles,"/"), function (x) x[[3]]))
pd <- data.frame(sample_name = sub(basename(mzfiles), pattern = ".mzXML", replacement = "", fixed = TRUE), 
                 sample_group = group, 
                 stringsAsFactors = FALSE)

knitr::kable(pd)
sample_name sample_group
Blank7_B_23 blank
MS299_HB_38 group1
MS650_HD_346 group1
MS122_MA_307 group2
MS423_MB_628 group2

Now .mzXML files can be imported into MSnExp object via readMSData function

raw_data <- readMSData(files = mzfiles, 
                       pdata = new("NAnnotatedDataFrame", pd), 
                       mode = "onDisk", 
                       msLevel = 1, 
                       verbose = T, 
                       centroided = T)
## Reading 2880 spectra from file Blank7_B_23.mzXML

## Reading 2881 spectra from file MS299_HB_38.mzXML

## Reading 2876 spectra from file MS650_HD_346.mzXML

## Reading 2874 spectra from file MS122_MA_307.mzXML

## Reading 2877 spectra from file MS423_MB_628.mzXML

Peak picking

Feature detection CentWave algorithm based on continuous wavelet transformation allows to distinguish lipid peaks from background noise. To perform peak calling we need to set up CentWave parameters first

cwp <- CentWaveParam(peakwidth = c(9.5, 36),
                     ppm = 11.5,
                     noise = 0, 
                     snthresh = 10, 
                     mzdiff = -0.001, 
                     prefilter = c(3, 100), 
                     mzCenterFun = "wMean", 
                     integrate = 1, 
                     fitgauss = FALSE)

It’s highly recommended to specify peakwidth (minimum and maximum MS peak width in RT dimension) and ppm(width of region of interest in the m/z dimension) parameters based on ion chromatograms for internal standards.

Once the parameters are set one can proceed to chromatographic peak detection using findChromPeaks function.

xset <- findChromPeaks(raw_data, param = cwp)

Peak alignment

Peak alignment procedure aims to eliminate retention times shifts between identified chromatographic peaks from samples. We will use OBI-warp algorithm implemented within xcms package to match peaks between MS runs.

arp <- ObiwarpParam(distFun = "cor_opt", 
                    binSize = 1, 
                    response = 1, 
                    gapInit = 0.32, 
                    gapExtend = 2.688, 
                    factorDiag = 2, 
                    factorGap = 1,
                    localAlignment = FALSE)

xset <- adjustRtime(xset, param = arp)
## Sample number 3 used as center sample.

## Applying retention time adjustment to the identified chromatographic peaks ... OK

ObiwarpParam function creates an object with parameters for the alignment and adjustRtime performs the peak matching.

An example of peaks before and after RT shift correction is shown below

group_colors <- c("black", "red", "forestgreen")
names(group_colors) <- unique(xset$sample_group)

par(mfrow=c(1,2),las=1)
chr <- chromatogram(xset, rt = c(405, 435), mz = c(797.58, 797.63), aggregationFun = "max", adjustedRtime = F)
chr.adj <- chromatogram(xset, rt = c(405, 435), mz = c(797.58, 797.63), aggregationFun = "max", adjustedRtime = T)
plot(chr, peakType = "none", col=group_colors[xset$sample_group], main = "Before alignment")
legend(426, 32000, legend=c("Blank", "Group1", "Group2"), col=group_colors, lty=1:1, cex=0.45)
plot(chr.adj, peakType = "none", col=group_colors[xset$sample_group], main = "After alignment")
legend(426, 32000, legend=c("Blank", "Group1", "Group2"), col=group_colors, lty=1:1, cex=0.45)

Peak grouping

Once retention time correction is done one can proceed to the correspondence analysis of aligned peaks. The general idea of peak grouping is to identify peaks from the same ion across samples and group them together to form a lipid feature. In order to do this analysis, we will use Peak density method that accessible in xcms via groupChromPeaks function. PeakDensityParam object contains grouping settings and is used as input for groupChromPeaks.

pdp <- PeakDensityParam(sampleGroups = xset$sample_group, 
                        bw = 0.879999999999999, 
                        binSize = 0.02412, 
                        minFraction = 0.000001, 
                        minSamples = 1, 
                        maxFeatures = 50)

xset <- groupChromPeaks(xset, param = pdp)
## Processing 90525 mz slices ... OK

Selection of parameters for peak picking, alignment, and grouping

For simplicity and saving time, in the code sections above we provided parameters optimized for untargeted lipidome LC-MS measurements on a Reversed-Phase Bridged Ethyl Hybrid (BEH) C8 column reverse coupled to a Vanguard precolumn, using a Waters Acquity UPLC system and a heated electrospray ionization source in combination with a Bruker Impact II QTOF (quadrupole-Time-of-Flight) mass spectrometer. To customize the parameters for a particular LS-MS experiment we recommend optimizing them via IPO package.

Please note that optimizing peak calling parameters with optimizeXcmsSet function is a quite computationally intensive operation. It may take hours or even days (depending on the number of samples) before the optimization process ends!

To perform the optimization just uncomment the code below. It will return the R script with optimized processing parameters.

Set up default parameters for peak picking optimization procedure:

#peakpickingParameters <- getDefaultXcmsSetStartingParams('centWave')
#peakpickingParameters$min_peakwidth = c(0,10)
#peakpickingParameters$max_peakwidth = c(10,30)
#peakpickingParameters$ppm = c(0,10)

Optimize peak picking parameters:

#resultPeakpicking <- optimizeXcmsSet(files = mzfiles, 
#                                     params = peakpickingParameters, 
#                                     nSlaves = 0, 
#                                     subdir = NULL)

#optimizedXcmsSetObject <- resultPeakpicking$best_settings$xset

Optimize retention time correction and grouping parameters:

#retcorGroupParameters <- getDefaultRetGroupStartingParams()
#resultRetcorGroup <- optimizeRetGroup(xset = optimizedXcmsSetObject, 
#                                      params = retcorGroupParameters,
#                                      nSlaves = 0,
#                                      subdir = NULL)

#writeRScript(resultPeakpicking$best_settings$parameters, 
#             resultRetcorGroup$best_settings, 1)

Imputation of missing values

Unfortunately, peak peaking algorithm may produces a sufficient numbers of NAs for those samples in which it wasn’t able to identify MS peaks. We will try to impute missing chromatographic peaks within samples using fillChromPeaks function.

xset <- fillChromPeaks(xset)
## Defining peak areas for filling-in .... OK
## Start integrating peak areas from original files

Please note that fillChromPeaks may not impute all the gaps in MS data. The remaining missing values will be further removed/imputed in the section “Filtering of peaks” below.

Data export

Now we can extract feature matrix from xset object

pks <- chromPeaks(xset)
grs <- featureDefinitions(xset)
mtx <- featureValues(xset, method="maxint", value="into", filled=T) 

knitr::kable(head(mtx))
Blank7_B_23.mzXML MS299_HB_38.mzXML MS650_HD_346.mzXML MS122_MA_307.mzXML MS423_MB_628.mzXML
FT00001 1353.6443 992.3192 786.8389 858.8071 806.18414
FT00002 3119.5129 724.5263 2068.3573 146.3082 56.23612
FT00003 51652.7105 59957.6374 46196.7493 41635.2908 49026.81721
FT00004 255.9576 737.5624 1122.6784 1180.6987 1649.92238
FT00005 4420.6834 4915.5724 4860.1878 2665.9554 4412.18887
FT00006 32.2304 3396.0094 1792.0359 942.3421 3048.21657

Change the column names of mtx matrix

colnames(mtx) <- unlist(strsplit(colnames(mtx), split = '.mzXML'))

To correctly normalize abundance matrix mtx further we will need the information about internal standard used in the experiment. The code below will extract peaks corresponding to TG(15:0-18:1-d7-15:0).

source("src/rt-mz.annotator.R") # load the annotator function

Generate TAG that corresponds to the standard:

std = generateTGL(15+18+15,1)
std$FORMULA = 'C51H89D7O6' #change 7 H to 7 D
std$EXACT_MASS = calcExactMass(std$FORMULA) # recalculate the exact mass using updated formula
knitr::kable(std)
n k FORMULA EXACT_MASS LM_ID
C51H96O6 48 1 C51H89D7O6 811.7646 TAG_48:1

Using annotateByMass function we will find all the peaks in the data that matched internal standard by exact mass value and filter them based on the elution time. In this function adducts can be defined by ions parameter, by default the function uses H, Na, NH4, K and aNH4. We will specify ppm threshold for acceptable m/z deviation in annotateByMass to 20.

std.ann = annotateByMass(data.frame(id=rownames(grs),rt=grs$rtmed,mz=grs$mzmed), std, ppm=20)
std.ann = std.ann[std.ann$rt > 600, ] # Although we don't know the exact RT value we know that TAGs elute after 10 minutes
knitr::kable(std.ann)
id rt mz ion LM_ID EXACT_MASS FORMULA ppm delta ppmd
1 FT11076 935.0534 812.7639 H TAG_48:1 811.7646 C51H89D7O6 9.856172 0.0080108 9.856172
01 FT11602 928.0410 829.8016 NH4 TAG_48:1 811.7646 C51H89D7O6 3.813919 0.0031648 -3.813919
11 FT11737 927.5400 834.7573 Na TAG_48:1 811.7646 C51H89D7O6 4.137053 0.0034534 -4.137053
03 FT12188 928.0410 850.7315 K TAG_48:1 811.7646 C51H89D7O6 4.402539 0.0037454 -4.402539
04 FT12700 928.0410 870.8291 aNH4 TAG_48:1 811.7646 C51H89D7O6 4.700613 0.0040934 -4.700613

H-adduct looks suspicious (high ppm, different rt), so we can exclude it from analysis:

std.ann = std.ann[std.ann$ppm < 5, ]

Finally, compute the median intensity of peaks found within each sample. Those will come in handy for normalization procedure.

div <- apply(mtx[std.ann$id, c(2:5)], 2, function (x) median(x, na.rm = T))

Annotation

Annotation is arguably the most tricky part of untargeted LS-MS analysis. Here we present a method to create annotation for lipid features obtained in the previous steps.

To obtain the full annotation set we will again utilize annotateByMass function, but this time m/z values of peaks will be matched to exact masses of existing lipids from LIPID MAPS database.

ann <- annotateByMass(data.frame(id=rownames(grs),rt=grs$rtmed,mz=grs$mzmed), db = LMDB, ppm = 20)
knitr::kable(head(ann))
id rt mz ion LM_ID EXACT_MASS FORMULA SYSTEMATIC_NAME ABBREV ppm delta ppmd
0 FT00001 286.703735 113.1329 H LMFA11000592 112.1252 C8H16 3-methyl-1-heptene//cis-1,2-dimethylcyclohexane//trans-1,2-dimethylcyclohexane//1,4-dimethylcyclohexane//Ethylcyclohexane - 3.336934 0.0003775 -3.336934
02 FT00003 286.204010 121.0283 Na LMFA01030099 98.0368 C5H6O2 2,4-pentadienoic acid//penta-2,4-dienoic acid FA 5:2 18.635676 0.0022554 -18.635676
03 FT00006 35.376669 123.0556 K LMFA11000035 84.0939 C6H12 2E-Hexene - 11.744882 0.0014453 11.744882
1 FT00012 5.270351 125.0961 H LMFA06000034 124.0888 C8H12O 2,4-octadienal//5,7-octadienal//2,4-Dimethyl-2E,4E-hexadienal//2E,7-Octadienal//2E,6E-Octadienal//2E,4Z-Octadienal//2E,6Z-Octadienal FAL 8:2 0.435522 0.0000545 -0.435522
2 FT00012 5.270351 125.0961 H LMFA12000017 124.0888 C8H12O 3E,5E-Octadien-2-one//6-Methyl-3E,5-heptadien-2-one - 0.435522 0.0000545 -0.435522
11002 FT00012 5.270351 125.0961 Na LMFA05000111 102.1045 C6H14O 3-Methylpentan-1-ol//Hexan-1-ol//Hexan-3S-ol//Hexan-2-ol//3-methyl-3-Pentanol//4-Methyl-pentan-1-ol FOH 6:0 19.285398 0.0024125 -19.285398
table(ann$ion)
## 
## aNH4    H    K   Na  NH4 
## 9666 9941 7243 9040 9602

Here ppmd is calculated as

ppmd = (1 - ) 1e^6

where MZ is a peak m/z value, EM is an exact mass of known lipid from LIPID MAPS database and Adduct represents the adduct mass.

Next we will explore the distribution of ppmd values from the annotation table:

h = hist(ann$ppmd, 1000, main = 'ppmd distribution', xlab = 'ppmd')
mode = which.max(h$counts)
mode = h$breaks[mode]/2 + h$breaks[mode+1]/2
abline(v = mode, col = 'red')

Most likely there is a m/z shift around -4 ppm (shown by red vertical line), it also corresponds well with ppmd of the internal standard.

Display the value of m/z shift

mode
## [1] -3.925

Take only annotation within mode +- 10

ann = ann[ann$ppmd > (mode - 10) & ann$ppmd < (mode + 10), ]

Check annotation uniqueness

annotation.freq <- table(table(ann$id))
print(paste(annotation.freq[1], "peaks have unique annotation"))
## [1] "3590 peaks have unique annotation"
print(paste(annotation.freq[2], "peaks are annotated with two lipids"))
## [1] "2182 peaks are annotated with two lipids"

Add two columns with lipid categories and classes to ann table

ann$category = substr(ann$LM_ID,3,4)
ann$class = substr(ann$LM_ID,3,6)

Display the number of lipids annotated uniquely on category level

annc = unique(ann[,c('id','rt','mz','category')])
annc.freq <- table(table(annc$id))
print(paste(annc.freq[1], "lipids have unique annotation on category level"))
## [1] "5845 lipids have unique annotation on category level"

We will retain features with unique annotation to build rt-mz plot:

t = table(annc$id) # count the annotation variants for all features
annc = annc[annc$id %in% names(t)[t==1],] # filter out all features with multiply annotation names
cats = unique(annc$category) # get the list of lipid categories
cols = setNames(RColorBrewer::brewer.pal(length(cats),'Set1'),cats) # set color palette
annc <- annc[annc$rt > 120 & annc$rt < 1200, ] # specify the range of RT 
plot(annc$rt,annc$mz,pch=16,col=cols[annc$category],xlab='RT', ylab='m/z', xlim = c(120, 1330), bty='n')
legend('topright',pch=16,col=cols,legend=cats) 

Print the number of lipids annotated uniquely on class level

annc <- unique(ann[,c('id','rt','mz','class')])
annc.freq <- table(table(annc$id))
print(paste(annc.freq[1], "lipids have unique annotation on class level"))
## [1] "4567 lipids have unique annotation on class level"

Keep annotated features only

mtx <- mtx[unique(ann$id),]

Also one can try to perform annotation for specific lipid class only. We will illustrate such approach using a custom TAG generator that creates a table of triacylglycerols (TAG(10:0) - TAG(70:8)) with corresponding lipid formula and exact masses.

tags = generateTGL(10:70,0:8) 
electron.mass = 0.00054858
knitr::kable(head(tags))
n k FORMULA EXACT_MASS LM_ID
C13H22O6 10 0 C13H22O6 274.1416 TAG_10:0
C13H20O6 10 1 C13H20O6 272.1260 TAG_10:1
C13H18O6 10 2 C13H18O6 270.1103 TAG_10:2
C13H16O6 10 3 C13H16O6 268.0947 TAG_10:3
C13H14O6 10 4 C13H14O6 266.0790 TAG_10:4
C13H12O6 10 5 C13H12O6 264.0634 TAG_10:5

We will use the same annotateByMass function as a before, but with previously generated TAG table instead of full LIPIDMAPS database.

ann2 = annotateByMass(data.frame(id=rownames(grs),rt=grs$rtmed,mz=grs$mzmed),
                      tags,
                      ions = c(NH4=calcExactMass('NH4') - electron.mass),
                      ppm=20)

knitr::kable(head(ann2))
id rt mz ion LM_ID EXACT_MASS FORMULA ppm delta ppmd
0 FT01070 37.19504 290.1612 NH4 TAG_10:1 272.1260 C13H20O6 4.632649 0.0013442 -4.632649
1 FT01208 60.22953 304.0841 NH4 TAG_12:8 286.0477 C15H10O6 8.201367 0.0024939 -8.201367
2 FT01355 41.18413 318.1925 NH4 TAG_12:1 300.1573 C15H24O6 4.487987 0.0014280 -4.487987
3 FT01815 50.21149 358.2254 NH4 TAG_15:2 340.1886 C18H28O6 8.342262 0.0029884 -8.342262
4 FT01816 53.73159 358.2288 NH4 TAG_15:2 340.1886 C18H28O6 17.826423 0.0063858 -17.826423
5 FT01953 227.58200 366.1916 NH4 TAG_16:5 348.1573 C19H24O6 1.326931 0.0004859 -1.326931

ann2 annotation table contains a number of peaks that have non-unique annotation. We will plot the rt-m/z scatter plot and highlight the area of correct TAGs by red rectangle

plot(ann2$rt,ann2$mz, xlab = 'RT', ylab = 'm/z')
rect(820, 700, 1050, 1000, border = 'red', col=NA)

Since the plot above showed that there are additional peaks in obtained TAG annotation table, we will try to exclude them using an approach based on grid-like pattern in rt-mz coordinates (please see the manuscript for the method description).

nets <- lookForNets(ann2,rt.win=c(1,30)) # by default rt.win is in mins, we will change it to secs
net.table <- sort(table(nets$start)) # count the number of features per net 
start <- names(net.table)[length(names(net.table))] # get id of the biggest net
tags = nets[!is.na(nets$start) & nets$start==as.numeric(start),] # select the biggest net
plotNet(tags)

The net looks quite noisy, so we will plot it again, but with circle size based on value of ppmd deviation of peaks

plot(tags$rt, tags$mz, cex = abs(tags$ppmd-mode)/6, xlab = 'RT', ylab = 'm/z')

Filter out the peaks with high ppmd deviation

net <- plotNet(tags[abs(tags$ppmd-mode) < 5, ])

The net looks better, but there are still some peaks that pretends to be same lipid, so the filtering procedure needs manual curation.

Filtering of peaks

We will perform feature filtering based on blank samples to retain all features for which the median concentration ratio between biological samples and blank samples is greater than 2.

med.MS <- apply(mtx, 1, function(x) log10(median(x[grep("MS",colnames(mtx),perl=T)], na.rm=T)))
med.blank <- apply(mtx, 1, function(x) log10(median(x[grep("Blank",colnames(mtx))], na.rm=T)))
filter.blank <- (med.MS - med.blank) > log10(2)
filter.blank[filter.blank==T] <- NA
filter.blank[is.na(filter.blank)] <- T
mtx <- mtx[filter.blank, c(2:5)]

The proportion of filtered features can be visualized using a mean-difference plot:

plot((med.MS+med.blank)/2, 
     med.MS-med.blank, 
     pch=21, las=1, bg="gray",  
     col="dimgray", cex=1.2, lwd=0.4, xlab = "Sample Intensity", ylab = "Sample - Blank")

points((med.MS[filter.blank]+med.blank[filter.blank])/2,
       med.MS[filter.blank]-med.blank[filter.blank],
       pch=21, bg="#B20F25", col="dimgray", cex=1.2, lwd=0.4)

abline(h=log10(2),col="#B20F25")

The code below removes all features that possess more than 30% of NA across samples.

th <- 0.3
peaks.nas <- apply(mtx, 1, function (x) sum(is.na(x)))
mtx <- mtx[(peaks.nas/ncol(mtx)) < th, ]

To impute missing values not filled by fillChromPeaks and not removed by filtering we will use an implementation of random forest algorithm from MissForest package.

mtx.imp <- missForest(mtx)
##   missForest iteration 1 in progress...done!
##   missForest iteration 2 in progress...done!
##   missForest iteration 3 in progress...done!
##   missForest iteration 4 in progress...done!
##   missForest iteration 5 in progress...done!
##   missForest iteration 6 in progress...done!
mtx <- mtx.imp$ximp

Normalization

In order to make samples comparable to each other we will utilize sample-specific normalization by wet weight and internal standard.

Load the matrix of wet weights for our samples

wetw <- as.matrix(read.csv("POS.WETWEIGHT.csv", header=F, row.names=1))
knitr::kable(wetw)
V2
MS122_MA_307 9.1
MS299_HB_38 10.5
MS423_MB_628 14.0
MS650_HD_346 12.9

Perform the normalization

wetw <- wetw[colnames(mtx), ]
mtx.normalized <- t(apply(mtx, 1, function (x) x*mean(wetw)*mean(div)/(wetw*div)))
mtx.log <- log10(mtx.normalized)

Downstream analysis

Processed matrix with quantified lipid abundances can be used for downstream analysis. We will apply two classical multivariate approaches (PCA and PLS-DA) as well as univariate statistical methods to analyze the intergroup differences between lipid profiles.

Principal Component Analysis (PCA)

PCA is a multivariate technique that extremely useful for classification purpose. The key idea of the method is to project original matrix of lipid abundances into low dimensional space. To perform dimensionality reduction PCA computes the reduced set of uncorrelated variables named Principal Components. For a given matrix X_{n \times m}, where features are rows and samples are columns, principal component vectors can be defined by finding eigenvectors of the following sample covariance matrix S:

where C_n = I_n - \frac{1}{n}1_{n} 1^{T}_{n} is a centering matrix, I_n represents an identity matrix of size n.

To calculate principal components we will use base R function prcomp

pca <- prcomp(t(mtx.log), center = TRUE, scale. = TRUE)

To visualize relationships between samples in a new low dimensional space we will plot PC1 and PC2 against each other.

Y <- pd$sample_group[2:5] 
pca.data <- data.frame(PC1 = pca$x[, 1], PC2 = pca$x[, 2], class = Y)
ggplot(data = pca.data, aes_string(x = "PC1", y = "PC2", color = "class", shape = "class")) + 
    geom_point(size = 5) +
    theme_light()

Partial Least-Squares Discriminant Analysis (PLS-DA)

While both PCA and PLS-DA achieve dimensionality reduction computing the principal components, PLS-DA can be used rather for classification and feature selection than for clustering purpose. Mathematically, PLS-DA principal components can be obtained in a similar to PCA manner, as eigenvectors of a matrix of covariances between X and Y:

where C_n and n represent a centering matrix and a total number of samples, respectively.

We will use the sparse version of the algorithm - sPLS-DA. First of all, matrix of predictors and vector of responses should be defined

X <- t(mtx.normalized)
Y <- as.factor(Y)

Tune sPLS-DA parameters using Leave-One-Out cross validation

list.keepX <- seq(1, 100, 2)
tune.splsda <- tune.splsda(X, Y, ncomp = 2, validation = 'loo', folds = 4, 
                           progressBar = FALSE, dist = 'max.dist',
                           test.keepX = list.keepX, nrepeat = 1)

Run sPLS-DA with optimized parameters

splsda.model <- splsda(X, Y, ncomp = 2, keepX = tune.splsda$choice.keepX)
plotIndiv(splsda.model, ind.names = FALSE, legend=TRUE, ellipse = TRUE)

A vector of feature contributions can be retrieved from the model in the following way

plsda.contributions <- selectVar(splsda.model, comp = 1)$value

Statistical analysis

Univariate methods (e.g. t-test, Wilcoxon rank sum test, ANOVA) are especially useful for detecting differences in concentration between samples on the level of single molecules.

As an illustrative example, we will apply Wilcoxon rank sum test (WRST) to the task of comparing Group1 and Group1 lipid abundances. In contrast to t-test, this test doesn’t assume that the data has a normal distribution.

stats <- apply(mtx.normalized, 1, function (x) wilcox.test(x[1:2], x[3:4])$p.value)

Given the small sample size of groups studied (two samples per group) the minimal raw p-value we can obtain with WRST is 0.33.

min(stats)
## [1] 0.3333333

The one more important step when dealing with statistical testing of hundreds or thousands features is a multiple testing correction procedure. Setting a significance level to 0.05 one can found a 5% of features to be significant even they are not. So, if the 5000 lipids are tested, 250 lipids could be significant by chance. To avoid having large number of false positive results, raw p-value can be adjust using multiply testing correction methods (e.g. Bonferroni, Holm or FDR corrections).

p.val <- p.adjust(stats, method = 'fdr') 

The statistical testing often accompanied by Fold Change (FC) calculation. FC shows the magnitude and direction of change in lipid concentration between group studied.

lfc <- log2(rowMeans(mtx.normalized[,c(1,2)])) - log2(rowMeans(mtx.normalized[,c(3,4)]))

The code below creates a table with columns describing the mean concentrations across samples and log2(Fold Change) values obtained above. We will also specify the 2-fold FC cutoff for up- and downregulated features.

mean.abundance <- (log2(rowMeans(mtx.normalized[,c(1,2)])) + log2(rowMeans(mtx.normalized[,c(3,4)])))/2 
ma.data <- as.data.frame(mean.abundance)
ma.data$lfc <- lfc
colnames(ma.data) <- c('MeanAbundance', 'Lfc')
ma.data$Sig <- 'NS'
ma.data$Sig[ma.data$Lfc > 1] <- 'Up'
ma.data$Sig[ma.data$Lfc < -1] <- 'Down'

Explore the number of up- and downregulated features:

table(ma.data$Sig)
## 
## Down   NS   Up 
## 1288 2456  502

Now the differences in lipid abundance between Group1 and Group2 can be visualize using MA plot, where x-axis represents average abundance level and y-axis represents log2(Fold Change) values.

p <- ggplot(ma.data, aes(x = MeanAbundance, y = Lfc, color = Sig)) + 
  geom_point() + 
  geom_hline(yintercept = c(-1, 1), linetype = "dashed") +
  scale_colour_manual(values=c("#1465AC", "darkgray", "#B31B21"))+
  xlab('log2(Mean normalized abundance)') + ylab('log2(Fold Change)') +
  theme_light()

p

Software used

sessionInfo()
## R version 4.0.3 (2020-10-10)
## Platform: x86_64-apple-darwin17.0 (64-bit)
## Running under: macOS Catalina 10.15.7
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRblas.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/4.0/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] ru_RU.UTF-8/ru_RU.UTF-8/ru_RU.UTF-8/C/ru_RU.UTF-8/ru_RU.UTF-8
## 
## attached base packages:
## [1] stats4    parallel  stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] SummarizedExperiment_1.20.0 GenomicRanges_1.42.0       
##  [3] GenomeInfoDb_1.26.7         IRanges_2.24.1             
##  [5] MatrixGenerics_1.2.1        matrixStats_0.60.1         
##  [7] gridExtra_2.3               reshape2_1.4.4             
##  [9] missForest_1.4              itertools_0.1-3            
## [11] iterators_1.0.13            foreach_1.5.1              
## [13] randomForest_4.6-14         dplyr_1.0.7                
## [15] mixOmics_6.14.1             lattice_0.20-44            
## [17] MASS_7.3-54                 IPO_1.16.0                 
## [19] CAMERA_1.46.0               rsm_2.10.2                 
## [21] DT_0.18                     ggplot2_3.3.5              
## [23] xcms_3.12.0                 MSnbase_2.16.1             
## [25] ProtGenerics_1.22.0         S4Vectors_0.28.1           
## [27] mzR_2.24.1                  Rcpp_1.0.7                 
## [29] BiocParallel_1.24.1         Biobase_2.50.0             
## [31] BiocGenerics_0.36.1        
## 
## loaded via a namespace (and not attached):
##  [1] colorspace_2.0-2       ellipsis_0.3.2         htmlTable_2.2.1       
##  [4] corpcor_1.6.9          XVector_0.30.0         base64enc_0.1-3       
##  [7] rstudioapi_0.13        farver_2.1.0           affyio_1.60.0         
## [10] ggrepel_0.9.1          RSpectra_0.16-0        fansi_0.5.0           
## [13] codetools_0.2-18       splines_4.0.3          ncdf4_1.17            
## [16] doParallel_1.0.16      impute_1.64.0          robustbase_0.93-8     
## [19] knitr_1.33             Formula_1.2-4          cluster_2.1.2         
## [22] vsn_3.58.0             png_0.1-7              graph_1.68.0          
## [25] BiocManager_1.30.16    compiler_4.0.3         backports_1.2.1       
## [28] assertthat_0.2.1       Matrix_1.3-4           limma_3.46.0          
## [31] htmltools_0.5.1.1      tools_4.0.3            igraph_1.2.6          
## [34] gtable_0.3.0           glue_1.4.2             GenomeInfoDbData_1.2.4
## [37] affy_1.68.0            RANN_2.6.1             MALDIquant_1.20       
## [40] vctrs_0.3.8            preprocessCore_1.52.1  xfun_0.25             
## [43] stringr_1.4.0          lifecycle_1.0.0        XML_3.99-0.7          
## [46] DEoptimR_1.0-9         zlibbioc_1.36.0        scales_1.1.1          
## [49] pcaMethods_1.82.0      RBGL_1.66.0            MassSpecWavelet_1.56.0
## [52] RColorBrewer_1.1-2     yaml_2.2.1             rpart_4.1-15          
## [55] latticeExtra_0.6-29    stringi_1.7.3          highr_0.9             
## [58] checkmate_2.0.0        rlang_0.4.11           pkgconfig_2.0.3       
## [61] bitops_1.0-7           mzID_1.28.0            evaluate_0.14         
## [64] purrr_0.3.4            labeling_0.4.2         htmlwidgets_1.5.3     
## [67] tidyselect_1.1.1       plyr_1.8.6             magrittr_2.0.1        
## [70] R6_2.5.1               generics_0.1.0         Hmisc_4.5-0           
## [73] DelayedArray_0.16.3    DBI_1.1.1              pillar_1.6.2          
## [76] foreign_0.8-81         withr_2.4.2            MsCoreUtils_1.2.0     
## [79] survival_3.2-13        RCurl_1.98-1.4         nnet_7.3-16           
## [82] tibble_3.1.3           crayon_1.4.1           rARPACK_0.11-0        
## [85] utf8_1.2.2             ellipse_0.4.2          rmarkdown_2.10        
## [88] jpeg_0.1-9             grid_4.0.3             data.table_1.14.0     
## [91] digest_0.6.27          tidyr_1.1.3            munsell_0.5.0