preprocessing-standards.Rmd

---
title: "Preprocessing of the *standards* data set"
author: "Andrea Vicini, Johannes Rainer"
output:
  rmarkdown::html_document:
    highlight: pygments
    toc: true
    toc_depth: 3
    fig_width: 5
---

```{r style, echo = FALSE, results = 'asis', message = FALSE}
library(BiocStyle)
knitr::opts_chunk$set(echo = TRUE, message = FALSE)
```

**Last modified:** `r file.info("preprocessing-standards.Rmd")$mtime`<br />
**Compiled**: `r date()`

```{r}
mix <- 1
mix_name <- paste0("Mix", ifelse(mix < 10, paste0(0, mix), mix)) 
```

```{r general-settings, echo = FALSE}
IMAGE_PATH <- paste0("images/preprocessing-standards/", tolower(mix_name),"/")
RDATA_PATH <- paste0("data/RData/preprocessing-standards/", tolower(mix_name), "/")
dir.create(IMAGE_PATH, showWarnings = FALSE, recursive = TRUE)
dir.create(RDATA_PATH, showWarnings = FALSE, recursive = TRUE)

## Define the *base* path where mzML files can be found. This is
## /data/massspec/mzML/ on the cluster
MZML_PATH <- "~/mix01/"
## MZML_PATH <- "/data/massspec/mzML/"

library(BiocParallel)
#' Set up parallel processing using 2 cores
if (.Platform$OS.type == "unix") {
    register(bpstart(MulticoreParam(2)))
} else {
    register(bpstart(SnowParam(2)))
}
```

# Introduction

In this document we perform the preprocessing and analysis of mzML files to
determine retention times and measured ions for the *standards*. These standards
are a collection of ~ 250 pure standards of polar metabolites which were used to
setup the HILIC-based LC-MS protocols to measure the polar metabolome in human
serum samples.


# Data preprocessing

In this section we perform the preprocessing of the LC-MS(/MS) data for all
mzML files of one mix of standards.

Below we load all required libraries and the definition of the dataset.

```{r libraries, message = FALSE, warning = FALSE}
library(xcms)
library(pander)
library(RColorBrewer)
library(magrittr)
library(Rdisop)
library(MetaboCoreUtils)
library(MsCoreUtils)
library(MsFeatures)
library(SummarizedExperiment)
library(MetaboAnnotation)
library(DBI)
library(dplyr)

std_serum_files <- read.table("data/std_serum_files.txt", header = TRUE)
```

We next subset the data set to files with samples from a single standards mix.

```{r}
std_serum_files01 <- std_serum_files[which(std_serum_files$type == mix_name), ]
```

The list of standards that constitute the present sample mix are listed below
along with the expected retention time and the most abundant adduct for positive
and negative polarity as defined in a previous analysis.

```{r, echo = FALSE, results = "asis"}
std_dilution <- read.table("data/standards_dilution.txt",
                           sep= "\t", header = TRUE)
std_dilution$exact_mass <- vapply(
    std_dilution$formula,
    function(z) getMolecule(z)$exactmass, numeric(1))

std_dilution01 <- std_dilution[std_dilution$mix == mix, ]
table1 <- std_dilution01[, c("name", "formula", "RT", "POS", "NEG")]
pandoc.table(table1, style = "rmarkdown", split.tables = Inf,
             caption = paste0("Standards of ", mix_name))
```

We next import the MS data from all mzML files of this mix.

```{r load-data, warning = FALSE, eval = !file.exists(paste0(RDATA_PATH, "data.RData"))}
fls <- paste0(MZML_PATH, std_serum_files01$folder, "/", std_serum_files01$mzML)
data <- readMSData(fls, pdata = new("NAnnotatedDataFrame", std_serum_files01),
                   mode = "onDisk")
save(data, file = paste0(RDATA_PATH, "data.RData"))
```


```{r load-data-cached, echo = FALSE, eval = file.exists(paste0(RDATA_PATH, "data.RData"))}
load(paste0(RDATA_PATH, "data.RData"))
```

We next create the base peak chromatogram (BPC) for each file and define colors
for the different sets of samples.

```{r}
bpc <- chromatogram(data, aggregationFun = "max")

#' Define a new variable combining the polarity and the matrix in which the
#' standards are solved.
tmp <- rep("Water", length(fileNames(data)))
tmp[grep("^QC", data$class)] <- "Serum"
matrix_pol <- paste0(tmp, "_", data$polarity)
data$matrix_pol <- matrix_pol

#' Define the colors for each type
col_matrix_pol <- brewer.pal(5, "Set1")[c(1, 2, 4, 5)]
names(col_matrix_pol) <- c("Water_POS", "Water_NEG", "Serum_NEG", "Serum_POS")
```

The BPC are shown below.

```{r bpc-mix01, fig.path = IMAGE_PATH, fig.cap = "Base peak chromatogram. Different colours for polarity and matrix. The vertical grey dotted lines indicate the retention time where an ion of a standard present in the current mix is expected.", fig.width = 10, fig.height = 5}
plot(bpc, col = paste0(col_matrix_pol[data$matrix_pol], 80))
legend("topright", col = col_matrix_pol,
       legend = names(col_matrix_pol), lwd = 1)
abline(v = std_dilution01$RT, lty = 3, col = "grey")
```

As expected, the signal in serum samples (orange and purple) seems to be higher
for the retention time regions in which a high number of compounds is expected
(20-60, 110-200 seconds). Surprisingly, also the samples with the mix solved in
water yield high signals for positive polarity (in the region from 60-120
seconds, but also around 160 seconds). In general, samples measured in positive
polarity seem to have higher intensities than the corresponding samples measured
in negative polarity.

Next we perform the chromatographic peak detection on all samples and
subsequently *refine* also the chromatographic peaks.

```{r peakdetection, eval = !file.exists(paste0(RDATA_PATH, "data_peakdetection.RData"))}
cwp <- CentWaveParam(ppm = 50,
                     peakwidth = c(2, 20),
                     snthresh = 5,
                     mzdiff = 0.001,
                     prefilter = c(4, 300),
                     noise = 100,
                     integrate = 2)

data <- findChromPeaks(data, param = cwp) 
save(data, file = paste0(RDATA_PATH, "data_peakdetection.RData"))
```

```{r peakdetection-cached, echo = FALSE, eval = file.exists(paste0(RDATA_PATH, "data_peakdetection.RData"))}
load(paste0(RDATA_PATH, "data_peakdetection.RData"))
```

```{r refinement, message = FALSE, warning = FALSE, eval = !file.exists(paste0(RDATA_PATH, "data_refined.RData"))}
mnp <- MergeNeighboringPeaksParam(expandRt = 2.5, expandMz = 0.001,
                                  minProp = 3/4)
data <- refineChromPeaks(data, param = mnp)
save(data, file = paste0(RDATA_PATH, "data_refined.RData"))
```

```{r refinement-cached, echo = FALSE, eval = file.exists(paste0(RDATA_PATH, "data_refined.RData"))}
load(paste0(RDATA_PATH, "data_refined.RData"))
```

We next evaluate the signal for Xanthine. For this we first calculate first its
monoisotopic mass based on the compound's chemical formula and then the m/z of
its [M+H]+ ion. 

```{r xanthine-mz}
frml_xan <- std_dilution01[std_dilution01$name == "Xanthine", "formula"]
mass_xan <- getMolecule(frml_xan)$exactmass    # calculate mass
mz_xan <- mass2mz(mass_xan, "[M+H]+")[1, 1]    # get the m/z for [M+H]+
mzr_xan <- mz_xan + c(1, -1) * ppm(mz_xan, 50) # calculate m/z range
```


```{r Xanthine-detected_peaks, fig.path = IMAGE_PATH, fig.cap = "Detected peaks for Xanthine.", fig.width = 7, fig.height = 5}
chr <- chromatogram(filterFile(data, file = which(data$polarity == "POS")), 
                    rt = c(130, 150), mz = mzr_xan)
col_samples <- col_matrix_pol[chr$matrix_pol]
plot(chr, col = col_samples, 
     peakBg = paste0(col_samples[chromPeaks(chr)[, "sample"]], 40))
legend("topright", col = col_matrix_pol,
       legend = names(col_matrix_pol), lwd = 1, cex = 1)
# The last two peaks have not been detected
```

We can clearly see a difference in signal for the different samples in this
experiment.

The number of detected chromatographic peaks per file is shown below.

```{r, echo = FALSE, results = "asis"}
nppf <- data.frame(file = data$mzML,
                   peak_count = as.numeric(table(chromPeaks(data)[, "sample"])))
nppf$matrix_pol <- data$matrix_pol
pandoc.table(nppf, style = "rmarkdown", split.tables = Inf,
             caption = "Number of peaks per file")
```

A larger number of peaks is detected in samples with a high concentration of the
internal standards. As can be expected, the number of peaks is higher for files
with the internal standards added to a QC samples compared to the files with
internal standards in water.

We split the data according to matrix and polarity.

```{r}
data_WP <- filterFile(data, file = which(data$matrix_pol == "Water_POS"))
data_WN <- filterFile(data, file = which(data$matrix_pol == "Water_NEG"))
data_SP <- filterFile(data, file = which(data$matrix_pol == "Serum_POS"))
data_SN <- filterFile(data, file = which(data$matrix_pol == "Serum_NEG"))
```

Next we perform a correspondence analysis to group chromatographic peaks across
samples. After that we fill-in missing peak data.

```{r correspondence, echo = FALSE, eval = !file.exists(paste0(RDATA_PATH, "data_postcorrespondence.RData"))}
pdp <- PeakDensityParam(sampleGroups = data_WP$matrix_pol, bw = 1.8,
                        minFraction = 0.7, binSize = 0.02)
data_WP <- groupChromPeaks(data_WP, param = pdp)
pdp <- PeakDensityParam(sampleGroups = data_WN$matrix_pol, bw = 1.8,
                        minFraction = 0.7, binSize = 0.02)
data_WN <- groupChromPeaks(data_WN, param = pdp)
pdp <- PeakDensityParam(sampleGroups = data_SP$matrix_pol, bw = 1.8,
                        minFraction = 0.7, binSize = 0.02)
data_SP <- groupChromPeaks(data_SP, param = pdp)
pdp <- PeakDensityParam(sampleGroups = data_SN$matrix_pol, bw = 1.8,
                        minFraction = 0.7, binSize = 0.02)
data_SN <- groupChromPeaks(data_SN, param = pdp)

## Gap-filling
data_WP <- fillChromPeaks(data_WP, param = ChromPeakAreaParam())
data_WN <- fillChromPeaks(data_WN, param = ChromPeakAreaParam())
data_SP <- fillChromPeaks(data_SP, param = ChromPeakAreaParam())
data_SN <- fillChromPeaks(data_SN, param = ChromPeakAreaParam())

save(data_WP, data_WN, data_SP, data_SN, 
     file = paste0(RDATA_PATH, "data_postcorrespondence.RData"))
```

```{r correspondence-cached, echo = FALSE, eval = file.exists(paste0(RDATA_PATH, "data_postcorrespondence.RData"))}
load(paste0(RDATA_PATH, "data_postcorrespondence.RData"))
```

We further subset according the data mode: MS1 data only and MS2 (i.e. data 
generated by LC-MS/MS).

```{r}
data_ms2_WP <- filterFile(data_WP, file = which(data_WP$mode != "FS"), 
                          keepFeatures = TRUE)
data_ms2_WN <- filterFile(data_WN, file = which(data_WN$mode != "FS"), 
                          keepFeatures = TRUE)
data_ms2_SP <- filterFile(data_SP, file = which(data_SP$mode != "FS"), 
                          keepFeatures = TRUE)
data_ms2_SN <- filterFile(data_SN, file = which(data_SN$mode != "FS"), 
                          keepFeatures = TRUE)
data_WP <- filterFile(data_WP, file = which(data_WP$mode == "FS"), 
                      keepFeatures = TRUE)
data_WN <- filterFile(data_WN, file = which(data_WN$mode == "FS"), 
                      keepFeatures = TRUE)
data_SP <- filterFile(data_SP, file = which(data_SP$mode == "FS"), 
                      keepFeatures = TRUE)
data_SN <- filterFile(data_SN, file = which(data_SN$mode == "FS"), 
                      keepFeatures = TRUE)
```

```{r}
## fmat_WP_f <- featureValues(data_WP, value = "into", method = "sum")
## fmat_WN_f <- featureValues(data_WN, value = "into", method = "sum")
## fmat_SP_f <- featureValues(data_SP, value = "into", method = "sum")
## fmat_SN_f <- featureValues(data_SN, value = "into", method = "sum")

## #' Percentage of missing values after filling
## sum(is.na(fmat_WP_f)) / length(fmat_WP_f) # positive polarity
## sum(is.na(fmat_WN_f)) / length(fmat_WN_f) # negative polarity
## sum(is.na(fmat_SP_f)) / length(fmat_SP_f) # positive polarity
## sum(is.na(fmat_SN_f)) / length(fmat_SN_f) # negative polarity
```

## Standards solved in pure water
      
```{r, echo = FALSE}
# This function returns a data frame of features with significant difference 
# betweenn low and high samples ordered according to rtmed
get_sign_f <- function(data_subs, drawplot = TRUE) {
  fmat <- featureValues(data_subs, value = "into", method = "sum", filled = FALSE)
  fmat_f <- featureValues(data_subs, value = "into", method = "sum", filled = TRUE)
  fD <- featureDefinitions(data_subs)
  keep_features <- rowSums(!is.na(fmat)) >= 2
  fmat_l2 <- log2(na.omit(fmat_f[keep_features,]))
  
  # H0: high==low vs H1: H0^c
  high <- grep("High", colnames(fmat_l2)) #c(1,3,6)
  low <- grep("Low", colnames(fmat_l2)) #c(2,4,5)
  
  pvalues <- apply(fmat_l2, 1, function(x) {
  res <- t.test(x[high], x[low], mu = 0)
  c(pvalue = res$p.value,
    M = unname(res$estimate[1] - res$estimate[2]))
  })
  pvalues <- t(pvalues)
  
  if(drawplot)
  {
    plot(pvalues[, "M"], -log10(pvalues[, "pvalue"]), pch = 21, col = "#00000080",
      bg = "#00000020", xlab = "M", ylab = expression(-log[10]~(p)))
  }
  
  sign_f_id <- rownames(fmat_l2)[which(pvalues[, "pvalue"] < 0.05 & 
                                       pvalues[, "M"] > 3)]
  sign_f <- cbind(sign_f_id, fD[sign_f_id, c("mzmed", "rtmed")], 
                pvalues[sign_f_id, ])
  sign_f <- sign_f[order(sign_f[, "rtmed" ], decreasing = FALSE), ]
  sign_f
}
```

```{r volcano-water-positive, fig.path = IMAGE_PATH, fig.width = 6, fig.height = 6, fig.cap = "Volcano plot for differential abundance of features in water, positive mode."}
sign_f_WP <- get_sign_f(data_WP)
```

```{r ,results = "asis", echo = FALSE}
pandoc.table(as.data.frame(sign_f_WP), style = "rmarkdown", split.table = Inf, 
             caption = "Significant features for Water in positive polarity")
```

```{r volcano-water-negative, fig.path = IMAGE_PATH, fig.width = 6, fig.height = 6, fig.cap = "Volcano plot for differential abundance of features in water, negative mode."}
sign_f_WN <- get_sign_f(data_WN)
```

```{r, results = "asis", echo = FALSE}
pandoc.table(data.frame(sign_f_WN), style = "rmarkdown",  split.table = Inf,
             caption = "Significant features for Water in negative polarity")
```


```{r plotfeatures, echo = FALSE}
# Function (to avoid code duplication) to plot all the features and save 
# their plot as a png files in the  selectedfeatures folder. 
# I have to adjust this 
plotfeatures <- function(features, data_subs, mp)
{
  if(!file.exists(paste0(RDATA_PATH, "f_chrs_", mp, ".RData")))
  {
    f_chrs <- featureChromatograms(data_subs, expandRt = 2, 
                                   features = features, filled = TRUE)
    save(f_chrs, file = paste0(RDATA_PATH, "f_chrs_", mp, ".RData"))
  }
  else
    load(paste0(RDATA_PATH, "f_chrs_", mp, ".RData"))
  
  dir.create(paste0(IMAGE_PATH, "selectedfeatures", mp),
             showWarnings = FALSE, recursive = TRUE)
  col_hl <- rep(brewer.pal(5, "Set1")[3], ncol(f_chrs)) # green
  low <- grep("Low", f_chrs$class)
  col_hl[low] <- brewer.pal(5, "Set1")[4] # purple
  for (i in seq_along(features)) {
    png(filename = paste0(IMAGE_PATH, "selectedfeatures",
                          mp, "/", features[i], ".png"), 
        width = 10, height = 6, res = 300, units = "cm", pointsize = 4)
    chr_obj <- f_chrs[i, ]
    plot(chr_obj, col = col_hl, 
         peakBg = paste0(col_hl[chromPeaks(chr_obj)[, "sample"]], 40))
    legend("topright", col = brewer.pal(5, "Set1")[c(3, 4)],
           legend = c("high", "low"), lwd = 1, cex = 1)
    abline(v = std_dilution01$RT, lty = 3)
    dev.off()
  }
}
```

```{r, message = FALSE, echo = FALSE}
plot_again <- FALSE
```

```{r plotfeatures-water, eval = plot_again, echo = plot_again}
plotfeatures(sign_f_WP$sign_f_id, data_WP, "Water_POS")
plotfeatures(sign_f_WN$sign_f_id, data_WN, "Water_NEG")
```

### Annotating significant features using MS2 data

Next we extract for all significant features potentially measured MS2 spectra
and match them against MassBank in order to annotate them. We first extract 
spectra from MassBank database.

```{r}
library(RMariaDB)
library(MsBackendMassbank)

co <- dbConnect(MariaDB(), user = "avwork", dbname = "MassBank",
                 host = "localhost", pass = "massbank")
mbank <- Spectra(co, source = MsBackendMassbankSql())
mbank <- setBackend(mbank, MsBackendDataFrame()) # breaks with parallel processing
```

The following function (to avoid code repetition) takes the ids of significant 
features `sign_f_id` check which ones are also available for the given MS2 data
`data_ms2`. For these it extracts the corresponding spectra and matches them 
against a reference database `mbank`. It returns a vector. Each entry contains 
for a given feature the compound names of the spectra in `mbank` that were
matched to the spectra related to the feature. 

```{r}
sign_f_ms2_mtch <- function(data_ms2, sign_f_id, mbank, case, recompute = FALSE)
{
  ids <- intersect(rownames(featureDefinitions(data_ms2)), sign_f_id)
  sign_f_ms2 <- featureSpectra(data_ms2, msLevel = 2L,
                               return.type = "Spectra", ppm = 10,
                               features = ids)
  fn <- paste0(RDATA_PATH, "comp_", case, "_ms2_MassBank.RData")
  if(!file.exists(fn) | recompute)
  {
    mtches <- matchSpectra(sign_f_ms2, mbank,
                         param = CompareSpectraParam(requirePrecursor = FALSE,
                                                     ppm = 10))
    save(mtches, file = fn)
  }
  else load(fn)
  mtches <- mtches[whichQuery(mtches)]
  tmp <- aggregate(mtches$target_compound_name,
                   by = list(feature = mtches$feature_id),
                   function(x) paste0(unique(x), collapse = ";"))
  #tmp[tmp[, 2] == ""] <- NA
  res <- rep(NA, length(sign_f_id))
  res[match(tmp[, 1], sign_f_id)] <- tmp[, 2]
  res
}
```


```{r}
sign_f_WP$compound_names <- sign_f_ms2_mtch(data_ms2_WP, sign_f_WP$sign_f_id,
                                            mbank, "WP")
sign_f_WN$compound_names <- sign_f_ms2_mtch(data_ms2_WN, sign_f_WN$sign_f_id, 
                                            mbank, "WN")
```

### Grouping of features

```{r}
plot(sign_f_WP$rtmed, sign_f_WP$mzmed,
     xlab = "retention time", ylab = "m/z", main = "features",
     col = "#00000060", pch = 16)
abline(v= table1$RT, lty = 2)
grid()
```

Some features overlap the expected retention times of the standards but for many
of them no significant feature was found.

```{r}
library(MsFeatures)
featureGroups(data_WP) <- NA_character_
featureDefinitions(data_WP)[sign_f_WP$sign_f_id, "feature_group"] <- "FG"

```

Grouping of features by similar retention time.

```{r, results = "asis"}
data_WP <- groupFeatures(data_WP, param = SimilarRtimeParam(diffRt = 4))
sign_f_WP$feature_group <-
    featureDefinitions(data_WP)[sign_f_WP$sign_f_id, "feature_group"]
sign_f_WP <- sign_f_WP[order(sign_f_WP$feature_group, sign_f_WP$mzmed), ]
pandoc.table(as.data.frame(sign_f_WP), style = "rmarkdown", split.tables = Inf)
```

```{r}
library(pheatmap)
fvals <- log2(featureValues(data_WP, filled = TRUE, 
                            method = "sum")[sign_f_WP$sign_f_id, ])
cormat <- cor(t(fvals), use = "pairwise.complete.obs")
ann <- data.frame(fgroup = featureDefinitions(data_WP)[sign_f_WP$sign_f_id, 
                                                         "feature_group"])
rownames(ann) <- rownames(cormat)
res <- pheatmap(cormat, annotation_row = ann, cluster_rows = TRUE,
                cluster_cols = TRUE)
```

Grouping of features by abundance correlation across samples

```{r, results = "asis"}
data_WP <- groupFeatures(data_WP, 
                         AbundanceSimilarityParam(threshold = 0.5, 
                                                  transform = log2), 
                         filled = TRUE, method = "sum")
sign_f_WP$feature_group <-
    featureDefinitions(data_WP)[sign_f_WP$sign_f_id, "feature_group"]
sign_f_WP <- sign_f_WP[order(sign_f_WP$feature_group, sign_f_WP$mzmed), ]
pandoc.table(as.data.frame(sign_f_WP), style = "rmarkdown", split.tables = Inf)
```

```{r}
plotFeatureGroups(data_WP)
grid()
```

Grouping of features by EIC correlation

```{r EIC-grouping, echo = FALSE, eval = !file.exists(paste0(RDATA_PATH, "data_post_groupingWater_POS.RData"))}
data_WP <- groupFeatures(data_WP, EicSimilarityParam(threshold = 0.7, n = 2))
save(data_WP, file = paste0(RDATA_PATH, "data_post_groupingWater_POS.RData"))
```

```{r grouping-cached, echo = FALSE, eval = file.exists(paste0(RDATA_PATH, "data_post_groupingWater_POS.RData"))}
load(paste0(RDATA_PATH, "data_post_groupingWater_POS.RData"))
```

```{r, echo = FALSE, results = "asis"}
sign_f_WP$feature_group <-
    featureDefinitions(data_WP)[sign_f_WP$sign_f_id, "feature_group"]
sign_f_WP <- sign_f_WP[order(sign_f_WP$feature_group, sign_f_WP$mzmed), ]
pandoc.table(as.data.frame(sign_f_WP), style = "rmarkdown", split.tables = Inf)
```

We next create plots for each feature group. These are saved to a folder and not
displayed here.

```{r}
# Plots the features in `data_subs` that have id in `features_id` and are 
# grouped as specified by `groups`
plot_groups_of_features <- function(features_id, groups, data_subs, dr)
{
  fgs <- unique(groups)
  dir.create(dr, showWarnings = FALSE, recursive = TRUE)
  for (fg in fgs) {
      fts_idx <- which(featureGroups(data_subs) == fg)
      eics <- featureChromatograms(data_subs, features = fts_idx,
                                   expandRt = 2, filled = TRUE, n = 2)
      png(paste0(dr, fg, ".png"), width = 12, height = 6, units = "cm",
          res = 300, pointsize = 4)
      par(mfrow = c(2, 2))
      ft_cols <- brewer.pal(max(3, length(fts_idx)), "Paired")[seq_along(fts_idx)]
      plotChromatogramsOverlay(normalize(eics[, 1, drop = FALSE]), lwd = 2,
                               col = paste0(ft_cols, 80),
                               main = paste0(fg, ": ", colnames(eics)[1]))
      legend("topright",
             legend = rownames(featureDefinitions(data_subs))[fts_idx],
             col = ft_cols, lty = 1)
      plotChromatogramsOverlay(normalize(eics[, 2, drop = FALSE]), lwd = 2,
                               col = paste0(ft_cols, 80),
                               main = paste0(fg, ": ", colnames(eics)[2]))
      legend("topright",
             legend = rownames(featureDefinitions(data_subs))[fts_idx],
             col = ft_cols, lty = 1)
      if (nrow(eics) > 1) {
          mz_y <- plotChromatogramsOverlay(eics[, 1, drop = FALSE],
                                           peakType = "none",
                                           col = ft_cols, stacked = 0.5)[[1L]]
          ## mz_y <- joyPlot(eics[, 1, drop = FALSE], col = ft_cols)
          abline(h = mz_y, col = paste0(ft_cols, 60), lty = 3)
          text(x = rep(min(unlist(lapply(eics, rtime))), length(mz_y)),
               y = mz_y, labels = format(mz(eics)[, 1], digits = 6),
               col = ft_cols, pos = 4)
          mz_y <- plotChromatogramsOverlay(eics[, 2, drop = FALSE],
                                           peakType = "none",
                                           col = ft_cols, stacked = 0.5)[[1L]]
          abline(h = mz_y, col = paste0(ft_cols, 40), lty = 3)
          text(x = rep(min(unlist(lapply(eics, rtime))), length(mz_y)),
               y = mz_y, labels = format(mz(eics)[, 1], digits = 6),
               col = ft_cols, pos = 4)
      }
      dev.off()
  }
}

```

```{r feature_groups_plots_WP, echo = FALSE, eval = plot_again}
## order feature groups by retention time and plot each.
sign_f_WP <- sign_f_WP[order(sign_f_WP$rtmed), ]
plot_groups_of_features(sign_f_WP$sign_f_id, sign_f_WP$feature_group,
                        data_WP, paste0(IMAGE_PATH, "feature_groups_WP/"))
```

Feature grouping in water and negative polarity

```{r, eval = !file.exists(paste0(RDATA_PATH, "data_post_groupingWater_NEG.RData"))}
featureGroups(data_WN) <- NA_character_
featureDefinitions(data_WN)[sign_f_WN$sign_f_id, "feature_group"] <- "FG"
data_WN <- groupFeatures(data_WN, param = SimilarRtimeParam(diffRt = 4))
data_WN <- groupFeatures(data_WN, 
                           AbundanceSimilarityParam(threshold = 0.5, 
                                                    transform = log2), 
                           filled = TRUE, method = "sum")
data_WN <- groupFeatures(data_WN, EicSimilarityParam(threshold = 0.7, n = 2))
save(data_WN, file = paste0(RDATA_PATH, "data_post_groupingWater_NEG.RData"))
```
```{r, eval = file.exists(paste0(RDATA_PATH, "data_post_groupingWater_NEG.RData"))}
load(file = paste0(RDATA_PATH, "data_post_groupingWater_NEG.RData"))
```

```{r, echo = FALSE, results = "asis"}
sign_f_WN$feature_group <-
    featureDefinitions(data_WN)[sign_f_WN$sign_f_id, "feature_group"]
sign_f_WN <- sign_f_WN[order(sign_f_WN$feature_group, sign_f_WN$mzmed), ]
pandoc.table(as.data.frame(sign_f_WN), style = "rmarkdown", split.tables = Inf)
```

```{r feature_groups_plots_WN, echo = FALSE, eval = plot_again}
## order feature groups by retention time and plot each.
sign_f_WN <- sign_f_WN[order(sign_f_WN$rtmed), ]
plot_groups_of_features(sign_f_WN$sign_f_id, sign_f_WN$feature_group,
                        data_WN, paste0(IMAGE_PATH, "feature_groups_WN/"))
```


### Mapping features to standards

Next we create a table that maps features to compounds' adducts. This table will
have one entry for each map between one feature and one adduct and this mapping
can thus be n:m.
To avoid code repetition we define the following function that matches the 
features in `sign_f` to possible adducts of the compounds in `info_stds` with
polarity specified by `pol`

```{r}
get_matchings <- function(info_stds, sign_f, pol, ppm = 20)
{
  #Compute exact masses of standars
  frmls <- info_stds[, "formula"]
  masses <- vapply(frmls, function(formula) getMolecule(formula)$exactmass,
                   numeric(1))
  
  #Compute m/z of possible adducts of the standards
  mz_adducts <- mass2mz(
    masses, adductNames(ifelse(pol == "POS", "positive", "negative")))
  rownames(info_stds) <- info_stds$name
  rownames(mz_adducts) <- rownames(info_stds)
  idx <- order(mz_adducts[, 1])
  mz_adducts <- mz_adducts[idx, ]
  info_stds <- info_stds[idx, ]
  
  #mzmed of significant features
  mzmed_f <- sign_f$mzmed
  names(mzmed_f) <- sign_f$sign_f_id
  mzmed_f <- sort(mzmed_f)
  
  #Compute matrix of matchings
  matches_mat <- apply(mz_adducts, 2, function(col)
    closest(mzmed_f, col, tolerance = 0, ppm = ppm))
  rownames(matches_mat) <- names(mzmed_f)
  
  get_adducts <- function(ft, matches_mat, info_stds = NULL) {
    not_na <- which(!is.na(matches_mat[ft, ]))
    if (length(not_na))
      data.frame(
        feature_id = ft,
        adduct = colnames(matches_mat)[not_na], 
        info_stds[matches_mat[ft, not_na], ])
    else data.frame()
  }
  
  tmp <- lapply(sign_f$sign_f_id, function(z) {
    addcts <- get_adducts(ft = z, matches_mat, info_stds[, c("name", "formula")])
  })
  
  matchings <- do.call(rbind, tmp)
  rownames(matchings) <- NULL
  matchings
}

match_all_adducts <- function(info_stds, sign_f, pol, ppm = 20,
                              adducts = adductNames(polarity = "pos")) {
    prm <- Mass2MzParam(adducts = adducts, tolerance = 0, ppm = 10)
    mtches <- matchMz(sign_f, info_stds, param = prm, mzColname = "mzmed",
                      massColname = "exact_mass")
    tmp <- aggregate(
        as.data.frame(matchedData(mtches, c("sign_f_id",
                                            "target_name", "adduct"))),
        by = list(mtches$sign_f_id), function(z)
            paste0(unique(z), collapse = ";"))
    cbind(sign_f,
          tmp[match(sign_f$sign_f_id, tmp[, 2]), c("target_name", "adduct")])
}

```

```{r}
sign_f_WP <- match_all_adducts(std_dilution01, sign_f_WP, ppm = 10,
                               adducts = adductNames(polarity = "pos"))
sign_f_WN <- match_all_adducts(std_dilution01, sign_f_WN, ppm = 10,
                               adducts = adductNames(polarity = "neg"))

matchings_WP <- get_matchings(std_dilution01, sign_f_WP, "POS")
matchings_WN <- get_matchings(std_dilution01, sign_f_WN, "NEG")
```

LLLLL: replace the `get_matchings` with `match_all_adducts`.


TODO
	- for each standard, create overlapping EIC plots (like in code block
      *feature_group_plots*) for the assigned features.

```{r features_by_standards, echo = FALSE, eval = plot_again}
## order feature groups by retention time and plot each.
plot_groups_of_features(matchings_WP$feature_id, matchings_WP$name,
                       data_WP, paste0(IMAGE_PATH, "features_by_standards_WP/"))
```


## Standards solved in human serum samples

Features with significant difference among high and low concentration samples

```{r volcano-serum-pos, fig.path = IMAGE_PATH, fig.width = 6, fig.height = 6, fig.cap = "Volcano plot representing the significant features for the comparison of high vs low concentration in serum, positive polarity."}
sign_f_SP <- get_sign_f(data_SP)
```


```{r, results = "asis", echo = FALSE}
pandoc.table(data.frame(sign_f_SP), style = "rmarkdown", 
             caption = "Significant features for Serum, positive polarity.")
```

```{r volcano-serum-neg, fig.path = IMAGE_PATH, fig.width = 6, fig.height = 6, fig.cap = "Volcano plot representing the significant features for the comparison of high vs low concentration in serum, negative polarity."}
sign_f_SN <- get_sign_f(data_SN)
```

```{r, results = "asis", echo = FALSE}
pandoc.table(data.frame(sign_f_SN), style = "rmarkdown", 
             caption = "Significant features for Serum, negative polarity.")
```

```{r plotfeatures-serum, eval = plot_again}
plotfeatures(sign_f_SP$sign_f_id, data_SP, "Serum_POS")
plotfeatures(sign_f_SN$sign_f_id, data_SN, "Serum_NEG")
```

```{r}
a <- sign_f_ms2_mtch(data_ms2_SP, sign_f_SP$sign_f_id, mbank, "SP")
a <- sign_f_ms2_mtch(data_ms2_SN, sign_f_SN$sign_f_id, mbank, "SN")
```

```{r, echo = FALSE, eval = file.exists(paste0(RDATA_PATH, "comp_S_ms2_MassBank.RData"))}
load(file = paste0(RDATA_PATH, "comp_S_ms2_MassBank.RData"))
```


```{r}
sign_f_SP$compound_names <- sign_f_ms2_mtch(data_ms2_SP, sign_f_SP$sign_f_id, 
                                            mbank, "SP")
sign_f_SN$compound_names <- sign_f_ms2_mtch(data_ms2_SN, sign_f_SN$sign_f_id, 
                                        mbank, "SN")
dbDisconnect(co)
```

Feature grouping in serum samples with positive polarity

```{r}
plot(sign_f_SP$rtmed, sign_f_SP$mzmed,
     xlab = "retention time", ylab = "m/z", main = "features",
     col = "#00000060")
abline(v= table1$RT, lty = 2)
grid()
```

```{r, eval = !file.exists(paste0(RDATA_PATH, "data_post_groupingSerum_POS.RData"))}
featureGroups(data_SP) <- NA_character_
featureDefinitions(data_SP)[sign_f_SP$sign_f_id, "feature_group"] <- "FG"
data_SP <- groupFeatures(data_SP, param = SimilarRtimeParam(diffRt = 4))
data_SP <- groupFeatures(data_SP, 
                           AbundanceSimilarityParam(threshold = 0.5, 
                                                    transform = log2), 
                           filled = TRUE, method = "sum")
data_SP <- groupFeatures(data_SP, EicCorrelationParam(threshold = 0.7, n = 2,
                                                          clean = TRUE))
save(data_SP, file = paste0(RDATA_PATH, "data_post_groupingSerum_POS.RData"))
```

```{r, eval = file.exists(paste0(RDATA_PATH, "data_post_groupingSerum_POS.RData"))}
load(file = paste0(RDATA_PATH, "data_post_groupingSerum_POS.RData"))
```

```{r, echo = FALSE, results = "asis"}
sign_f_SP$feature_group <-
    featureDefinitions(data_SP)[sign_f_SP$sign_f_id, "feature_group"]
sign_f_SP <- sign_f_SP[order(sign_f_SP$feature_group, sign_f_SP$mzmed), ]
pandoc.table(as.data.frame(sign_f_SP), style = "rmarkdown")
```

Feature grouping in serum samples with negative polarity

```{r, eval = !file.exists(paste0(RDATA_PATH, "data_post_groupingSerum_NEG.RData"))}
featureGroups(data_SN) <- NA_character_
featureDefinitions(data_SN)[sign_f_SN$sign_f_id, "feature_group"] <- "FG"
data_SN <- groupFeatures(data_SN, param = SimilarRtimeParam(diffRt = 4))
data_SN <- groupFeatures(data_SN, 
                           AbundanceSimilarityParam(threshold = 0.5, 
                                                    transform = log2), 
                           filled = TRUE, method = "sum")
data_SN <- groupFeatures(data_SN, EicCorrelationParam(threshold = 0.7, n = 2,
                                                          clean = TRUE))
save(data_SN, file = paste0(RDATA_PATH, "data_post_groupingSerum_NEG.RData"))
```

```{r, eval = file.exists(paste0(RDATA_PATH, "data_post_groupingSerum_NEG.RData"))}
load(file = paste0(RDATA_PATH, "data_post_groupingSerum_NEG.RData"))
```

```{r, echo = FALSE, results = "asis"}
sign_f_SN$feature_group <-
    featureDefinitions(data_SN)[sign_f_SN$sign_f_id, "feature_group"]
sign_f_SN <- sign_f_SN[order(sign_f_SN$feature_group, sign_f_SN$mzmed), ]
pandoc.table(as.data.frame(sign_f_SN), style = "rmarkdown")
```

Matching the features with significant difference between low and high 
concentration samples in serum

```{r}
matchings_SP <- get_matchings(std_dilution01, sign_f_SP, "POS")
matchings_SN <- get_matchings(std_dilution01, sign_f_SN, "NEG")
```

We next store this mapping information into a SQLite database.

```{r}
compound <- cbind(table1, mix = mix, pk = paste0(table1$name, "_", mix))
## correspondence was done seprately for each data subset, thus we can have
## duplicated feature ids. Appending thus the subset and the mix information
## to generate a primary key
ft_WP <- cbind(as.data.frame(sign_f_WP),
               pk = paste0(sign_f_WP$sign_f_id, "_WP_", mix))
ft_WN <- cbind(as.data.frame(sign_f_WN),
               pk = paste0(sign_f_WN$sign_f_id, "_WN_", mix))
ft_SP <- cbind(as.data.frame(sign_f_SP),
               pk = paste0(sign_f_SP$sign_f_id, "_SP_", mix))
ft_SN <- cbind(as.data.frame(sign_f_SN),
               pk = paste0(sign_f_SN$sign_f_id, "_SN_", mix))
## Mapping between compounds and features
ft_to_comp_WP <- cbind(
    compound_pk = paste0(matchings_WP$name, "_", mix),
    feature_pk = paste0(matchings_WP$feature_id, "_WP_", mix),
    adduct = matchings_WP$adduct, matrix_pol = "Water_POS")
ft_to_comp_WN <- cbind(
    compound_pk = paste0(matchings_WN$name, "_", mix),
    feature_pk = paste0(matchings_WN$feature_id, "_WN_", mix),
    adduct = matchings_WN$adduct, matrix_pol = "Water_NEG")
ft_to_comp_SP <- cbind(
    compound_pk = paste0(matchings_SP$name, "_", mix),
    feature_pk = paste0(matchings_SP$feature_id, "_SP_", mix),
    adduct = matchings_SP$adduct, matrix_pol = "Serum_POS")
ft_to_comp_SN <- cbind(
    compound_pk = paste0(matchings_SN$name, "_", mix),
    feature_pk = paste0(matchings_SN$feature_id, "_SN_", mix),
    adduct = matchings_SN$adduct, matrix_pol = "Serum_NEG")
feature <- rbind(ft_WP, ft_WN, ft_SP, ft_SN)
feature_to_compound <- as.data.frame(rbind(ft_to_comp_WP, ft_to_comp_WN, 
                                           ft_to_comp_SP, ft_to_comp_SN))
colnames(feature)[1] <- "feature_id"
```

```{r}
library(RSQLite)
#con <- dbConnect(RSQLite::SQLite(), paste0(RDATA_PATH," database.sqlite"))
con <- dbConnect(RSQLite::SQLite(), ":memory:")
dbWriteTable(con, "feature", feature, append = TRUE)
dbWriteTable(con, "compound", compound, append = TRUE)
dbWriteTable(con, "feature_to_compound", feature_to_compound, 
             append = TRUE)
```

The table below summarizes the mapping of features to adducts of standards.

```{r, results = "asis", echo = FALSE}
stab <- dbGetQuery(
  con, paste0("select compound.name, adduct, feature.feature_id, ",
              "feature.compound_names, ",
              "feature_group, rtmed, RT, POS, NEG, matrix_pol, mix ",
              "from feature_to_compound ",
              "join compound on ",
              "(feature_to_compound.compound_pk = compound.pk) join feature on",
              " (feature_to_compound.feature_pk = feature.pk)"))
stab <- stab[order(stab$name), ]
rownames(stab) <- NULL
pandoc.table(stab, style = "rmarkdown", split.table = "Inf",
             caption = paste0("Matching of theoretical adducts of standards ",
                              "with m/z of features with a significant ",
                              "difference of abundances."))
```

For the standards below no feature was identified.

```{r, results = "asis", echo = FALSE}
pandoc.table(compound[!compound$name %in% stab$name, ],
             style = "rmarkdown", caption = "Standards without features.")
```


```{r, results = "asis", echo = FALSE}
stab2 <- dbGetQuery(
    con, paste0("select feature.feature_id, feature_group, ",
                "compound.name, adduct, rtmed, RT, matrix_pol, mix ",
                "from feature_to_compound ",
                "join feature on ",
                "(feature_to_compound.feature_pk = feature.pk) join compound on",
                " (feature_to_compound.compound_pk = compound.pk)"))
stab2 <- stab2[order(stab2$rtmed), ]
rownames(stab2) <- NULL
pandoc.table(stab2, style = "rmarkdown", split.table = "Inf",
             caption = paste0("Matching of theoretical adducts of standards ",
                              "with m/z of features with a significant ",
                              "difference of abundances."))
```

We get the MS2 spectra for each of the "significant" features

```{r}
f_ms2_WP <- rownames(featureDefinitions(data_ms2_WP))
sign_f_WP_ms2 <- intersect(f_ms2_WP, sign_f_WP$sign_f_id)
obJ <- featureSpectra(data_ms2_WP, msLevel = 2L,
                      return.type = "Spectra",
                      features = sign_f_WP_ms2)
```


# Session information

The R version and packages used in this analysis are listed below.

```{r sessioninfo}
sessionInfo()
```

# Ideas for updates

Seems we're not finding "significant" features at the expected retention
times. Maybe we should update, and definitely simplify, the workflow:

- Analysis plan (for each mix, polarity, sample matrix (water/serum)):
  - load raw data
  - peak detection, correspondence.
  - find features that have a difference in abundance > 2? (without p-value,
    let's try to be less strict and have here also potentially false positives)
  - do feature grouping.
  - `matchSpectra`: extract MS2 spectra for all significant features and match
    them against MassBank (and/or HMDB?). Add information of matching MS2
    spectra to each spectrum.
  - `matchMz`: match m/z of all significant features against the expected
    standards in the mix (or maybe also against all to ensure there are no
    sample mixups?)
  - Maybe later: for each feature group, check if their m/z and intensity would
    match isotopes?
  - Result object could be the `Matched` object with the match of features to
    standards (since it represents n:m mapping).
  - re-compile result table to have it listed by standard: each row is one
    standard with:
	- feature IDs assigned to it.
	- the feature group IDs of these.
	- the information from the MS2 matching.