-
Notifications
You must be signed in to change notification settings - Fork 0
/
large_experiment_processing.Rmd
299 lines (236 loc) · 10.4 KB
/
large_experiment_processing.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
---
title: "MSnbase, efficient and elegant R-based processing and visualisation of raw mass spectrometry data"
author:
- name: Laurent Gatto
affiliation: Computational Biology Unit, de Duve Institute, Universit\'e catholique de Louvain, Brussels, Belgium
- name: Sebastian Gibb
affiliation: Department of Anaesthesiology and Intensive Care of the University Medicine Greifswald, Germany
- name: Johannes Rainer
affiliation: Institute for Biomedicine, Eurac Research, Affiliated Institueof the University of L\"ubeck, Bolzano, Italy
output:
rticles::acs_article:
---
```{r style, echo = FALSE, results = 'asis', message=FALSE}
BiocStyle::markdown()
knitr::opts_chunk$set(echo = TRUE, warning = FALSE, message = FALSE)
```
# Introduction
This document describes handling of mass spectrometry data from large
experiments using the `MSnbase` package and more specifically its *on-disk*
backend. For demonstration purposes, the
[MassIVE](https://massive.ucsd.edu/ProteoSAFe/static/massive.jsp) data set
[MSV000080030](https://massive.ucsd.edu/ProteoSAFe/dataset.jsp?task=5e7034cc98c54a47b803b144bff6a296)
is used. This consists of over 1,000 mzXML files from swab-samples collected
from hands and various personal objects of 80 volunteers.
# Data handling and analysis with `MSnbase`
In this section we demonstrate data handling and access by `MSnbase` on a large
experiment consisting of more than 1,000 data files.
To reproduce the analysis described in this document, download the *MSV000080030* folder
from
[ftp://massive.ucsd.edu/MSV000080030/](ftp://massive.ucsd.edu/MSV000080030/) and
place it into the same folder as this document.
Below we load the required libraries and define the files to be analyzed.
```{r}
library(MSnbase)
library(magrittr)
library(pryr)
fls <- dir("MSV000080030/ccms_peak/Forensic_study_80_volunteers/",
pattern = "mzXML", full.names = TRUE, recursive = TRUE)
```
The data set consists of `r length(fls)` mzXML files. We next load the
data using the two different `MSnbase` backends `"inMemory`" and `"onDisk"`. For
the in-memory backend, due to the larger memory requirements, we import the data
only from a subset of the files.
```{r import-inmem, eval = !file.exists("ms_mem_hand.RData")}
ms_mem <- readMSData(fls[grep("Hand", fls)], mode = "inMemory")
```
```{r import-inmem-save, echo = FALSE}
if (!file.exists("ms_mem_hand.RData")) {
save(ms_mem, file = "ms_mem_hand.RData")
} else {
load("ms_mem_hand.RData")
}
```
Next we load data from all mzXML files as an on-disk `MSnExp` object.
```{r import-ondisk, eval = !file.exists("ms_dsk_all.RData")}
ms_dsk <- readMSData(fls, mode = "onDisk")
```
```{r import-ondisk-save, echo = FALSE}
if (!file.exists("ms_dsk_all.RData")) {
save(ms_dsk, file = "ms_dsk_all.RData")
} else {
load("ms_dsk_all.RData")
}
```
Below we count the number of spectra per MS level of the whole experiment.
```{r ondisk-spectra-table}
table(msLevel(ms_dsk))
```
Note that the in-memory `MSnExp` object contains only MS2 spectra (in total
`r length(ms_mem)`) from a subset of data files. However, the data import was much
slower (over ~ 12 hours for the in-memory backend while creating the on-disk
object from the full data data set took ~ 3 hours).
Next we subset the on-disk object to contain the same set of spectra as the
in-memory `MSnExp` and compare their memory footprint.
```{r memory-footprint}
ms_dsk_hands <- ms_dsk %>%
filterFile(grep("Hand", fls)) %>%
filterMsLevel(2L)
object_size(ms_mem)
object_size(ms_dsk_hands)
```
Since the on-disk object stores only spectra metadata in memory it occupies also
much less system memory. As a comparison, the on-disk `MSnExp` for the full
experiment was still much smaller than the in-memory object:
```{r memory-footprint-all}
object_size(ms_dsk)
```
## Basic MS data access functionality
Before evaluating the `MSnbase` performance on the large data set we provide
some general description of the `MSnbase` data classes and basic
data access operations. MS data from raw data files in mzML, mzXML, mzData or
netCDF format is represented by the `MSnExp` object which organizes the spectra
from the original files in an one-dimensional list. Functions like `rtime` and
`msLevel` allow to extract the retention time and MS level, respectively.
They return a `numeric` (or `integer`) vector with the same length
as the number of spectra in the `MSnExp`. In the example below we use the
`rtime` function to extract the retention times for each spectrum.
```{r rtime}
rts <- rtime(ms_dsk)
length(rts)
head(rts)
```
The `fromFile` function can be used to determine the source file (sample)
of a specific spectrum in the `MSnExp` object. This function returns an `integer`
vector, of the same length as spectra in the experiment, with the file index. The
file names can be accessed with the `fileNames` method. An `MSnExp` object can be
subsetted with `[` and e.g. the index of the spectra that should be retained. In
the code block below we subset our `ms_dsk` object to keep only spectra from
the 3rd file.
```{r subset}
one_file <- ms_dsk[fromFile(ms_dsk) == 3]
length(one_file)
```
Note that there are also dedicated *filter* functions to subset an
`MSnExp` object such as `filterFile`, `filterMsLevel`, `filterRt`, `filterMz`,
`filterPrecursorMz` or `filterIsolationWindow`. In the example below we use the
`filterRt` function to further subset our data to keep only spectra acquired
within a certain time range.
```{r filterrt}
one_file <- filterRt(one_file, rt = c(40, 300))
length(one_file)
```
As mentioned above, the `MSnExp` object is comparable with a list of
spectra. Thus, to extract a single spectrum from it we can use `[[`. This will
return an object of type `Spectrum` which encapsules/represents all information
of the measured spectrum (i.e. m/z and intensity values as well as metadata
information). In the example below we extract the 15th spectrum from our data
subset and access its m/z values with the `mz` function.
```{r}
sp <- one_file[[15]]
mz(sp)
```
This particular spectrum has only 3 peaks.
Note that m/z or intensity values can also be directly extracted from the
`MSnExp` object as shown in the example below. The result will be a `list` of
`numeric` vectors, each element representing the m/z values for each spectrum in
the object.
```{r}
mzs <- mz(one_file)
class(mzs)
length(mzs)
```
In addition, it is also possible to extract all m/z and intensity values from an
`MSnExp` object as a `data.frame` as shown in the code block below, but this is
not suggested, since it loads all the data into memory but all MS spectrum
metadata such as MS level or precursor m/z get lost.
```{r data.frame}
df <- as(one_file, "data.frame")
head(df)
nrow(df)
```
Note that for all these operations it is irrelevant whether an in-memory or
on-disk backend was used. In general it is advisable to use the on-disk backend
especially for experiments with more than ~ 50 files.
## Performance of the *on-disk* backend on large scale data sets
To demonstrate `MSnbase`'s efficiency in processing large scale experiments we
perform some standard subsetting, data access and manipulation operations.
```{r parallel-setup, echo = FALSE}
library(BiocParallel)
register(SerialParam())
```
We first compare the performance of the on-disk and in-memory backend on
accessing m/z values with the `mz` function on a set of 100 randomly selected
spectra. The performance is assessed with the `microbenchmark` function.
```{r random-mz}
set.seed(123)
idx <- sample(seq_along(ms_mem), 100)
library(microbenchmark)
microbenchmark(mz(ms_mem[idx]),
mz(ms_dsk_hands[idx]),
times = 5)
```
For this combined subsetting and data access operation the on-disk backend
performed better than the in-memory `MSnExp`, while even requiring much less
memory.
Next we extract all MS2 spectra with a retention time between 50 and 60 seconds
and a precursor m/z of 108.5362 (+/- 5ppm). This subsetting operation is
performed on the on-disk `MSnExp` object representing the full experiment with
the `r length(fileNames(ms_dsk))` data files/samples. To assess the performance
of the following operations we use `system.time` calls that record elapsed time
in seconds.
```{r precursor-selection}
system.time(
ms_sub <- ms_dsk %>%
filterMsLevel(2L) %>%
filterRt(c(50, 60)) %>%
filterPrecursorMz(mz = 108.5362, ppm = 5)
)["elapsed"]
```
In total `length(ms_sub)` spectra were selected from in total
`r length(unique(fromFile(ms_sub)))` data files/samples. The plot below shows
the data for the first spectrum.
```{r precursor-selection-plot-1, fig.cap = "Example spectrum of the data set."}
system.time(
plot(ms_sub[[1]])
)["elapsed"]
```
Since there seems to be quite some background noise in the MS2 spectrum we next
remove peaks with an intensity below 50 by first replacing their intensities
with 0 (with the `removePeaks` call) and subsequently removing all 0-intensity
peaks from each spectrum with the `clean` call. In addition we *normalize* each
spectrum by dividing the maximum intensity per spectrum from the spectrum's
intensities.
```{r normalize}
system.time(
ms_sub <- ms_sub %>%
removePeaks(t = 50) %>%
clean(all = TRUE) %>%
normalize(method = "max")
)["elapsed"]
```
The result on the first spectrum is shown below.
```{r precursor-selection-plot-normalized, fig.cap = "Example spectrum after cleaning."}
system.time(
plot(ms_sub[[1]])
)["elapsed"]
```
Note that any of the data manipulations above are not directly applied to the
data but *cached* in the object's internal *lazy processing queue* (explaining
the very short running time of the normalization call). The operations are only
effectively applied to the data when m/z or intensity values are extracted from
the object, e.g. in the `plot` call above.
For additional workflows employing `MSnbase` see also
[metabolomics2018](https://jorainer.github.io/metabolomics2018/xcms-preprocessing.html)[^1]
that explains filtering, plotting and centroiding of profile-mode MS data with
`MSnbase` and subsequent pre-processing of the (label free/untargeted) LC-MS
data with the `xcms` package (that builds upon `MSnbase` for MS data
representation and access).
## System information
The present analysis was run on a MacBook Pro 16,1 with 2.3 GHz 8-Core Intel
Core i9 CPU and 64 GB 2667 MHz DDR4 memory running macOS version 10.15.5. The R
version and the version of the used packages are listed below.
```{r session-info}
sessionInfo()
```
[^1]: https://github.com/jorainer/metabolomics2018