-
Notifications
You must be signed in to change notification settings - Fork 0
/
msnbase2.Rnw
645 lines (559 loc) · 27.3 KB
/
msnbase2.Rnw
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
\documentclass[journal=jacsat,manuscript=article]{achemso}
\usepackage[]{graphicx}
\usepackage[]{color}
\usepackage[]{xcolor}
\usepackage[normalem]{ulem}
%% maxwidth is the original width if it is less than linewidth
%% otherwise use linewidth (to make sure the graphics do not exceed the margin)
\makeatletter
\def\maxwidth{ %
\ifdim\Gin@nat@width>\linewidth
\linewidth
\else
\Gin@nat@width
\fi
}
\makeatother
\usepackage{subfig}
\usepackage{graphicx}
\usepackage{alltt}
\usepackage{chemformula} % Formula subscripts using \ch{}
\usepackage[T1]{fontenc} % Use modern font encodings
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% If issues arise when submitting your manuscript, you may want to
%% un-comment the next line. This provides information on the
%% version of every file you have used.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%\listfiles
\newcommand*\mycommand[1]{\texttt{\emph{#1}}}
\author{Laurent Gatto}
\email{[email protected]}
\affiliation[UCLouvain]{Computational Biology Unit, de Duve Institute, Universit\'e catholique de Louvain, Brussels, Belgium}
\author{Sebastian Gibb}
\affiliation[University of Greifswald]{Department of Anaesthesiology and Intensive Care of the University Medicine Greifswald, Germany}
\author{Johannes Rainer}
\affiliation[Eurac Research]{Institute for Biomedicine, Eurac Research, Affiliated Institute of the University of L\"ubeck, Bolzano, Italy}
\title[MSnbase version 2]
{\texttt{MSnbase}, efficient and elegant R-based processing and
visualisation of raw mass spectrometry data}
\abbreviations{}
\keywords{Bioconductor, mass spectrometry, software, metabolomics, proteomics, visualisation, reproducible research} %% up to 10 keywords
\IfFileExists{upquote.sty}{\usepackage{upquote}}{}
\begin{document}
%% \begin{tocentry}
%% See achemso-demo.tex
%% \end{tocentry}
\begin{abstract} %% 200 words max
We present version 2 of the \texttt{MSnbase} R/Bioconductor
package. \texttt{MSnbase} provides infrastructure for the
manipulation, processing and visualisation of mass spectrometry
data. We focus on the new \textit{on-disk} infrastructure, that
allows the handling of large raw mass spectrometry
experiment\textcolor{black}{s} on commodity hardware and illustrate
how the package is used for elegant data processing, method
development, and visualisation.
\end{abstract}
Keywords: R, Bioconductor, mass spectrometry, software, metabolomics,
proteomics, visualisation, reproducible research
<<setup, include = FALSE, message = FALSE>>=
knitr::opts_chunk$set(echo = FALSE, cache = FALSE)
library("tidyverse")
library("patchwork")
@
\section{Introduction}
Mass spectrometry is a powerful technology to assay chemical and
biological samples. It is used in routine applications with well
characterised \textcolor{black}{protocols} \textcolor{black}{such as in
clinical settings}, as well as a development platform,
\textcolor{black}{with the aim} to improve on existing
\textcolor{black}{protocols} and devise new ones. The complexity and
diversity of mass spectrometry yield complex \textcolor{black}{data} of
considerable size, that require non trivial processing before
producing interpretable results. \textcolor{black}{The complexity and
size of these data} constitute a significant challenge for
\textcolor{black}{protocol} development: in addition to the development
of sample processing and mass spectrometry methods
\textcolor{black}{that yield the raw data}, \textcolor{black}{it is
essential} to process, analyse, interpret and assess these new data
to demonstrate the improvement in the technical, analytical and
computational workflows.
Practitioners have a diverse catalogue of software tools at their
disposal. These range from low level software libraries that are aimed
at programmers to enable \textcolor{black}{the} development of new
applications, to more user-oriented applications with graphical user
interfaces which provide a more limited set of functionalities to
address a defined scope. Examples of software libraries include
Java-based jmzML~\cite{Cote:2010} or C/C++-based
ProteoWizard~\cite{Chambers:2012}. \textcolor{black}{Thermo Scientific
Proteome Discoverer (Thermo Fisher Scientific)},
MaxQuant~\cite{Cox:2008} and PeptideShaker~\cite{Vaudel:2015} are
among the most widely used user-centric applications.
In this software note, we present version 2 of the
\texttt{MSnbase}~\cite{Gatto:2012} software, available from the
Bioconductor~\cite{Huber:2015} project. The package, like other
software such as Python-based {pyOpenMS}~\cite{Rost:2014},
spectrum\_utils~\cite{Bittremieux:2020} or
Pyteomics~\cite{Goloborodko:2013}, offers a platform that lies between
low level libraries and end-user software. \texttt{MSnbase} provides a
flexible R~\cite{R} command-line environment for metabolomics and
proteomics mass spectrometry-based applications. It lays out a sound
infrastructure to work with raw mass spectrometry \textcolor{black}{data
from MS files in mzML, mzXML, mzData or ANDI-MS/netCDF format as
well as} quantitative and proteomics identification data. The
package enables manipulation (for example subsetting, filtering, or
accessing specific parts thereof), detailed step-by-step processing
(for example smoothing and centroiding of \textcolor{black}{profile-mode
MS} data, or normalisation and imputation of quantitative data),
analysis and visualisation of these data and the development of novel
computational mass spectrometry
methods~\cite{Stanstrup:2019}. \textcolor{black}{Extensive documentation
and use cases are provided in \textit{package
vignettes}~\cite{MSnbaseVignettes} and
workflows~\cite{xcmsWorkflow}.} Here, we focus on the new
developments pertaining to raw mass spectrometry data handling and
processing.
\section{Infrastructure for raw data}
In \texttt{MSnbase}, mass spectrometry experiments are handled as
\texttt{MSnExp} objects. While the implementation is more complex, it
is useful to schematise a raw data experiment as being composed of raw
data, i.e. a collection of individual spectra, as well as
spectra-level metadata (Figure \ref{fig:raw}). Each spectrum is
composed of m/z values and associated intensities. The metadata are
represented by a \textcolor{black}{single} table with variables along
the columns and each row associated to a spectrum. Among the metadata
available for each spectrum, there are MS level, acquisition number,
retention time, precursor m/z and intensity (for MS level 2 and
above), and many more. \texttt{MSnbase} \textcolor{black}{relies on the
\texttt{mzR} package~\cite{Chambers:2012} to import raw mass
spectrometry data from one of the many community-maintained open
standards formats (mzML, mzXML, mzData or ANDI-MS/netCDF) and}
provides a rich \textcolor{black}{and principled} interface to
manipulate such objects. The code chunk below illustrates such an
object as displayed in the R console and an enumeration of the
metadata fields.
\begin{figure}
\centering
\includegraphics[width=0.8\linewidth]{./figure/raw.png}
\caption{\textcolor{black}{Schematic representation of what is
referred to by \textit{raw data}: a collection of mass spectra
and a table containing spectrum-level annotations along the
lines. Raw data are imported from one of the many
community-maintained open standards formats (mzML, mzXML, mzData
or ANDI-MS/netCDF).} }
\label{fig:raw}
\end{figure}
<<msnexp, echo = TRUE, eval = FALSE>>=
> show(ms)
MSn experiment data ("OnDiskMSnExp")
Object size in memory: 0.54 Mb
- - - Spectra data - - -
MS level(s): 1 2 3
Number of spectra: 994
MSn retention times: 45:27 - 47:6 minutes
- - - Processing information - - -
Data loaded [Sun Apr 26 15:40:58 2020]
MSnbase version: 2.13.6
- - - Meta data - - -
phenoData
rowNames: MS3TMT11.mzML
varLabels: sampleNames
varMetadata: labelDescription
Loaded from:
MS3TMT11.mzML
protocolData: none
featureData
featureNames: F1.S001 F1.S002 ... F1.S994 (994 total)
fvarLabels: fileIdx spIdx ... spectrum (35 total)
fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
> fvarLabels(ms)
[1] "fileIdx" "spIdx"
[3] "smoothed" "seqNum"
[5] "acquisitionNum" "msLevel"
[7] "polarity" "originalPeaksCount"
[9] "totIonCurrent" "retentionTime"
[11] "basePeakMZ" "basePeakIntensity"
[13] "collisionEnergy" "ionisationEnergy"
[15] "lowMZ" "highMZ"
[17] "precursorScanNum" "precursorMZ"
[19] "precursorCharge" "precursorIntensity"
[21] "mergedScan" "mergedResultScanNum"
[23] "mergedResultStartScanNum" "mergedResultEndScanNum"
[25] "injectionTime" "filterString"
[27] "spectrumId" "centroided"
[29] "ionMobilityDriftTime" "isolationWindowTargetMZ"
[31] "isolationWindowLowerOffset" "isolationWindowUpperOffset"
[33] "scanWindowLowerLimit" "scanWindowUpperLimit"
[35] "spectrum"
@
In the following sections, we describe \textcolor{black}{how
\texttt{MSnbase}} can be used for data processing and
visualisation. \textcolor{black}{An example of its ability to also
efficiently handle very large mass spectrometry experiments (in this
case with 5,773,464 spectra in 1,182 mzXML files) is provided as
supplementary information.} We will also illustrate how it makes
use of the forward-pipe operator (\texttt{\%>\%}) defined in the
\texttt{magrittr} package. This operator has proved useful to develop
non-trivial analyses by combining individual functions into easily
readable \textcolor{black}{and elegant} pipelines.
\subsection{On-disk backend}
The main feature in version 2 of the \texttt{MSnbase} package was the
addition of different backends for raw data storage, namely
\textit{in-memory} and \textit{on-disk}. The following code chunk
demonstrates how to \textcolor{black}{import data from an mzML file} to
create two \texttt{MSnExp} objects that store the data either
in memory or on disk.
<<readMSData, eval = FALSE, echo = TRUE>>=
library("MSnbase")
raw_mem <- readMSData("file.mzML", mode = "inMemory")
raw_dsk <- readMSData("file.mzML", mode = "onDisk")
@
<<time_sz, message=FALSE>>=
load("bench_time_sz.rda")
load("bench_time_2.rda")
sz <- sapply(time_sz, function(x) x[, "sz"])
colnames(sz) <- c(1, 5, 10)
rownames(sz) <- c("in-memory", "on-disk")
sz <- round(sz/(1014^2), 2)
set.seed(1L) ## set jittering
p_read <-
time_2 %>%
magrittr::set_colnames(c("n", "in-memory", "on-disk")) %>%
pivot_longer(names_to = "backend",
values_to = "time",
cols = -n) %>%
ggplot(aes(x = factor(n), y = time, colour = backend)) +
geom_jitter(width = 0.05, size = 2, alpha = 0.5) +
geom_smooth(aes(group = backend),
method = "loess",
se = FALSE,
size = 0.8) +
ylab("Reading time [s]") +
xlab("Number of files") +
theme(legend.position = c(0.35, 0.86),
legend.title = element_blank(),
legend.text=element_text(size = 6))
p_sz <-
data.frame(sz) %>%
rownames_to_column() %>%
pivot_longer(names_to = "files",
values_to = "size",
-rowname) %>%
rename(backend = rowname) %>%
mutate(files = as.numeric(sub("X", "", files))) %>%
ggplot(aes(x = files, y = size, colour = backend)) +
geom_point(size = 2, alpha = 0.5) +
geom_line(size = 0.8) +
scale_x_continuous(breaks=c(1, 5, 10)) +
ylab("Object size [MB]") +
xlab("Number of files") +
theme(legend.position = "none")
@
<<filt>>=
load("bench_t_filt.rda")
p_filt <-
tibble(time = microbenchmark:::convert_to_unit(t_filt$time, "ms"),
expr = as.character(t_filt$expr)) %>%
mutate(mode = if_else(grepl("mem", expr), "in-memory", "on-disk")) %>%
ggplot(aes(x = mode, y = time, fill = mode)) +
ggplot2::geom_violin() +
ggplot2::scale_y_log10() +
ylab("Filtering time [ms]") +
xlab("Backend") +
theme(legend.position = "none")
@
<<access>>=
load("bench_t_access.rda")
p_access <-
tibble(time = microbenchmark:::convert_to_unit(t_access$time, "ms"),
expr = as.character(t_access$expr)) %>%
mutate(n = sub("^access_.+_", "", expr)) %>%
mutate(n = sub("all", 6103, n)) %>%
mutate(mode = if_else(grepl("mem", expr), "in-memory", "on-disk")) %>%
ggplot(aes(x = n, y = time, fill = mode)) +
ggplot2::geom_violin() +
ggplot2::scale_y_log10() +
ylab("Raw data access [ms]") +
xlab("Number of spectra (out of 6103)") +
facet_wrap(~ mode) +
theme(legend.position = "none")
@
\begin{figure}[p]
\centering
<<plot_bench, warning=FALSE, message=FALSE>>=
(p_read + p_sz + p_filt) / p_access +
plot_layout(heights = c(1, 0.75)) +
plot_annotation(tag_levels = 'a') &
theme(axis.title = element_text(size = 10))
@
\caption{(a) Reading time (\textcolor{black}{triplicates,} in seconds)
and (b) data size in memory (in MB) to read/store 1, 5 and 10 files
containing 1431 MS1 (on-disk only) and 6103 MS2 (on-disk and
in-memory) spectra. (c) Filtering benchmark assessed over 10
interactions on in-memory and on-disk data containing 6103 MS2
spectra. (d) Access time to spectra for the in-memory (left) and
on-disk (right) backends for 1, 10, 100 1000, 5000 and all 6103
spectra. Benchmarks were performed on a Dell XPS laptop with an
Intel i5-8250U processor 1.60 GHz (4 cores, 8 threads), 7.5 GB RAM
running Ubuntu 18.04.4 LTS 64-bit \textcolor{black}{and an SSD
drive}. \textcolor{black}{The data used for the benchmarking are a
TMT 4-plex experiment acquired on a LTQ Orbitrap Velos (Thermo
Fisher Scientific) available in the \texttt{msdata} package and
described in \citep{Gatto:2014}.}}
\label{fig:bench}
\end{figure}
Both modes rely on the \texttt{mzR}~\cite{Chambers:2012} package to
access the spectra (using the \texttt{mzR::peaks()} function) and the
metadata (using the \texttt{mzR::header()} function) in the data
files. The former is the legacy storage mode, implemented in the first
version of the package, that loads all the raw data and the metadata
into memory upon creation of the in-memory \texttt{MSnExp}
object. This solution doesn't scale for modern large dataset, and was
complemented by the on-disk backend. The on-disk backend only loads
the metadata into memory when the on-disk \texttt{MSnExp} is created
and accesses the spectra data (i.e. m/z and intensity values) in the
original files on disk only when needed (see below and Figure
\ref{fig:bench} (d)), such as for example for plotting. There are two
direct benefits using the on-disk backend, namely faster reading and
reduced memory footprint. Figure \ref{fig:bench} shows 5-fold faster
reading times (a) and over a 10-fold reduction in memory usage (b).
\textcolor{black}{Because the on-disk backend does not hold all the
spectra data in memory, direct manipulations of these data are not
possible. We thus implemented a \textit{lazy processing} mechanism
for this backend that caches any data manipulation operations in a
processing queue in the object itself. These operations are then
applied only when the user accesses m/z or intensity values. }
\textcolor{black}{As an additional advantage, operations on subsets of
the data become much faster since data manipulations are applied
only to data subsets instead of the full data set at once. Also,
on-disk data access is parallelized by data file ensuring a higher
performance of this backend over conventional in-memory data
representations.} As an example, the following short analysis
pipeline, that can equally be applied to in-memory or on-disk data,
retains MS2 spectra acquired between 1000 and 3000 seconds, extracts
the m/z range corresponding to the TMT 6-plex range and focuses on the
MS2 spectra with a precursor intensity greater than $11 \times 10^6$
(the median precursor intensity).
<<filter, eval = FALSE, echo = TRUE>>=
ms <- ms %>%
filterRt(c(1000, 3000)) %>%
filterMz(120, 135)
ms[precursorIntensity(ms) > 11e6, ]
@
As shown on Figure~\ref{fig:bench} (c), this lazy mechanism is
significantly faster than its application on in-memory data. The
advantageous reading and execution times and memory footprint of the
on-disk backend are possible by retrieving only spectra data from the
selected subset hence avoiding access to the full raw data. Once
access to the spectra m/z and intensity values become mandatory (for
example for plotting), then the in-memory backend becomes more
efficient, as illustrated on Figure~\ref{fig:bench} (d).
\textcolor{black}{The benefit of accessing data in memory is however
reduced by underlying copies that are performed during the
subsetting operation. When subsetting an in-memory \texttt{MSnExp}
into a new, smaller in-memory \texttt{MSnExp} instance, the matrices
that contain the spectra for the new object are copied, thus leading
to increased execution time and (transient, if the original data are
replaced) memory usage. Figure~\ref{fig:bench} (d) shows that the
larger the subset, the smaller the benefits of an in-memory backend
become. The example with the 6103 spectra, corresponding to the full
data (i.e. all spectra are already in memory and there is no memory
management overhead) is representative of memory access only and
constitutes the best case scenario. }
\textcolor{black}{The on-disk backend has become the preferred backend
for large data, and the only viable alternative when the size of the
data exceeds the available RAM and/or when several MS levels are to
be loaded and handled simultaneously. The in-memory backend can
still prove useful in cases when small MS2-only data are to be
analysed, and will remain available in future versions of
\texttt{MSnbase}.}
\subsection{Prototyping}
The \texttt{MSnExp} data structure and its interface constitute an
efficient prototyping environment for computational method
development. We illustrate this by demonstrating how to implement the
BoxCar\cite{Meier:2018} acquisition method. In a nutshell, BoxCar
acquisition aims at improving the detection of intact precursor ions
by distributing the charge capacity over multiple narrow m/z segments
and thus limiting the proportion of highly abundant precursors in each
segment. A full scan is reconstructed by combining the respective
adjacent segments of the BoxCar acquisitions. The
\texttt{MSnbaseBoxCar} package\cite{MSnbaseBoxCar} is a small package
that demonstrates this. The simple \textcolor{black}{pipeline} is
composed of three steps, described below, and illustrated with code
from \texttt{MSnbaseBoxCar} in the following code chunk.
\begin{enumerate}
\item Identify and filter the groups of spectra that represent
adjacent BoxCar acquisitions (Figure~\ref{fig:bc}~(b)). This can be
done using the \textit{filterString} metadata variable that
identifies BoxCar spectra by their adjacent m/z segments with the
\texttt{bc\_groups()} function and filtering relevant spectra with
the \texttt{filterBoxCar()}.
\item Remove any signal outside the BoxCar segments using the
\texttt{bc\_zero\_out\_box()} function from \texttt{MSnbaseBoxCar}
(Figures~\ref{fig:bc}~(c) and (d)).
\item Using the \texttt{combineSpectra} function from the
\texttt{MSnbase}, combine the cleaned BoxCar spectra into a new,
full spectrum (Figure~\ref{fig:bc}~(e)).
\end{enumerate}
<<bc1, echo=TRUE, eval=FALSE>>=
bc <- readMSData("boxcar.mzML", mode = "onDisk") %>%
bc_groups() %>% ## identify BoxCar groups (creates 'bc_groups')
filterBoxCar() %>% ## keep only BoxCar spectra
bc_zero_out_box() %>% ## remove signal outside of BoxCar segments
combineSpectra(fcol = "bc_groups", ## reconstruct full spectrum
method = boxcarCombine)
@
After processing of the BoxCar data, the final object can either be
further analysed \textcolor{black}{using \texttt{MSnbase}} or written
back to disk as an \texttt{mzML} file using \texttt{writeMSData()} for
processing with other tools.
\begin{figure}[p]
\centering
<<plot_boxcar, message=FALSE, warning=FALSE, fig.height=8>>=
load("boxcar.rda")
p4 <- p4 + xlab("m/z")
p1 + p2 + p3 + p4 +
plot_layout(ncol = 1,
heights = c(0.7, 1.2, 0.7, 0.7)) +
plot_annotation(tag_levels = 'a') &
theme(axis.title = element_text(size = 10),
axis.text = element_text(size = 6),
plot.margin = margin(0, 0, 0, 0, "mm"))
@
\caption{BoxCar processing with \texttt{MSnbase}. (a) Standard full
scan with (b) three corresponding BoxCar scans showing the adjacent
segments. Figure (c) shows the overlapping intact BoxCar segments
and (d) the same segments after cleaning, i.e. where peaks outside
of the segments were removed. The reconstructed full scan is shown
on panel (e). Spectra visualisation, as shown here, rely on the
\texttt{ggplot2} \cite{ggplot2} package. }
\label{fig:bc}
\end{figure}
All the functions for the processing of BoxCar spectra and segments in
\texttt{MSnbaseBoxCar} were developed using existing functionality
implemented in \texttt{MSnbase}, illustrating the flexibility and
adaptability of the \texttt{MSnbase} package for computational mass
spectrometry method development.
\subsection{Visualisation}
The R environment is well known for the quality of its visualisation
capacity. This also holds true for mass
spectrometry\cite{Gatto:2015,protViz,Pviz,Bemis:2015}. Here, we
conclude the overview of version 2 of the \texttt{MSnbase} package by
highlighting the flexibility of the software to visualise and assess
the efficiency of raw data processing. Figure \ref{fig:cent} compares
the raw MS profile data \textcolor{black}{imported from an mzML file}
for serine and the same data after smoothing, centroiding and m/z
refinement, as illustrated in the code chunk below. Detailed execution
and description of these operations can be found in the \emph{MSnbase:
centroiding of profile-mode MS data} \texttt{MSnbase} vignette.
<<serine_processing, echo=TRUE, eval=FALSE>>=
serine_mz <- 106.049871
serine_proc <- ms %>%
smooth(method = "SavitzkyGolay", halfWindowSize = 4L) %>%
pickPeaks(refineMz = "descendPeak") %>%
filterMz(c(serine_mz - 0.01, serine_mz + 0.01)) %>%
filterRt(c(175, 187))
@
\begin{figure}
\centering
\includegraphics[width=\linewidth]{./figure/centroiding.pdf}
\caption{Visualisation of data smoothing and m/z refinement using
\texttt{MSnbase}. (a) Raw MS profile data for serine. Upper panel
shows the base peak chromatogram (BPC), lower panel the individual
signals in the retention time -- m/z space. The horizontal dashed
red line indicates the theoretical m/z of the [M+H]+ adduct of
serine. (b) Smoothed and centroided data for serine with m/z
refinement. The horizontal red dashed line indicates the
theoretical m/z for the [M+H]+ ion and the vertical red dotted
line the position of the maximum signal. }
\label{fig:cent}
\end{figure}
\subsection{\textcolor{black}{Package maintenance and governance}}
The first public commit to the \texttt{MSnbase} GitHub repository was
in October 2010. Since then, the package benefited from 12
contributors\cite{contribs} that added various features, some
particularly significant ones such as the on-disk backend described
herein. \textcolor{black}{Contributions to the package are explicitly
encouraged, rewarded by an official contributor status and governed
by a code of conduct.}
According to \texttt{MSnbase}'s Bioconductor page, there are 36
Bioconductor packages that depend, import or suggest it. Among these
are \texttt{pRoloc}~\cite{Gatto:2014a} to analyse mass
spectrometry-based spatial proteomics data,
\texttt{msmsTests}~\cite{msmsTests}, \texttt{DEP}~\cite{Zhang:2018},
\texttt{DAPAR} and \texttt{ProStaR}~\cite{Wieczorek:2017} for the
statistical analysis \textcolor{black}{of} quantitative proteomics data,
\texttt{RMassBank}~\cite{Stravs:2013} to process metabolomics tandem
MS files and build MassBank records,
\texttt{MSstatsQC}~\cite{Dogu:2017} for longitudinal system
suitability monitoring and quality control of targeted proteomic
experiments and the widely used \texttt{xcms}~\cite{Smith:2006}
package for the processing and analysis of metabolomics
data. \texttt{MSnbase} is also used in non-R/Bioconductor software,
such as for example IsoProt~\cite{Griss:2019}, that provides a
reproducible workflow for iTRAQ/TMT experiments. \textcolor{black}{The
BioContainers~\cite{da_Veiga_Leprevost:2017} project offers a
dedicated container for the \texttt{MSnbase} package, this
facilitating the reuse of the package in third-party pipelines.}
\texttt{MSnbase} currently ranks 101 out of 1823 packages based on the
monthly downloads from unique IP addresses, tallying over 1000
downloads from unique IP addresses each months.
As is custom with Bioconductor packages, \texttt{MSnbase} comes with
ample documentation. Every user-accessible function is documented in a
dedicated manual page. In addition, the package offers 5 vignettes,
including one aimed at developers. The package is checked nightly on
the Bioconductor servers: it implements unit tests covering 72\% of
the code base and, through its vignettes, also provides integration
testing. Questions from users and developers are answered on the
Bioconductor support forum as well as on the package GitHub page. The
package provides several sample and benchmarking datasets, and relies
on other dedicated \textit{experiment packages} such as
\texttt{msdata}~\cite{msdata} for raw data or
\texttt{pRolocdata}~\cite{Gatto:2014a} for quantitative
data. \texttt{MSnbase} is available on Windows, Mac OS and Linux under
the open source Artistic 2.0 license and easily installable using
standard installation procedures.
\textcolor{black}{The growth of \texttt{MSnbase} and the user support
provided over the years attest to the core maintainers commitment to
long-term development, and the quality and maintainability of the
code base.}
\section{Discussion}
We have presented here some important functionality of
\texttt{MSnbase} version 2. The new on-disk infrastructure enables
large scale data analyses~\cite{Nothias:2020}, either using
\texttt{MSnbase} directly or through packages that rely on it, such as
\texttt{xcms}. We have also illustrated how \texttt{MSnbase} can be
used for standard data analysis and visualisation, and how it can be
used for method development and prototyping.
<<version>>=
v <- packageVersion("MSnbase")
@
The version of \texttt{MSnbase} used in this manuscript is
\Sexpr{v}. The main features presented here were available since
version 2.0. The code to reproduce the analyses and figures in this
article is available at
\url{https://github.com/lgatto/2020-msnbase-v2/}.
\section{Associated Content}
Supplementary file 1: script documenting the processing of 1182 mzXML
files (5773464 spectra) using \texttt{MSnbase}.
\begin{acknowledgement}
The authors thank the various contributors and users who have provided
constructive input and feedback that have helped, over the years, the
improvement of the package. The authors declare no conflict of
interest.
\end{acknowledgement}
%% \begin{suppinfo} %% Supporting Information
%%
%% A listing of the contents of each file supplied as Supporting Information
%% should be included. For instructions on what should be included in the
%% Supporting Information as well as how to prepare this material for
%% publications, refer to the journal's Instructions for Authors.
%%
%% The following files are available free of charge.
%% \begin{itemize}
%% \item Filename: brief description
%% \item Filename: brief description
%% \end{itemize}
%% \end{suppinfo}
\bibliography{refs}
\end{document}