msnbase2.Rnw

\documentclass[journal=jacsat,manuscript=article]{achemso}

\usepackage[]{graphicx}
\usepackage[]{color}

\usepackage[]{xcolor}
\usepackage[normalem]{ulem}

%% maxwidth is the original width if it is less than linewidth
%% otherwise use linewidth (to make sure the graphics do not exceed the margin)
\makeatletter
\def\maxwidth{ %
  \ifdim\Gin@nat@width>\linewidth
    \linewidth
  \else
    \Gin@nat@width
  \fi
}
\makeatother


\usepackage{subfig}
\usepackage{graphicx}
\usepackage{alltt}

\usepackage{chemformula} % Formula subscripts using \ch{}
\usepackage[T1]{fontenc} % Use modern font encodings

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%% If issues arise when submitting your manuscript, you may want to
%% un-comment the next line.  This provides information on the
%% version of every file you have used.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%\listfiles

\newcommand*\mycommand[1]{\texttt{\emph{#1}}}

\author{Laurent Gatto}
\email{laurent.gatto@uclouvain.be}
\affiliation[UCLouvain]{Computational Biology Unit, de Duve Institute, Universit\'e catholique de Louvain, Brussels, Belgium}
\author{Sebastian Gibb}
\affiliation[University of Greifswald]{Department of Anaesthesiology and Intensive Care of the University Medicine Greifswald, Germany}
\author{Johannes Rainer}
\affiliation[Eurac Research]{Institute for Biomedicine, Eurac Research, Affiliated Institute of the University of L\"ubeck, Bolzano, Italy}


\title[MSnbase version 2]
  {\texttt{MSnbase}, efficient and elegant R-based processing and
    visualisation of raw mass spectrometry data}

\abbreviations{}
\keywords{Bioconductor, mass spectrometry, software, metabolomics, proteomics, visualisation, reproducible research} %% up to 10 keywords
\IfFileExists{upquote.sty}{\usepackage{upquote}}{}
\begin{document}

%% \begin{tocentry}
%% See achemso-demo.tex
%% \end{tocentry}

\begin{abstract} %% 200 words max
  We present version 2 of the \texttt{MSnbase} R/Bioconductor
  package. \texttt{MSnbase} provides infrastructure for the
  manipulation, processing and visualisation of mass spectrometry
  data. We focus on the new \textit{on-disk} infrastructure, that
  allows the handling of large raw mass spectrometry
  experiment\textcolor{black}{s} on commodity hardware and illustrate
  how the package is used for elegant data processing, method
  development, and visualisation.
\end{abstract}

Keywords: R, Bioconductor, mass spectrometry, software, metabolomics,
proteomics, visualisation, reproducible research

<<setup, include = FALSE, message = FALSE>>=
knitr::opts_chunk$set(echo = FALSE, cache = FALSE)
library("tidyverse")
library("patchwork")
@

\section{Introduction}

Mass spectrometry is a powerful technology to assay chemical and
biological samples. It is used in routine applications with well
characterised \textcolor{black}{protocols} \textcolor{black}{such as in
  clinical settings}, as well as a development platform,
\textcolor{black}{with the aim} to improve on existing
\textcolor{black}{protocols} and devise new ones. The complexity and
diversity of mass spectrometry yield complex \textcolor{black}{data} of
considerable size, that require non trivial processing before
producing interpretable results.  \textcolor{black}{The complexity and
  size of these data} constitute a significant challenge for
\textcolor{black}{protocol} development: in addition to the development
of sample processing and mass spectrometry methods
\textcolor{black}{that yield the raw data}, \textcolor{black}{it is
  essential} to process, analyse, interpret and assess these new data
to demonstrate the improvement in the technical, analytical and
computational workflows.

Practitioners have a diverse catalogue of software tools at their
disposal. These range from low level software libraries that are aimed
at programmers to enable \textcolor{black}{the} development of new
applications, to more user-oriented applications with graphical user
interfaces which provide a more limited set of functionalities to
address a defined scope. Examples of software libraries include
Java-based jmzML~\cite{Cote:2010} or C/C++-based
ProteoWizard~\cite{Chambers:2012}.  \textcolor{black}{Thermo Scientific
  Proteome Discoverer (Thermo Fisher Scientific)},
MaxQuant~\cite{Cox:2008} and PeptideShaker~\cite{Vaudel:2015} are
among the most widely used user-centric applications.

In this software note, we present version 2 of the
\texttt{MSnbase}~\cite{Gatto:2012} software, available from the
Bioconductor~\cite{Huber:2015} project. The package, like other
software such as Python-based {pyOpenMS}~\cite{Rost:2014},
spectrum\_utils~\cite{Bittremieux:2020} or
Pyteomics~\cite{Goloborodko:2013}, offers a platform that lies between
low level libraries and end-user software. \texttt{MSnbase} provides a
flexible R~\cite{R} command-line environment for metabolomics and
proteomics mass spectrometry-based applications. It lays out a sound
infrastructure to work with raw mass spectrometry \textcolor{black}{data
  from MS files in mzML, mzXML, mzData or ANDI-MS/netCDF format as
  well as} quantitative and proteomics identification data. The
package enables manipulation (for example subsetting, filtering, or
accessing specific parts thereof), detailed step-by-step processing
(for example smoothing and centroiding of \textcolor{black}{profile-mode
  MS} data, or normalisation and imputation of quantitative data),
analysis and visualisation of these data and the development of novel
computational mass spectrometry
methods~\cite{Stanstrup:2019}. \textcolor{black}{Extensive documentation
  and use cases are provided in \textit{package
    vignettes}~\cite{MSnbaseVignettes} and
  workflows~\cite{xcmsWorkflow}.} Here, we focus on the new
developments pertaining to raw mass spectrometry data handling and
processing.

\section{Infrastructure for raw data}

In \texttt{MSnbase}, mass spectrometry experiments are handled as
\texttt{MSnExp} objects. While the implementation is more complex, it
is useful to schematise a raw data experiment as being composed of raw
data, i.e. a collection of individual spectra, as well as
spectra-level metadata (Figure \ref{fig:raw}). Each spectrum is
composed of m/z values and associated intensities. The metadata are
represented by a \textcolor{black}{single} table with variables along
the columns and each row associated to a spectrum. Among the metadata
available for each spectrum, there are MS level, acquisition number,
retention time, precursor m/z and intensity (for MS level 2 and
above), and many more. \texttt{MSnbase} \textcolor{black}{relies on the
  \texttt{mzR} package~\cite{Chambers:2012} to import raw mass
  spectrometry data from one of the many community-maintained open
  standards formats (mzML, mzXML, mzData or ANDI-MS/netCDF) and}
provides a rich \textcolor{black}{and principled} interface to
manipulate such objects. The code chunk below illustrates such an
object as displayed in the R console and an enumeration of the
metadata fields.

\begin{figure}
  \centering
  \includegraphics[width=0.8\linewidth]{./figure/raw.png}
  \caption{\textcolor{black}{Schematic representation of what is
      referred to by \textit{raw data}: a collection of mass spectra
      and a table containing spectrum-level annotations along the
      lines. Raw data are imported from one of the many
      community-maintained open standards formats (mzML, mzXML, mzData
      or ANDI-MS/netCDF).} }
  \label{fig:raw}
\end{figure}


<<msnexp, echo = TRUE, eval = FALSE>>=
> show(ms)
MSn experiment data ("OnDiskMSnExp")
Object size in memory: 0.54 Mb
- - - Spectra data - - -
 MS level(s): 1 2 3
 Number of spectra: 994
 MSn retention times: 45:27 - 47:6 minutes
- - - Processing information - - -
Data loaded [Sun Apr 26 15:40:58 2020]
 MSnbase version: 2.13.6
- - - Meta data  - - -
phenoData
  rowNames: MS3TMT11.mzML
  varLabels: sampleNames
  varMetadata: labelDescription
Loaded from:
  MS3TMT11.mzML
protocolData: none
featureData
  featureNames: F1.S001 F1.S002 ... F1.S994 (994 total)
  fvarLabels: fileIdx spIdx ... spectrum (35 total)
  fvarMetadata: labelDescription
experimentData: use 'experimentData(object)'
> fvarLabels(ms)
 [1] "fileIdx"                    "spIdx"
 [3] "smoothed"                   "seqNum"
 [5] "acquisitionNum"             "msLevel"
 [7] "polarity"                   "originalPeaksCount"
 [9] "totIonCurrent"              "retentionTime"
[11] "basePeakMZ"                 "basePeakIntensity"
[13] "collisionEnergy"            "ionisationEnergy"
[15] "lowMZ"                      "highMZ"
[17] "precursorScanNum"           "precursorMZ"
[19] "precursorCharge"            "precursorIntensity"
[21] "mergedScan"                 "mergedResultScanNum"
[23] "mergedResultStartScanNum"   "mergedResultEndScanNum"
[25] "injectionTime"              "filterString"
[27] "spectrumId"                 "centroided"
[29] "ionMobilityDriftTime"       "isolationWindowTargetMZ"
[31] "isolationWindowLowerOffset" "isolationWindowUpperOffset"
[33] "scanWindowLowerLimit"       "scanWindowUpperLimit"
[35] "spectrum"
@

In the following sections, we describe \textcolor{black}{how
  \texttt{MSnbase}} can be used for data processing and
visualisation. \textcolor{black}{An example of its ability to also
  efficiently handle very large mass spectrometry experiments (in this
  case with 5,773,464 spectra in 1,182 mzXML files) is provided as
  supplementary information.}  We will also illustrate how it makes
use of the forward-pipe operator (\texttt{\%>\%}) defined in the
\texttt{magrittr} package. This operator has proved useful to develop
non-trivial analyses by combining individual functions into easily
readable \textcolor{black}{and elegant} pipelines.


\subsection{On-disk backend}

The main feature in version 2 of the \texttt{MSnbase} package was the
addition of different backends for raw data storage, namely
\textit{in-memory} and \textit{on-disk}. The following code chunk
demonstrates how to \textcolor{black}{import data from an mzML file} to
create two \texttt{MSnExp} objects that store the data either
in memory or on disk.

<<readMSData, eval = FALSE, echo = TRUE>>=
library("MSnbase")
raw_mem <- readMSData("file.mzML", mode = "inMemory")
raw_dsk <- readMSData("file.mzML", mode = "onDisk")
@

<<time_sz, message=FALSE>>=
load("bench_time_sz.rda")
load("bench_time_2.rda")

sz <- sapply(time_sz, function(x) x[, "sz"])
colnames(sz) <- c(1, 5, 10)
rownames(sz) <- c("in-memory", "on-disk")
sz <- round(sz/(1014^2), 2)

set.seed(1L) ## set jittering
p_read <-
    time_2 %>%
    magrittr::set_colnames(c("n", "in-memory", "on-disk")) %>%
    pivot_longer(names_to = "backend",
                 values_to = "time",
                 cols = -n) %>%
    ggplot(aes(x = factor(n), y = time, colour = backend)) +
    geom_jitter(width = 0.05, size = 2, alpha = 0.5) +
    geom_smooth(aes(group = backend),
                method = "loess",
                se = FALSE,
                size = 0.8) +
    ylab("Reading time [s]") +
    xlab("Number of files") +
    theme(legend.position = c(0.35, 0.86),
          legend.title = element_blank(),
          legend.text=element_text(size = 6))


p_sz <-
    data.frame(sz) %>%
    rownames_to_column() %>%
    pivot_longer(names_to = "files",
                 values_to = "size",
                 -rowname) %>%
    rename(backend = rowname) %>%
    mutate(files = as.numeric(sub("X", "", files))) %>%
    ggplot(aes(x = files, y = size, colour = backend)) +
    geom_point(size = 2, alpha = 0.5) +
    geom_line(size = 0.8) +
    scale_x_continuous(breaks=c(1, 5, 10)) +
    ylab("Object size [MB]") +
    xlab("Number of files") +
    theme(legend.position = "none")
@

<<filt>>=
load("bench_t_filt.rda")
p_filt <-
    tibble(time = microbenchmark:::convert_to_unit(t_filt$time, "ms"),
           expr = as.character(t_filt$expr)) %>%
    mutate(mode = if_else(grepl("mem", expr), "in-memory", "on-disk")) %>%
    ggplot(aes(x = mode, y = time, fill = mode)) +
    ggplot2::geom_violin() +
    ggplot2::scale_y_log10() +
    ylab("Filtering time [ms]") +
    xlab("Backend") +
    theme(legend.position = "none")

@

<<access>>=
load("bench_t_access.rda")
p_access <-
    tibble(time = microbenchmark:::convert_to_unit(t_access$time, "ms"),
           expr = as.character(t_access$expr)) %>%
    mutate(n = sub("^access_.+_", "", expr)) %>%
    mutate(n = sub("all", 6103, n)) %>%
    mutate(mode = if_else(grepl("mem", expr), "in-memory", "on-disk")) %>%
    ggplot(aes(x = n, y = time, fill = mode)) +
    ggplot2::geom_violin() +
    ggplot2::scale_y_log10() +
    ylab("Raw data access [ms]") +
    xlab("Number of spectra (out of 6103)") +
    facet_wrap(~ mode) +
    theme(legend.position = "none")
@

\begin{figure}[p]
  \centering
<<plot_bench, warning=FALSE, message=FALSE>>=
(p_read + p_sz + p_filt) / p_access +
    plot_layout(heights = c(1, 0.75)) +
    plot_annotation(tag_levels = 'a') &
    theme(axis.title = element_text(size = 10))
@
\caption{(a) Reading time (\textcolor{black}{triplicates,} in seconds)
  and (b) data size in memory (in MB) to read/store 1, 5 and 10 files
  containing 1431 MS1 (on-disk only) and 6103 MS2 (on-disk and
  in-memory) spectra. (c) Filtering benchmark assessed over 10
  interactions on in-memory and on-disk data containing 6103 MS2
  spectra.  (d) Access time to spectra for the in-memory (left) and
  on-disk (right) backends for 1, 10, 100 1000, 5000 and all 6103
  spectra. Benchmarks were performed on a Dell XPS laptop with an
  Intel i5-8250U processor 1.60 GHz (4 cores, 8 threads), 7.5 GB RAM
  running Ubuntu 18.04.4 LTS 64-bit \textcolor{black}{and an SSD
    drive}. \textcolor{black}{The data used for the benchmarking are a
    TMT 4-plex experiment acquired on a LTQ Orbitrap Velos (Thermo
    Fisher Scientific) available in the \texttt{msdata} package and
    described in \citep{Gatto:2014}.}}
\label{fig:bench}
\end{figure}

Both modes rely on the \texttt{mzR}~\cite{Chambers:2012} package to
access the spectra (using the \texttt{mzR::peaks()} function) and the
metadata (using the \texttt{mzR::header()} function) in the data
files. The former is the legacy storage mode, implemented in the first
version of the package, that loads all the raw data and the metadata
into memory upon creation of the in-memory \texttt{MSnExp}
object. This solution doesn't scale for modern large dataset, and was
complemented by the on-disk backend. The on-disk backend only loads
the metadata into memory when the on-disk \texttt{MSnExp} is created
and accesses the spectra data (i.e. m/z and intensity values) in the
original files on disk only when needed (see below and Figure
\ref{fig:bench} (d)), such as for example for plotting. There are two
direct benefits using the on-disk backend, namely faster reading and
reduced memory footprint. Figure \ref{fig:bench} shows 5-fold faster
reading times (a) and over a 10-fold reduction in memory usage (b).

\textcolor{black}{Because the on-disk backend does not hold all the
  spectra data in memory, direct manipulations of these data are not
  possible. We thus implemented a \textit{lazy processing} mechanism
  for this backend that caches any data manipulation operations in a
  processing queue in the object itself. These operations are then
  applied only when the user accesses m/z or intensity values. }
\textcolor{black}{As an additional advantage, operations on subsets of
  the data become much faster since data manipulations are applied
  only to data subsets instead of the full data set at once. Also,
  on-disk data access is parallelized by data file ensuring a higher
  performance of this backend over conventional in-memory data
  representations.}  As an example, the following short analysis
pipeline, that can equally be applied to in-memory or on-disk data,
retains MS2 spectra acquired between 1000 and 3000 seconds, extracts
the m/z range corresponding to the TMT 6-plex range and focuses on the
MS2 spectra with a precursor intensity greater than $11 \times 10^6$
(the median precursor intensity).

<<filter, eval = FALSE, echo = TRUE>>=
ms <- ms %>%
    filterRt(c(1000, 3000)) %>%
    filterMz(120, 135)
ms[precursorIntensity(ms) > 11e6, ]
@

As shown on Figure~\ref{fig:bench} (c), this lazy mechanism is
significantly faster than its application on in-memory data. The
advantageous reading and execution times and memory footprint of the
on-disk backend are possible by retrieving only spectra data from the
selected subset hence avoiding access to the full raw data. Once
access to the spectra m/z and intensity values become mandatory (for
example for plotting), then the in-memory backend becomes more
efficient, as illustrated on Figure~\ref{fig:bench} (d).
\textcolor{black}{The benefit of accessing data in memory is however
  reduced by underlying copies that are performed during the
  subsetting operation. When subsetting an in-memory \texttt{MSnExp}
  into a new, smaller in-memory \texttt{MSnExp} instance, the matrices
  that contain the spectra for the new object are copied, thus leading
  to increased execution time and (transient, if the original data are
  replaced) memory usage. Figure~\ref{fig:bench} (d) shows that the
  larger the subset, the smaller the benefits of an in-memory backend
  become. The example with the 6103 spectra, corresponding to the full
  data (i.e. all spectra are already in memory and there is no memory
  management overhead) is representative of memory access only and
  constitutes the best case scenario. } 


\textcolor{black}{The on-disk backend has become the preferred backend
  for large data, and the only viable alternative when the size of the
  data exceeds the available RAM and/or when several MS levels are to
  be loaded and handled simultaneously. The in-memory backend can
  still prove useful in cases when small MS2-only data are to be
  analysed, and will remain available in future versions of
  \texttt{MSnbase}.}


\subsection{Prototyping}

The \texttt{MSnExp} data structure and its interface constitute an
efficient prototyping environment for computational method
development. We illustrate this by demonstrating how to implement the
BoxCar\cite{Meier:2018} acquisition method. In a nutshell, BoxCar
acquisition aims at improving the detection of intact precursor ions
by distributing the charge capacity over multiple narrow m/z segments
and thus limiting the proportion of highly abundant precursors in each
segment. A full scan is reconstructed by combining the respective
adjacent segments of the BoxCar acquisitions. The
\texttt{MSnbaseBoxCar} package\cite{MSnbaseBoxCar} is a small package
that demonstrates this. The simple \textcolor{black}{pipeline} is
composed of three steps, described below, and illustrated with code
from \texttt{MSnbaseBoxCar} in the following code chunk.

\begin{enumerate}

\item Identify and filter the groups of spectra that represent
  adjacent BoxCar acquisitions (Figure~\ref{fig:bc}~(b)). This can be
  done using the \textit{filterString} metadata variable that
  identifies BoxCar spectra by their adjacent m/z segments with the
  \texttt{bc\_groups()} function and filtering relevant spectra with
  the \texttt{filterBoxCar()}.

\item Remove any signal outside the BoxCar segments using the
  \texttt{bc\_zero\_out\_box()} function from \texttt{MSnbaseBoxCar}
  (Figures~\ref{fig:bc}~(c) and (d)).


\item Using the \texttt{combineSpectra} function from the
  \texttt{MSnbase}, combine the cleaned BoxCar spectra into a new,
  full spectrum (Figure~\ref{fig:bc}~(e)).

\end{enumerate}

<<bc1, echo=TRUE, eval=FALSE>>=
bc <- readMSData("boxcar.mzML", mode = "onDisk") %>%
      bc_groups() %>%       ## identify BoxCar groups (creates 'bc_groups')
      filterBoxCar() %>%    ## keep only BoxCar spectra
      bc_zero_out_box() %>% ## remove signal outside of BoxCar segments
      combineSpectra(fcol = "bc_groups", ## reconstruct full spectrum
                     method = boxcarCombine)
@


After processing of the BoxCar data, the final object can either be
further analysed \textcolor{black}{using \texttt{MSnbase}} or written
back to disk as an \texttt{mzML} file using \texttt{writeMSData()} for
processing with other tools.

\begin{figure}[p]
  \centering
<<plot_boxcar, message=FALSE, warning=FALSE, fig.height=8>>=
load("boxcar.rda")
p4 <- p4 + xlab("m/z")
p1 + p2 + p3 + p4 +
    plot_layout(ncol = 1,
                heights = c(0.7, 1.2, 0.7, 0.7)) +
    plot_annotation(tag_levels = 'a') &
    theme(axis.title = element_text(size = 10),
          axis.text = element_text(size = 6),
          plot.margin = margin(0, 0, 0, 0, "mm"))
@
\caption{BoxCar processing with \texttt{MSnbase}. (a) Standard full
  scan with (b) three corresponding BoxCar scans showing the adjacent
  segments. Figure (c) shows the overlapping intact BoxCar segments
  and (d) the same segments after cleaning, i.e. where peaks outside
  of the segments were removed. The reconstructed full scan is shown
  on panel (e). Spectra visualisation, as shown here, rely on the
  \texttt{ggplot2} \cite{ggplot2} package. }
\label{fig:bc}
\end{figure}

All the functions for the processing of BoxCar spectra and segments in
\texttt{MSnbaseBoxCar} were developed using existing functionality
implemented in \texttt{MSnbase}, illustrating the flexibility and
adaptability of the \texttt{MSnbase} package for computational mass
spectrometry method development.


\subsection{Visualisation}

The R environment is well known for the quality of its visualisation
capacity. This also holds true for mass
spectrometry\cite{Gatto:2015,protViz,Pviz,Bemis:2015}. Here, we
conclude the overview of version 2 of the \texttt{MSnbase} package by
highlighting the flexibility of the software to visualise and assess
the efficiency of raw data processing. Figure \ref{fig:cent} compares
the raw MS profile data \textcolor{black}{imported from an mzML file}
for serine and the same data after smoothing, centroiding and m/z
refinement, as illustrated in the code chunk below. Detailed execution
and description of these operations can be found in the \emph{MSnbase:
  centroiding of profile-mode MS data} \texttt{MSnbase} vignette.

<<serine_processing, echo=TRUE, eval=FALSE>>=
serine_mz <- 106.049871
serine_proc <- ms %>%
    smooth(method = "SavitzkyGolay", halfWindowSize = 4L) %>%
    pickPeaks(refineMz = "descendPeak") %>%
    filterMz(c(serine_mz - 0.01, serine_mz + 0.01)) %>%
    filterRt(c(175, 187))
@

\begin{figure}
  \centering
  \includegraphics[width=\linewidth]{./figure/centroiding.pdf}
  \caption{Visualisation of data smoothing and m/z refinement using
    \texttt{MSnbase}. (a) Raw MS profile data for serine. Upper panel
    shows the base peak chromatogram (BPC), lower panel the individual
    signals in the retention time -- m/z space. The horizontal dashed
    red line indicates the theoretical m/z of the [M+H]+ adduct of
    serine. (b) Smoothed and centroided data for serine with m/z
    refinement. The horizontal red dashed line indicates the
    theoretical m/z for the [M+H]+ ion and the vertical red dotted
    line the position of the maximum signal. }
  \label{fig:cent}
\end{figure}


\subsection{\textcolor{black}{Package maintenance and governance}}

The first public commit to the \texttt{MSnbase} GitHub repository was
in October 2010. Since then, the package benefited from 12
contributors\cite{contribs} that added various features, some
particularly significant ones such as the on-disk backend described
herein. \textcolor{black}{Contributions to the package are explicitly
  encouraged, rewarded by an official contributor status and governed
  by a code of conduct.}

According to \texttt{MSnbase}'s Bioconductor page, there are 36
Bioconductor packages that depend, import or suggest it. Among these
are \texttt{pRoloc}~\cite{Gatto:2014a} to analyse mass
spectrometry-based spatial proteomics data,
\texttt{msmsTests}~\cite{msmsTests}, \texttt{DEP}~\cite{Zhang:2018},
\texttt{DAPAR} and \texttt{ProStaR}~\cite{Wieczorek:2017} for the
statistical analysis \textcolor{black}{of} quantitative proteomics data,
\texttt{RMassBank}~\cite{Stravs:2013} to process metabolomics tandem
MS files and build MassBank records,
\texttt{MSstatsQC}~\cite{Dogu:2017} for longitudinal system
suitability monitoring and quality control of targeted proteomic
experiments and the widely used \texttt{xcms}~\cite{Smith:2006}
package for the processing and analysis of metabolomics
data. \texttt{MSnbase} is also used in non-R/Bioconductor software,
such as for example IsoProt~\cite{Griss:2019}, that provides a
reproducible workflow for iTRAQ/TMT experiments. \textcolor{black}{The
  BioContainers~\cite{da_Veiga_Leprevost:2017} project offers a
  dedicated container for the \texttt{MSnbase} package, this
  facilitating the reuse of the package in third-party pipelines.}
\texttt{MSnbase} currently ranks 101 out of 1823 packages based on the
monthly downloads from unique IP addresses, tallying over 1000
downloads from unique IP addresses each months.

As is custom with Bioconductor packages, \texttt{MSnbase} comes with
ample documentation. Every user-accessible function is documented in a
dedicated manual page. In addition, the package offers 5 vignettes,
including one aimed at developers. The package is checked nightly on
the Bioconductor servers: it implements unit tests covering 72\% of
the code base and, through its vignettes, also provides integration
testing. Questions from users and developers are answered on the
Bioconductor support forum as well as on the package GitHub page. The
package provides several sample and benchmarking datasets, and relies
on other dedicated \textit{experiment packages} such as
\texttt{msdata}~\cite{msdata} for raw data or
\texttt{pRolocdata}~\cite{Gatto:2014a} for quantitative
data. \texttt{MSnbase} is available on Windows, Mac OS and Linux under
the open source Artistic 2.0 license and easily installable using
standard installation procedures.


\textcolor{black}{The growth of \texttt{MSnbase} and the user support
  provided over the years attest to the core maintainers commitment to
  long-term development, and the quality and maintainability of the
  code base.}

\section{Discussion}

We have presented here some important functionality of
\texttt{MSnbase} version 2. The new on-disk infrastructure enables
large scale data analyses~\cite{Nothias:2020}, either using
\texttt{MSnbase} directly or through packages that rely on it, such as
\texttt{xcms}. We have also illustrated how \texttt{MSnbase} can be
used for standard data analysis and visualisation, and how it can be
used for method development and prototyping.

<<version>>=
v <- packageVersion("MSnbase")
@

The version of \texttt{MSnbase} used in this manuscript is
\Sexpr{v}. The main features presented here were available since
version 2.0. The code to reproduce the analyses and figures in this
article is available at
\url{https://github.com/lgatto/2020-msnbase-v2/}.

\section{Associated Content}

Supplementary file 1: script documenting the processing of 1182 mzXML
files (5773464 spectra) using \texttt{MSnbase}.


\begin{acknowledgement}

The authors thank the various contributors and users who have provided
constructive input and feedback that have helped, over the years, the
improvement of the package. The authors declare no conflict of
interest.

\end{acknowledgement}

%% \begin{suppinfo} %% Supporting Information
%%
%% A listing of the contents of each file supplied as Supporting Information
%% should be included. For instructions on what should be included in the
%% Supporting Information as well as how to prepare this material for
%% publications, refer to the journal's Instructions for Authors.
%%
%% The following files are available free of charge.
%% \begin{itemize}
%%   \item Filename: brief description
%%   \item Filename: brief description
%% \end{itemize}
%% \end{suppinfo}


\bibliography{refs}

\end{document}