Morris Lab, Sloan Kettering Institute. R package for GenomeTrackSig.
To cite, please visit this link.
Please create an Issue or email [email protected] with questions.
R >= 3.3.3
This package imports the following other R packages:
-
reshape2 >= 1.4.3 (CRAN)
-
ggplot2 >= 3.2.0 (CRAN)
-
NMF >= 0.21.0 (CRAN)
-
assertthat >= 0.2.1 (CRAN)
-
BSgenome.Hsapiens.UCSC.hg19 >= 1.4.0 (Bioconductor)
-
GenomicRanges >= 1.26.4 (Bioconductor)
-
Biostrings >= 2.42.1 (Bioconductor)
-
SummarizedExperiment >= 1.4.0 (Bioconductor)
-
VariantAnnotation >= 1.20.3 (Bioconductor)
-
grid >= 3.3.3 (CRAN)
-
progress >= 1.2.2 (CRAN)
In addition, we recommend importing the following R packages which are required for parallelization or plotting genomic and epigenomic features alongside signature profiles.
-
cowplot >= 1.1.1 (CRAN)
-
foreach >= 1.5.1 (CRAN)
-
doParallel >= 1.0.16 (CRAN)
-
TxDb.Hsapiens.UCSC.hg19.knownGene >= 3.2.2 (Bioconducter)
-
GenomicFeatures >= 1.47.14 (Bioconducter)
To install the package, enter the following command into the R console:
devtools::install_github("morrislab/GenomeTrackSig")
Using the example data provided in extdata/
, the following code will
plot the signature profile, and return the fitted mixture of signatures
for each bin, the bins where changepoints were detected, and create the
ggplot object to visualize the signature profile with changepoints.
Two example datasets are provided, both of which contain genome-wide
mutation counts for an individual sample. We recommend using
Example_counts.csv
to test basic package functionality since it has
fewer mutations and will yield faster runtime. We recommend using
Example_counts_large.csv
, which has more mutations, to test
functionalities described in Section 5.
1. First, load the CSV dataset of mutation counts. In the example datasets provided, we classify mutations into one of 96 types based on the mutation type and trinucletotide context. We aggregate mutation counts by megabase, since 1 Mb is generally the smallest region across which GenomeTrackSig will demarcate with changepoints, but this step is not required. In this dataset, rows are 1 Mb genomic regions and columns represent region location (including starting chromosome, start position, ending chromosome, end position, region width) and mutation counts for each of 96 mutation types.
# load packages
library(GenomeTrackSig)
library(ggplot2)
# read in example datasets
counts = read.csv(system.file(package = "GenomeTrackSig",
"extdata/Example_counts.csv"))
large_counts = read.csv(system.file(package = "GenomeTrackSig",
"extdata/Example_counts_large.csv"))
2. Next, restrict the list of signatures to fit exposure for. This is
recommended for improving speed by making the model smaller. We
recommend manually specifying tissue-specific active signatures when
running GenomeTrackSig, but active signatures may also be determined
using the detectActiveSignatures()
function. To illustrate this
function, we use the default threshold
of 5%, meaning that signatures
with activity less than 5% across all genomic regions will not be fit.
set.seed(0811)
# restrict signatures
detectedSigs <- detectActiveSignatures(counts = large_counts,
threshold = 0.05,
binSize = 100)
2. Next, we demonstrate the simplest use case by computing the
signature activity profile across all regions of the genome. The minimal
input for GenomeTrackSig()
is a dataframe of mutation counts, a list
of active signatures to fit, and the desired number of mutations per
bin. We recommend a minimum bin size of 100 mutations. If signature
activities differ between two bins, then a changepoint will be placed at
the boundary between bins.
set.seed(0811)
# construct genome-wide signature activity profile
sigprofile <- GenomeTrackSig(counts = large_counts,
activeInSample = c("SBS1", "SBS2", "SBS13", "SBS40"),
binSize = 100)
3. Visualize the activity profile using the plotGenomeProfile
function. The chr_level
parameter specifies if segmentation was
performed on each chromosome individually or on the entire genome.
profilePlot <- plotGenomeProfile(profile = sigprofile,
chr_level = FALSE,
cutoff = 0)
4. Optionally, we can then plot the signature activity profile alongside contextual genomic information, including GC content, mutation density, gene density, and ChromHMM chromatin states.
# plot profile alongside GC content, mutation density, gene density, and ChromHMM states for each bin
profileContextPlot <- plotGenomeContext(counts = large_counts,
profile = sigprofile,
binSize = 100)
For samples with large numbers of mutations (>400 per chromosome),
performing segmentation on individual chromosomes can reduce runtime.
The number and positions of bins is the same whether segmentation is
performed on individual chromosomes or the entire genome. Users can
specify chromosome-wise segmentation by setting chr_level = TRUE
within the GenomeTrackSig()
, plotGenomeProfile()
, and
plotGenomeContext()
functions.
set.seed(0811)
# estimate signature activity profile
sigprofile_chromosomewise <- GenomeTrackSig(counts = large_counts,
chr_level = TRUE,
binSize = 100,
activeInSample = c("SBS1", "SBS2", "SBS13", "SBS40"))
# plot signature activity profile
profilePlotChrLevel <- plotGenomeProfile(profile = sigprofile_chromosomewise,
chr_level = TRUE,
cutoff = 0)
There are two bootstrapping options available within GenomeTrackSig()
,
controlled by the bootstrapMethod
parameter.
bootstrapMethod = "Mutation"
will resample with replacement from the
mutations within each bin boostrapSample
number of times.
bootstrapMethod = "Shuffle"
will resample the ordering of chromosomes
across the genome bootstrapSample
number of times (note that this
resampling method is not applicable when chr_level = TRUE
). Each
resulting changepoint has a bootstrap support value from 0-1, reflecting
the proportion of bootstrap samples that identified a changepoint at
that location. When visualizing profiles, changepoint opacity reflects
bootstrap support, and changepoints can be filtered by bootstrap support
by specifying a cutoff
> 0.
set.seed(0811)
# estimate signature activity profile with bootstrapMethod = "Mutation"
sigprofile_bootstrap1 <- GenomeTrackSig(counts = large_counts,
chr_level = TRUE,
binSize = 100,
activeInSample = c("SBS1", "SBS2", "SBS13", "SBS40"),
bootstrapMethod = "Mutation",
bootstrapSamples = 2)
# plot signature activity profile
profilePlotBootstrap1 <- plotGenomeProfile(profile = sigprofile_bootstrap1,
chr_level = TRUE,
cutoff = 0)
set.seed(0811)
# estimate signature activity profile with bootstrapMethod = "Shuffle"
sigprofile_bootstrap2 <- GenomeTrackSig(counts = large_counts,
chr_level = FALSE,
binSize = 100, activeInSample = c("SBS1", "SBS2", "SBS13", "SBS40"),
bootstrapMethod = "Shuffle",
bootstrapSamples = 2)
# plot signature activity profile
profilePlotBootstrap2 <- plotGenomeProfile(profile = sigprofile_bootstrap2,
chr_level = FALSE,
cutoff = 0)
For very large samples, running multiple bootstrap samples or multiple
chromosomes (when chr_level = TRUE
) in parallel will reduce runtime.
Setting parallelize = TRUE
within the GenomeTrackSig()
call will
utilize 70% of available cores on the machine to parallelize tasks.
set.seed(0811)
# estimate signature activity profile by running multiple chromosomes in parallel
sigprofile_parallel <- GenomeTrackSig(counts = large_counts,
chr_level = TRUE,
binSize = 100,
activeInSample = c("SBS1", "SBS2", "SBS13", "SBS40"),
parallelize = TRUE)
We do not recommend running GenomeTrackSig on samples with fewer than 1,000 mutations.