-
Notifications
You must be signed in to change notification settings - Fork 1
/
README.Rmd
161 lines (115 loc) · 5.1 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit the Rmd file only -->
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE, collapse = TRUE)
```
[![Build Status](https://travis-ci.org/UofABioinformaticsHub/strandCheckR.svg?branch=master)](https://travis-ci.org/UofABioinformaticsHub/strandCheckR)
[![Project Status: Active - The project has reached a stable, usable state and is being actively developed.](http://www.repostatus.org/badges/latest/active.svg)](http://www.repostatus.org/#active)
[![DOI](https://zenodo.org/badge/70646093.svg)](https://zenodo.org/badge/latestdoi/70646093)
strandCheckR
---
This package aims to check the strandedness of reads in a bam file,
enabling easy detection of any contaminating genomic DNA or other
unexpected sources of contamination.
It can be applied to quantify and remove reads which correspond to putative
double strand DNA within a strand-specific RNA sample.
The package uses a sliding window to scan a bam file and find the number of
positive/negative reads in each window.
It then provides method to plot the proportions of positive/negative stranded
alignments within all windows, which allow users to determine how much the
sample was contaminated, and to determine an appropriate threshold for filtering.
Finally, users can filter putative DNA contamination from any strand-specific
RNAseq sample using their selected threshold.
## Installation
To install the release version from Bioconductor:
```{r installBioc, eval = FALSE}
install.packages("BiocManager")
BiocManager::install("strandCheckR")
```
To install the development version on github (i.e. this version):
```{r installGit, eval=FALSE}
install.packages("BiocManager")
BiocManager::install("UofABioinformaticsHub/strandCheckR")
```
## Quick Usage Guide
Following are the main functions of the package.
- `getStrandFromBamFile()`
To get the number of +/- stranded reads of all sliding windows across a bam
file:
```{r, message=FALSE}
# Load the package and example bam files
library(strandCheckR)
files <- system.file(
"extdata", c("s1.sorted.bam", "s2.sorted.bam"),
package = "strandCheckR"
)
# Find the read proportions from chromosome 10 for the two files
win <- getStrandFromBamFile(files, sequences = "10")
# Tidy up the file name for prettier output
win$File <- basename(as.character(win$File))
win
```
- `plotHist()`
The histogram plot shows you the proportion of +/- stranded reads across all
windows.
```{r plotHist, message=FALSE}
plotHist(
windows = win,
groupBy = "File",
normalizeBy = "File",
scales = "free_y"
)
```
In this example, *s2.sorted.bam* seems to be contaminated with double stranded
DNA, as evidenced by many windows containing a roughly equal proportion of
reads on both strands, whilst *s1.sorted.bam* is cleaner.
- `plotWin()`
The output from `plotWin()` represents each window as a point.
This plot also has threshold lines which can be used to provide guidance as to
the best threshold to choose when filtering windows.
Given a suitable threshold, reads from a positive (resp. negative) window are
kept if and only if the proportion is above (resp. below) the corresponding
threshold line.
```{r plotWin, message=FALSE, warning=FALSE}
plotWin(win, groupBy = "File")
```
- `filterDNA()`
The function `filterDNA()` removes potential double stranded DNA from a bam
file using a selected threshold.
```{r win2, message=FALSE}
win2 <- filterDNA(
file = files[2],
destination = "s2.filtered.bam",
sequences = "10",
threshold = 0.7,
getWin = TRUE
)
```
Comparing the histogram plot of the file before and after filtering shows that
reads from the windows with roughly equal proportions of +/- stranded reads
have been removed.
```{r plotHistAfterFilter, message=FALSE}
win2$File <- basename(as.character(win2$File))
win2$File <- factor(win2$File, levels = c("s2.sorted.bam", "s2.filtered.bam"))
library(ggplot2)
plotHist(win2, groupBy = "File", normalizeBy = "File", scales = "free_y")
```
A more comprehensive vignette is available at https://bioconductor.org/packages/release/bioc/vignettes/strandCheckR/inst/doc/strandCheckR.html
## Support
We recommend that questions seeking support in using the software are posted to
the Bioconductor support forum - https://support.bioconductor.org/ - where they
will attract not only our attention but that of the wider Bioconductor community.
Code contributions, bug reports and feature requests are most welcome.
Please make any pull requests against the master branch at https://github.com/UofABioinformaticsHub/strandCheckR and file issues at https://github.com/UofABioinformaticsHub/strandCheckR/issues
## Author Contributions
- *Thu-Hien To* authored the vast majority of code within the package along with unit tests
- *Thu-Hien To* and *Stephen Pederson* worked closely together on the package design and
methodology
## License
`strandCheckR` is licensed under [GPL >= 2.0](https://www.r-project.org/Licenses/GPL-2)
```{r, echo=FALSE, results='hide'}
## Clean up the files generated by the above
file.remove("s2.filtered.bam", "s2.filtered.bam.bai", "out.stat")
```