-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathREADME.Rmd
227 lines (181 loc) · 11.5 KB
/
README.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
---
output: github_document
---
<!-- README.md is generated from README.Rmd. Please edit that file -->
```{r, include = FALSE}
knitr::opts_chunk$set(
collapse = TRUE,
comment = "#>",
fig.path = "man/figures/README-",
out.width = "100%"
)
```
# genRelateR
## Description
genRelateR is a tool for analyzing global human genomic relatedness across populations. The package addresses the need for efficient handling and analysis of large genomic datasets by providing specialized functions that streamline common tasks such as filtering samples, managing metadata, and performing population-based analyses. By integrating with established bioinformatics tools like VariantAnnotation and SummarizedExperiment, this package simplifies the process of working with genomic data in R. It empowers users to conduct exploratory analyses quickly and efficiently, reducing the complexity of data manipulation and making it easier to focus on higher-level analyses such as population structure exploration. The `genRelateR` package was developed using `R version 4.4.1 (2024-06-14 ucrt)`, `Platform: x86_64-w64-mingw32/x64` and `Running under: Windows 10 x64 (build 19045)`.
## Installation
To install the latest version of the package:
``` r
install.packages("devtools")
library(devtools)
options(repos = BiocManager::repositories())
devtools::install_github("bdls-jamal/genRelateR", build_vignettes = TRUE)
library(genRelateR)
```
To run the Shiny app:
``` r
rungenRelateR()
```
Note: Ensure the chromosome selected matches your input file. This should be done automatically.
## Overview
``` r
ls("package:genRelateR")
data(package = "genRelateR")
browseVignettes("genRelateR")
```
`genRelateR` contains 6 functions.
1. ***loadGeneticData()*** loads and formats genetic vcf data from various sources
2. ***filterPopulation()*** filters data based on population, geographic region, or genetic criteria
3. ***analyzePopulationStructure()*** performs PCA analysis on population data
4. ***plotPopulationPca()*** plots population structure using PCA
5. ***plotAncestryMap()*** maps the distribution of genetic ancestry components
6. ***rungenRelateR()*** generates an interactive shiny dashboard for exploring relatedness data
This package also contains sample 1000 genome project data found in the data folder. You may use these for initial analyses and then download other files from the project later.
Refer to package vignettes for more details. An overview of the package is illustrated below.
![Overview of Functions Output](vignettes/R_Image.png)
## Contributions
The author of this package is Kobi Jamal Schmalenberg. The author wrote all of the functions in this package. Generative AI tools were used in all functions for proofing, optimization suggestions, and to help easily replicate styling conventions.
This package leverages several widely-used R packages to perform key tasks. Here are the primary contributions:
- VariantAnnotation: Used for reading and processing VCF files.
- dplyr: Essential for manipulating population metadata, allowing flexible filtering of samples based on various criteria such as population and super-population.
- SummarizedExperiment: Utilized for handling collapsed VCF data and extracting genotype information efficiently.
- GenomicRanges: Employed for managing genomic regions, making it easier to specify regions of interest within large genomic datasets.
## References
- Becker, R. A., Chambers, J. M. and Wilks, A. R. (1988) The New S Language. Wadsworth & Brooks/Cole.
- Mardia, K. V., J. T. Kent, and J. M. Bibby (1979) Multivariate Analysis, London: Academic Press.
- R Core Team (2024). _R: A Language and Environment for Statistical
Computing_. R Foundation for Statistical Computing, Vienna, Austria.
<https://www.R-project.org/>.
- Susan Fairley, Ernesto Lowy-Gallego, Emily Perry, Paul Flicek, The International Genome Sample Resource (IGSR) collection of open human genomic variation resources, Nucleic Acids Research, Volume 48, Issue D1, 08 January 2020, Pages D941–D947, https://doi.org/10.1093/nar/gkz836
- The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature 526, 68–74 (2015). https://doi.org/10.1038/nature15393
- Venables, W. N. and B. D. Ripley (2002) Modern Applied Statistics with S, Springer-Verlag.
### Packages
- Barrett T, Dowle M, Srinivasan A, Gorecki J, Chirico M, Hocking T, Schwendinger B (2024). _data.table: Extension of `data.frame`_. R package version 1.16.0, <https://CRAN.R-project.org/package=data.table>.
- Becker OScbRA, Minka ARWRvbRBEbTP, Deckmyn. A (2023). _maps: Draw Geographical Maps_. R package
version 3.4.2, <https://CRAN.R-project.org/package=maps>.
- Csardi G, Nepusz T (2006). “The igraph software package for complex network research.”
_InterJournal_, *Complex Systems*, 1695. <https://igraph.org>.
- C. Sievert. Interactive Web-Based Data Visualization with R, plotly, and shiny. Chapman and Hall/CRC
Florida, 2020.
- Lawrence M, Huber W, Pag\`es H, Aboyoun P, Carlson M, et al. (2013) Software for Computing and
Annotating Genomic Ranges. PLoS Comput Biol 9(8): e1003118. doi:10.1371/journal.pcbi.1003118
- Neuwirth E (2022). _RColorBrewer: ColorBrewer Palettes_. R package version 1.1-3,
<https://CRAN.R-project.org/package=RColorBrewer>.
- Obenchain V, Lawrence M, Carey V, Gogarten S, Shannon P, Morgan M (2014). “VariantAnnotation: a
Bioconductor package for exploration and annotation of genetic variants.” _Bioinformatics_, *30*(14),
2076-2078. doi:10.1093/bioinformatics/btu168 <https://doi.org/10.1093/bioinformatics/btu168>.
- Simon Garnier, Noam Ross, Robert Rudis, Antônio P. Camargo, Marco Sciaini, and Cédric Scherer (2024).
viridis(Lite) - Colorblind-Friendly Color Maps for R. viridis package version 0.6.5.
- Slowikowski K (2024). _ggrepel: Automatically Position Non-Overlapping Text Labels with 'ggplot2'_. R
package version 0.9.6, <https://CRAN.R-project.org/package=ggrepel>.
- Wickham H, Averick M, Bryan J, Chang W, McGowan LD, François R, Grolemund G, Hayes A, Henry L, Hester
J, Kuhn M, Pedersen TL, Miller E, Bache SM, Müller K, Ooms J, Robinson D, Seidel DP, Spinu V,
Takahashi K, Vaughan D, Wilke C, Woo K, Yutani H (2019). “Welcome to the tidyverse.” _Journal of Open
Source Software_, *4*(43), 1686. doi:10.21105/joss.01686 <https://doi.org/10.21105/joss.01686>.
- Wickham H, François R, Henry L, Müller K, Vaughan D (2023). _dplyr: A Grammar of Data Manipulation_.
R package version 1.1.4, <https://CRAN.R-project.org/package=dplyr>.
- Wickham H. ggplot2: Elegant Graphics for Data Analysis. Springer-Verlag New York, 2016.
- Wickham H, Hester J, Bryan J (2024). _readr: Read Rectangular Text Data_. R package version 2.1.5,
<https://CRAN.R-project.org/package=readr>.
- Wickham H, Vaughan D, Girlich M (2024). _tidyr: Tidy Messy Data_. R package version 1.3.1,
<https://CRAN.R-project.org/package=tidyr>.
- Wickham H (2023). _stringr: Simple, Consistent Wrappers for Common String Operations_. R package
version 1.5.1, <https://CRAN.R-project.org/package=stringr>.
## Acknowledgements
This package was developed as part of an assessment for 2024 BCB410H: Applied Bioinformatics course at the University of Toronto, Toronto, Canada. `genRelateR` welcomes issues, enhancement requests, and other contributions. To submit an issue, use the [GitHub issues](https://github.com/bdls-jamal/genRelateR/issues).
## Example Usage
### Downloading VCF Data
``` r
# Data in example is from the 1000 genomes project. This package is for the project and may not work as intended with other data that does not have the same format as the project.
# Define the base FTP URL(you may choose a different release)
base_url <- "ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/"
# Specify the filenames for the VCF and .tbi files(example only below)
vcf_file <- "ALL.chr1.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz"
tbi_file <- paste0(vcf_file, ".tbi")
# Specify the destination folder (create it if it doesn’t exist)
dest_folder <- "data/vcf_files"
if (!dir.exists(dest_folder)) {
dir.create(dest_folder, recursive = TRUE)
}
# Define the full destination paths
vcf_dest <- file.path(dest_folder, vcf_file)
tbi_dest <- file.path(dest_folder, tbi_file)
# Download the VCF and .tbi files
download.file(url = paste0(base_url, vcf_file), destfile = vcf_dest, mode = "wb")
download.file(url = paste0(base_url, tbi_file), destfile = tbi_dest, mode = "wb")
# Download population metadata files, or grab from github repo
pop_codes_url <- "https://raw.githubusercontent.com/bdls-jamal/genRelateR/main/inst/extdata/population_codes.txt"
pop_long_lat_url <- "https://raw.githubusercontent.com/bdls-jamal/genRelateR/main/inst/extdata/population_long_lat.txt"
pop_metadata_url <- "https://raw.githubusercontent.com/bdls-jamal/genRelateR/main/inst/extdata/population_metadata.txt"
dest_folder <- "data"
pop_codes_dest <- file.path(dest_folder, "population_codes.txt")
pop_long_lat_dest <- file.path(dest_folder, "population_long_lat.txt")
pop_metadata_dest <- file.path(dest_folder, "population_metadata.txt")
download.file(url = pop_codes_url, destfile = pop_codes_dest, mode = "wb")
download.file(url = pop_long_lat_url, destfile = pop_long_lat_dest, mode = "wb")
download.file(url = pop_metadata_url, destfile = pop_metadata_dest, mode = "wb")
```
### Function Calling
``` r
Run genRelateR::setupGeneticPackages() to install and load necessary packages
You may download the necessary txt files directly from the github repository or from the 1000 genomes project along with the vcf.gz/tbi files from the project.
# Load genetic data from compressed vcf file(i.e. data from 1000 genomes project)
# Ensure this data has an associated tbi file in the same folder, or generate it using Tabix
# Example: vcf_file <- "inst/extdata/vcf/ALL.chr1.phase3_shapeit2_mvncall_integrated_v5b.20130502.genotypes.vcf.gz"
vcf_file <- "path/to/your/vcf.gz/data"
# Define a region(range) for analysis(required for very large data files)
chr1_region <- GRanges(
seqnames = "1",
ranges = IRanges(start = 20000000, end = 20001000)
)
# Call loadGeneticData function with specified region and vcf file
genetic_data <- genRelateR::loadGeneticData(vcf_file, regions = chr1_region)
# Create population metadata object from metadata file
# File is provided for 1000 genome project use, for other data you must use with your own data
# Example: pop_metadata <- "inst/extdata/population_metadata.txt"
pop_metadata <- "path/to/your/metadata.txt/file"
# Choose populations to filter by(all are selected in this example)
populations <- c(
"CHB", "JPT", "CHS", "CDX", "KHV",
"CEU", "TSI", "GBR", "FIN", "IBS",
"YRI", "LWK", "GWD", "MSL", "ESN",
"ASW", "ACB", "MXL", "PUR", "CLM",
"PEL", "GIH", "PJL", "BEB", "STU", "ITU"
)
# Call filterPopulation with loaded data, metadata, and specified populations
filtered_data <- genRelateR::filterPopulation(genetic_data, pop_metadata, populations)
# Perform population structure analysis by calling analyzePopulationStructure
pca_results <- genRelateR::analyzePopulationStructure(
filtered_data$vcf_data,
filtered_data$pop_metadata,
method = "pca"
)
# Create PCA data cluster visualization
# Call plotPopulationPca then print to view
pca_plot <- genRelateR::plotPopulationPca(
analysis_results = pca_results,
filtered_data$pop_metadata,
title = "Population Structure PCA",
ellipses = TRUE,
super_pop = TRUE
)
print(pca_plot)
# Create PCA data map visualization
# Call plotAncestryMap then print to view
ancestry_map <- genRelateR::plotAncestryMap(
pca_results,
filtered_data$pop_metadata,
title = "Global Population Distribution"
)
print(ancestry_map)
```