Skip to content

Commit

Permalink
**v.0.3.4**
Browse files Browse the repository at this point in the history
* `fst_NEI87`: very fast function that can compute: the overall and pairwise Nei's (1987) fst and f'st (prime).
Bootstrap resampling of markers is avalaible to build Confidence Intervals. The estimates are available as a data frame and a matrix with upper diagonal filled with Fst values and lower diagonal filled with the confidence intervals. Jost's D is also given ;)
  • Loading branch information
thierrygosselin committed Sep 26, 2016
1 parent 9142ae9 commit d0a8fae
Show file tree
Hide file tree
Showing 8 changed files with 1,062 additions and 72 deletions.
4 changes: 2 additions & 2 deletions DESCRIPTION
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
Package: assigner
Type: Package
Title: Assignment Analysis with GBS/RADseq Data using R
Version: 0.3.3
Date: 2016-09-22
Version: 0.3.4
Date: 2016-09-26
Encoding: UTF-8
Authors@R: c(
person("Thierry", "Gosselin", email = "[email protected]", role = c("aut", "cre")),
Expand Down
1 change: 1 addition & 0 deletions NAMESPACE
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ export(assignment_genodive)
export(assignment_mixture)
export(assignment_ngs)
export(dlr)
export(fst_NEI87)
export(fst_WC84)
export(gsi_sim_binary)
export(import_subsamples)
Expand Down
764 changes: 764 additions & 0 deletions R/fst_NEI87.R

Large diffs are not rendered by default.

103 changes: 50 additions & 53 deletions R/fst_WC84.R
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,9 @@
#' involves more than 2 populations that evolved more by neutral processes
#' (e.g. genetic drift) than by natural selection (see the vignette for more details).
#'
#' @note \strong{This function is dedicated to Louis Bernatchez!
#' In the hope that your students find this function fast and usefull.}
#' @note Negative Fst are technical artifact of the computation
#' (see Roesti el al. 2012) and are automatically replaced with zero inside
#' this function.

#' @param data A file in the working directory or object in the global environment
#' in wide or long (tidy) formats. To import, the function uses internally
Expand Down Expand Up @@ -150,10 +151,6 @@
#' \pkg{stackr} \code{\link{tidy_genomic_data}} can transform 6 genomic data formats
#' in a tidy data frame.





#' @export
#' @rdname fst_WC84
#' @import stringi
Expand Down Expand Up @@ -203,6 +200,9 @@
#' @references Weir BS, Cockerham CC (1984) Estimating F-Statistics for the
#' Analysis of Population Structure.
#' Evolution, 38, 1358-1370.
#' @references Roesti M, Salzburger W, Berner D. (2012)
#' Uninformative polymorphisms bias genome scans for signatures of selection.
#' BMC Evol Biol., 12:94. doi:10.1111/j.1365-294X.2012.05509.x

#' @seealso
#' From \href{http://www.bentleydrummer.nl/software/software/GenoDive.html}{GenoDive} manual:
Expand All @@ -229,31 +229,23 @@

#' @author Thierry Gosselin \email{thierrygosselin@@icloud.com}

# required to pass the R CMD check and have 'no visible binding for global variable'
if (getRversion() >= "2.15.1"){
utils::globalVariables(
c("GENOTYPE", "FIS", "POP1", "POP2", "tsiga", "tsigb", "tsigw", "CI",
"CI_LOW", "CI_HIGH", "N_MARKERS","sigma.loc.alleles")
)
}


# Fst function: Weir & Cockerham 1984
fst_WC84 <- function(data,
pop.levels = NULL,
pop.labels = NULL,
strata = NULL,
holdout.samples = NULL,
pairwise = FALSE,
ci = FALSE,
iteration.ci = 100,
quantiles.ci = c(0.025,0.975),
digits = 9,
parallel.core = detectCores()-1,
verbose = FALSE,
...) {
fst_WC84 <- function(
data,
pop.levels = NULL,
pop.labels = NULL,
strata = NULL,
holdout.samples = NULL,
pairwise = FALSE,
ci = FALSE,
iteration.ci = 100,
quantiles.ci = c(0.025,0.975),
digits = 9,
parallel.core = detectCores() - 1,
verbose = FALSE,
...) {

if(verbose){
if (verbose) {
cat("#######################################################################\n")
cat("######################### assigner::fst_WC84 ##########################\n")
cat("#######################################################################\n")
Expand All @@ -266,21 +258,21 @@ fst_WC84 <- function(data,
if (!is.null(pop.labels) & is.null(pop.levels)) stop("pop.levels is required if you use pop.labels")

# Import data ---------------------------------------------------------------
if(verbose) message("Importing data")
if (verbose) message("Importing data")
input <- stackr::read_long_tidy_wide(data = data)

# switch LOCUS to MARKERS if found
if ("LOCUS" %in% colnames(input)) input <- rename(.data = input, MARKERS = LOCUS)

# population levels and strata ----------------------------------------------
if (is.null(strata)){ # no strata
if(is.null(pop.levels)) { # no pop.levels
if (is.null(strata)) { # no strata
if (is.null(pop.levels)) { # no pop.levels
if (is.factor(input$POP_ID)) {
input$POP_ID <- droplevels(x = input$POP_ID)
} else {
input$POP_ID <- factor(input$POP_ID)
}
} else { # with pop.levels
} else {# with pop.levels
input <- input %>%
mutate( # Make population ready
POP_ID = factor(
Expand All @@ -294,20 +286,20 @@ fst_WC84 <- function(data,
)
)
}
} else { # Make population ready with the strata provided
} else {# Make population ready with the strata provided
if (is.vector(strata)) {
strata.df <- read_tsv(file = strata, col_names = TRUE, col_types = "cc") %>%
rename(POP_ID = STRATA)
} else {
strata.df <- strata
}
if(is.null(pop.levels)) { # no pop.levels
if (is.null(pop.levels)) { # no pop.levels
input <- input %>%
select(-POP_ID) %>%
mutate(INDIVIDUALS = as.character(INDIVIDUALS)) %>%
left_join(strata.df, by = "INDIVIDUALS") %>%
mutate(POP_ID = factor(POP_ID))
} else { # with pop.levels
} else {# with pop.levels
input <- input %>%
select(-POP_ID) %>%
mutate(INDIVIDUALS = as.character(INDIVIDUALS)) %>%
Expand All @@ -333,7 +325,7 @@ fst_WC84 <- function(data,
if (is.null(holdout.samples)) { # use all the individuals
data.genotyped <- input %>%
filter(GT != "000000") # Check for df and plink...
} else { # if holdout set, removes individuals
} else {# if holdout set, removes individuals
message("removing holdout individuals")
data.genotyped <- input %>%
filter(GT != "000000") %>% # remove missing genotypes
Expand Down Expand Up @@ -367,7 +359,10 @@ fst_WC84 <- function(data,
FST = round(tsiga/(tsiga + tsigb + tsigw), digits),
FIS = round(tsigb/(tsigb + tsigw), digits)
) %>%
mutate(ITERATIONS = rep(x, n()))
mutate(
ITERATIONS = rep(x, n()),
FST = dplyr::if_else(FST < 0, true = 0, false = FST, missing = 0)
)
return(fst.fis.overall.iterations)
} # End boot_ci function

Expand Down Expand Up @@ -490,7 +485,7 @@ fst_WC84 <- function(data,
full_join(ind.count.locus, by = "MARKERS") %>%
mutate(nal_sq_sum_nt = (n - nal_sq_sum/n)) %>%
full_join(n.pop.locus, by = "MARKERS") %>%
mutate(ncal = nal_sq_sum_nt/(npl-1)) %>%
mutate(ncal = nal_sq_sum_nt/(npl - 1)) %>%
select(MARKERS, ncal)
)

Expand Down Expand Up @@ -587,9 +582,10 @@ fst_WC84 <- function(data,
summarise(
FST = round(lsiga/(lsiga + lsigb + lsigw), digits),
FIS = round(lsigb/(lsigb + lsigw), digits)
)
) %>%
mutate(FST = dplyr::if_else(FST < 0, true = 0, false = FST, missing = 0))

fst.fis.overall <-sigma.loc.alleles %>%
fst.fis.overall <- sigma.loc.alleles %>%
ungroup %>%
summarise(
tsiga = sum(siga, na.rm = TRUE),
Expand All @@ -599,13 +595,14 @@ fst_WC84 <- function(data,
summarise(
FST = round(tsiga/(tsiga + tsigb + tsigw), digits),
FIS = round(tsigb/(tsigb + tsigw), digits)
)
) %>%
mutate(FST = dplyr::if_else(FST < 0, true = 0, false = FST, missing = 0))
# add new column with number of markers
fst.fis.overall$N_MARKERS <- n.markers

# Confidence Intervals -----------------------------------------------------
# over loci for the overall Fst estimate
if (ci){
if (ci) {
# the function:
boot.fst.list <- purrr::map(.x = 1:iteration.ci, .f = boot_ci, sigma.loc.alleles = sigma.loc.alleles)
boot.fst <- bind_rows(boot.fst.list)
Expand Down Expand Up @@ -637,7 +634,7 @@ fst_WC84 <- function(data,
)

# Fst overall -------------------------------------------------------------
if (ci){
if (ci) {
fst.overall <- fst.fis.overall %>%
select(FST, N_MARKERS) %>%
bind_cols(boot.fst.summary)
Expand All @@ -655,10 +652,10 @@ fst_WC84 <- function(data,
fis.overall <- fst.fis.overall %>% select(FIS, N_MARKERS)

# Plot -----------------------------------------------------------------------
fst.plot <- ggplot(fst.markers, aes(x = FST, na.rm = T))+
geom_histogram(binwidth = 0.01)+
labs(x = "Fst (overall)")+
expand_limits(x = 0)+
fst.plot <- ggplot(fst.markers, aes(x = FST, na.rm = T)) +
geom_histogram(binwidth = 0.01) +
labs(x = "Fst (overall)") +
expand_limits(x = 0) +
theme(
legend.position = "none",
axis.title.x = element_text(size = 10, family = "Helvetica", face = "bold"),
Expand Down Expand Up @@ -703,7 +700,7 @@ fst_WC84 <- function(data,


# Compute global Fst ---------------------------------------------------------
if(verbose) message("Computing global fst")
if (verbose) message("Computing global fst")
res <- compute_fst(x = data.genotyped, ci = ci, iteration.ci = iteration.ci, quantiles.ci = quantiles.ci)

# Compute pairwise Fst -------------------------------------------------------
Expand Down Expand Up @@ -732,7 +729,7 @@ fst_WC84 <- function(data,
)
# Matrix--------------------------------------------------------------------
upper.mat.fst <- pairwise.fst %>%
select (POP1, POP2, FST) %>%
select(POP1, POP2, FST) %>%
tidyr::spread(data = ., POP2, FST, fill = "", drop = FALSE) %>%
rename(POP = POP1)
rn <- upper.mat.fst$POP # rownames
Expand All @@ -748,10 +745,10 @@ fst_WC84 <- function(data,
full.mat.fst[lower.tri(full.mat.fst)] <- lower.mat.fst[lower.tri(lower.mat.fst)]
diag(full.mat.fst) <- "0"

if (ci){
if (ci) {
# bind upper and lower diagonal of matrix
lower.mat.ci <- pairwise.fst %>%
select (POP1, POP2, CI_LOW, CI_HIGH) %>%
select(POP1, POP2, CI_LOW, CI_HIGH) %>%
tidyr::unite(data = ., CI, CI_LOW, CI_HIGH, sep = " - ") %>%
tidyr::spread(data = ., POP2, CI, fill = "", drop = FALSE) %>%
rename(POP = POP1)
Expand All @@ -778,7 +775,7 @@ fst_WC84 <- function(data,


# messages -------------------------------------------------------------------
if(verbose){
if (verbose) {
cat("############################### RESULTS ###############################\n")
if (ci) {
message(stri_paste("Fst (overall): ", res$fst.overall$FST, " [", res$fst.overall$CI_LOW, " - ", res$fst.overall$CI_HIGH, "]"))
Expand Down
7 changes: 6 additions & 1 deletion R/global_variables.R
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,11 @@ if (getRversion() >= "2.15.1") {
"COUNT", "MAX_COUNT_MARKERS", "hierarchy", "GT_VCF", "ANALYSIS",
"MEAN_ITERATIONS", "MEAN_SUBSAMPLE", "NUMBER_ITERATIONS",
"NUMBER_SUBSAMPLE", "TOTAL_ITERATIONS", "TOTAL_SUBSAMPLE", "X1", "X2",
"strata.df")
"strata.df", "GENOTYPE", "FIS", "POP1", "POP2", "tsiga", "tsigb", "tsigw", "CI",
"CI_LOW", "CI_HIGH", "N_MARKERS","sigma.loc.alleles", "DST", "DST_P",
"FIS_CI_HIGH", "FIS_CI_LOW", "HO", "HS", "HT", "HT_P", "JOST_D", "MHO",
"MN", "MP", "MP2", "MSP2", "NEI_FST", "NEI_FST_CI_HIGH", "NEI_FST_CI_LOW",
"NEI_FST_P", "NEI_FST_P_CI_HIGH", "NEI_FST_P_CI_LOW", "NP", "N_INV",
"SHO", "SP2", "n.markers", "JOST_D_CI_LOW", "JOST_D_CI_HIGH")
)
}
20 changes: 6 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ by Eric C. Anderson (see Anderson et al. 2008 and Anderson 2010) or [adegenet] (
* Markers can be randomly selected for a **classic LOO (Leave-One-Out) assignment** or
chosen based on **ranked Fst** (Weir & Cockerham, 1984) for a **THL (Training, Holdout, Leave-one-out) assignment analysis** (reviewed in Anderson 2010)
* Use `iteration.method` and/or `iteration.subsample` arguments to resample markers or individuals to get statistics!
* assigner provides a fast implementation of Weir and Cockerham (1984) Fst/Theta. Both **overall** and **pairwise Fst** can be estimated with **confidence intervals** based on bootstrap of markers (resampling with replacement).
* assigner provides a fast implementation of Weir and Cockerham (1984) Fst/Theta and Nei's fst (1987). Both **overall** and **pairwise Fst** can be estimated with **confidence intervals** based on bootstrap of markers (resampling with replacement).
* **Map-independent imputation** of missing genotype or alleles using **Random Forest** or the most frequent category is also available to test the impact of missing data on assignment analysis
* **Filters:**
+ Individuals, populations and markers can be **filtered** and/or selected in several ways using **blacklist,
Expand Down Expand Up @@ -113,24 +113,14 @@ The Amazon image can be imported into Google Cloud Compute Engine to start a new
## New features
Change log, version, new features and bug history now lives in the [NEWS.md file] (https://github.com/thierrygosselin/assigner/blob/master/NEWS.md)

**v.0.3.4**
* `fst_NEI87`: very fast function that can compute: the overall and pairwise Nei's (1987) fst and f'st (prime).
Bootstrap resampling of markers is avalaible to build Confidence Intervals. The estimates are available as a data frame and a matrix with upper diagonal filled with Fst values and lower diagonal filled with the confidence intervals. Jost's D is also given ;)

**v.0.3.3**
* `fst_WC84`: bug fix, the function was not properly configured for multi-allelic markers (e.g. microsatellite, and haplotype format from STACKS). Thanks to Craig McDougall for catching this.


**v.0.3.2**
* `assignment_mixture`: added a check that throws an error when pop.levels != the pop.id in strata

**v.0.3.1**
`assignment_mixture`:
* updated with latest modules from `stackr`.
* simplified the identification of mixture or unknown samples. See doc.

**v.0.3.0**
* updated vignettes
* major bug fix that involved dplyr new version (0.5.0) and mostly with
the use of dplyr::distinct

For previous news:
[NEWS.md file] (https://github.com/thierrygosselin/assigner/blob/master/NEWS.md)

Expand Down Expand Up @@ -216,6 +206,8 @@ Kavakiotis I, Triantafyllidis A, Ntelidou D et al. (2015) TRES: Identification o

Meirmans PG, Van Tienderen PH (2004) genotype and genodive: two programs for the analysis of genetic diversity of asexual organisms. Molecular Ecology Notes, 4, 792-794.

Nei M. (1987) Molecular Evolutionary Genetics. Columbia University Press.

Paetkau D, Slade R, Burden M, Estoup A (2004) Genetic assignment methods for the direct, real-time estimation of migration rate: a simulation-based exploration of accuracy and power. Molecular Ecology, 13, 55-65.

Paetkau D, Waits LP, Clarkson PL, Craighead L, Strobeck C (1997) An empirical evaluation of genetic distance statistics using microsatellite data from bear (Ursidae) populations. Genetics, 147, 1943-1957.
Expand Down
Loading

0 comments on commit d0a8fae

Please sign in to comment.