**v.0.3.4**

* `fst_NEI87`: very fast function that can compute: the overall and pairwise Nei's (1987) fst and f'st (prime). Bootstrap resampling of markers is avalaible to build Confidence Intervals. The estimates are available as a data frame and a matrix with upper diagonal filled with Fst values and lower diagonal filled with the confidence intervals. Jost's D is also given ;)
thierrygosselin · Sep 26, 2016 · d0a8fae · d0a8fae
1 parent 9142ae9
commit d0a8fae
Show file tree

Hide file tree

Showing 8 changed files with 1,062 additions and 72 deletions.
diff --git a/DESCRIPTION b/DESCRIPTION
@@ -1,8 +1,8 @@
 Package: assigner
 Type: Package
 Title: Assignment Analysis with GBS/RADseq Data using R
-Version: 0.3.3
-Date: 2016-09-22
+Version: 0.3.4
+Date: 2016-09-26
 Encoding: UTF-8
 Authors@R: c(
   person("Thierry", "Gosselin", email = "[email protected]", role = c("aut", "cre")),

diff --git a/NAMESPACE b/NAMESPACE
@@ -4,6 +4,7 @@ export(assignment_genodive)
 export(assignment_mixture)
 export(assignment_ngs)
 export(dlr)
+export(fst_NEI87)
 export(fst_WC84)
 export(gsi_sim_binary)
 export(import_subsamples)

diff --git a/R/fst_NEI87.R b/R/fst_NEI87.R
diff --git a/R/fst_WC84.R b/R/fst_WC84.R
@@ -29,8 +29,9 @@
 #' involves more than 2 populations that evolved more by neutral processes 
 #' (e.g. genetic drift) than by natural selection (see the vignette for more details).
 #' 
-#' @note \strong{This function is dedicated to Louis Bernatchez!
-#' In the hope that your students find this function fast and usefull.}
+#' @note Negative Fst are technical artifact of the computation 
+#' (see Roesti el al. 2012) and are automatically replaced with zero inside 
+#' this function.
 
 #' @param data A file in the working directory or object in the global environment 
 #' in wide or long (tidy) formats. To import, the function uses internally
@@ -150,10 +151,6 @@
 #' \pkg{stackr} \code{\link{tidy_genomic_data}} can transform 6 genomic data formats 
 #' in a tidy data frame.
 
-
-
-
-
 #' @export
 #' @rdname fst_WC84
 #' @import stringi
@@ -203,6 +200,9 @@
 #' @references Weir BS, Cockerham CC (1984) Estimating F-Statistics for the
 #' Analysis of Population Structure. 
 #' Evolution, 38, 1358-1370.
+#' @references Roesti M, Salzburger W, Berner D. (2012)
+#' Uninformative polymorphisms bias genome scans for signatures of selection. 
+#' BMC Evol Biol., 12:94. doi:10.1111/j.1365-294X.2012.05509.x
 
 #' @seealso
 #' From \href{http://www.bentleydrummer.nl/software/software/GenoDive.html}{GenoDive} manual:
@@ -229,31 +229,23 @@
 
 #' @author Thierry Gosselin \email{thierrygosselin@@icloud.com}
 
-# required to pass the R CMD check and have 'no visible binding for global variable'
-if (getRversion() >= "2.15.1"){
-  utils::globalVariables(
-    c("GENOTYPE", "FIS", "POP1", "POP2", "tsiga", "tsigb", "tsigw", "CI", 
-      "CI_LOW", "CI_HIGH", "N_MARKERS","sigma.loc.alleles")
-  )
-}
-
-
 # Fst function: Weir & Cockerham 1984
-fst_WC84 <- function(data,
-                     pop.levels = NULL, 
-                     pop.labels = NULL, 
-                     strata = NULL,
-                     holdout.samples = NULL,
-                     pairwise = FALSE,
-                     ci = FALSE,
-                     iteration.ci = 100,
-                     quantiles.ci = c(0.025,0.975),
-                     digits = 9,
-                     parallel.core = detectCores()-1,
-                     verbose = FALSE,
-                     ...) {
+fst_WC84 <- function(
+  data,
+  pop.levels = NULL, 
+  pop.labels = NULL, 
+  strata = NULL,
+  holdout.samples = NULL,
+  pairwise = FALSE,
+  ci = FALSE,
+  iteration.ci = 100,
+  quantiles.ci = c(0.025,0.975),
+  digits = 9,
+  parallel.core = detectCores() - 1,
+  verbose = FALSE,
+  ...) {
 
-  if(verbose){
+  if (verbose) {
     cat("#######################################################################\n")
     cat("######################### assigner::fst_WC84 ##########################\n")
     cat("#######################################################################\n")
@@ -266,21 +258,21 @@ fst_WC84 <- function(data,
   if (!is.null(pop.labels) & is.null(pop.levels)) stop("pop.levels is required if you use pop.labels")
 
   # Import data ---------------------------------------------------------------
-  if(verbose) message("Importing data")
+  if (verbose) message("Importing data")
   input <- stackr::read_long_tidy_wide(data = data)
 
   # switch LOCUS to MARKERS if found
   if ("LOCUS" %in% colnames(input)) input <- rename(.data = input, MARKERS = LOCUS)
 
   # population levels and strata  ----------------------------------------------
-  if (is.null(strata)){ # no strata
-    if(is.null(pop.levels)) { # no pop.levels
+  if (is.null(strata)) { # no strata
+    if (is.null(pop.levels)) { # no pop.levels
       if (is.factor(input$POP_ID)) {
         input$POP_ID <- droplevels(x = input$POP_ID)
       } else {
         input$POP_ID <- factor(input$POP_ID)
       }
-    } else { # with pop.levels
+    } else {# with pop.levels
       input <- input %>%
         mutate( # Make population ready
           POP_ID = factor(
@@ -294,20 +286,20 @@ fst_WC84 <- function(data,
           )
         )
     }
-  } else { # Make population ready with the strata provided
+  } else {# Make population ready with the strata provided
     if (is.vector(strata)) {
       strata.df <- read_tsv(file = strata, col_names = TRUE, col_types = "cc") %>% 
         rename(POP_ID = STRATA)
     } else {
       strata.df <- strata
     }
-    if(is.null(pop.levels)) { # no pop.levels
+    if (is.null(pop.levels)) { # no pop.levels
       input <- input %>%
         select(-POP_ID) %>% 
         mutate(INDIVIDUALS =  as.character(INDIVIDUALS)) %>% 
         left_join(strata.df, by = "INDIVIDUALS") %>% 
         mutate(POP_ID = factor(POP_ID))
-    } else { # with pop.levels
+    } else {# with pop.levels
       input <- input %>%
         select(-POP_ID) %>% 
         mutate(INDIVIDUALS =  as.character(INDIVIDUALS)) %>% 
@@ -333,7 +325,7 @@ fst_WC84 <- function(data,
   if (is.null(holdout.samples)) { # use all the individuals
     data.genotyped <- input %>%
       filter(GT != "000000") # Check for df and plink...
-  } else { # if holdout set, removes individuals
+  } else {# if holdout set, removes individuals
     message("removing holdout individuals")
     data.genotyped <- input %>%
       filter(GT != "000000") %>% # remove missing genotypes
@@ -367,7 +359,10 @@ fst_WC84 <- function(data,
         FST = round(tsiga/(tsiga + tsigb + tsigw), digits),
         FIS = round(tsigb/(tsigb + tsigw), digits)
       ) %>% 
-      mutate(ITERATIONS = rep(x, n()))
+      mutate(
+        ITERATIONS = rep(x, n()),
+        FST = dplyr::if_else(FST < 0, true = 0, false = FST, missing = 0)
+      )
     return(fst.fis.overall.iterations)
   } # End boot_ci function
 
@@ -490,7 +485,7 @@ fst_WC84 <- function(data,
         full_join(ind.count.locus, by = "MARKERS") %>%
         mutate(nal_sq_sum_nt = (n - nal_sq_sum/n)) %>%
         full_join(n.pop.locus, by = "MARKERS") %>%
-        mutate(ncal = nal_sq_sum_nt/(npl-1)) %>%
+        mutate(ncal = nal_sq_sum_nt/(npl - 1)) %>%
         select(MARKERS, ncal)
     )
 
@@ -587,9 +582,10 @@ fst_WC84 <- function(data,
       summarise(
         FST = round(lsiga/(lsiga + lsigb + lsigw), digits),
         FIS = round(lsigb/(lsigb + lsigw), digits)
-      )
+      ) %>% 
+      mutate(FST = dplyr::if_else(FST < 0, true = 0, false = FST, missing = 0))
 
-    fst.fis.overall <-sigma.loc.alleles %>% 
+    fst.fis.overall <- sigma.loc.alleles %>% 
       ungroup %>%
       summarise(
         tsiga = sum(siga, na.rm = TRUE),
@@ -599,13 +595,14 @@ fst_WC84 <- function(data,
       summarise(
         FST = round(tsiga/(tsiga + tsigb + tsigw), digits),
         FIS = round(tsigb/(tsigb + tsigw), digits)
-      )
+      ) %>% 
+      mutate(FST = dplyr::if_else(FST < 0, true = 0, false = FST, missing = 0))
     # add new column with number of markers
     fst.fis.overall$N_MARKERS <- n.markers
 
     # Confidence Intervals -----------------------------------------------------
     # over loci for the overall Fst estimate
-    if (ci){
+    if (ci) {
       # the function:
       boot.fst.list <- purrr::map(.x = 1:iteration.ci, .f = boot_ci, sigma.loc.alleles = sigma.loc.alleles)
       boot.fst <- bind_rows(boot.fst.list)
@@ -637,7 +634,7 @@ fst_WC84 <- function(data,
       )
 
     # Fst overall  -------------------------------------------------------------
-    if (ci){
+    if (ci) {
       fst.overall <- fst.fis.overall %>% 
         select(FST, N_MARKERS) %>% 
         bind_cols(boot.fst.summary)
@@ -655,10 +652,10 @@ fst_WC84 <- function(data,
     fis.overall <- fst.fis.overall %>% select(FIS, N_MARKERS)
 
     # Plot -----------------------------------------------------------------------
-    fst.plot <- ggplot(fst.markers, aes(x = FST, na.rm = T))+
-      geom_histogram(binwidth = 0.01)+
-      labs(x = "Fst (overall)")+
-      expand_limits(x = 0)+
+    fst.plot <- ggplot(fst.markers, aes(x = FST, na.rm = T)) +
+      geom_histogram(binwidth = 0.01) +
+      labs(x = "Fst (overall)") +
+      expand_limits(x = 0) +
       theme(
         legend.position = "none",
         axis.title.x = element_text(size = 10, family = "Helvetica", face = "bold"),
@@ -703,7 +700,7 @@ fst_WC84 <- function(data,
 
 
   # Compute global Fst ---------------------------------------------------------
-  if(verbose) message("Computing global fst")
+  if (verbose) message("Computing global fst")
   res <- compute_fst(x = data.genotyped, ci = ci, iteration.ci = iteration.ci, quantiles.ci = quantiles.ci)
 
   # Compute pairwise Fst -------------------------------------------------------
@@ -732,7 +729,7 @@ fst_WC84 <- function(data,
       )
     # Matrix--------------------------------------------------------------------
     upper.mat.fst <- pairwise.fst %>% 
-      select (POP1, POP2, FST) %>% 
+      select(POP1, POP2, FST) %>% 
       tidyr::spread(data = ., POP2, FST, fill = "", drop = FALSE) %>% 
       rename(POP = POP1)
     rn <- upper.mat.fst$POP # rownames
@@ -748,10 +745,10 @@ fst_WC84 <- function(data,
     full.mat.fst[lower.tri(full.mat.fst)] <- lower.mat.fst[lower.tri(lower.mat.fst)] 
     diag(full.mat.fst) <- "0"
 
-    if (ci){
+    if (ci) {
       # bind upper and lower diagonal of matrix
       lower.mat.ci <- pairwise.fst %>% 
-        select (POP1, POP2, CI_LOW, CI_HIGH) %>% 
+        select(POP1, POP2, CI_LOW, CI_HIGH) %>% 
         tidyr::unite(data = ., CI, CI_LOW, CI_HIGH, sep = " - ") %>% 
         tidyr::spread(data = ., POP2, CI, fill = "", drop = FALSE) %>% 
         rename(POP = POP1)
@@ -778,7 +775,7 @@ fst_WC84 <- function(data,
 
 
   # messages -------------------------------------------------------------------
-  if(verbose){
+  if (verbose) {
     cat("############################### RESULTS ###############################\n")
     if (ci) {
       message(stri_paste("Fst (overall): ", res$fst.overall$FST, " [", res$fst.overall$CI_LOW, " - ", res$fst.overall$CI_HIGH, "]"))

diff --git a/R/global_variables.R b/R/global_variables.R
@@ -21,6 +21,11 @@ if (getRversion() >= "2.15.1") {
       "COUNT", "MAX_COUNT_MARKERS", "hierarchy", "GT_VCF", "ANALYSIS",
       "MEAN_ITERATIONS", "MEAN_SUBSAMPLE", "NUMBER_ITERATIONS",
       "NUMBER_SUBSAMPLE", "TOTAL_ITERATIONS", "TOTAL_SUBSAMPLE", "X1", "X2",
-      "strata.df")
+      "strata.df", "GENOTYPE", "FIS", "POP1", "POP2", "tsiga", "tsigb", "tsigw", "CI", 
+      "CI_LOW", "CI_HIGH", "N_MARKERS","sigma.loc.alleles", "DST", "DST_P", 
+      "FIS_CI_HIGH", "FIS_CI_LOW", "HO", "HS", "HT", "HT_P", "JOST_D", "MHO", 
+      "MN", "MP", "MP2", "MSP2", "NEI_FST", "NEI_FST_CI_HIGH", "NEI_FST_CI_LOW",
+      "NEI_FST_P", "NEI_FST_P_CI_HIGH", "NEI_FST_P_CI_LOW", "NP", "N_INV", 
+      "SHO", "SP2", "n.markers", "JOST_D_CI_LOW", "JOST_D_CI_HIGH")
   )
 }
diff --git a/README.md b/README.md
@@ -27,7 +27,7 @@ by Eric C. Anderson (see Anderson et al. 2008 and Anderson 2010) or [adegenet] (
 * Markers can be randomly selected for a **classic LOO (Leave-One-Out) assignment** or 
 chosen based on **ranked Fst** (Weir & Cockerham, 1984) for a **THL (Training, Holdout, Leave-one-out) assignment analysis** (reviewed in Anderson 2010)
 * Use `iteration.method` and/or `iteration.subsample` arguments to resample markers or individuals to get statistics!
-* assigner provides a fast implementation of Weir and Cockerham (1984) Fst/Theta. Both **overall** and **pairwise Fst** can be estimated with **confidence intervals** based on bootstrap of markers (resampling with replacement).
+* assigner provides a fast implementation of Weir and Cockerham (1984) Fst/Theta and Nei's fst (1987). Both **overall** and **pairwise Fst** can be estimated with **confidence intervals** based on bootstrap of markers (resampling with replacement).
 * **Map-independent imputation** of missing genotype or alleles using **Random Forest** or the most frequent category is also available to test the impact of missing data on assignment analysis
 * **Filters:**
   + Individuals, populations and markers can be **filtered** and/or selected in several ways using **blacklist, 
@@ -113,24 +113,14 @@ The Amazon image can be imported into Google Cloud Compute Engine to start a new
 ## New features
 Change log, version, new features and bug history now lives in the [NEWS.md file] (https://github.com/thierrygosselin/assigner/blob/master/NEWS.md)
 
+**v.0.3.4**
+* `fst_NEI87`: very fast function that can compute: the overall and pairwise Nei's (1987) fst and f'st (prime). 
+Bootstrap resampling of markers is avalaible to build Confidence Intervals. The estimates are available as a data frame and a matrix with upper diagonal filled with Fst values and lower diagonal filled with the confidence intervals. Jost's D is also given ;)
 
 **v.0.3.3**
 * `fst_WC84`: bug fix, the function was not properly configured for multi-allelic markers (e.g. microsatellite, and haplotype format from STACKS). Thanks to Craig McDougall for catching this.
 
 
-**v.0.3.2**
-* `assignment_mixture`: added a check that throws an error when pop.levels != the pop.id in strata
-
-**v.0.3.1**
-`assignment_mixture`: 
-* updated with latest modules from `stackr`. 
-* simplified the identification of mixture or unknown samples. See doc.
-
-**v.0.3.0**
-* updated vignettes
-* major bug fix that involved dplyr new version (0.5.0) and mostly with 
-the use of dplyr::distinct
-
 For previous news:
 [NEWS.md file] (https://github.com/thierrygosselin/assigner/blob/master/NEWS.md)
 
@@ -216,6 +206,8 @@ Kavakiotis I, Triantafyllidis A, Ntelidou D et al. (2015) TRES: Identification o
 
 Meirmans PG, Van Tienderen PH (2004) genotype and genodive: two programs for the analysis of genetic diversity of asexual organisms. Molecular Ecology Notes, 4, 792-794.
 
+Nei M. (1987) Molecular Evolutionary Genetics. Columbia University Press.
+
 Paetkau D, Slade R, Burden M, Estoup A (2004) Genetic assignment methods for the direct, real-time estimation of migration rate: a simulation-based exploration of accuracy and power. Molecular Ecology, 13, 55-65.
 
 Paetkau D, Waits LP, Clarkson PL, Craighead L, Strobeck C (1997) An empirical evaluation of genetic distance statistics using microsatellite data from bear (Ursidae) populations. Genetics, 147, 1943-1957.