Too many outliers #81

sandyplus · 2023-04-26T08:15:55Z

Hello, everyone,

I used pcadapt to identify environment-related outliers, but I obtained an excessive number of them. Is there anything I overlooked?
Best regards,
Sandy

PS: The code I used:

library(pcadapt)
library(qvalue)
library(data.table)

# Read input file name and best K value from command line arguments
infile <- "imputation.merged.lfmm"
posfile <- "imputation.merged.pos"
bestk  <- 2
outpre <- gsub(".lfmm","",basename(infile))
outdir <- "output"
dir.create(outdir, showWarnings = FALSE)

# Read data using read.pcadapt function with "lfmm" type
ddl <- read.pcadapt(infile, type = "lfmm")

# Read "pos" column from the input file and set column name to NULL
pos <- fread(posfile, header = F)
print(paste("dim of this input is", paste(as.character(dim(ddl)), collapse = "x")))

# Plot the first screeplot of the first division
pca <- pcadapt(ddl, K = 10, method = "mahalanobis", min.maf = 0.05)
pdf(paste0(outdir, "/", "outplots_", outpre, ".screeplot", ".K10", ".pdf"))
plot(pca, option = "screeplot")
dev.off()

# Perform pcadapt analysis with the best K value
pca     <- pcadapt(ddl, K = bestk, method = "mahalanobis", min.maf = 0.05)
pca$pos <- pos
pca$qvalue <- qvalue(pca$pvalues)

# Plot the histogram of p-values
outpdf  <- paste0(outdir, "/", "outplots_", outpre, ".hist", paste0(".K", bestk), ".pdf")
pdf(outpdf)
hist(pca$pvalues, xlab = "p-values", main = NULL, breaks = 50, col = "orange")
dev.off()

# Plot the QQ plot
pdf("output_sandy_Ccar2.5/Ccar.qqplot_k2.pdf")
plot(pca, option = "qqplot")
dev.off()

# Plot the histogram of Dj statistic
pdf("output_sandy_Ccar2.5/Ccar.Dj_hist.pdf")
plot(pca, option = "stat.distribution")
dev.off()

# Plot the loading scores for the top 2 principal components
pdf("output_sandy_Ccar2.5/Ccar.loading.pdf")
par(mfrow = c(2, 1))
for (i in 1:2)
  plot(pca$loadings[, i], pch = 19, cex = .3, ylab = paste0("Loadings PC", i))
dev.off()

# Save the pcadapt results in outpca
# Merge pos, pvalues, and qvalues variables
out_dd <- data.frame("pos" = pca$pos, "pvalues" = pca$pvalues, "qvalues" = pca$qvalue$qvalues)
outpca = paste0(outdir, "/", "outplots_", outpre, paste0(".K", bestk), ".pcadapt.qvalue.txt")
write.table(out_dd, file = outpca, sep = "\t", quote = F, row.names = F, col.names = T)

saveRDS(pca, paste0(outpre, ".", bestk, ".RDS"))

Here is the R session info:

> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /opt/software/R/4.0.3/lib64/R/lib/libRblas.so
LAPACK: /opt/software/R/4.0.3/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] pcadapt_4.3.3      stringr_1.5.0      qvalue_2.22.0      RColorBrewer_1.1-2
[5] data.table_1.14.8

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7       magrittr_2.0.3   splines_4.0.3    tidyselect_1.2.0
 [5] munsell_0.5.0    colorspace_2.0-0 R6_2.5.0         rlang_1.1.0
 [9] fansi_0.4.1      plyr_1.8.6       dplyr_1.1.1      tools_4.0.3
[13] grid_4.0.3       gtable_0.3.0     utf8_1.1.4       cli_3.6.1
[17] tibble_3.2.1     lifecycle_1.0.3  reshape2_1.4.4   ggplot2_3.4.2
[21] vctrs_0.6.1      glue_1.6.2       stringi_1.5.3    compiler_4.0.3
[25] pillar_1.9.0     generics_0.1.0   scales_1.2.1     pkgconfig_2.0.3

Here is the plot output. The blue and light blue colored points indicate the outliers, while the black and grey colored points represent the non-outliers. The grey dashed line represents the threshold q-value of less than 0.01, and the red line represents the threshold Bonferroni adjusted p-value of less than 0.05. Additionally, the purple points indicate outliers that were identified using other software, such as RDA or LFMM. The Y-axis represents the unadjusted p-value, which has been transformed using the -log10() function.
I have also checked these plots, and everything seems to be okay. The plots include a histogram of p-values, a QQ plot, a histogram of Dj statistic, and loading scores plots.

The text was updated successfully, but these errors were encountered:

privefl · 2023-04-26T08:32:19Z

How many variants do you have as input?
Have you checked #56?

sandyplus · 2023-04-26T09:12:10Z

@privefl Thanks for your quick response.

I identified 247,899 SNPs as outliers using a Q-value threshold of < 0.01, and 37,743 using the Bonferroni adjusted p-value threshold out of 7,856,218 SNPs. However, there was little overlap with RDA and LFMM with the Bonferroni method.

I reviewed the issue and confirmed that I have enough variants for pcadapt to calculate the mahalanobis distance. Let me know if you need more information.

Best regards,
Sandy

sandyplus · 2023-04-26T09:21:44Z

FYI, I divided the chromosome into 200 parts and merged them as input for pcadapt, which may have caused some misordering of SNPs. Should I sort the SNPs according to their respective chromosomes?

privefl · 2023-04-26T09:36:17Z

Ok, the number of variants is not the problem then.

The other obvious next issue can be LD. Are you capturing any LD in the PCA?
But you seem to be using only 2 PCs, so that seems unlikely..
You can always try to use some pruning.

privefl · 2023-05-11T09:59:09Z

Any update on this?

sandyplus · 2023-05-23T12:46:11Z

@privefl
I'm sorry for the delayed response. I was unable to detect any LD in the PCA. Additionally, I attempted to implement prune using command "pca <- pcadapt(ddl, K = bestk, method="mahalanobis",min.maf=0.05, LD.clumping = list(size = 200, thr = 0.1))", but the outcomes were almost identical with minor variations.

privefl · 2023-05-23T12:51:39Z

Then, maybe the results are fine.
How many samples do you have?

You might want to increase the size in LD.clumping to e.g. 10000, given the large number of variants you have.

sandyplus · 2023-05-23T12:56:26Z

I have 79 individuals. Alright, I will attempt to increase the size in LD.clumping and update you with the results. Thanks for your suggestion.

privefl · 2023-05-23T13:09:10Z

That's a very small sample.
Do you have populations that separate perfectly on the PC plot?

sandyplus · 2023-05-23T13:27:13Z

Yes, I have 16 populations which are separated by K = 2.
So I used K = 2 for pcadapt analysis.

privefl · 2023-05-23T14:19:05Z

They are seperated in 3 groups clearly.
Maybe results are just fine.

How does the histogram of pvalues look like?

sandyplus · 2023-05-24T13:54:17Z

It looks good.

privefl · 2023-08-29T09:55:28Z

What's the status on this issue?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Too many outliers #81

Too many outliers #81

sandyplus commented Apr 26, 2023

privefl commented Apr 26, 2023

sandyplus commented Apr 26, 2023

sandyplus commented Apr 26, 2023 •

edited

Loading

privefl commented Apr 26, 2023 •

edited

Loading

privefl commented May 11, 2023

sandyplus commented May 23, 2023 •

edited

Loading

privefl commented May 23, 2023 •

edited

Loading

sandyplus commented May 23, 2023

privefl commented May 23, 2023

sandyplus commented May 23, 2023

privefl commented May 23, 2023

sandyplus commented May 24, 2023

privefl commented Aug 29, 2023 •

edited

Loading

Too many outliers #81

Too many outliers #81

Comments

sandyplus commented Apr 26, 2023

privefl commented Apr 26, 2023

sandyplus commented Apr 26, 2023

sandyplus commented Apr 26, 2023 • edited Loading

privefl commented Apr 26, 2023 • edited Loading

privefl commented May 11, 2023

sandyplus commented May 23, 2023 • edited Loading

privefl commented May 23, 2023 • edited Loading

sandyplus commented May 23, 2023

privefl commented May 23, 2023

sandyplus commented May 23, 2023

privefl commented May 23, 2023

sandyplus commented May 24, 2023

privefl commented Aug 29, 2023 • edited Loading

sandyplus commented Apr 26, 2023 •

edited

Loading

privefl commented Apr 26, 2023 •

edited

Loading

sandyplus commented May 23, 2023 •

edited

Loading

privefl commented May 23, 2023 •

edited

Loading

privefl commented Aug 29, 2023 •

edited

Loading