Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Too many outliers #81

Open
sandyplus opened this issue Apr 26, 2023 · 13 comments
Open

Too many outliers #81

sandyplus opened this issue Apr 26, 2023 · 13 comments

Comments

@sandyplus
Copy link

Hello, everyone,

I used pcadapt to identify environment-related outliers, but I obtained an excessive number of them. Is there anything I overlooked?
Best regards,
Sandy

PS: The code I used:

library(pcadapt)
library(qvalue)
library(data.table)

# Read input file name and best K value from command line arguments
infile <- "imputation.merged.lfmm"
posfile <- "imputation.merged.pos"
bestk  <- 2
outpre <- gsub(".lfmm","",basename(infile))
outdir <- "output"
dir.create(outdir, showWarnings = FALSE)

# Read data using read.pcadapt function with "lfmm" type
ddl <- read.pcadapt(infile, type = "lfmm")

# Read "pos" column from the input file and set column name to NULL
pos <- fread(posfile, header = F)
print(paste("dim of this input is", paste(as.character(dim(ddl)), collapse = "x")))

# Plot the first screeplot of the first division
pca <- pcadapt(ddl, K = 10, method = "mahalanobis", min.maf = 0.05)
pdf(paste0(outdir, "/", "outplots_", outpre, ".screeplot", ".K10", ".pdf"))
plot(pca, option = "screeplot")
dev.off()

# Perform pcadapt analysis with the best K value
pca     <- pcadapt(ddl, K = bestk, method = "mahalanobis", min.maf = 0.05)
pca$pos <- pos
pca$qvalue <- qvalue(pca$pvalues)

# Plot the histogram of p-values
outpdf  <- paste0(outdir, "/", "outplots_", outpre, ".hist", paste0(".K", bestk), ".pdf")
pdf(outpdf)
hist(pca$pvalues, xlab = "p-values", main = NULL, breaks = 50, col = "orange")
dev.off()

# Plot the QQ plot
pdf("output_sandy_Ccar2.5/Ccar.qqplot_k2.pdf")
plot(pca, option = "qqplot")
dev.off()

# Plot the histogram of Dj statistic
pdf("output_sandy_Ccar2.5/Ccar.Dj_hist.pdf")
plot(pca, option = "stat.distribution")
dev.off()

# Plot the loading scores for the top 2 principal components
pdf("output_sandy_Ccar2.5/Ccar.loading.pdf")
par(mfrow = c(2, 1))
for (i in 1:2)
  plot(pca$loadings[, i], pch = 19, cex = .3, ylab = paste0("Loadings PC", i))
dev.off()

# Save the pcadapt results in outpca
# Merge pos, pvalues, and qvalues variables
out_dd <- data.frame("pos" = pca$pos, "pvalues" = pca$pvalues, "qvalues" = pca$qvalue$qvalues)
outpca = paste0(outdir, "/", "outplots_", outpre, paste0(".K", bestk), ".pcadapt.qvalue.txt")
write.table(out_dd, file = outpca, sep = "\t", quote = F, row.names = F, col.names = T)

saveRDS(pca, paste0(outpre, ".", bestk, ".RDS"))

Here is the R session info:

> sessionInfo()
R version 4.0.3 (2020-10-10)
Platform: x86_64-pc-linux-gnu (64-bit)
Running under: CentOS Linux 7 (Core)

Matrix products: default
BLAS:   /opt/software/R/4.0.3/lib64/R/lib/libRblas.so
LAPACK: /opt/software/R/4.0.3/lib64/R/lib/libRlapack.so

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] pcadapt_4.3.3      stringr_1.5.0      qvalue_2.22.0      RColorBrewer_1.1-2
[5] data.table_1.14.8

loaded via a namespace (and not attached):
 [1] Rcpp_1.0.7       magrittr_2.0.3   splines_4.0.3    tidyselect_1.2.0
 [5] munsell_0.5.0    colorspace_2.0-0 R6_2.5.0         rlang_1.1.0
 [9] fansi_0.4.1      plyr_1.8.6       dplyr_1.1.1      tools_4.0.3
[13] grid_4.0.3       gtable_0.3.0     utf8_1.1.4       cli_3.6.1
[17] tibble_3.2.1     lifecycle_1.0.3  reshape2_1.4.4   ggplot2_3.4.2
[21] vctrs_0.6.1      glue_1.6.2       stringi_1.5.3    compiler_4.0.3
[25] pillar_1.9.0     generics_0.1.0   scales_1.2.1     pkgconfig_2.0.3

Here is the plot output. The blue and light blue colored points indicate the outliers, while the black and grey colored points represent the non-outliers. The grey dashed line represents the threshold q-value of less than 0.01, and the red line represents the threshold Bonferroni adjusted p-value of less than 0.05. Additionally, the purple points indicate outliers that were identified using other software, such as RDA or LFMM. The Y-axis represents the unadjusted p-value, which has been transformed using the -log10() function.
I have also checked these plots, and everything seems to be okay. The plots include a histogram of p-values, a QQ plot, a histogram of Dj statistic, and loading scores plots.

img

@privefl
Copy link
Member

privefl commented Apr 26, 2023

How many variants do you have as input?
Have you checked #56?

@sandyplus
Copy link
Author

@privefl Thanks for your quick response.

I identified 247,899 SNPs as outliers using a Q-value threshold of < 0.01, and 37,743 using the Bonferroni adjusted p-value threshold out of 7,856,218 SNPs. However, there was little overlap with RDA and LFMM with the Bonferroni method.

I reviewed the issue and confirmed that I have enough variants for pcadapt to calculate the mahalanobis distance. Let me know if you need more information.

Best regards,
Sandy

@sandyplus
Copy link
Author

sandyplus commented Apr 26, 2023

FYI, I divided the chromosome into 200 parts and merged them as input for pcadapt, which may have caused some misordering of SNPs. Should I sort the SNPs according to their respective chromosomes?

@privefl
Copy link
Member

privefl commented Apr 26, 2023

Ok, the number of variants is not the problem then.

The other obvious next issue can be LD. Are you capturing any LD in the PCA?
But you seem to be using only 2 PCs, so that seems unlikely..
You can always try to use some pruning.

@privefl
Copy link
Member

privefl commented May 11, 2023

Any update on this?

@sandyplus
Copy link
Author

sandyplus commented May 23, 2023

@privefl
I'm sorry for the delayed response. I was unable to detect any LD in the PCA. Additionally, I attempted to implement prune using command "pca <- pcadapt(ddl, K = bestk, method="mahalanobis",min.maf=0.05, LD.clumping = list(size = 200, thr = 0.1))", but the outcomes were almost identical with minor variations.

@privefl
Copy link
Member

privefl commented May 23, 2023

Then, maybe the results are fine.
How many samples do you have?

You might want to increase the size in LD.clumping to e.g. 10000, given the large number of variants you have.

@sandyplus
Copy link
Author

I have 79 individuals. Alright, I will attempt to increase the size in LD.clumping and update you with the results. Thanks for your suggestion.

@privefl
Copy link
Member

privefl commented May 23, 2023

That's a very small sample.
Do you have populations that separate perfectly on the PC plot?

@sandyplus
Copy link
Author

Yes, I have 16 populations which are separated by K = 2.
So I used K = 2 for pcadapt analysis.
image

@privefl
Copy link
Member

privefl commented May 23, 2023

They are seperated in 3 groups clearly.
Maybe results are just fine.

How does the histogram of pvalues look like?

@sandyplus
Copy link
Author

It looks good.
image

@privefl
Copy link
Member

privefl commented Aug 29, 2023

What's the status on this issue?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants