-
Notifications
You must be signed in to change notification settings - Fork 19
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ancestry clustering yielding incorrect result #333
Comments
Thanks for the bug report. We did recently change how variants are filtered for PCA. @smlmbrt will investigate after his holidays 🌴 Can I double check if you're deleting the cache each time you use a new VCF? |
I deleted the work directory and the results directory before running. Does removing the work directory delete the cache? I did not use –genotypes_cache. I did not attempt to do anything with NXF_SINGULARITY_CACHEDIR. I also tested this on a earlier, but still recent version (I believe v2.0.0-beta), and had the same issue. Congratulations on the preprint, by the way. |
@Fiwx, are you applying the pipeline to individual samples? In pgsc_calc/modules/local/ancestry/intersect_variants.nf Lines 33 to 39 in 0f33b4c
For individual samples reducing the MAF threshold to 0 should revert to the old behaviour. I think it's still sensible to applying the missingness filter to these variants though. It would be great if you could let us know if that fixes it, we may consider some sort of |
Yes, I saw those changes, and I also thought they were sensible. I'm not sure why they would cause an issue. By the way, I think the documentation still mentions 5% threshold for MAF: "minor allele frequency [MAF > 5%]." I am running the pipeline with single samples, yes. Is there a MAF threshold flag/config, or should I just edit --maf_target 0.1 in the source? |
@smlmbrt The results are biased towards extreme percentile values. Here are some examples from different scores:
It also shows up in an unusual location in the PCA chart, away from other samples in 1000G. Run information:
|
Are these the original results or after editing the module? Could you edit to |
This was without changing any parameters. I will try those parameters now. For maf_target, this will return the minor allele frequency at a position. I see the usefulness (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7077175/). Ancestry adjustment is done on the PC level for the Z_norms, so perhaps a large difference in one PC value could cause a large difference in score. For example, this figure shows large eigenvalue change right above MAF 0.02: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4143691/figure/F1/, and shows a transient jump at around 0.1, where the maf_target is currently set. Others have various stances: https://onlinelibrary.wiley.com/doi/epdf/10.1111/1755-0998.12995.
For geno_miss, the PLINK .vmiss documentation is a bit unclear on this. What does geno_miss do? In other words, what counts as a missing dosage?
|
I ran it with the line changed to --maf_target 0 in modules/local/ancestry/intersect_variants.nf and got:
The logs show this command was ran: The target_variants.txt.gz file only contains a header line with the expected 'CHR:POS:A0:A1' column, with no other lines. |
I changed it to 0.0 instead of "0" and it ran through. Here are the results: Before and after the change:
The random forest ancestry assignment probabilities were a bit more accurate after the change. The main issue of extreme percentile values has remained the same: a large percentage of the scores are very high (100.0) or low (0.0). Changing geno_miss to 1.0, I got the same PC results, and the same issue with the extreme percentile scores. Example for various scores, with --maf_target 0.0 and --geno_miss 1.0 set: Here is an example of a normal result: PGS002228_hmPOS_GRCh37
The below results seem to be off, and happen for most scores: PGS000758_hmPOS_GRCh37
PGS000892_hmPOS_GRCh37
PGS000921_hmPOS_GRCh37
PGS002308_hmPOS_GRCh37
PGS002764_hmPOS_GRCh37
PGS002771_hmPOS_GRCh37
PGS003725_hmPOS_GRCh37
|
See #343 for parallel discussion. Have fixed the filters so that low MAF variants can still be included and the next update of the calculator will allow for these changes (and should revert back to the pre-beta behaviour if the filters are removed). |
The new parameters are available in v2.0.0-beta.2 🥳 |
Description of the bug
On the most recent dev branch, I have noticed VCF files having incorrect ancestry results. That is, the PC values that are assigned are unusual and aren't near any other value in 1000G, and the ancestry assignment (closest population) is incorrect: the ancestry does not (not even close) to matching the self-reported ancestry of the samples, and previous runs on these samples have resulted in correct ancestry assignments (and normal PC values). I am using the run ancestry flag.
These are imputed VCFs. The match rate to the reference panel is about 1/3, and the match rate to the scorefile is just under 100%. The genome build used is correct.
I'm wondering if there are any ideas that I can explore to fix this or if this behavior has been seen before. Along with the failed ancestry analysis, the Z_norm2 values (which seem normal) are discrepant from the percentile in the most similar population values (which are at 100.0%, abnormally high).
I was able to replicate this across multiple VCFs, though each VCF was submitted one at a time.
Command used and terminal output
No response
Relevant files
No response
System information
Linux. -dev branch of pgsc_calc. Singularity.
The text was updated successfully, but these errors were encountered: