Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ancestry clustering yielding incorrect result #333

Closed
Fiwx opened this issue Jul 6, 2024 · 11 comments
Closed

Ancestry clustering yielding incorrect result #333

Fiwx opened this issue Jul 6, 2024 · 11 comments
Assignees
Labels
bug Something isn't working documentation Improvements or additions to documentation user-query User queries & requests
Milestone

Comments

@Fiwx
Copy link

Fiwx commented Jul 6, 2024

Description of the bug

On the most recent dev branch, I have noticed VCF files having incorrect ancestry results. That is, the PC values that are assigned are unusual and aren't near any other value in 1000G, and the ancestry assignment (closest population) is incorrect: the ancestry does not (not even close) to matching the self-reported ancestry of the samples, and previous runs on these samples have resulted in correct ancestry assignments (and normal PC values). I am using the run ancestry flag.

These are imputed VCFs. The match rate to the reference panel is about 1/3, and the match rate to the scorefile is just under 100%. The genome build used is correct.

I'm wondering if there are any ideas that I can explore to fix this or if this behavior has been seen before. Along with the failed ancestry analysis, the Z_norm2 values (which seem normal) are discrepant from the percentile in the most similar population values (which are at 100.0%, abnormally high).

I was able to replicate this across multiple VCFs, though each VCF was submitted one at a time.

Command used and terminal output

No response

Relevant files

No response

System information

Linux. -dev branch of pgsc_calc. Singularity.

@Fiwx Fiwx added the bug Something isn't working label Jul 6, 2024
@nebfield
Copy link
Member

nebfield commented Jul 8, 2024

Thanks for the bug report. We did recently change how variants are filtered for PCA. @smlmbrt will investigate after his holidays 🌴

Can I double check if you're deleting the cache each time you use a new VCF?

@nebfield nebfield added this to the v2.0.0-beta.2 milestone Jul 8, 2024
@Fiwx
Copy link
Author

Fiwx commented Jul 8, 2024

I deleted the work directory and the results directory before running. Does removing the work directory delete the cache? I did not use –genotypes_cache. I did not attempt to do anything with NXF_SINGULARITY_CACHEDIR.

I also tested this on a earlier, but still recent version (I believe v2.0.0-beta), and had the same issue. Congratulations on the preprint, by the way.

@smlmbrt smlmbrt added the user-query User queries & requests label Jul 16, 2024
@smlmbrt
Copy link
Member

smlmbrt commented Jul 16, 2024

@Fiwx, are you applying the pipeline to individual samples? In v2.0.0-beta I added an additional filter for PCA variant inclusion that the MAF > 10% and the Missingness < 10% for variants in the target sample

pgscatalog-intersect --ref $ref_variants \
--target $variants \
--chrom $meta.chrom \
--maf_target 0.1 \
--geno_miss 0.1 \
--outdir . \
-v

For individual samples reducing the MAF threshold to 0 should revert to the old behaviour. I think it's still sensible to applying the missingness filter to these variants though. It would be great if you could let us know if that fixes it, we may consider some sort of --single_sample flag that would then change these filters depending on how many people the PGS is being calculated for.

@Fiwx
Copy link
Author

Fiwx commented Jul 16, 2024

Yes, I saw those changes, and I also thought they were sensible. I'm not sure why they would cause an issue. By the way, I think the documentation still mentions 5% threshold for MAF: "minor allele frequency [MAF > 5%]."

I am running the pipeline with single samples, yes. Is there a MAF threshold flag/config, or should I just edit --maf_target 0.1 in the source?

@Fiwx
Copy link
Author

Fiwx commented Jul 16, 2024

@smlmbrt The results are biased towards extreme percentile values. Here are some examples from different scores:

            "Z_norm2": 1.9385237149563703,
            "percentile_MostSimilarPop": 100.0

            "Z_norm2": 2.644385079008054,
            "percentile_MostSimilarPop": 100.0

            "Z_norm2": -2.3964278919739876,
            "percentile_MostSimilarPop": 0.0

            "Z_norm2": 1.7975167781748878,
            "percentile_MostSimilarPop": 100.0

            "Z_norm2": -3.0629493759113258,
            "percentile_MostSimilarPop": 0.0

            "Z_norm2": -2.5554729425082163,
            "percentile_MostSimilarPop": 0.0

It also shows up in an unusual location in the PCA chart, away from other samples in 1000G.

Run information:

    "run_ancestry": "/home/user/data/pgsc_1000G_v1.tar.zst",
    "geno_ref": 0.1,
    "mind_ref": 0.1,
    "maf_ref": 0.05,
    "hwe_ref": 0.0001,
    "indep_pairwise_ref": "1000 50 0.05",
    "projection_method": "oadp",
    "ancestry_method": "RandomForest",
    "ref_label": "SuperPop",
    "n_popcomp": 5,
    "normalization_method": "empirical mean mean+var",
    "n_normalization": 4,
    "liftover": false,
    "target_build": "GRCh37",
    "min_lift": 0.95,
    "min_overlap": 0.0,

@smlmbrt
Copy link
Member

smlmbrt commented Jul 17, 2024

@smlmbrt The results are biased towards extreme percentile values. Here are some examples from different scores:

            "Z_norm2": 1.9385237149563703,
            "percentile_MostSimilarPop": 100.0

            "Z_norm2": 2.644385079008054,
            "percentile_MostSimilarPop": 100.0

            "Z_norm2": -2.3964278919739876,
            "percentile_MostSimilarPop": 0.0

            "Z_norm2": 1.7975167781748878,
            "percentile_MostSimilarPop": 100.0

            "Z_norm2": -3.0629493759113258,
            "percentile_MostSimilarPop": 0.0

            "Z_norm2": -2.5554729425082163,
            "percentile_MostSimilarPop": 0.0

Are these the original results or after editing the module? Could you edit to --maf_target 0 --geno_miss 1 or --maf_target 0 --geno_miss 0.1 and re-running without any cacheing to see if that fixes it?

@Fiwx
Copy link
Author

Fiwx commented Jul 23, 2024

This was without changing any parameters. I will try those parameters now.

For maf_target, this will return the minor allele frequency at a position. I see the usefulness (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7077175/). Ancestry adjustment is done on the PC level for the Z_norms, so perhaps a large difference in one PC value could cause a large difference in score. For example, this figure shows large eigenvalue change right above MAF 0.02: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4143691/figure/F1/, and shows a transient jump at around 0.1, where the maf_target is currently set. Others have various stances: https://onlinelibrary.wiley.com/doi/epdf/10.1111/1755-0998.12995.

When seeking to exclude only singletons in alignments with missing data
(a ubiquitous problem for reduced‐representation library prepara‐
tion methods), it is preferable to filter by the count (rather than
frequency) of the minor allele, because variation in the amount of
missing data across an alignment will cause a static frequency cut‐
off to remove different SFS classes at different sites
https://github.com/cjbattey/LinckBattey2017_MAF_clustering/tree/master

For geno_miss, the PLINK .vmiss documentation is a bit unclear on this. What does geno_miss do? In other words, what counts as a missing dosage?
I can think of a few interpretations:

  • chrom:pos pair being present
  • position containing no missing genotype information
  • position containing the exact same genotype
  • position containing the exact same genotype with the same ref/alt assignment
  • how samples have missing dosage data for an allele, out of 2 * number of samples

@Fiwx
Copy link
Author

Fiwx commented Jul 23, 2024

I ran it with the line changed to --maf_target 0 in modules/local/ancestry/intersect_variants.nf and got:

Jul-23 16:53:32.703 [Task monitor] DEBUG n.processor.TaskPollingMonitor - !! executor local > tasks to be completed: 1 -- submitted tasks are shown below
~> TaskHandler[id: 5; name: PGSCATALOG_PGSCCALC:PGSCCALC:ANCESTRY_PROJECT:INTERSECT_VARIANTS (file chromosome ALL); status: RUNNING; exit: -; error: -; workDir: /home/ubuntu/user/script/test/file/work/21/2b7ed6cc085c2e89b7ce925e1d52ce]
Jul-23 16:58:32.757 [Task monitor] DEBUG n.processor.TaskPollingMonitor - !! executor local > tasks to be completed: 1 -- submitted tasks are shown below
~> TaskHandler[id: 5; name: PGSCATALOG_PGSCCALC:PGSCCALC:ANCESTRY_PROJECT:INTERSECT_VARIANTS (file chromosome ALL); status: RUNNING; exit: -; error: -; workDir: /home/ubuntu/user/script/test/file/work/21/2b7ed6cc085c2e89b7ce925e1d52ce]
Jul-23 17:03:32.814 [Task monitor] DEBUG n.processor.TaskPollingMonitor - !! executor local > tasks to be completed: 1 -- submitted tasks are shown below
~> TaskHandler[id: 5; name: PGSCATALOG_PGSCCALC:PGSCCALC:ANCESTRY_PROJECT:INTERSECT_VARIANTS (file chromosome ALL); status: RUNNING; exit: -; error: -; workDir: /home/ubuntu/user/script/test/file/work/21/2b7ed6cc085c2e89b7ce925e1d52ce]
Jul-23 17:04:16.189 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 5; name: PGSCATALOG_PGSCCALC:PGSCCALC:ANCESTRY_PROJECT:INTERSECT_VARIANTS (file chromosome ALL); status: COMPLETED; exit: 1; error: -; workDir: /home/ubuntu/user/script/test/file/work/21/2b7ed6cc085c2e89b7ce925e1d52ce]
Jul-23 17:04:16.194 [TaskFinalizer-6] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
  task: name=PGSCATALOG_PGSCCALC:PGSCCALC:ANCESTRY_PROJECT:INTERSECT_VARIANTS (file chromosome ALL); work-dir=/home/ubuntu/user/script/test/file/work/21/2b7ed6cc085c2e89b7ce925e1d52ce
  error [nextflow.exception.ProcessFailedException]: Process `PGSCATALOG_PGSCCALC:PGSCCALC:ANCESTRY_PROJECT:INTERSECT_VARIANTS (file chromosome ALL)` terminated with an error exit status (1)
Jul-23 17:04:16.223 [TaskFinalizer-6] ERROR nextflow.processor.TaskProcessor - Error executing process > 'PGSCATALOG_PGSCCALC:PGSCCALC:ANCESTRY_PROJECT:INTERSECT_VARIANTS (file chromosome ALL)'

Caused by:
  Process `PGSCATALOG_PGSCCALC:PGSCCALC:ANCESTRY_PROJECT:INTERSECT_VARIANTS (file chromosome ALL)` terminated with an error exit status (1)


Command executed:

  pgscatalog-intersect --ref GRCh37_1000G_ALL.pvar.zst         --target GRCh37_file_ALL.pvar.zst         --chrom ALL         --maf_target 0         --geno_miss 0.1         --outdir .         -v
  
  n_matched=$(sed -n '3p' intersect_counts_ALL.txt)
  
  if [ $n_matched == "0" ]
  then
      echo "ERROR: No variants in intersection"
      exit 1
  else
      mv matched_variants.txt.gz file_ALL_matched.txt.gz
  fi
  
  cat <<-END_VERSIONS > versions.yml
  INTERSECT_VARIANTS:
      pgscatalog.match: $(echo $(python -c 'import pgscatalog.match; print(pgscatalog.match.__version__)'))
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  pgscatalog.match.cli.intersect_cli: 2024-07-23 17:01:42 INFO     Processed 12000000 TARGET variants

[...]

  pgscatalog.match.cli.intersect_cli: 2024-07-23 17:04:16 INFO     Processed 27904794 TARGET variants
  pgscatalog.match.cli.intersect_cli: 2024-07-23 17:04:16 INFO     Outputting TARGET variants -> target_variants.txt.gz
  Traceback (most recent call last):
    File "/app/pgscatalog.utils/.venv/bin/pgscatalog-intersect", line 8, in <module>
      sys.exit(run_intersect())
               ^^^^^^^^^^^^^^^
    File "/app/pgscatalog.utils/.venv/lib/python3.11/site-packages/pgscatalog/match/cli/intersect_cli.py", line 172, in run_intersect
      for v in heapq.merge(
    File "/usr/local/lib/python3.11/heapq.py", line 376, in merge
      h_append([key(value), order * direction, value, next])
                ^^^^^^^^^^
    File "/app/pgscatalog.utils/.venv/lib/python3.11/site-packages/pgscatalog/match/cli/intersect_cli.py", line 174, in <lambda>
      key=lambda v: (v["CHR:POS:A0:A1"], v["ID_TARGET"], v["REF_TARGET"]),
                     ~^^^^^^^^^^^^^^^^^
  KeyError: 'CHR:POS:A0:A1'
  cp: '.command.out' and '.command.out' are the same file
  cp: '.command.err' and '.command.err' are the same file
  cp: '.command.trace' and '.command.trace' are the same file
[...]
Jul-23 17:04:16.239 [TaskFinalizer-6] INFO  nextflow.Session - Execution cancelled -- Finishing pending tasks before exit
Jul-23 17:04:16.240 [Actor Thread 38] ERROR nextflow.Nextflow - ERROR: Matching subworkflow failed
Jul-23 17:04:16.251 [main] DEBUG nextflow.Session - Session await > all processes finished
Jul-23 17:04:16.252 [Task monitor] DEBUG n.processor.TaskPollingMonitor - <<< barrier arrives (monitor: local) - terminating tasks monitor poll loop
Jul-23 17:04:16.252 [main] DEBUG nextflow.Session - Session await > all barriers passed
Jul-23 17:04:16.260 [Actor Thread 35] ERROR nextflow.Nextflow - ERROR: No results report written!
Jul-23 17:04:16.268 [Actor Thread 42] ERROR nextflow.Nextflow - ERROR: No scores calculated!
Jul-23 17:04:16.275 [Actor Thread 45] ERROR nextflow.Nextflow - ERROR: Projection subworkflow failed

The logs show this command was ran:
pgscatalog-intersect --ref GRCh37_1000G_ALL.pvar.zst --target GRCh37_file_ALL.pvar.zst --chrom ALL --maf_target 0 --geno_miss 0.1 --outdir . -v

The target_variants.txt.gz file only contains a header line with the expected 'CHR:POS:A0:A1' column, with no other lines.

@Fiwx
Copy link
Author

Fiwx commented Jul 24, 2024

I changed it to 0.0 instead of "0" and it ran through. Here are the results:

Before and after the change:

PC DEFAULT WITH CHANGE
PC1 23.60 23.66
PC2 -30.01 -30.60
PC3 -0.00 0.22
PC4 8.52 8.74
PC5 0.37 -0.10
PC6 0.30 0.67
PC7 -1.00 0.19
PC8 -7.53 -7.64
PC9 1.29 0.87
PC10 -1.36 -1.48

The random forest ancestry assignment probabilities were a bit more accurate after the change. The main issue of extreme percentile values has remained the same: a large percentage of the scores are very high (100.0) or low (0.0).

Changing geno_miss to 1.0, I got the same PC results, and the same issue with the extreme percentile scores.

Example for various scores, with --maf_target 0.0 and --geno_miss 1.0 set:

Here is an example of a normal result:

PGS002228_hmPOS_GRCh37

  • SUM value from sample: 1.0793
  • Z-score for SUM value 1.0793: -0.1270 (compared to 1000G reference)
  • Percentile for SUM value 1.0793: 45.2610
  • Z_norm1: 0.1183
  • Z_norm2: 0.1164
  • percentile_MostSimilarPop: 52.3006

The below results seem to be off, and happen for most scores:

PGS000758_hmPOS_GRCh37

  • SUM value from sample: 0.4187
  • Z-score for SUM value 0.4187: 2.0613 (compared to 1000G reference)
  • Percentile for SUM value 0.4187: 97.0281
  • Z_norm1: 1.9390
  • Z_norm2: 1.7975
  • percentile_MostSimilarPop: 100.0000

PGS000892_hmPOS_GRCh37

  • SUM value from sample: -0.1183
  • Z-score for SUM value -0.1183: -0.5594 (compared to 1000G reference)
  • Percentile for SUM value -0.1183: 29.5984
  • Z_norm1: 0.0261
  • Z_norm2: 0.0244
  • percentile_MostSimilarPop: 80.2147

PGS000921_hmPOS_GRCh37

  • SUM value from sample: -5.0703
  • Z-score for SUM value -5.0703: -1.9863 (compared to 1000G reference)
  • Percentile for SUM value -5.0703: 2.8112
  • Z_norm1: -2.2731
  • Z_norm2: -2.1926
  • percentile_MostSimilarPop: 0.0000

PGS002308_hmPOS_GRCh37

  • SUM value from sample: 0.2243
  • Z-score for SUM value 0.2243: -1.2707 (compared to 1000G reference)
  • Percentile for SUM value 0.2243: 6.6265
  • Z_norm1: -3.0208
  • Z_norm2: -3.0629
  • percentile_MostSimilarPop: 0.0000

PGS002764_hmPOS_GRCh37

  • SUM value from sample: 0.1875
  • Z-score for SUM value 0.1875: 0.3131 (compared to 1000G reference)
  • Percentile for SUM value 0.1875: 57.9920
  • Z_norm1: 1.1029
  • Z_norm2: 1.0425
  • percentile_MostSimilarPop: 98.4663

PGS002771_hmPOS_GRCh37

  • SUM value from sample: -0.5765
  • Z-score for SUM value -0.5765: -1.5752 (compared to 1000G reference)
  • Percentile for SUM value -0.5765: 7.7510
  • Z_norm1: -2.5792
  • Z_norm2: -2.5555
  • percentile_MostSimilarPop: 0.0000

PGS003725_hmPOS_GRCh37

  • SUM value from sample: -7.5222
  • Z-score for SUM value -7.5222: -0.1440 (compared to 1000G reference)
  • Percentile for SUM value -7.5222: 45.4217
  • Z_norm1: -0.5261
  • Z_norm2: -0.5248
  • percentile_MostSimilarPop: 4.7546

@smlmbrt
Copy link
Member

smlmbrt commented Jul 29, 2024

See #343 for parallel discussion. Have fixed the filters so that low MAF variants can still be included and the next update of the calculator will allow for these changes (and should revert back to the pre-beta behaviour if the filters are removed).

@smlmbrt smlmbrt added the documentation Improvements or additions to documentation label Jul 29, 2024
@nebfield nebfield mentioned this issue Jul 30, 2024
@nebfield
Copy link
Member

The new parameters are available in v2.0.0-beta.2 🥳

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working documentation Improvements or additions to documentation user-query User queries & requests
Projects
None yet
Development

No branches or pull requests

3 participants