Ancestry clustering yielding incorrect result #333

Fiwx · 2024-07-06T02:14:39Z

Description of the bug

On the most recent dev branch, I have noticed VCF files having incorrect ancestry results. That is, the PC values that are assigned are unusual and aren't near any other value in 1000G, and the ancestry assignment (closest population) is incorrect: the ancestry does not (not even close) to matching the self-reported ancestry of the samples, and previous runs on these samples have resulted in correct ancestry assignments (and normal PC values). I am using the run ancestry flag.

These are imputed VCFs. The match rate to the reference panel is about 1/3, and the match rate to the scorefile is just under 100%. The genome build used is correct.

I'm wondering if there are any ideas that I can explore to fix this or if this behavior has been seen before. Along with the failed ancestry analysis, the Z_norm2 values (which seem normal) are discrepant from the percentile in the most similar population values (which are at 100.0%, abnormally high).

I was able to replicate this across multiple VCFs, though each VCF was submitted one at a time.

Command used and terminal output

No response

Relevant files

No response

System information

Linux. -dev branch of pgsc_calc. Singularity.

nebfield · 2024-07-08T14:10:07Z

Thanks for the bug report. We did recently change how variants are filtered for PCA. @smlmbrt will investigate after his holidays 🌴

Can I double check if you're deleting the cache each time you use a new VCF?

Fiwx · 2024-07-08T17:42:20Z

I deleted the work directory and the results directory before running. Does removing the work directory delete the cache? I did not use –genotypes_cache. I did not attempt to do anything with NXF_SINGULARITY_CACHEDIR.

I also tested this on a earlier, but still recent version (I believe v2.0.0-beta), and had the same issue. Congratulations on the preprint, by the way.

smlmbrt · 2024-07-16T10:46:38Z

@Fiwx, are you applying the pipeline to individual samples? In v2.0.0-beta I added an additional filter for PCA variant inclusion that the MAF > 10% and the Missingness < 10% for variants in the target sample

pgsc_calc/modules/local/ancestry/intersect_variants.nf

Lines 33 to 39 in 0f33b4c

    
               pgscatalog-intersect --ref $ref_variants \ 
        
                   --target $variants \ 
        
                   --chrom $meta.chrom \ 
        
                   --maf_target 0.1 \ 
        
                   --geno_miss 0.1 \ 
        
                   --outdir . \ 
        
                   -v

For individual samples reducing the MAF threshold to 0 should revert to the old behaviour. I think it's still sensible to applying the missingness filter to these variants though. It would be great if you could let us know if that fixes it, we may consider some sort of --single_sample flag that would then change these filters depending on how many people the PGS is being calculated for.

Fiwx · 2024-07-16T20:48:45Z

Yes, I saw those changes, and I also thought they were sensible. I'm not sure why they would cause an issue. By the way, I think the documentation still mentions 5% threshold for MAF: "minor allele frequency [MAF > 5%]."

I am running the pipeline with single samples, yes. Is there a MAF threshold flag/config, or should I just edit --maf_target 0.1 in the source?

Fiwx · 2024-07-16T21:54:58Z

@smlmbrt The results are biased towards extreme percentile values. Here are some examples from different scores:

            "Z_norm2": 1.9385237149563703,
            "percentile_MostSimilarPop": 100.0

            "Z_norm2": 2.644385079008054,
            "percentile_MostSimilarPop": 100.0

            "Z_norm2": -2.3964278919739876,
            "percentile_MostSimilarPop": 0.0

            "Z_norm2": 1.7975167781748878,
            "percentile_MostSimilarPop": 100.0

            "Z_norm2": -3.0629493759113258,
            "percentile_MostSimilarPop": 0.0

            "Z_norm2": -2.5554729425082163,
            "percentile_MostSimilarPop": 0.0

It also shows up in an unusual location in the PCA chart, away from other samples in 1000G.

Run information:

    "run_ancestry": "/home/user/data/pgsc_1000G_v1.tar.zst",
    "geno_ref": 0.1,
    "mind_ref": 0.1,
    "maf_ref": 0.05,
    "hwe_ref": 0.0001,
    "indep_pairwise_ref": "1000 50 0.05",
    "projection_method": "oadp",
    "ancestry_method": "RandomForest",
    "ref_label": "SuperPop",
    "n_popcomp": 5,
    "normalization_method": "empirical mean mean+var",
    "n_normalization": 4,
    "liftover": false,
    "target_build": "GRCh37",
    "min_lift": 0.95,
    "min_overlap": 0.0,

smlmbrt · 2024-07-17T10:29:55Z

@smlmbrt The results are biased towards extreme percentile values. Here are some examples from different scores:

            "Z_norm2": 1.9385237149563703,
            "percentile_MostSimilarPop": 100.0

            "Z_norm2": 2.644385079008054,
            "percentile_MostSimilarPop": 100.0

            "Z_norm2": -2.3964278919739876,
            "percentile_MostSimilarPop": 0.0

            "Z_norm2": 1.7975167781748878,
            "percentile_MostSimilarPop": 100.0

            "Z_norm2": -3.0629493759113258,
            "percentile_MostSimilarPop": 0.0

            "Z_norm2": -2.5554729425082163,
            "percentile_MostSimilarPop": 0.0

Are these the original results or after editing the module? Could you edit to --maf_target 0 --geno_miss 1 or --maf_target 0 --geno_miss 0.1 and re-running without any cacheing to see if that fixes it?

Fiwx · 2024-07-23T20:20:41Z

This was without changing any parameters. I will try those parameters now.

For maf_target, this will return the minor allele frequency at a position. I see the usefulness (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7077175/). Ancestry adjustment is done on the PC level for the Z_norms, so perhaps a large difference in one PC value could cause a large difference in score. For example, this figure shows large eigenvalue change right above MAF 0.02: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4143691/figure/F1/, and shows a transient jump at around 0.1, where the maf_target is currently set. Others have various stances: https://onlinelibrary.wiley.com/doi/epdf/10.1111/1755-0998.12995.

When seeking to exclude only singletons in alignments with missing data
(a ubiquitous problem for reduced‐representation library prepara‐
tion methods), it is preferable to filter by the count (rather than
frequency) of the minor allele, because variation in the amount of
missing data across an alignment will cause a static frequency cut‐
off to remove different SFS classes at different sites
https://github.com/cjbattey/LinckBattey2017_MAF_clustering/tree/master

For geno_miss, the PLINK .vmiss documentation is a bit unclear on this. What does geno_miss do? In other words, what counts as a missing dosage?
I can think of a few interpretations:

chrom:pos pair being present
position containing no missing genotype information
position containing the exact same genotype
position containing the exact same genotype with the same ref/alt assignment
how samples have missing dosage data for an allele, out of 2 * number of samples

Fiwx · 2024-07-23T21:15:31Z

I ran it with the line changed to --maf_target 0 in modules/local/ancestry/intersect_variants.nf and got:

Jul-23 16:53:32.703 [Task monitor] DEBUG n.processor.TaskPollingMonitor - !! executor local > tasks to be completed: 1 -- submitted tasks are shown below
~> TaskHandler[id: 5; name: PGSCATALOG_PGSCCALC:PGSCCALC:ANCESTRY_PROJECT:INTERSECT_VARIANTS (file chromosome ALL); status: RUNNING; exit: -; error: -; workDir: /home/ubuntu/user/script/test/file/work/21/2b7ed6cc085c2e89b7ce925e1d52ce]
Jul-23 16:58:32.757 [Task monitor] DEBUG n.processor.TaskPollingMonitor - !! executor local > tasks to be completed: 1 -- submitted tasks are shown below
~> TaskHandler[id: 5; name: PGSCATALOG_PGSCCALC:PGSCCALC:ANCESTRY_PROJECT:INTERSECT_VARIANTS (file chromosome ALL); status: RUNNING; exit: -; error: -; workDir: /home/ubuntu/user/script/test/file/work/21/2b7ed6cc085c2e89b7ce925e1d52ce]
Jul-23 17:03:32.814 [Task monitor] DEBUG n.processor.TaskPollingMonitor - !! executor local > tasks to be completed: 1 -- submitted tasks are shown below
~> TaskHandler[id: 5; name: PGSCATALOG_PGSCCALC:PGSCCALC:ANCESTRY_PROJECT:INTERSECT_VARIANTS (file chromosome ALL); status: RUNNING; exit: -; error: -; workDir: /home/ubuntu/user/script/test/file/work/21/2b7ed6cc085c2e89b7ce925e1d52ce]
Jul-23 17:04:16.189 [Task monitor] DEBUG n.processor.TaskPollingMonitor - Task completed > TaskHandler[id: 5; name: PGSCATALOG_PGSCCALC:PGSCCALC:ANCESTRY_PROJECT:INTERSECT_VARIANTS (file chromosome ALL); status: COMPLETED; exit: 1; error: -; workDir: /home/ubuntu/user/script/test/file/work/21/2b7ed6cc085c2e89b7ce925e1d52ce]
Jul-23 17:04:16.194 [TaskFinalizer-6] DEBUG nextflow.processor.TaskProcessor - Handling unexpected condition for
  task: name=PGSCATALOG_PGSCCALC:PGSCCALC:ANCESTRY_PROJECT:INTERSECT_VARIANTS (file chromosome ALL); work-dir=/home/ubuntu/user/script/test/file/work/21/2b7ed6cc085c2e89b7ce925e1d52ce
  error [nextflow.exception.ProcessFailedException]: Process `PGSCATALOG_PGSCCALC:PGSCCALC:ANCESTRY_PROJECT:INTERSECT_VARIANTS (file chromosome ALL)` terminated with an error exit status (1)
Jul-23 17:04:16.223 [TaskFinalizer-6] ERROR nextflow.processor.TaskProcessor - Error executing process > 'PGSCATALOG_PGSCCALC:PGSCCALC:ANCESTRY_PROJECT:INTERSECT_VARIANTS (file chromosome ALL)'

Caused by:
  Process `PGSCATALOG_PGSCCALC:PGSCCALC:ANCESTRY_PROJECT:INTERSECT_VARIANTS (file chromosome ALL)` terminated with an error exit status (1)


Command executed:

  pgscatalog-intersect --ref GRCh37_1000G_ALL.pvar.zst         --target GRCh37_file_ALL.pvar.zst         --chrom ALL         --maf_target 0         --geno_miss 0.1         --outdir .         -v
  
  n_matched=$(sed -n '3p' intersect_counts_ALL.txt)
  
  if [ $n_matched == "0" ]
  then
      echo "ERROR: No variants in intersection"
      exit 1
  else
      mv matched_variants.txt.gz file_ALL_matched.txt.gz
  fi
  
  cat <<-END_VERSIONS > versions.yml
  INTERSECT_VARIANTS:
      pgscatalog.match: $(echo $(python -c 'import pgscatalog.match; print(pgscatalog.match.__version__)'))
  END_VERSIONS

Command exit status:
  1

Command output:
  (empty)

Command error:
  pgscatalog.match.cli.intersect_cli: 2024-07-23 17:01:42 INFO     Processed 12000000 TARGET variants

[...]

  pgscatalog.match.cli.intersect_cli: 2024-07-23 17:04:16 INFO     Processed 27904794 TARGET variants
  pgscatalog.match.cli.intersect_cli: 2024-07-23 17:04:16 INFO     Outputting TARGET variants -> target_variants.txt.gz
  Traceback (most recent call last):
    File "/app/pgscatalog.utils/.venv/bin/pgscatalog-intersect", line 8, in <module>
      sys.exit(run_intersect())
               ^^^^^^^^^^^^^^^
    File "/app/pgscatalog.utils/.venv/lib/python3.11/site-packages/pgscatalog/match/cli/intersect_cli.py", line 172, in run_intersect
      for v in heapq.merge(
    File "/usr/local/lib/python3.11/heapq.py", line 376, in merge
      h_append([key(value), order * direction, value, next])
                ^^^^^^^^^^
    File "/app/pgscatalog.utils/.venv/lib/python3.11/site-packages/pgscatalog/match/cli/intersect_cli.py", line 174, in <lambda>
      key=lambda v: (v["CHR:POS:A0:A1"], v["ID_TARGET"], v["REF_TARGET"]),
                     ~^^^^^^^^^^^^^^^^^
  KeyError: 'CHR:POS:A0:A1'
  cp: '.command.out' and '.command.out' are the same file
  cp: '.command.err' and '.command.err' are the same file
  cp: '.command.trace' and '.command.trace' are the same file
[...]
Jul-23 17:04:16.239 [TaskFinalizer-6] INFO  nextflow.Session - Execution cancelled -- Finishing pending tasks before exit
Jul-23 17:04:16.240 [Actor Thread 38] ERROR nextflow.Nextflow - ERROR: Matching subworkflow failed
Jul-23 17:04:16.251 [main] DEBUG nextflow.Session - Session await > all processes finished
Jul-23 17:04:16.252 [Task monitor] DEBUG n.processor.TaskPollingMonitor - <<< barrier arrives (monitor: local) - terminating tasks monitor poll loop
Jul-23 17:04:16.252 [main] DEBUG nextflow.Session - Session await > all barriers passed
Jul-23 17:04:16.260 [Actor Thread 35] ERROR nextflow.Nextflow - ERROR: No results report written!
Jul-23 17:04:16.268 [Actor Thread 42] ERROR nextflow.Nextflow - ERROR: No scores calculated!
Jul-23 17:04:16.275 [Actor Thread 45] ERROR nextflow.Nextflow - ERROR: Projection subworkflow failed

The logs show this command was ran:
pgscatalog-intersect --ref GRCh37_1000G_ALL.pvar.zst --target GRCh37_file_ALL.pvar.zst --chrom ALL --maf_target 0 --geno_miss 0.1 --outdir . -v

The target_variants.txt.gz file only contains a header line with the expected 'CHR:POS:A0:A1' column, with no other lines.

Fiwx · 2024-07-24T00:00:32Z

I changed it to 0.0 instead of "0" and it ran through. Here are the results:

Before and after the change:

PC	DEFAULT	WITH CHANGE
PC1	23.60	23.66
PC2	-30.01	-30.60
PC3	-0.00	0.22
PC4	8.52	8.74
PC5	0.37	-0.10
PC6	0.30	0.67
PC7	-1.00	0.19
PC8	-7.53	-7.64
PC9	1.29	0.87
PC10	-1.36	-1.48

The random forest ancestry assignment probabilities were a bit more accurate after the change. The main issue of extreme percentile values has remained the same: a large percentage of the scores are very high (100.0) or low (0.0).

Changing geno_miss to 1.0, I got the same PC results, and the same issue with the extreme percentile scores.

Example for various scores, with --maf_target 0.0 and --geno_miss 1.0 set:

Here is an example of a normal result:

PGS002228_hmPOS_GRCh37

SUM value from sample: 1.0793
Z-score for SUM value 1.0793: -0.1270 (compared to 1000G reference)
Percentile for SUM value 1.0793: 45.2610
Z_norm1: 0.1183
Z_norm2: 0.1164
percentile_MostSimilarPop: 52.3006

The below results seem to be off, and happen for most scores:

PGS000758_hmPOS_GRCh37

SUM value from sample: 0.4187
Z-score for SUM value 0.4187: 2.0613 (compared to 1000G reference)
Percentile for SUM value 0.4187: 97.0281
Z_norm1: 1.9390
Z_norm2: 1.7975
percentile_MostSimilarPop: 100.0000

PGS000892_hmPOS_GRCh37

SUM value from sample: -0.1183
Z-score for SUM value -0.1183: -0.5594 (compared to 1000G reference)
Percentile for SUM value -0.1183: 29.5984
Z_norm1: 0.0261
Z_norm2: 0.0244
percentile_MostSimilarPop: 80.2147

PGS000921_hmPOS_GRCh37

SUM value from sample: -5.0703
Z-score for SUM value -5.0703: -1.9863 (compared to 1000G reference)
Percentile for SUM value -5.0703: 2.8112
Z_norm1: -2.2731
Z_norm2: -2.1926
percentile_MostSimilarPop: 0.0000

PGS002308_hmPOS_GRCh37

SUM value from sample: 0.2243
Z-score for SUM value 0.2243: -1.2707 (compared to 1000G reference)
Percentile for SUM value 0.2243: 6.6265
Z_norm1: -3.0208
Z_norm2: -3.0629
percentile_MostSimilarPop: 0.0000

PGS002764_hmPOS_GRCh37

SUM value from sample: 0.1875
Z-score for SUM value 0.1875: 0.3131 (compared to 1000G reference)
Percentile for SUM value 0.1875: 57.9920
Z_norm1: 1.1029
Z_norm2: 1.0425
percentile_MostSimilarPop: 98.4663

PGS002771_hmPOS_GRCh37

SUM value from sample: -0.5765
Z-score for SUM value -0.5765: -1.5752 (compared to 1000G reference)
Percentile for SUM value -0.5765: 7.7510
Z_norm1: -2.5792
Z_norm2: -2.5555
percentile_MostSimilarPop: 0.0000

PGS003725_hmPOS_GRCh37

SUM value from sample: -7.5222
Z-score for SUM value -7.5222: -0.1440 (compared to 1000G reference)
Percentile for SUM value -7.5222: 45.4217
Z_norm1: -0.5261
Z_norm2: -0.5248
percentile_MostSimilarPop: 4.7546

smlmbrt · 2024-07-29T11:40:49Z

See #343 for parallel discussion. Have fixed the filters so that low MAF variants can still be included and the next update of the calculator will allow for these changes (and should revert back to the pre-beta behaviour if the filters are removed).

nebfield · 2024-07-31T12:30:59Z

The new parameters are available in v2.0.0-beta.2 🥳

Fiwx added the bug Something isn't working label Jul 6, 2024

nebfield assigned smlmbrt Jul 8, 2024

nebfield added this to the v2.0.0-beta.2 milestone Jul 8, 2024

smlmbrt added the user-query User queries & requests label Jul 16, 2024

This was referenced Jul 28, 2024

Inconsistent ancestry assignments in v2.0.0-beta vs. alpha #343

Closed

Fix intersection filters (default: no filtering) PGScatalog/pygscatalog#33

Merged

smlmbrt added the documentation Improvements or additions to documentation label Jul 29, 2024

nebfield mentioned this issue Jul 30, 2024

v2.0.0-beta.2 #341

Merged

nebfield closed this as completed Jul 31, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ancestry clustering yielding incorrect result #333

Ancestry clustering yielding incorrect result #333

Fiwx commented Jul 6, 2024

nebfield commented Jul 8, 2024

Fiwx commented Jul 8, 2024 •

edited

Loading

smlmbrt commented Jul 16, 2024

Fiwx commented Jul 16, 2024

Fiwx commented Jul 16, 2024 •

edited

Loading

smlmbrt commented Jul 17, 2024

Fiwx commented Jul 23, 2024

Fiwx commented Jul 23, 2024 •

edited

Loading

Fiwx commented Jul 24, 2024 •

edited

Loading

smlmbrt commented Jul 29, 2024 •

edited

Loading

nebfield commented Jul 31, 2024

Ancestry clustering yielding incorrect result #333

Ancestry clustering yielding incorrect result #333

Comments

Fiwx commented Jul 6, 2024

Description of the bug

Command used and terminal output

Relevant files

System information

nebfield commented Jul 8, 2024

Fiwx commented Jul 8, 2024 • edited Loading

smlmbrt commented Jul 16, 2024

Fiwx commented Jul 16, 2024

Fiwx commented Jul 16, 2024 • edited Loading

smlmbrt commented Jul 17, 2024

Fiwx commented Jul 23, 2024

Fiwx commented Jul 23, 2024 • edited Loading

Fiwx commented Jul 24, 2024 • edited Loading

smlmbrt commented Jul 29, 2024 • edited Loading

nebfield commented Jul 31, 2024

Fiwx commented Jul 8, 2024 •

edited

Loading

Fiwx commented Jul 16, 2024 •

edited

Loading

Fiwx commented Jul 23, 2024 •

edited

Loading

Fiwx commented Jul 24, 2024 •

edited

Loading

smlmbrt commented Jul 29, 2024 •

edited

Loading