Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add XY filtration workflow #191

Open
wants to merge 67 commits into
base: main
Choose a base branch
from
Open

Conversation

Faizal-Eeman
Copy link

@Faizal-Eeman Faizal-Eeman commented Dec 4, 2024

Description

ADD XY filter workflow

Closes #190

Testing Results

  • N-T paired WGS (sample_sex = XY)

    • samples: NA24149, NA24143
    • input YAML: /hot/software/pipeline/pipeline-call-gSNP/Nextflow/development/pre-10.1.0-rc.1/mmootor-add-xy-filtration/pipeline_test/NA24143.yaml
    • config: /hot/software/pipeline/pipeline-call-gSNP/Nextflow/development/pre-10.1.0-rc.1/mmootor-add-xy-filtration/pipeline_test/filter-xy-gSNP.config
    • output: /hot/software/pipeline/pipeline-call-gSNP/Nextflow/development/pre-10.1.0-rc.1/mmootor-add-xy-filtration/pipeline_test/call-gSNP-10.1.0-rc.1/NA24143/GATK-4.5.0.0/output/
  • N-T paired WGS (sample_sex = XX)

    • samples: NA24149, NA24143
    • input YAML: /hot/software/pipeline/pipeline-call-gSNP/Nextflow/development/pre-10.1.0-rc.1/mmootor-add-xy-filtration/pipeline_test/NA24143.yaml
    • config: /hot/software/pipeline/pipeline-call-gSNP/Nextflow/development/pre-10.1.0-rc.1/mmootor-add-xy-filtration/pipeline_test/test-XX/filter-xx-gSNP.config
    • output: /hot/software/pipeline/pipeline-call-gSNP/Nextflow/development/pre-10.1.0-rc.1/mmootor-add-xy-filtration/pipeline_test/test-XX/call-gSNP-10.1.0-rc.1/NA24143/GATK-4.5.0.0/output/

Checklist

  • I have read the code review guidelines and the code review best practice on GitHub check-list.

  • I have reviewed the Nextflow pipeline standards.

  • The name of the branch is meaningful and well formatted following the standards, using [AD_username (or 5 letters of AD if AD is too long)]-[brief_description_of_branch].

  • I have set up or verified the branch protection rule following the github standards before opening this pull request.

  • I have added my name to the contributors listings in the manifest block in the nextflow.config as part of this pull request, am listed
    already, or do not wish to be listed. (This acknowledgement is optional.)

  • I have added the changes included in this pull request to the CHANGELOG.md under the next release version or unreleased, and updated the date.

  • I have updated the version number in the metadata.yaml and manifest block of the nextflow.config file following semver, or the version number has already been updated. (Leave it unchecked if you are unsure about new version number and discuss it with the infrastructure team in this PR.)

  • I have tested the pipeline on at least one A-mini sample.

@alkaZeltser
Copy link

@Faizal-Eeman Would it be possible to add a line to the outputed VCF header documenting the XY filtration, similar to how bcftools appends every operation to the header? It would be a good way to maintain a record of what has been done to the file.

@alkaZeltser
Copy link

alkaZeltser commented Jan 8, 2025

Test run complete.

XY filtered call-gSNP output - /hot/software/pipeline/pipeline-call-gSNP/Nextflow/development/pre-10.1.0-rc.1/mmootor-add-xy-filtration/pipeline_test/call-gSNP-10.1.0-rc.1/NA24143/GATK-4.5.0.0/output/Hail-branch-mmootor-fix-spark-permission_GATK-4.5.0.0_TEST_NA24143_Hail-branch-mmootor-fix-spark-permission-GATK-4.5.0.0-TEST-XY-filtered.vcf.bgz

pipeline log - /hot/software/pipeline/pipeline-call-gSNP/Nextflow/development/pre-10.1.0-rc.1/mmootor-add-xy-filtration/pipeline_test/xy-filter.log

I skimmed through the output file and everything looks as expected: PARs remain diploid in both X and Y, non-PARs are haploid in X and Y. Missing genotypes are always in diploid notation: ./. but I don't think that's an issue. @Faizal-Eeman for completeness' sake, would be good to test a female sample? Unless one of the two is already female?

@Faizal-Eeman
Copy link
Author

@Faizal-Eeman Would it be possible to add a line to the outputed VCF header documenting the XY filtration, similar to how bcftools appends every operation to the header? It would be a good way to maintain a record of what has been done to the file.

@alkaZeltser for now I'm appending the script command like GATK does to VCF header

##source=ApplyVQSR
##source=CombineGVCFs
##XYFiltration=<CommandLine=script/filter_xy_call.py --sample_name Hail-0.2.133_GATK-4.5.0.0_TEST_NA24143 --input_vcf GATK-4.5.0.0_TEST_NA24143_VQSR-SNP-AND-INDEL.vcf.gz --vcf_source_file ./vcf_source.txt --sample_sex XY --par_bed pseudoautosomal_regions_hg38.bed --genome_build GRCh38 --output_dir .>

As to the steps in the workflow, I've added a document to ./docs in the repo and referenced it in call-gSNP README.

@Faizal-Eeman
Copy link
Author

for completeness' sake, would be good to test a female sample? Unless one of the two is already female?

The same sample was treated as an XX case,

  • output VCF - /hot/software/pipeline/pipeline-call-gSNP/Nextflow/development/pre-10.1.0-rc.1/mmootor-add-xy-filtration/pipeline_test/test-XX/call-gSNP-10.1.0-rc.1/NA24143/GATK-4.5.0.0/output/Hail-0.2.133_GATK-4.5.0.0_TEST_NA24143_XY_filtered.vcf.bgz

By the time this test finished, I updated the python script to name the output file based on the sample_sex. Latest python script test output is below,
/hot/software/pipeline/pipeline-call-gSNP/Nextflow/development/pre-10.1.0-rc.1/mmootor-add-xy-filtration/script_test/multisample_XX_filtered.vcf.bgz

Copy link
Collaborator

@yashpatel6 yashpatel6 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Anything else to add from @alkaZeltser ?

@@ -115,6 +118,8 @@ For normal-only or tumor-only samples, exclude the fields for the other state.
|:----------------|:---------|:-----|:------------|
| `dataset_id` | Yes | string | Dataset ID |
| `blcds_registered_dataset` | Yes | boolean | Set to true when using BLCDS folder structure; use false for now |
| `genome_build` | Yes | string | Genome build, GRCh37 or GRCh38 |
| `sample_sex` | Yes | string | Sample Sex, XY or XX |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question (non-blocking):

@Faizal-Eeman @yashpatel6 We might've touched on this before but have we tried to adjust ploidy for male X Y chromosomes when running HC even before this filtering?

--sample-ploidy 2 \

@tyamaguchi-ucla tyamaguchi-ucla self-assigned this Jan 16, 2025
Comment on lines +208 to +211
| `<Hail>_<GATK>_<dataset_id>_<patient_id>_<sample_sex>_filtered.vcf.bgz` | chrX/Y filtered SNP and INDEL recalibrated variants |
| `<Hail>_<GATK>_<dataset_id>_<patient_id>_<sample_sex>_filtered.vcf.bgz.sha512` | chrX/Y filtered SNP and INDEL recalibrated variants checksum |
| `<Hail>_<GATK>_<dataset_id>_<patient_id>_<sample_sex>_filtered.vcf.bgz.tbi` | chrX/Y filtered SNP and INDEL recalibrated variants index |
| `<Hail>_<GATK>_<dataset_id>_<patient_id>_<sample_sex>_filtered.vcf.bgz.tbi.sha512` | chrX/Y filtered SNP and INDEL recalibrated variants index checksum |

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion:

I recommend removing _filtered from the name. Many of our VCF outputs undergo some form of filtering, so the term might not add significant value here. Have we decided whether these outputs will be the final outputs or supplementary final outputs?

Comment on lines +81 to +82
### 9. Adjust chrX and chrY genotypes based on sample sex from recalibrated VCF
Apply XY filtration workflow to recalibrated VCF as discribed [here](docs/xy_filtration_workflow.md).
Copy link

@tyamaguchi-ucla tyamaguchi-ucla Jan 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

question:

How does this process work with mouse samples (or other species)? Is this process optional?

@alkaZeltser
Copy link

The same sample was treated as an XX case,
output VCF - /hot/software/pipeline/pipeline-call-gSNP/Nextflow/development/pre-10.1.0-rc.1/mmootor-add-xy-filtration/pipeline_test/test-XX/call-gSNP-10.1.0-rc.1/NA24143/GATK-4.5.0.0/output/Hail-0.2.133_GATK-4.5.0.0_TEST_NA24143_XY_filtered.vcf.bgz
By the time this test finished, I updated the python script to name the output file based on the sample_sex. Latest python script test output is below,
/hot/software/pipeline/pipeline-call-gSNP/Nextflow/development/pre-10.1.0-rc.1/mmootor-add-xy-filtration/script_test/multisample_XX_filtered.vcf.bgz

@Faizal-Eeman hmm I am still confused. I looked at bcftools view -r chrY --no-header multisample_XX_filtered.vcf.bgz and saw many chrY calls, including heterozygous genotypes. If the sample is treated as XX, there should be no chrY calls at all right?

@Faizal-Eeman
Copy link
Author

The same sample was treated as an XX case,
output VCF - /hot/software/pipeline/pipeline-call-gSNP/Nextflow/development/pre-10.1.0-rc.1/mmootor-add-xy-filtration/pipeline_test/test-XX/call-gSNP-10.1.0-rc.1/NA24143/GATK-4.5.0.0/output/Hail-0.2.133_GATK-4.5.0.0_TEST_NA24143_XY_filtered.vcf.bgz
By the time this test finished, I updated the python script to name the output file based on the sample_sex. Latest python script test output is below,
/hot/software/pipeline/pipeline-call-gSNP/Nextflow/development/pre-10.1.0-rc.1/mmootor-add-xy-filtration/script_test/multisample_XX_filtered.vcf.bgz

@Faizal-Eeman hmm I am still confused. I looked at bcftools view -r chrY --no-header multisample_XX_filtered.vcf.bgz and saw many chrY calls, including heterozygous genotypes. If the sample is treated as XX, there should be no chrY calls at all right?

@alkaZeltser Just compared the chrY calls in the XX output and they are all PAR variants. Based on the XY filtration workflow, we only remove non-PAR chrY calls.

@alkaZeltser
Copy link

The same sample was treated as an XX case,
output VCF - /hot/software/pipeline/pipeline-call-gSNP/Nextflow/development/pre-10.1.0-rc.1/mmootor-add-xy-filtration/pipeline_test/test-XX/call-gSNP-10.1.0-rc.1/NA24143/GATK-4.5.0.0/output/Hail-0.2.133_GATK-4.5.0.0_TEST_NA24143_XY_filtered.vcf.bgz
By the time this test finished, I updated the python script to name the output file based on the sample_sex. Latest python script test output is below,
/hot/software/pipeline/pipeline-call-gSNP/Nextflow/development/pre-10.1.0-rc.1/mmootor-add-xy-filtration/script_test/multisample_XX_filtered.vcf.bgz

@Faizal-Eeman hmm I am still confused. I looked at bcftools view -r chrY --no-header multisample_XX_filtered.vcf.bgz and saw many chrY calls, including heterozygous genotypes. If the sample is treated as XX, there should be no chrY calls at all right?

@alkaZeltser Just compared the chrY calls in the XX output and they are all PAR variants. Based on the XY filtration workflow, we only remove non-PAR chrY calls.

@Faizal-Eeman Ok but if the individual is truly XX, that means any reads that map to chrY regions, even PAR, are incorrectly mapped right? They're being drawn away from chrX? And how do you interpret a PAR genotype in an XX individual that has chrY calls? Doesn't that technically result in a triploid XXY dosage?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add XY filtration to correct chrX and chrY variant calls
4 participants