Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about ProVar and multiply mutated alleles #1

Closed
chad-hyer opened this issue Dec 10, 2024 · 11 comments
Closed

Question about ProVar and multiply mutated alleles #1

chad-hyer opened this issue Dec 10, 2024 · 11 comments

Comments

@chad-hyer
Copy link

Hey, I'm currently performing a proteogenomic study on how mutation impact protein folding stability and came across ProVar. I am using ProVar to take phased VCFs from exome sequencing to generate fasta files for use in FragPipe. Looking through the outputs of ProVar, I noticed that the sequences include 5' UTRs and 3' UTRs. Is there a setting to only take the translated regions? I could programmtically remove anything untranslated in a separate script, but I just wanted to see how ProVar handles this first.

I also had questions about how ProVar handles multiple mutations on a single allele. If a protein has multiple mutations, does ProVar separate them into separate entries or use phasing data to combine them into two entries (one for allele 1 and one for allele 2)? For my specific workflow, it is important to be able to search for separate isoforms to test for structural differences based on mutations. My searches in FragPipe require searching for a large series of variable PTMs, so it can be very computationally expsensive to have extra entries in the search. I'd love to hear your suggestions of how to potentially use ProVar in my workflow.

Thanks,
Chad Hyer

@vasicek58
Copy link
Collaborator

vasicek58 commented Dec 11, 2024

Hey Chad, thanks for reaching out!

First, I'd strongly recommend using ProHap if you have phased VCFs - this way, all your alleles in the same haplotype will end up in the same protein sequence, reducing the redundancy in your database.

ProHap by default removes UTR sequences, please let me know if you still see them there - it should not happen with default config.

We do support multi-allelic format of the phased VCF too, and if using ProHap, all the alleles that belong together will end up in the same protein sequence. For example, if you have these variants in your VCF file:

#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO         FORMAT  SAMPLE1 SAMPLE2 SAMPLE3 SAMPLE4
1       123456  var_1   G       A,C     .       .       AF=0.18,0.01 GT      0|1     1|0     2|1     1|1  
1       123476  var_2   C       T       .       .       AF=0.05      GT      0|0     0|1     1|0     0|0  
1       123490  var_3   G       T,A     .       .       AF=0.3,0.05  GT      1|1     1|0     0|2     1|1  

You will get the following combinations of alleles in your protein haplotypes:

  • G, C, T (Sample1 left)
  • A, C, T (Sample1 right, Sample2 left, Sample4 left, Sample4 right)
  • G, T, G (Sample2 right)
  • C, T, G (Sample3 left)
  • A, C, A (Sample3 right)

If you have VCF files specific to an individual, then you will of course get max. two alternative protein sequences per transcript (note that if you allow alternative splicing, you will still get multiple different sequences per gene).

Does that make sense? Let me know if you have any further questions!

@vasicek58
Copy link
Collaborator

Note that if you have the two alleles at the same position as two separate entries in your VCF, it will work exactly the same. This file would be equivalent to the example in my previous comment:

#CHROM  POS     ID        REF     ALT   QUAL    FILTER  INFO     FORMAT  SAMPLE1 SAMPLE2 SAMPLE3 SAMPLE4
1       123456  var_1.1   G       A     .       .       AF=0.18  GT      0|1     1|0     0|1     1|1  
1       123456  var_1.2   G       C     .       .       AF=0.01  GT      0|0     0|0     1|0     0|0  
1       123476  var_2     C       T     .       .       AF=0.05  GT      0|0     0|1     1|0     0|0  
1       123490  var_3.1   G       T     .       .       AF=0.3   GT      1|1     1|0     0|0     1|1  
1       123490  var_3.2   G       A     .       .       AF=0.05  GT      0|0     0|0     0|1     0|0  

In other words, I interpret the 0 in the genotype notation as "this alternative allele isn't there", but not as "the reference allele is there".

@chad-hyer
Copy link
Author

Thanks for getting back to me. I just tried out ProHap and ran into some errors that I may need some help unpacking. For reference, I used bcftools to split my combined vcf into individual vcf files for each chromosome and put them in a folder labeled inputs. I then set up ProHap to choose my input files and ran in with the attached config.yaml and igsr_samples.tsv. It gave me this error:

Error in rule filter_phased_vcf: jobid: 30 input: inputs/split.4.vcf output: data/vcf/phased/chr4_phased_filtered.vcf shell: mkdir -p data/vcf/phased ; python3 src/vcf_filter_fix.py -i inputs/split.4.vcf -chr 4 -af 0.01 -af_field AF -o data/vcf/phased/chr4_phased_filtered.vcf (one of the commands exited with non-zero exit code; note that snakemake uses bash strict mode!)

I suspect there may be something wrong with my inputs. I have also attached one and a presplit file for reference. Do I need to change how they are formatted to make them work with PropHap?

Thanks,
Chad

ProHap Files.zip

@vasicek58
Copy link
Collaborator

Thanks for sharing the files, it indeed looks like a formatting error. What happened here is that the 1000 Genomes data I used followed an older VCF standard, while your files are up to date with the current standard here (https://gatk.broadinstitute.org/hc/en-us/articles/360035531692-VCF-Variant-Call-Format).

I will fix this shortly, so that we support both formats in ProHap. Thanks for bringing this up! I will let you know when the fix is ready.

@chad-hyer
Copy link
Author

Awesome! Thank you!

@vasicek58
Copy link
Collaborator

I've pushed the fix now, so I hope your format won't cause any more issues.

In your config file, please set phased_min_af to 0 in line 25, and specify the name of the final FASTA file in line 12. For FragPipe, I also recommend using the simplified fasta format (set this in the config line 13).

You will also need to specify the sex of the individuals, this needs a separate file. This should be simple to make, please refer to https://github.com/ProGenNo/ProHap/wiki/Input-&-Usage#prohap, and give 'ALL' in the population and superpopulation columns. Provide the path to this file in the config line 21.

Then you should be good to go. Let me know if you run into any more issues!

After you've done your search with FragPipe, you can use this pipeline to annotate the peptides with other useful info: https://github.com/ProGenNo/ProHap_PeptideAnnotator

@vasicek58
Copy link
Collaborator

Sorry, I got a little confused here - are you sure your data is phased? Because I only see two different values in the genotypes in your file: either 1/1, meaning homozygous, or 0/1 meaning heterozygous. I don't think you can distinguish which alleles come together at which copy of the gene. If the VCF was phased, I would expect to see 1/0 too in some cases.

@chad-hyer
Copy link
Author

You are correct. I just realized I provided one of the only file examples that is not phased haha. I'll fix that and run it again. Most of my other VCFs came from the sequencing center phased, but this particular batch was not. Do you have a recommended tool for phasing these VCFs? I also have the raw data. My background is not in genomics, but I could certainly find one and figure it out if needed.

@vasicek58
Copy link
Collaborator

I don't have my background in genomics either, but it looks like Shapeit is a popular tool for phasing: https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html. Hope that helps!

@chad-hyer
Copy link
Author

Thanks! I've looked a little more into our data, and it looks like it's a little more difficult to get a lot of high quality, phased data for exome sequencing at the depth we have been doing, so identifying the haplotype for some of our more important SNPs wasn't possible. For where we did have good, phased data, ProHap did a great job, but I'll have to handle cases where we don't have phased data in a different way. Good program overall though, and thanks for the good support!

Best,
Chad

@vasicek58
Copy link
Collaborator

Alright, thanks for the feedback! ProVar could handle the variants one by one where phasing is missing, if that makes sense in your workflow.

Best of luck with the project!
Jakub

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants