-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Question about ProVar and multiply mutated alleles #1
Comments
Hey Chad, thanks for reaching out! First, I'd strongly recommend using ProHap if you have phased VCFs - this way, all your alleles in the same haplotype will end up in the same protein sequence, reducing the redundancy in your database. ProHap by default removes UTR sequences, please let me know if you still see them there - it should not happen with default config. We do support multi-allelic format of the phased VCF too, and if using ProHap, all the alleles that belong together will end up in the same protein sequence. For example, if you have these variants in your VCF file:
You will get the following combinations of alleles in your protein haplotypes:
If you have VCF files specific to an individual, then you will of course get max. two alternative protein sequences per transcript (note that if you allow alternative splicing, you will still get multiple different sequences per gene). Does that make sense? Let me know if you have any further questions! |
Note that if you have the two alleles at the same position as two separate entries in your VCF, it will work exactly the same. This file would be equivalent to the example in my previous comment:
In other words, I interpret the 0 in the genotype notation as "this alternative allele isn't there", but not as "the reference allele is there". |
Thanks for getting back to me. I just tried out ProHap and ran into some errors that I may need some help unpacking. For reference, I used bcftools to split my combined vcf into individual vcf files for each chromosome and put them in a folder labeled inputs. I then set up ProHap to choose my input files and ran in with the attached config.yaml and igsr_samples.tsv. It gave me this error:
I suspect there may be something wrong with my inputs. I have also attached one and a presplit file for reference. Do I need to change how they are formatted to make them work with PropHap? Thanks, |
Thanks for sharing the files, it indeed looks like a formatting error. What happened here is that the 1000 Genomes data I used followed an older VCF standard, while your files are up to date with the current standard here (https://gatk.broadinstitute.org/hc/en-us/articles/360035531692-VCF-Variant-Call-Format). I will fix this shortly, so that we support both formats in ProHap. Thanks for bringing this up! I will let you know when the fix is ready. |
Awesome! Thank you! |
I've pushed the fix now, so I hope your format won't cause any more issues. In your config file, please set You will also need to specify the sex of the individuals, this needs a separate file. This should be simple to make, please refer to https://github.com/ProGenNo/ProHap/wiki/Input-&-Usage#prohap, and give 'ALL' in the population and superpopulation columns. Provide the path to this file in the config line 21. Then you should be good to go. Let me know if you run into any more issues! After you've done your search with FragPipe, you can use this pipeline to annotate the peptides with other useful info: https://github.com/ProGenNo/ProHap_PeptideAnnotator |
Sorry, I got a little confused here - are you sure your data is phased? Because I only see two different values in the genotypes in your file: either |
You are correct. I just realized I provided one of the only file examples that is not phased haha. I'll fix that and run it again. Most of my other VCFs came from the sequencing center phased, but this particular batch was not. Do you have a recommended tool for phasing these VCFs? I also have the raw data. My background is not in genomics, but I could certainly find one and figure it out if needed. |
I don't have my background in genomics either, but it looks like Shapeit is a popular tool for phasing: https://mathgen.stats.ox.ac.uk/genetics_software/shapeit/shapeit.html. Hope that helps! |
Thanks! I've looked a little more into our data, and it looks like it's a little more difficult to get a lot of high quality, phased data for exome sequencing at the depth we have been doing, so identifying the haplotype for some of our more important SNPs wasn't possible. For where we did have good, phased data, ProHap did a great job, but I'll have to handle cases where we don't have phased data in a different way. Good program overall though, and thanks for the good support! Best, |
Alright, thanks for the feedback! ProVar could handle the variants one by one where phasing is missing, if that makes sense in your workflow. Best of luck with the project! |
Hey, I'm currently performing a proteogenomic study on how mutation impact protein folding stability and came across ProVar. I am using ProVar to take phased VCFs from exome sequencing to generate fasta files for use in FragPipe. Looking through the outputs of ProVar, I noticed that the sequences include 5' UTRs and 3' UTRs. Is there a setting to only take the translated regions? I could programmtically remove anything untranslated in a separate script, but I just wanted to see how ProVar handles this first.
I also had questions about how ProVar handles multiple mutations on a single allele. If a protein has multiple mutations, does ProVar separate them into separate entries or use phasing data to combine them into two entries (one for allele 1 and one for allele 2)? For my specific workflow, it is important to be able to search for separate isoforms to test for structural differences based on mutations. My searches in FragPipe require searching for a large series of variable PTMs, so it can be very computationally expsensive to have extra entries in the search. I'd love to hear your suggestions of how to potentially use ProVar in my workflow.
Thanks,
Chad Hyer
The text was updated successfully, but these errors were encountered: