Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Retrieval of phased sequences #2

Open
mossmatters opened this issue May 17, 2019 · 0 comments
Open

Retrieval of phased sequences #2

mossmatters opened this issue May 17, 2019 · 0 comments

Comments

@mossmatters
Copy link

Users who identify their sequences may contain paralogs and/or heterozygous sequences (#1) may be interested in extracting multiple sequences per gene for each sample. Recovery of the paralogous sequences is especially necessary for projects where some of the genes may have unknown duplication history. Users will want to build gene trees with multiple paralogs to identify where duplications took place and select orthologs for species-level phylogenetic analysis.

This is accomplished in HybPiper by https://github.com/mossmatters/HybPiper/blob/master/paralog_retriever.py

However, in HybPiper there are multiple contigs assembled and the retriever can extract sequences from each of them. With the overlap assembler, only one consensus sequence is made each time.

One idea is to use a workflow similar to the "alleles_workflow" I used in a 2018 AJB paper to phase heterozygous sites. The workflow uses BWA (map reads) Picard (to mark duplicate reads), GATK (call variants within individuals), and WhatsHap (to phase SNPs using read data). I then have a script (haplonerate.py) to extract phased FASTA sequences based on a user decision about what to do outside of the largest phased block.

https://github.com/mossmatters/phyloscripts/tree/master/alleles_workflow

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant