Retrieval of phased sequences #2

mossmatters · 2019-05-17T18:29:01Z

Users who identify their sequences may contain paralogs and/or heterozygous sequences (#1) may be interested in extracting multiple sequences per gene for each sample. Recovery of the paralogous sequences is especially necessary for projects where some of the genes may have unknown duplication history. Users will want to build gene trees with multiple paralogs to identify where duplications took place and select orthologs for species-level phylogenetic analysis.

This is accomplished in HybPiper by https://github.com/mossmatters/HybPiper/blob/master/paralog_retriever.py

However, in HybPiper there are multiple contigs assembled and the retriever can extract sequences from each of them. With the overlap assembler, only one consensus sequence is made each time.

One idea is to use a workflow similar to the "alleles_workflow" I used in a 2018 AJB paper to phase heterozygous sites. The workflow uses BWA (map reads) Picard (to mark duplicate reads), GATK (call variants within individuals), and WhatsHap (to phase SNPs using read data). I then have a script (haplonerate.py) to extract phased FASTA sequences based on a user decision about what to do outside of the largest phased block.

https://github.com/mossmatters/phyloscripts/tree/master/alleles_workflow

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Retrieval of phased sequences #2

Retrieval of phased sequences #2

mossmatters commented May 17, 2019

Retrieval of phased sequences #2

Retrieval of phased sequences #2

Comments

mossmatters commented May 17, 2019