-
Notifications
You must be signed in to change notification settings - Fork 0
The pipeline
Pavel V. Dimens edited this page Dec 6, 2021
·
2 revisions
So what does gust
actually do? Well, a few things, so let's walk through them.
- FASTA format assemblies are converted to FASTQ format with dummy quality scores
J
- The FASTQ'd assemblies are "fragmented" by creating a sliding window that advances by 1 bp
- The fragmented assemblies are mapped against the reference genome
- The alignments are used to call SNPs with freebayes
- The raw SNPs are filtered to refine the highest quality sites
- no missing data
- a bunch of quality filters
- indels decomposed
- all alleles in reference sample must be reference allele (else it's genotyping error)
- SNPs are thinned to retain x SNPs every y basepairs (reduce data size and redundancy)
- VCF is converted into FASTA for multiple-sequence alignment ("MSA")
- MAFFT performs MSA to get the best possible alignment under multiple scenarios
- Run RaxML on best MSA once (bootstrapped)
- Refine and optimize mutation model and rerun RaxML
- Basic plot of tree topology
Gust likes to be verbose in the message prompts for every task, so it will be very clear what it's doing as it's doing it.