Skip to content

Latest commit

 

History

History
86 lines (55 loc) · 4.86 KB

README.md

File metadata and controls

86 lines (55 loc) · 4.86 KB

Hijacking Sourmash

This repository contains scripts, information and data for the manuscript "Hijacking a rapid and scalable metagenomic method for plant comparative genomics highlights subgenome dynamics and evolution".

For the sourmash program, see the sourmash github,documentation and paper.

Citation

If you use this work please cite Reynolds,G., Mumey, B., Strnadova-Neeley, V., & Lachowiec, J. (2024). Hijacking a rapid and scalable metagenomic method reveals subgenome dynamics and evolution in polyploid plants. Applications In Plant Science. e11581.

Questions and issues

If you have a question regarding the application of sourmash to polyploid genomics please feel free to reach out to me by opening an issue. This ensuress others who may have the same question can see read my responses. Please note, this is not monitored full time but I will do my best to get back to you ASAP. If you have a question or issue regarding sourmash itself, please follow their guidelines for help.

Sourmash modification

As described in the publication, application of sourmash to polyploid genomes may benefit from the modification of sourmash's default single hierarchical clustering technique. This can be done after installation of sourmash by locating the file "commands.py" and modifying the clustering command presently on line 316:

Y = sch.linkage(D, method='single')

One may also wish to modify the size of the plots generated by sourmash. This can be achieved by modifying line 293 of "commands.py" and line 40 of "fig.py" to figure dimensions of your choosing. The default for sourmash is:

 fig = pylab.figure(figsize=(8,5))

We used the following dimensions for nearly single genome figures:

 fig = pylab.figure(figsize=(8, 6))

And the following dimensions for multi-genome figures (i.e. progenitor plots) and large genome figures (i.e. S.tuberosum):

fig = pylab.figure(figsize=(12, 10))

For legibility purposes, one may wish to modify the font size of sourmash' output, especially if the figure size has been modified. This can be achieved by modifying the dendrogram commands in "commands.py" on line 317 and "fig.py" on line 50 to include leaf size. For example, the default clustering command in "commands.py" is:

sch.dendrogram(Y, orientation='right', labels=labeltext)

Which modified with the below outputs a larger font size:

sch.dendrogram(Y, orientation='right', labels=labeltext,  leaf_font_size = 12)

Example sourmash commands

The following section contains the commands used to produce the results for the Hijacking Sourmash paper. Exact directories were provided where "" is written. In sourmash plot, the labels originally procued from sourmash are overwridden by the flag "--labeltext new_labels.txt", where "new_labels.txt" ccontained exactly the same labels for the chromosomes, in exactly the same order, but with the directory structure that is produced for the labels by default, removed for legibility.

Example sourmash sketch command

This is the first command that will need to be performed for each set of chromosomes. Theroetically, one could use a single fasta file containing all of the chromosomes for the genome with the addition of the "--singleton" command. However, for publically deposited genomes, a genome file will contains numerous scaffolds and contigs alongside the sequences anchored to chromosomes which will make resulting signatures extremely large and visualisation will be difficult. Instead, one may wish to seperate out chromosomes first using either "SepChr.py" or "SepUnplaced.py" depending on the naming of the sequences in the file. Both programs will place annchored chromosomal sequences within a "placed" directory and the sketch command below can be pointed at the sequences within it.

sourmash sketch dna -p k=2,k=3,k=4,k=5,k=6,k=7,k=8,k=9,k=10,k=11,k=12,k=13,k=14,k=15,k=16,k=17,k=18,k=19,k=20,k=21,k=31,k=41,k=51,k=61,abund  <DirectoryOfChromosomes>/*.fa

Example sourmash compare command

Below is an example of the code used to compare sequence signatures generated by sourmash sketch. The k-mer size, filenames and use of k-mer frequency and composition can be modified as wishes.

Frequency

sourmash compare -k=21 --output k21_freq_compare --csv k21_freq_compare.csv <LocationOfSigs>

Composition

sourmash compare -k=21 --ignore-abundance --output k21_comp_compare --csv k21_comp_compare.csv <LocationOfSigs>

Example sourmash plot command

Below is an example of the sourmash plot command. Filenames and plot labels can be freely adjusted.

sourmash plot k21_freq_compare --labels --labeltext new_labels.txt --csv k21_plot.csv