R package for clustering and visualizing pairwise Mash distance data
Mash (Fast genome and metagenome distance estimation using MinHash) is a commonly used tool for fast distance estimation of sequences, including building a pairwise distance matrix for a set of sequences. MashClustR is a simple R package that is able to use the pairwise distance results from Mash to divide the sequences into clusters using UMGPA hierarchical clustering method. It is also able to compare different threshold of clustering, and generate visualization for the clustering results.
To install MashClustR to your R environment, devtools
must be installed.
MashClustR also requires magrittr
, dlpyr
, ggplot2
and gridExtra
packages to be installed. If not, they will be installed automatically.
install.packages("devtools")
devtools::install_github("davidtong28/MashClustR")
Currently, MashClustR offers these functions:
input(filepath)
: Read in a mash pairwise distance file, parses it and saves in variable "mash_matrix" in the global environment. There should not be a variable with the same name in the global environment in order for this package to function (this is CRUCIAL). The input file should be generated by using mash dist
command in Mash to the same sketch file to generate pairwise distance for each pair of sequences in the sequence set. Such as:
mash sketch -o sequences_to_cluster.sketch /path/to/sequence/set/*.fa
mash dist sequences_to_cluster.sketch sequences_to_cluster.sketch > inputfile
cluster_list_gen(cutoff)
: Generates a UPGMA hierarchical clustered table of Mash sequences based on given threshold from 0 to 1. The first column is the mash sequences in original order, the second column is the assigned cluster based on given cutoff. The mash distance cutoff for hierarchical clustering is chosen from 0 to 1. cutoff=0 means sequences will be divided when their mash distance is above 0. cutoff =1 means all sequences will form one cluster.
cluster_plot_gen(cutoff)
: Generates a histogram of cluster count distribution using the input cutoff value.
cluster_4_plot_gen(cutoff1,cutoff2,cutoff3,cutoff4)
: Generates 4 consecutive histograms (in a row) using cluster_plot_gen(cutoff)
, given 4 cutoff values. This is useful for choosing a preferred cutoff.
cluster_count_gen(list)
: Counts sequences in each assigned cluster. Input is the list generated by cluster_list_gen()
.
draw_heatmap(cutoff)
: Generates a heatmap of mash distances, reordered by UMPGA hierarchical clustering. Clusters are annotated accordingly by colors. Currently only supports annotating 50 clusters or less. If clusters > 50, minor clusters will be ignored in the annotation. Cutoff needs to be adjusted accordingly for best results.
total_num_list()
: Generates a chart of possible total number of cluster and the minimum cutoff required to reach that number. No input is needed.
total_num_plot(min=0.01,max=1,log=TRUE)
: Generates a geom_step plot showing distribution of total cluster counts (y axis) by clustering cutoff (x axis). Input indicates the minimum and maximum cutoff to be plotted, as well as a logical value (TRUE/FALSE) to indicate if the logarithm scale should be used on the cutoff (x) axis.