Skip to content

Frequently Asked Questions

Frédéric Mahé edited this page Nov 27, 2022 · 13 revisions

How to deal with ambiguous nucleotides?

Swarm is designed and optimized to work with four nucleotides: A, C, G, and T (or U). When working with d= 1. If your sequences contain ambiguous nucleotides (i.e. anything other than the expected four nucleotides), swarm will exit with an error message.

Users can either arbitrary replace ambiguous nucleotides or completely remove sequences containing ambiguous nucleotides. The latter seems safer. Sequences with ambiguous nucleotides should normally be rare in high-throughput studies. Our experience with Roche 54, llumina HiSeq and Illumina MiSeq confirms that removing these sequences should have no noticeable impact on downstream analyses.

Here is an example of command to drop sequences containing ambiguous nucleotides:

awk '{if (/^>/) {a = $0} else {if (/^[ACGT]*$/) {printf "%s\n%s\n", a, $0}}}' in.fas > out.fas

assuming your fasta entries are on two lines (one line for the header, one line for the sequence).

When clustering with swarm, when is an appropriate time to check for chimeras?

After clustering.

As pointed out by Colin Brislawn, Robert Edgar recommends using uchime to remove chimeras after cluster picking with uparse. His recommendation is based, in part, on chimera removal heuristics built in to uparse. When using uclust, Robert Edgar recommended referenced-based chimera checking before cluster picking so that clusters were not influenced by chimeras. When using swarm with default parameters or low d values (i.e. high-resolution clustering), chimeras and non-chimeras should form independent clusters. It is therefore more interesting to perform the chimera detection on cluster representatives, as the dataset size reduction allows to perform both de novo- and reference-based approaches.

An open question remains: can (similar) chimeras appear in independent PCR runs? In my own experience, it happens: the chimera formation process seems somewhat repeatable.

How to use the output of the --internal-structure option to produce network representations?

It is possible to use the pairwise relationships found by swarm to build a network representation of an cluster. The folder ./scripts/ contains a python script graph_plot.py performing that task. The script requires the swarm results (i.e. the clusters), the structure file and an cluster number (first cluster by default) to produce the plot. It also requires the python module igraph (http://igraph.org/python/) and python 2.7 (not tested with python 3). Install or upgrade igraph as such:

# as root
pip install --upgrade python-igraph

Note that network layout is slow and should only be applied to clusters containing a few thousand amplicons or less. Alternatively, the -d INTEGER option allows users to discard low abundant amplicons, which greatly accelerates network drawing. Amplicons grafted with the --fastidious option will not be represented. See python graph_plot.py -h for some help.

Here is an example producing network visualizations for the 20 first clusters:

INPUT="amplicons"

# Swarm
swarm -d 1 \
      -w "${INPUT}_representatives.fas" \
      -i "${INPUT}.struct" \
      -s "${INPUT}.stats" \
      -o "${INPUT}.swarms" \
      "${INPUT}.fas"

# Produce the plots for the top 20 clusters
GRAPH_PLOT="/path/to/script/graph_plot.py"
SWARMS="${INPUT}.swarms"
STRUCT="${INPUT}.struct"
DROP=10  # do not draw sequences with n reads or less (use 0 to draw all)
for i in {1..20} ; do
    python "${GRAPH_PLOT}" -s "${SWARMS}" -i "${STRUCT}" -c ${i} -d ${DROP}
done

Does swarm require sequences to all be of the same length? No.

swarm does not require reads to be of the same length. However, you will obtain better results if your reads are homologous, i.e. covering the same region contained between a forward and a reverse primer. That homologous region can vary in length from one taxonomic clade to another, swarm doesn't care.

Now, if your reads do not reach your reverse primer, you can choose to truncate your reads to an arbitrary length. That's not ideal, but you should still get good results from swarm. When truncating, swarm will overestimate the cost of insertion-deletion events. For example:

initial reads:
ACGTACGTA
ACGTAACGTA

truncate to:
ACGTACGTA
ACGTAACGT

alignment:
ACGTA-CGTA
|||||-|||-
ACGTAACGT-

You can see in the above example that the cost of the insertion in read 2 is double (one internal gap + one terminal gap). Fortunately, Illumina sequencing produces far less indels than substitutions. So that overestimation problem is minor.

And, no, we will not modify swarm to neglect terminal gaps like other softwares do. Neglecting terminal gaps is similar in spirit to prefix clustering, and it is in my opinion a mistake in most situations (see my rant on prefix clustering).

Can swarm work on short/long sequences? Yes.

Swarm has been successfully used to clusterize very short sequences (median 65 bp), where traditional 97%-global clustering failed. A user has also reported that swarm gives satisfying results on long PacBio sequences (up to 1.6 kb).

Swarm's scalability (speed, memory)?

swarm -f is more than 150 times faster than vsearch v2.0 (97%-threshold) on a dataset of 32.3 million reads (130-bp on average), for the same memory consumption (18.5 GB). Multithreaded computation on 8 cores, time is in seconds and memory in MB, computation was run three times:

 method  time (s)    memory 
----------------------------
 vsearch   475339  18667748 
 swarm       1083   8681160 
 swarmf      2987  18565840 
 vsearch   475604  18658496 
 swarm       1091   8681352 
 swarmf      2978  18565896 
 vsearch   476601  18664924 
 swarm       1066   8681336 
 swarmf      2984  18565804 
----------------------------

How to remove abundance values from identifiers in output files?

Abundance annotations are necessary for clustering with swarm, and for other operations such as de novo chimera detection with uchime. Nonetheless, if for some reason you need to remove abundance annotations (_INT or ;size=INT), that can be done with a simple shell command (assuming you are using the ;size=INT annotation):

sed -r 's/;size=[0-9]+;//g' in.swarm > out.swarm
# or inplace
sed -ir 's/;size=[0-9]+;//g' in.swarm

You can even pipe swarm's output and avoid a temporary file:

swarm | sed -r 's/;size=[0-9]+;//g' > in.swarm