Skip to content

Swarm as a compression method

Frédéric Mahé edited this page Nov 27, 2022 · 4 revisions

The paper by Patro & Kingsford (2015) shows that ordering sequences by similarity improves compression rates of FASTQ files.

On FASTA files, swarm can be used to perform the reordering of sequences (lossless compression), or to identify cluster representatives (lossy compression).

Let's check the effect of reordering on compression. Swarm outputs an ordered list of amplicons that can be used to reorder the input fasta file. Here is an example:

FASTA="input.fas"
MODIFIED_FASTA=$(mktemp)
SWARMS="${FASTA/.fas/.swarms}"
REORDERED_FASTA="${FASTA/.fas/_reordered.fas}"

# Prepare fasta file
paste - - < "${FASTA}" | tr -d ">" > "${MODIFIED_FASTA}"

# Reorder the fasta file
awk -v FASTA="${MODIFIED_FASTA}" \
    'BEGIN {FS = "\t"
            while ((getline < FASTA) > 0) {
                fasta[$1] = $2
            }
            close(FASTA)
            FS = " "
           }

     {# Parse the swarm file
      for (i = 1; i <= NF; i++) {
          printf ">%s\n%s\n", $i, fasta[$i]
          }
     }' "${SWARMS}" > "${REORDERED_FASTA}"

bzip2 -9k "${REORDERED_FASTA}" &

One the different datasets tested, a gain of 15% for rRNA 18S V4 and 25% for rRNA 18S V9 was observed. The gain is comparable to the average gain of 28% described on FASTQ files by Patro & Kingsford (2015).

Swarm groups sequences by similarity, but clusters themselves are not ordered by similarity (swarm does not compute inter-cluster distances). Further compression gains might be obtained by ordering clusters.

In my tests, reordering based on usearch 8 clustering results (97% identity threshold) achieves higher compression levels than swarm (whatever sorting option is used). More tests are necessary to understand what happens here.