Skip to content

Creating expandable search databases

Milot Mirdita edited this page Nov 28, 2024 · 7 revisions

Starting with an existing clustering

mmseqs createdb "INPUT.fasta" seqdb

# take a look at the seqdb.lookup
# you need to create a TSV mapping with numeric values (first column) of your sequence accessions (second column) for your clustering
# e.g. you have a .lookup with 
# 0 Q9HGP0 0
# 1 Q9HGP1 1
# ...
# You need to create a TSV:
# 0 0
# 0 1
# 0 2
# 3 4
# ...
# This means cluster 0 (Q9HGP0) contains sequences 0 (Q9HGP0), 1 (Q9HGP1), 2, cluster 3 contains sequence 4, ...

mmseqs tsv2db NEW_TSV clu --output-dbtype 6
# disable E-value threshold with -e inf, accept everything that was clustered
mmseqs align seqdb seqdb clu aln -a -e inf
mmseqs result2profile seqdb seqdb aln prof
mmseqs profile2consensus prof cons
DBNAME=GIVE_THIS_A_GOOOD_NAME
mmseqs prefixid cons ${DBNAME}.tsv --tsv --threads 1
mmseqs prefixid seqdb ${DBNAME}_seq.tsv --tsv --threads 1
mmseqs prefixid seqdb_h ${DBNAME}_h.tsv --tsv --threads 1
mmseqs prefixid aln ${DBNAME}_aln.tsv --tsv --threads 1
DBDIR=YOUR_FINAL_DB_DIR
mmseqs tsv2exprofiledb ${DBNAME} ${DBDIR}/${DBNAME}

Starting from scratch

mmseqs createdb "INPUT.fasta" seqdb

# parameter choice is very important here, generally you want to cluster to a low sequence identity however keep a high coverage.
# Without a high coverage, we might lose a domain in the representative sequence and then not be able to find the domain in any of the members anymore, since we always first need to match the cluster representative
mmseqs cluster seqdb clu tmp --min-seq-id 0.3 -c 0.8
# disable E-value threshold with -e inf, accept everything that was clustered
mmseqs align seqdb seqdb clu aln -a -e inf
mmseqs result2profile seqdb seqdb aln prof
mmseqs profile2consensus prof cons
DBNAME=GIVE_THIS_A_GOOOD_NAME
mmseqs prefixid cons ${DBNAME}.tsv --tsv --threads 1
mmseqs prefixid seqdb ${DBNAME}_seq.tsv --tsv --threads 1
mmseqs prefixid seqdb_h ${DBNAME}_h.tsv --tsv --threads 1
mmseqs prefixid aln ${DBNAME}_aln.tsv --tsv --threads 1
DBDIR=YOUR_FINAL_DB_DIR
mmseqs tsv2exprofiledb ${DBNAME} ${DBDIR}/${DBNAME}