SAINT is a weakly-supervised learning method where the embedding function is learned automatically from the easily-acquired data.Compared to existing deep learning-based alignment-free method, SAINT doesn’t require tedious labors to collect accurate alignment distances to train.SAINT is more computationally fast and memory efficient because sequence data are operated in a compressed embedding space which is much faster to retrieval and succinct to store.
Compared to existing alignment-free sequence comparison methods,SAINT offers following advantages:
-
SAINTis more computationally fast and memory efficient because sequence data are operated in a compressed embedding space which is much faster to retrieval and succinct to store.
-
SAINTis a weakly-supervised learning method where the embedding function is learned automatically from the easily-acquired data. Compared to existing deep learning-based alignment-free method, SAINT doesn’t require tedious labors to collect accurate alignment distances to train.
- Version 1.0
-
This is the first version of SAINT pipeline.
-
An demo of SAINT running is given here.
- Pre-install running environment
-
Unix or Linux operating system.
-
CPU is enough for calculation.
-
Python 3 or above.
-
Packages like sys, optparse, os, random, numpy, pandas, collections, keras and sklearn need to be prepared.
- Detailed steps
-
Download the source code to your directory, e.g: ’/home/user/SAINT’.
-
Enter your specified directory:
$ cd /home/user/SAINT
-
Extract the zip file:
$ unzip ./resource/kmer.zip
-
If your operating system has multiple Python version, please be sure your Python version at least 3 or above.
The dataset was download from NCBI. For the 232 bacteria genomes, Saint uses KMC tool to convert fasta file into kmer frequency file here.
Run SAINT
- Usage of SAINT
-
The main running command are triplet_model.py and taxonomy_localization.py with following options:
-h, --help: show this help message and exit
-i, --inputcsv: the taxomony of the input data
-d, --kmer_frequency_dir: the dir of kmer frequency.
-t, --test_name: the list of test name.
-k, --kofKTuple: the value k of KTuple
-e, --epochNum: the number of epoch.
-o, --output: output dir.
-
Run SAINT to get model.
Create a new folder to put model file
$ mkdir output
Run triplet_model.py
$ python code/triplet_model.py -i resource/data.csv -d resource/kmer/ -t resource/test_name.txt -k 6 -e 30 -o output/
-
Predict taxonomy of unknown species and Calculate the performance of SAINT results.`
Run taxonomy_localization.py
$ python code/taxonomy_localization.py -i resource/data.csv -d resource/kmer/ -t resource/test_name.txt -o output/
The output are ./output/predict_taxonomy.txt and ./output/Accuracy.txt.