Comparison between genomic data is critical to estimating evolutionary relationships, understanding the microbiome dynamics, or exploring the developmental process of embryogenesis.
Alignment-based and alignment-free methods play an important role is genomic comparison. However, the current existing measures are mostly based on a fixed dissimilarity function or model, which might not be suitable for all scenarios.
Therefore, in this study, we proposed MELT, a metric learning model with triplet network to obtain a suitable dissimilarity measure for hierarchical or longitudinal genomic and microbiome data. The construction of the training triplet dataset sufficiently reflects the dissimilarity characteristics of application scenarios. Hence, MELT offers a data-based, scenario-oriented framework for adaptive metric comparison.
- MELT learns useful representations of genomic data by the reference relationship that “data A is closer to data B than to data C ”, which sufficiently reflects the dissimilarity characteristics of application scenarios
- Instead of accurate alignment distances for training, MELT requires only dissimilarity comparisons among A, B, and C. The embedding function is learned automatically from the dissimilarity comparisons. Therefore, MELT is particularly suitable to deal with scenarios without clear categorical information, such as hierarchical or longitudinal datasets.
The experiments on comparing genomic sequences, temporal microbiome samples, and gene expression profiles from scRNA-seq demonstrate the significant performance in the three application scenarios.
Conda-pack is a command line tool for creating archives of conda environments that can be installed on other systems and locations.
Here we use conda-pack to pack our environment and provide it on release. You can easily download these environments to run our code without pre-installing anaconda or python.
- Make sure your operating system is Linux
- In the terminal window of the Linux operating system
- Detailed steps
- Download the source code to your directory.
$git clone https://github.com/Ying-Lab/MELT.git
$cd MELT/
- Download the environment .tar.gz file. This will take some time depending on your internet.
$wget https://github.com/Ying-Lab/MELT/releases/download/v1.0/MELT-pytorch-env.tar.gz
$wget https://github.com/Ying-Lab/MELT/releases/download/v1.0/MELT-tensorflow-env.tar.gz
4. Extract the environment .tar.gz file to a specific directory
$mkdir MELT-pytorch-env
$tar -zxvf MELT-pytorch-env.tar.gz -C MELT-pytorch-env
$mkdir MELT-tensorflow-env
$tar -zxvf MELT-tensorflow-env.tar.gz -C MELT-tensorflow-env
5.You have now successfully installed the environment. You can use the following command to activate an environment.
- activate pytorch env
$source MELT-pytorch-env/bin/activate
- activate tensorflow env
$source MELT-tensorflow-env/bin/activate
If you find conda-pack files difficult to download, you can choose this method to install the environment.
- In the terminal window of the Linux operating system
- Detailed steps:
- Install Miniconda to manage your environment (you can skip this step if you already have Conda installed).
$wget -c https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
$bash Miniconda3-latest-Linux-x86_64.sh -bfp
- Restart your terminal
- Download the source code to your directory.
$git clone https://github.com/Ying-Lab/MELT.git
- Create an environment and install the package from .yml file. Conda will download th package. The new environment name is 'MELT-pytorch-env'
$cd MELT/
$conda env create -f MELT-pytorch-env.yml
$conda env create -f MELT-tensorflow-env.yml
- Activate environment
- activate pytorch env
$conda activate MELT-pytorch-env
- activate tensorflow env
$conda activate MELT-tensorflow-env
Now you've successfully installed the environment
- We provided three trained models for each of the three experiments:
- Experiment1: MELT/demo_experiment1/resource/trained_model.h5
- Experiment2: MELT/demo_experiment2/trained_model.pth
- Experiment3: MELT/demo_experiment3/trained_model.pth
Three small datasets are used for demonstration purposes.
Please activate tensorflow-env first
$source MELT-tensorflow-env/bin/activate
or
$conda activate MELT-tensorflow-env
- Usage of MELT
-
The main running commands are as follows:
-h, --help: show help information
-i, --inputcsv: the taxomony of the input data
-d, --kmer_frequency_dir: the dir of kmer frequency.
-t, --test_name: the list of test name.
-k, --kofKTuple: the value k of KTuple
-e, --epochNum: the number of epoch.
-o, --output: output dir.
- Run MELT to get model.
Extract the zip file:
$unzip demo_experiment1/resource/kmer.zip -d demo_experiment1/resource/
Create a new folder to put model file:
$mkdir demo_experiment1/output
Enter demo_experiment1 directory:
$cd demo_experiment1/
Run triplet_model.py:
$python code/triplet_model.py -i resource/data.csv -d resource/kmer/ -t resource/test_name.txt -k 6 -e 30 -o output/
- Predict taxonomy of unknown species.
Run taxonomy_localization.py
$python code/taxonomy_localization.py -i resource/data.csv -d resource/kmer/ -t resource/test_name.txt -o output/
The output are ./output/predict_taxonomy.txt.
Here we also provide a trained model in resource/trained_model.h5
- back to MELT/ path
$cd ..
Please activate pytorch-env first
$source MELT-pytorch-env/bin/activate
or
$conda activate MELT-pytorch-env
- Enter demo_experiment2 directory
$cd demo_experiment2/
- Run MELT to train the model. It will save every epoch model during training(default) in ./model/ directory.
$python ./train.py
- Predict embedding of test data. Here we provide a trained model in ./trained_model.pth . It will produce two csv files in the current path :
- train_embedding.csv
- test_embedding.csv
$python ./predict.py
- back to MELT/ path
$cd ..
Please activate pytorch-env first
$source MELT-pytorch-env/bin/activate
or
$conda activate MELT-pytorch-env
- Enter experiment3 directory
$cd ./demo_experiment3
- Run MELT to train the model. It will save every epoch model during training(default) in ./model/ directory.
$python ./train.py
- Predict embedding of test data. Here we provide a trained model in ./trained_model.pth . It will produce two csv files in the current path :
- all_data_embedding.csv
- test_data_embedding.csv
$python ./predict.py