SiWalk is a machine learning-based tool that predicts conserved siRNA production in Arabidopsis using a Random Forest classifier, complementing established methods. It refines predictions by incorporating phase boundaries and a standardized genomic contig index. SiWalk identifies siRNA precursors and active siRNAs based on structural and nucleotide patterns, while improving accuracy and minimizing false positives.
Copyright (c) 2024 Chao-Jung Wu, see License.
Please cite the paper for your work using this tool.
Title: Improving siRNA Prediction in Arabidopsis with Machine Learning
Cite: Chao-Jung Wu, Abdoulaye Baniré Diallo
BibTex:
@article{... ...
}
Link to retrieve the paper: http...
DOI for source code NARGAB release tag: 10.5281/zenodo.17258867
DOI for source codes and supporting files: 10.5281/zenodo.17258908
Languages: python/3.10.2 (scipy, statsmodels, sklearn, numpy)
Packages: spark/3.3.0, bowtie/1.3.0, samtools/1.17, viennarna/2.5.1, perl/5.30.1
Distributed with siWalk: miRanda, miRCheck
Environments: Users should run siWalk on servers that support parallel computing with Spark. This program was developed on Compute Canada clusters, primarily on Narval.
Follow this manual to download and execute the siWalk project on a server. You can use Compute Canada clusters or any other server infrastructure.
The siWalk project is available on GitHub at github/JulieCJWu/siWalk.
To access the release version associated with the paper, visit: siWalk/releases/241106testlargefile.
- Download the assets and install the source code.
- Place the file background.tsv in
/path/siWalk/dbs/folder. - Place the file TAIR10_genomic.fa in
/path/siWalk/dbs/TAIR10folder.
3.1. Generate the RFAs100 model (described in the paper but too large to be distributed with siWalk) using the following command. Note that this step takes approximately 30 minutes on a personal computer.
cd /path/siWalk/src/
annotated_training_file=../dbs/background.tsv
python siWalk_pickle_localization.py $annotated_training_file
3.2. Rename the resulting model file to RFAs100.pkl and move it to /path/siWalk/model/.
Supporting files, including data for Arabidopsis, are bundled with siWalk. Use these files to quickly start analyzing Arabidopsis sRNA-seq data.
- background.tsv: This file served as the Arabidopsis training set in the paper and contains a total of 185539 segment samples.
- TAIR10_genomic.fa: The Arabidopsis genome is provided for your convenience.
- dbs/mature.fa: miRNAs from miRBase v22.1
- dbs/GSM1087998_C0RC1.txt: an example expression library in readcount format, sourced from the GSE44622 series
- dbs/feature_definition.txt: definitions of variables that may appear in the analysis
- model/GBA100.pkl: machine learning model using Gradient Boosting with AdaBoost, based on the top 100 features as described in the paper
- model/GBAs100.pkl: machine learning model using Gradient Boosting with AdaBoost, based on the top 100 structural features as described in the paper
- model/RFAs100.pkl: machine learning model using Random Forest with AdaBoost, based on the top 100 structural features as described in the paper. This file is generated by users following the steps in M2.
- model/Arabidopsis_feature_importance_n_correlation.tsv: the feature evaluation datafile to pair with
GBA100.pkl - model/Arabidopsis_strcture_feature_importance_n_correlation.tsv: the feature evaluation datafile to pair with
GBAs100.pklandRFAs100.pkl
- submit_siWalk_precursor_features.sh: see details in A2.
- siWalk_classify_precursors.py: see details in A3.
- siWalk_predict_siRNA_location.py: see details in B1.
- siWalk_pickle_precursor.py: see details in X2.
- siWalk_pickle_localization.py: see details in X2.
To support the prediction of precursors from expression libraries aligned in SAM format (A1), two distinct scripts are provided: one for computing precursor features (A2) and another for classifying precursors (A3). These steps remain separate to allow for custom precursor labeling before classification, providing greater flexibility in the process.
For effector prediction from a precursor, a script scans the precursor sequence to identify and report the top six predicted siRNA candidates (B1).
To support the creation of a machine learning model using a custom training set (X2), two scripts are provided: one for generating a model using all feature types, and another for generating a model using only the structural features.
siWalk can be executed on various server infrastructures, such as Compute Canada or Microsoft Azure. The setup involves modifying the job submission script according to the server environment.
-
Modify the Submission File: siWalk provides a template submission file for Narval on Compute Canada. To use it:
- Update your credentials in the template.
- Customize job parameters (e.g., CPUs, memory) based on your computational needs.
-
Submit the Job:
Thesubmit_siWalk_precursor_features.shscript requires SLURM and Spark. After customizing the submission file, submit it using the SLURM scheduling command. See details in A2.
The scripts siWalk_classify_precursors.py, siWalk_predict_siRNA_location.py, and siWalk_pickle_precursor.py, along with the steps to create a SAM file, are relatively simple tasks that can be executed on a personal computer with a properly set up environment, should you prefer that approach.
For running these tasks on a Compute Canada cluster, set up the environment as follows: First, nevigate to a subfolder of siWalk,
cd /path/siWalk/workdir/
Then,
module load StdEnv/2020 viennarna/2.5.1 gcc/9.3.0
module load blast+/2.14.0 samtools/1.17 bowtie/1.3.0 scipy-stack
chmod +x ../lib/miranda
virtualenv --no-download env
source env/bin/activate
pip install --no-index --upgrade pip
pip install --no-index statsmodels sklearn
For servers like Microsoft Azure, follow their specific instructions for job submission. Although this has not been explicitly tested by the authors, the general steps are as follows:
- Upload the siWalk files to the server.
- Set up the environment by installing the necessary dependencies and configuring any required settings.
- Submit the job using Azure's task scheduling tools.
siWalk requires SAM files as input for prediction. To create these alignment files, follow the steps below to align sRNA-seq libraries to the appropriate genome reference. While siWalk does not directly support this step, you can easily achieve it by:
- Define the path to the
genome_file. - Set the prefix for the genome using
index_prefix. - Build
bowtiegenome reference index. - Create genome
.faiindex.
Example set up:
genome_file=/path/TAIR10_genomic.fa
index_prefix=/path/bowtie_index/TAIR10_genomic
Then run:
bowtie-build -f --quiet $genome_file $index_prefix
samtools faidx $genome_file
- Prepare the library in
readcountorrawformat:- Ensure the library is in
readcountformat (each line contains a sequence and its occurrence) orrawformat (each line represents an occurrence of a sequence). - Use the provided helper script to convert
readcounttorawformat.
- Ensure the library is in
- Align the library using
bowtie:- Align your sRNA-seq library to the genome using
bowtieand output the results in.samformat. - Depending on the size of the genome and library, processing time may vary (e.g., 20 minutes for Arabidopsis, 20 hours for wheat using 64 CPUs).
- Align your sRNA-seq library to the genome using
- Obtain associated sorted
.bamand index.bai/.csifiles.
Example set up:
path_siWalk=/path/siWalk
filename=GSM1087998_C0RC1
infile=$path_siWalk/dbs/$filename.txt # in readcount format
Then run:
mkdir $path_siWalk/input
raw=$path_siWalk/input/$filename.raw
python $path_siWalk/src/rc2raw.py $infile > $raw
samfile=$path_siWalk/input/$filename.sam
bamfile=$path_siWalk/input/sorted_$filename.bam
bowtie -r $raw -x $index_prefix --threads 64 -v 0 --mm -a --sam --no-unal > $samfile
samtools sort $samfile > $bamfile
samtools index $bamfile #\\ or \\ samtools index -c $bamfile # for large genome
- Enter your credentials in the
submit_siWalk_precursor_features.shsubmission file.
Edit the following variables in the file to specify the input path for the.sam,.bam, and.baifiles, as well as the file path to the reference genome.
INPUT_SAM_PATH=/path/siWalk/input
genome_file=/path/TAIR10_genomic.fa
- Submit the job to the queue from the
workdirfolder, and take note of the job ID.
cd /path/siWalk/workdir/
sbatch submit_siWalk_precursor_features.sh
- Locate output files:
- Output files, including the
annotation_fileof expression and structural features ($filename.contig_features.tsv), will be available in the folder/path/siWalk/output/$jobID/$filename/.
- Output files, including the
Check that the output files have been generated:
ls /path/siWalk/output/$jobID/$filename/
annotation_file=/path/siWalk/output/$jobID/$filename/$filename.contig_features.tsv
head -n 3 $annotation_file
- Run precursor classification:
- Use the script
siWalk_classify_precursors.pyto predict precursor segments by passing theannotation_filefrom the previous step.
- Use the script
cd /path/siWalk/src/
python siWalk_classify_precursors.py $annotation_file
- Retrieve prediction results:
- The output, including the predicted siRNA-generating loci, will be available as
prefix.prediction.tsvin the same folder as theannotation_file. - Key columns include:
- CONTIG: 5250-nt genomic region defined by coordinates (e.g.,
3__5860000_5865250on chromosome 3). - eff_strand: coordinate of the most abundant effector siRNA allowing drift.
- eff_seq: sequence of the most abundant effector siRNA.
- segment: predicted start and end positions of the siRNA-generating locus.
- Predicted_Class: classification of the locus as an siRNA precursor (
trueorfalse) predicted by the model.
- CONTIG: 5250-nt genomic region defined by coordinates (e.g.,
- The output, including the predicted siRNA-generating loci, will be available as
Check the output file:
file=/path/siWalk/output/$jobID/$filename/*prediction.tsv
head -n 3 $file
- Run the siWalk script to scan the precursor sequence and identify the most likely effector siRNA location:
- The script
siWalk_predict_siRNA_location.pyrequires an input sequence (priseq) and the DicerCall parameter to define the dicing interval (e.g., 21 nt). - For optimal performance, it is recommended to limit precursor sequences to a maximum of 200 nucleotides to ensure efficient identification of local siRNA candidates, although there is no limitation on sequence length.
- The script
cd /path/siWalk/src/
priseq=CTTGACCTTGTAAGGCCTTTTCTTGACCTTGTAAGACCCCATCTCTTTCTAAACGTTTTATTATTTTCTCGTTTTACAGATTCTATTCTATCTCTTCTCAATATAGAATAGATATCTATCT
DicerCall=21
python siWalk_predict_siRNA_location.py $priseq $DicerCall
B2. Obtain the recommendations of the most likely start and end positions of effector siRNA on a precursor
- Retrieve the result file from the
/path/siWalk/output/timestamp/folder. You will find the following output files:- yourPrecursor.effector_localization_indication.tsv: Contains the predicted effector localization information.
- yourPrecursor.effector_localization_indication.png: A graphical representation of the predicted localization.
- yourPrecursor.effector_localization_indication_top6_recommendation.tsv: Contains the top 6 predicted siRNA candidates from the precursor.
The most likely effector positions will also be displayed in the standard output. Stdout example:
> yourPrecursor, predicted siRNA start: 38, end: 59, score: 749.860201
-
Construct the training dataset
- For each precursor candidate, compute the relevant features as outlined in A2.
-
Annotate the dataset with class labels
- Add a
consistentcolumn, where each instance is labeled as True or False, indicating whether the candidate corresponds to a true precursor (True) or a non-precursor (False). - Ensure that the dataset contains a sufficient number of positive and negative examples.
- This labeling process is user-defined and should be tailored according to the specific dataset and classification requirements.
- Add a
-
The resulting file will be the
annotated_training_filefor the next step.
- Considering top 100 overall attributes, run the script
siWalk_pickle_precursor.pyto train the machine learning model:- This script will produce several output files, including:
- output/prefix_GradientBoostingWithAdaBoost_model.pkl: trained model with a timestamp for version control.
- output/prefix_feature_importance_n_correlation.tsv: feature evaluation file listing the ranked importance of features and their correlation with precursor classes.
- The model generated by this script, trained on the provided
background.tsvfile, is the GBA100 model described in the paper, which is also included with the siWalk distribution. - Models in this category are intended solely for precursor predictions.
- This script will produce several output files, including:
cd /path/siWalk/src/
annotated_training_file=../dbs/background.tsv #\\ or \\ annotated_training_file=from_X1
python siWalk_pickle_precursor.py $annotated_training_file
- Considering top 100 structural attributes, run the script
siWalk_pickle_localization.pyto train the machine learning model:- This script will produce several output files, including:
- output/prefix_RandomForestWithAdaBoost_model.pkl: trained model with a timestamp for version control.
- output/prefix_strcture_feature_importance_n_correlation.tsv: feature evaluation file listing the ranked importance of structural features and their correlation with precursor classes.
- The model generated by this script, trained on the provided
background.tsvfile, is the RFAs100 model (also detailed in M2). - Models in this category are suitable for both precursor and effector predictions.
- This script will produce several output files, including:
cd /path/siWalk/src/
annotated_training_file=../dbs/background.tsv #\\ or \\ annotated_training_file=from_X1
python siWalk_pickle_localization.py $annotated_training_file