Skip to content

bioinfoUQAM/siWalk

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

siWalk

SiWalk is a machine learning-based tool that predicts conserved siRNA production in Arabidopsis using a Random Forest classifier, complementing established methods. It refines predictions by incorporating phase boundaries and a standardized genomic contig index. SiWalk identifies siRNA precursors and active siRNAs based on structural and nucleotide patterns, while improving accuracy and minimizing false positives.

Copyright (c) 2024 Chao-Jung Wu, see License.

Citation:

Please cite the paper for your work using this tool.

Title: Improving siRNA Prediction in Arabidopsis with Machine Learning

Cite: Chao-Jung Wu, Abdoulaye Baniré Diallo

BibTex:

@article{... ...
}

Link to retrieve the paper: http...

DOI for source code NARGAB release tag: 10.5281/zenodo.17258867

DOI for source codes and supporting files: 10.5281/zenodo.17258908

Dependencies:

Languages: python/3.10.2 (scipy, statsmodels, sklearn, numpy)

Packages: spark/3.3.0, bowtie/1.3.0, samtools/1.17, viennarna/2.5.1, perl/5.30.1

Distributed with siWalk: miRanda, miRCheck

Environments: Users should run siWalk on servers that support parallel computing with Spark. This program was developed on Compute Canada clusters, primarily on Narval.

Manual

Follow this manual to download and execute the siWalk project on a server. You can use Compute Canada clusters or any other server infrastructure.

M1. Download siWalk from GitHub

The siWalk project is available on GitHub at github/JulieCJWu/siWalk.

To access the release version associated with the paper, visit: siWalk/releases/241106testlargefile.

  • Download the assets and install the source code.
  • Place the file background.tsv in /path/siWalk/dbs/ folder.
  • Place the file TAIR10_genomic.fa in /path/siWalk/dbs/TAIR10 folder.

M2. Generate the RFAs100 Model

3.1. Generate the RFAs100 model (described in the paper but too large to be distributed with siWalk) using the following command. Note that this step takes approximately 30 minutes on a personal computer.

cd /path/siWalk/src/
annotated_training_file=../dbs/background.tsv
python siWalk_pickle_localization.py $annotated_training_file

3.2. Rename the resulting model file to RFAs100.pkl and move it to /path/siWalk/model/.

M3. Description of supporting files

Supporting files, including data for Arabidopsis, are bundled with siWalk. Use these files to quickly start analyzing Arabidopsis sRNA-seq data.

  • background.tsv: This file served as the Arabidopsis training set in the paper and contains a total of 185539 segment samples.
  • TAIR10_genomic.fa: The Arabidopsis genome is provided for your convenience.
  • dbs/mature.fa: miRNAs from miRBase v22.1
  • dbs/GSM1087998_C0RC1.txt: an example expression library in readcount format, sourced from the GSE44622 series
  • dbs/feature_definition.txt: definitions of variables that may appear in the analysis
  • model/GBA100.pkl: machine learning model using Gradient Boosting with AdaBoost, based on the top 100 features as described in the paper
  • model/GBAs100.pkl: machine learning model using Gradient Boosting with AdaBoost, based on the top 100 structural features as described in the paper
  • model/RFAs100.pkl: machine learning model using Random Forest with AdaBoost, based on the top 100 structural features as described in the paper. This file is generated by users following the steps in M2.
  • model/Arabidopsis_feature_importance_n_correlation.tsv: the feature evaluation datafile to pair with GBA100.pkl
  • model/Arabidopsis_strcture_feature_importance_n_correlation.tsv: the feature evaluation datafile to pair with GBAs100.pkl and RFAs100.pkl

M4. Overview of core scripts

  • submit_siWalk_precursor_features.sh: see details in A2.
  • siWalk_classify_precursors.py: see details in A3.
  • siWalk_predict_siRNA_location.py: see details in B1.
  • siWalk_pickle_precursor.py: see details in X2.
  • siWalk_pickle_localization.py: see details in X2.

To support the prediction of precursors from expression libraries aligned in SAM format (A1), two distinct scripts are provided: one for computing precursor features (A2) and another for classifying precursors (A3). These steps remain separate to allow for custom precursor labeling before classification, providing greater flexibility in the process.

For effector prediction from a precursor, a script scans the precursor sequence to identify and report the top six predicted siRNA candidates (B1).

To support the creation of a machine learning model using a custom training set (X2), two scripts are provided: one for generating a model using all feature types, and another for generating a model using only the structural features.

M5. Run siWalk on a server

siWalk can be executed on various server infrastructures, such as Compute Canada or Microsoft Azure. The setup involves modifying the job submission script according to the server environment.

5a. Use Compute Canada clusters (e.g., Narval)

  • Modify the Submission File: siWalk provides a template submission file for Narval on Compute Canada. To use it:

    • Update your credentials in the template.
    • Customize job parameters (e.g., CPUs, memory) based on your computational needs.
  • Submit the Job:
    The submit_siWalk_precursor_features.sh script requires SLURM and Spark. After customizing the submission file, submit it using the SLURM scheduling command. See details in A2.

5b. Use a Login Node for simple tasks on Compute Canada clusters

The scripts siWalk_classify_precursors.py, siWalk_predict_siRNA_location.py, and siWalk_pickle_precursor.py, along with the steps to create a SAM file, are relatively simple tasks that can be executed on a personal computer with a properly set up environment, should you prefer that approach.

For running these tasks on a Compute Canada cluster, set up the environment as follows: First, nevigate to a subfolder of siWalk,

cd /path/siWalk/workdir/

Then,

module load StdEnv/2020 viennarna/2.5.1 gcc/9.3.0 
module load blast+/2.14.0 samtools/1.17 bowtie/1.3.0 scipy-stack
chmod +x ../lib/miranda
virtualenv --no-download env
source env/bin/activate
pip install --no-index --upgrade pip
pip install --no-index statsmodels sklearn

5c. Use other servers

For servers like Microsoft Azure, follow their specific instructions for job submission. Although this has not been explicitly tested by the authors, the general steps are as follows:

  • Upload the siWalk files to the server.
  • Set up the environment by installing the necessary dependencies and configuring any required settings.
  • Submit the job using Azure's task scheduling tools.

Module A. precursor prediction

A1. Create SAM file

siWalk requires SAM files as input for prediction. To create these alignment files, follow the steps below to align sRNA-seq libraries to the appropriate genome reference. While siWalk does not directly support this step, you can easily achieve it by:

  1. Define the path to the genome_file.
  2. Set the prefix for the genome using index_prefix.
  3. Build bowtie genome reference index.
  4. Create genome .fai index.

Example set up:

genome_file=/path/TAIR10_genomic.fa
index_prefix=/path/bowtie_index/TAIR10_genomic

Then run:

bowtie-build -f --quiet $genome_file $index_prefix
samtools faidx $genome_file
  1. Prepare the library in readcount or raw format:
    • Ensure the library is in readcount format (each line contains a sequence and its occurrence) or raw format (each line represents an occurrence of a sequence).
    • Use the provided helper script to convert readcount to raw format.
  2. Align the library using bowtie:
    • Align your sRNA-seq library to the genome using bowtie and output the results in .sam format.
    • Depending on the size of the genome and library, processing time may vary (e.g., 20 minutes for Arabidopsis, 20 hours for wheat using 64 CPUs).
  3. Obtain associated sorted .bam and index .bai/.csi files.

Example set up:

path_siWalk=/path/siWalk
filename=GSM1087998_C0RC1
infile=$path_siWalk/dbs/$filename.txt # in readcount format

Then run:

mkdir $path_siWalk/input
raw=$path_siWalk/input/$filename.raw
python $path_siWalk/src/rc2raw.py $infile > $raw
samfile=$path_siWalk/input/$filename.sam
bamfile=$path_siWalk/input/sorted_$filename.bam
bowtie -r $raw -x $index_prefix --threads 64 -v 0 --mm -a --sam --no-unal > $samfile
samtools sort $samfile > $bamfile
samtools index $bamfile #\\ or \\ samtools index -c $bamfile # for large genome

A2. Generate features

  1. Enter your credentials in the submit_siWalk_precursor_features.sh submission file.
    Edit the following variables in the file to specify the input path for the .sam, .bam, and .bai files, as well as the file path to the reference genome.
INPUT_SAM_PATH=/path/siWalk/input
genome_file=/path/TAIR10_genomic.fa
  1. Submit the job to the queue from the workdir folder, and take note of the job ID.
cd /path/siWalk/workdir/
sbatch submit_siWalk_precursor_features.sh
  1. Locate output files:
    • Output files, including the annotation_file of expression and structural features ($filename.contig_features.tsv), will be available in the folder /path/siWalk/output/$jobID/$filename/.

Check that the output files have been generated:

ls /path/siWalk/output/$jobID/$filename/
annotation_file=/path/siWalk/output/$jobID/$filename/$filename.contig_features.tsv
head -n 3 $annotation_file

A3. Predict precursor segments

  1. Run precursor classification:
    • Use the script siWalk_classify_precursors.py to predict precursor segments by passing the annotation_file from the previous step.
cd /path/siWalk/src/
python siWalk_classify_precursors.py $annotation_file
  1. Retrieve prediction results:
    • The output, including the predicted siRNA-generating loci, will be available as prefix.prediction.tsv in the same folder as the annotation_file.
    • Key columns include:
      • CONTIG: 5250-nt genomic region defined by coordinates (e.g., 3__5860000_5865250 on chromosome 3).
      • eff_strand: coordinate of the most abundant effector siRNA allowing drift.
      • eff_seq: sequence of the most abundant effector siRNA.
      • segment: predicted start and end positions of the siRNA-generating locus.
      • Predicted_Class: classification of the locus as an siRNA precursor (true or false) predicted by the model.

Check the output file:

file=/path/siWalk/output/$jobID/$filename/*prediction.tsv
head -n 3 $file

Module B. effector prediction from a precursor

B1. Provide your precursor sequence, and generate structural features

  1. Run the siWalk script to scan the precursor sequence and identify the most likely effector siRNA location:
    • The script siWalk_predict_siRNA_location.py requires an input sequence (priseq) and the DicerCall parameter to define the dicing interval (e.g., 21 nt).
    • For optimal performance, it is recommended to limit precursor sequences to a maximum of 200 nucleotides to ensure efficient identification of local siRNA candidates, although there is no limitation on sequence length.
cd /path/siWalk/src/
priseq=CTTGACCTTGTAAGGCCTTTTCTTGACCTTGTAAGACCCCATCTCTTTCTAAACGTTTTATTATTTTCTCGTTTTACAGATTCTATTCTATCTCTTCTCAATATAGAATAGATATCTATCT
DicerCall=21
python siWalk_predict_siRNA_location.py $priseq $DicerCall

B2. Obtain the recommendations of the most likely start and end positions of effector siRNA on a precursor

  1. Retrieve the result file from the /path/siWalk/output/timestamp/ folder. You will find the following output files:
    • yourPrecursor.effector_localization_indication.tsv: Contains the predicted effector localization information.
    • yourPrecursor.effector_localization_indication.png: A graphical representation of the predicted localization.
    • yourPrecursor.effector_localization_indication_top6_recommendation.tsv: Contains the top 6 predicted siRNA candidates from the precursor.

The most likely effector positions will also be displayed in the standard output. Stdout example:

> yourPrecursor, predicted siRNA start: 38, end: 59, score: 749.860201

Module X. train a machine learning model

X1. Create the training dataset

  1. Construct the training dataset

    • For each precursor candidate, compute the relevant features as outlined in A2.
  2. Annotate the dataset with class labels

    • Add a consistent column, where each instance is labeled as True or False, indicating whether the candidate corresponds to a true precursor (True) or a non-precursor (False).
    • Ensure that the dataset contains a sufficient number of positive and negative examples.
    • This labeling process is user-defined and should be tailored according to the specific dataset and classification requirements.
  3. The resulting file will be the annotated_training_file for the next step.

X2. Generate the model

  1. Considering top 100 overall attributes, run the script siWalk_pickle_precursor.py to train the machine learning model:
    • This script will produce several output files, including:
      • output/prefix_GradientBoostingWithAdaBoost_model.pkl: trained model with a timestamp for version control.
      • output/prefix_feature_importance_n_correlation.tsv: feature evaluation file listing the ranked importance of features and their correlation with precursor classes.
    • The model generated by this script, trained on the provided background.tsv file, is the GBA100 model described in the paper, which is also included with the siWalk distribution.
    • Models in this category are intended solely for precursor predictions.
cd /path/siWalk/src/
annotated_training_file=../dbs/background.tsv #\\ or \\ annotated_training_file=from_X1
python siWalk_pickle_precursor.py $annotated_training_file
  1. Considering top 100 structural attributes, run the script siWalk_pickle_localization.py to train the machine learning model:
    • This script will produce several output files, including:
      • output/prefix_RandomForestWithAdaBoost_model.pkl: trained model with a timestamp for version control.
      • output/prefix_strcture_feature_importance_n_correlation.tsv: feature evaluation file listing the ranked importance of structural features and their correlation with precursor classes.
    • The model generated by this script, trained on the provided background.tsv file, is the RFAs100 model (also detailed in M2).
    • Models in this category are suitable for both precursor and effector predictions.
cd /path/siWalk/src/
annotated_training_file=../dbs/background.tsv #\\ or \\ annotated_training_file=from_X1
python siWalk_pickle_localization.py $annotated_training_file

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Packages

No packages published