Skip to content

Latest commit

 

History

History
28 lines (26 loc) · 1.42 KB

README.md

File metadata and controls

28 lines (26 loc) · 1.42 KB

IM-ML

A machine learning workflow to predict gene regulon membership based on promoter sequence features, focusing on top-down regulons derived from an Independent Component Analysis (ICA) of the PRECISE E. coli RNAseq database.

What is Independent Component Analysis?

To learn about ICA, how ICA components are computed, and what they can tell you, please visit https://imodulondb.org/about.html

Workflow outline

  1. Generate SigmaFactor PSSMs
  2. Feature Matrix Generation (This generates a ~200MB file necessary for machine learning)
  3. Feature Engineering
  4. Machine learing: model training and hyperparameter optimization
  5. ArcA Direct Repeats motifs to improve model performance

Dependencies

The workflow depends on:

  1. bitome: https://github.com/SBRG/bitome
  2. pymodulon: https://github.com/SBRG/pymodulon
  3. DNAshapeR:https://github.com/TsuPeiChiu/DNAshapeR
  4. scikit-learn: https://scikit-learn.org/stable/
  5. seaborn statistical data visualization:https://seaborn.pydata.org/index.html

Recommended package versions are:
Python==3.8
seaborn==0.12.2
numpy==1.24.3
matplotlib==3.7.1
pandas==1.5.3
biopython==1.78

Citation

Qiu, S., Lamoureux, C., Akbari, A., Palsson, B. O., & Zielinski, D. C. (2022). Quantitative sequence basis for the E. coli transcriptional regulatory network. https://doi.org/10.1101/2022.02.20.481200