Skip to content

qiyuanhuakai/MetaCSST

 
 

Repository files navigation

###############################################################################################
##Package: Metagenomic Complex Sequence Scanning Tool (MetaCSST)                             ##
##Developer: Fazhe Yan                                                                       ##
##Email: fazheyan33@163.com ; ccwei@sjtu.edu                                                 ##
##Department: Department of Bioinformatics and Biostatistics, Shanghai Jiao Tong University  ##
###############################################################################################

##################
## Introduction ##
##################
Metagenomic Complex Sequencing Scanning Tool (MetaCSST) is a tool to predict DGRs in sequenced genomes as well as metagenomic datasets. It is based on Generalized Hidden Markov Model (GHMM), using motif patterns to identify the elements in DGRs.

###############
## Copyright ##
###############
This software is free for personal, academic and non-profit use from https://github.com/fzyan/MetaCSST (GitHub website)
For commercial users, please contact <ccwei@sjtu.edu.cn>.

#########################
## System requirements ##
#########################
Linux operation system, memory 2G to use multiple threads.
Python 3.9+ (numpy, numba, pyfastx) and gcc with C++20 support.

###########
## Usage ##
###########
1>Identify sub structures (TR, VR or RT) in DGRs:
    ./MetaCSSTsub -build config.json -in $fa [-out $out_dir] [-thread $thread]

    # $fa : input file in FASTA format
    # $out_dir : output directory. If not given, the default out directory will be "out_metacsst"
    # $thread : thread number, default 1

2>DGR prediction
    Step1: ./MetaCSSTmain -build config.json -in $fa [-out $out_dir1] [-thread $thread]
           # Identification of the sub structures using GHMM
    Step2: python3 src/call_vr.py $out_dir1/raw.gtf $fa $out-DGR
           # calling VRs and removing duplicate TR-VR pairs

    # Note: legacy *.config files are no longer supported.
    #       Use single-file config.json / config.toml / config.yaml only.

###############
## OUT files ##
###############
1>Identify sub structures (TR, VR or RT) in DGRs:
    out_dir/out.txt   : Identified sub structures
    out_dir/align.txt : count matrix for each position, used to build PWMs
    out_dir/score.txt : PWMs (scoring matrices)

2>DGR prediction
    Step1:
        out_tmp1/raw.gtf : TRs and RTs identified.
    Step2:
        out-DGR.gtf : Final DGR output generated by call_vr.py

###########
## Files ##
###########
    |-MetaCSSTmain                                      executable program to predict DGRs
    |-MetaCSSTsub                                       executable program to identify TRs, VRs or RTs
    |-config.json / config.toml / config.yaml          single-file config files in the GHMM
    |-align/*align                                      align matrix used to develop the GHMM
    |-src/main_modern.cpp                               source code to build MetaCSSTmain
    |-src/sub_modern.cpp                                source code to build MetaCSSTsub
    |-src/ghmm_modern.hpp                               GHMM core
    |-src/fun_modern.hpp                                utility functions
    |-src/config_modern.hpp                             config parsing utilities
    |-src/call_vr.py                                    VR calling + duplicate removal
    |-addition/*                                        collected/training/test data
    |-example and example.sh                            example pipeline to identify DGRs

##################
## Installation ##
##################
MetaCSSTmain and MetaCSSTsub are executable programs.
If you want to modify the codes and recompile:
   g++ -std=c++20 -O2 -Wall -Wextra -pthread src/main_modern.cpp -o MetaCSSTmain
   g++ -std=c++20 -O2 -Wall -Wextra -pthread src/sub_modern.cpp -o MetaCSSTsub

#############
## Contact ##
#############
If you have any questions, feel free to contact us:
   fazheyan33@163.com
   ccwei@sjtu.edu.cn

About

A bioinformatics tool designed for discovering DGRs (Diversity-Generating Retroelements), based on generalized hidden Markov models. 为发现DGR(Diversity-Generating Retroelements)而设计的生物信息学工具,基于广义隐马尔可夫模型

Topics

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors

Languages

  • C++ 70.9%
  • Python 13.9%
  • Xmake 8.7%
  • Makefile 6.5%