In this page, we provide examples illustrating the different options offered by the platform.
metagm_build: "Quality control on reference genomes was done using checkm (ref), filtering out assemblies with more than 5% contamination or less than 90% completeness. High-quality assemblies were used to build a Kraken2 (ref) database. A Bracken (ref) database was also build."
metagm_classify: "Paired-end read files were classified using Kraken2 (ref) custom database (include description? n species?). Bracken (ref) was then applied to obtain species-level metagenomics profiles."
- Currently, it is preferable to run the scripts on
farm3
, as there is no large memory queue elsewhere (especiallymetagm_build
). - To use all the scripts described in this page, users need to add the following to their
~/.profile
file:
# add the metagm library to your path
export PATH=/nfs/team162/kv4/github/metagm/metagm/wrapper:$PATH
Then: depending of the server you use (farm3 or farm5):
-
optional only if working on farm5 or pcs6 , load useful modules:
module load /nfs/users/nfs_k/kv4/modules/metagm
-
Add the following line in your
~/.profile
:alias metagm_env='source /nfs/team162/kv4/bin/metagm_env/bin/activate'
-
Add the following line in your
~/.profile
:alias metagm_env_farm5='source /nfs/team162/kv4/bin/metagm_env_farm5/bin/activate'
-
Source it:
source ~/.profile
-
(farm3) Run
metagm_env
in the terminal it should slightly change the prompt (metagm_env
) in front of it -
(farm5) Run
metagm_env_farm5
in the terminal it should slightly change the prompt (metagm_env
) in front of it
Note: to leave the virtual environment, type deactivate
This section describes the process of building databases for different softwares (Kraken, Bracken, Mash, ...) using a list of reference genomes. It also takes care of submitting all the jobs on farm and manage job dependencies.
There is two mandatory positional arguments for this function:
- a
genomes
list (text file):- mandatory first column contains the absolute paths to genome assemblies (
.fa
or.fna
).- For internal genomes, please use symlinks path not the pathogen system path. The reason being that all assemblies are named
contigs.fa
in the system and therefore are not unique names. - User can get symlinks from
pf assembly -t file -i YOUR_LIST_OF_LANEIDS.txt -l MY_SYMLINK_FOLDER
- For internal genomes, please use symlinks path not the pathogen system path. The reason being that all assemblies are named
- second column contains the genome names (if not provided, the file name will be used)
- third column contains the taxids (if not provided, a taxonomic assignment step is performed)
- mandatory first column contains the absolute paths to genome assemblies (
- an
output
folder to store all the files produced by the script
The script metagm_build.py
also offers options in the building database process:
-h
: help display-v
: verbose mode for additional details on each step--QC
: run quality control on the list of genomes before building any database. If not done, the scirpt assumes that all the genomes have already been cheked.--taxoAssign
: run taxonomic assignment step on all the genomes. It will automatically be done if Kraken/Bracken databases are built.--KrakenDB
: build a Kraken2 database using the list of genomes--BrackenDB
: build a Bracken database using the list of genomes. Requires a Kraken database to exist, and will therefore automatically creates one--MashDB
: create a Mash sketch of all the genomes.--ncbi
: rely on NCBI taxonomy instead of gtdb (default: false)-b
: define the number of genomes to be analyzed in each batch (default: 10)-t
: define the number of threads used in each job (default: 2)-q
: define to which queue the jobs are submitted (default: long)-m
: define the amount of memory requested for each job (default: 64)--maxcontamination
: define the maximal value on checkm contamination to filter out a genome during QC (default: 5)--mincompleteness
: define the minimal value on checkm completeness to filter out a genome during QC (default: 90)
The output
folder contains a directory for each task that is performed:
output/merge_final
contains quality control (QC) results for all the genomesoutput/merge_final/ValidatedGenomes.txt
is the list of all genomes that pass QCoutput/merge_final/FilteredGenomes.txt
is the list of all genomes that fail QCoutput/merge_final/log.txt
provides details on why a genome failed QC
output/tmp$i
folders contain the checkm lineage_wf output for the batchi
of genomes (if parallelized)output/Kraken
contains the Kraken2 database, as well as the Bracken files (if requested)output/Mash
contains the Mash sketch fileoutput/genome_with_[gtdb|ncbi]_taxid.txt
contains the list of taxids assigned to the genomes list if a taxonomic assignment step is performed
The following examples illustrate various features from the metagm_build.py
script. Depending how busy farm is, these examples can take some time to run.
metagm_build.py /nfs/team162/kv4/bin/list_example_pipeline.txt ./test --QC --taxoAssign -m 150
The command applies:
- Quality control on 13 genomes.
- according to
./merge_final/ValidatedGenomes.txt
, there is 12 genomes that passed QC - according to
./merge_final/FilteredGenomes.txt
, there is 1 genome that failed QC - according to
./merge_final/log.txt
, the genome was filtered because of XYZ
- taxonomic assignment on validated genomes only, using GTDB taxonomy (default).
- the taxonomic assignment can be found in
./genome_with_gtdb_taxid.txt
- The taxonomic assignment step is high in memory requirement (~150Gb).
This example achieves the same task than previously, except it takes advantage of job parallelization. Better than running jobs on 10 genomes at a time (2 jobs in previous example), we now do batches of 2 genomes (-b
flag), therefore submitting 7 smaller jobs.
#delete previous example folder
rm -rf ./test
# run the faster command
metagm_build.py /nfs/team162/kv4/bin/list_example_pipeline.txt ./test --QC --taxoAssign -m 150 -b 2
It is possible to skip the quality control and directly create Kraken database if user already checked the genomes. Additionally, the taxonomic assignment step can be ignored if user provides a taxid for each genome (third column in input list). Here, we use the curated genome list generated in the previous example.
#Run on a queue with 'hugemem' (e.g., farm3)
metagm_build.py ./genome_with_gtdb_taxid.txt ./test --KrakenDB --BrackenDB
The command builds:
- Kraken2 database on XYZ genomes using GTDB taxonomy.
- generate multiple files described in Kraken2 manual
- generate multiple files described in Bracken manual
It is possible to run all the previous steps in one single command: first, genomes are QC'ed, and then validated genomes are assigned taxonomically, finally a Kraken/Bracken database is built.
#delete previous example folder
rm -rf ./test
#Run on a queue with 'hugemem' (e.g., farm3)
metagm_build.py /nfs/team162/kv4/bin/list_example_pipeline.txt ./test --QC --KrakenDB --BrackenDB -m 100 -b 2
The command runs:
- Quality control on 13 genomes.
- according to
./merge_final/ValidatedGenomes.txt
, there is 12 genomes that passed QC - according to
./merge_final/FilteredGenomes.txt
, there is 1 genome that failed QC - according to
./merge_final/log.txt
, the genome was filtered because of XYZ
- according to
- taxonomic assignment on validated genomes only, using GTDB taxonomy (default).
- the taxonomic assignment can be found in
./genome_with_gtdb_taxid.txt
- the taxonomic assignment can be found in
- Kraken2 database on XYZ genomes using GTDB taxonomy.
- generate multiple files described in Kraken2 manual
- Bracken files.
- generate multiple files described in Bracken manual
- You can access all the information in the command help
metagm_build.py -h
- Building a Kraken database might require a large memory amount (>200Gb). The function will automatically submit the final
kraken_build
job onhugemem
queue which means users need to run it on farm3 (or farm4/5 when this queue will be available). - If neither
--QC
,--Kraken
,--Bracken
, or--Mash
is provided, themetagm_build.py
is not going to do anything. - The taxonomic assignment step is high in memory requirement (~100Gb).
This section describes the process of classifying metagenomics sequencing reads to get both taxonomic and functional profiles. It also takes care of submitting all the jobs on farm and manage job dependencies.
There is two mandatory positional arguments for this function:
- a
metagenomes
list (one-column text file):- paired-end data(gzipped): only provide path without extension, like
/lustre/scratch118/infgen/team162/kv4/healthy/ERR209654
. The script will automatically seek for/lustre/scratch118/infgen/team162/kv4/healthy/ERR209654_1.fastq.gz
and/lustre/scratch118/infgen/team162/kv4/healthy/ERR209654_2.fastq.gz
. - single-end data: provide whole path to
fastq
files - the script will automatically detect if data are gzipped
- paired-end data(gzipped): only provide path without extension, like
- an
output
folder to store all the results produced by the script
The script metagm_classify.py
offers software options for classification:
-h
: help display-v
: verbose mode for additional details on each step--KrakenDB
: path to a Kraken2 database folder (example:/nfs/pathogen005/team162/Kraken1019
)--BrackenDB
: path to a Kraken2 database folder with Bracken files in it (example:/nfs/pathogen005/team162/Kraken1019
)--BrackenRank
: taxonomic rank for Bracken output (default: S for species) [options: D,P,C,O,F,G,S]--MashDB
: path to a.msh
Mash sketch (example/nfs/pathogen005/team162/RefSeq94n.msh
)--functional
: run a functional characterization of the metagenomes rather than a taxonomic one (default: false)--noMetaPhlan
: skipMetaPhlan2
step in functional analysis (default: false). will only work if these files already exist.-b
: define the number of metagenomes to be analyzed in each batch (default: 10)-t
: define the number of threads used in each job (default: 2)-q
: define to which queue the jobs are submitted (default: long)-m
: define the amount of memory requested for each job (default: 64)--merge
: will automatically merge all the output files from one method in a single table file (default: true)- for
Kraken
, it relies on the/nfs/team162/kv4/Kraken2Table.pl
script to generate two merged tables: a raw count and a normalized abundance - for
Bracken
, it relies on thecombine_bracken_outputs.py
script to generate a single table mixing raw and relative abundance - for
Mash
, it relies on the/nfs/team162/kv4/bin/parse_mash.R
script to report high-confidence hits for each sample - for
functional
analysis, data will always be merged in tables at different levels (gene families, pathways, GO categories, ...)
- for
The output
folder contains a directory for each task that is performed:
output/Kraken
contains the Kraken2 reportoutput/Bracken
contains the Bracken filesoutput/Mash
contains the Mash sketch fileoutput/MetaPhlan2
contains all the output files generated byMetaPhlan2
during the functional analysisoutput/Humann2
contains all the output files generated byHumann2
during the functional analysis
- The
--BrackenDB
flag will automatically trigger--KrakenDB
with the same database, asBracken
requiresKraken
output files to create its output files.
* __current__ Kraken1019: using the Mgnify curated collection (31,555 genomes) from bacteria and archea. The taxon IDs have been assigned using GTDB.
* __current__ Kraken1019_with_nonBact_nonArch: same as Kraken1019, but also contains 131 gut-associated fungi, small eukaryota and other parasites.
* Kraken0419_kraken2_taxo_resolved: build with Kraken2, using the HBC + BGI collections and ~300 public genomes and the taxon IDs have been manually curated using GTDB rather than the default NCBI.
* InternalKraken_OCT2017: HBC + public genomes (@Sam), relied on the old QC for reference genomes
* RefSeq94n.msh: Mash sketch of the entire RefSeq (version 94)
* taxonomy/ : custom taxonomy built from GTDB for Bacteria and Archea, and saved in a NCBI format for Kraken. It also contains few viruses and eukaryotes.
The following examples illustrate various features from the metagm_classify.py
script. Depending how busy farm is, these examples can take some time to run.
metagm_classify.py --BrackenDB /nfs/pathogen005/team162/Kraken1019 /nfs/team162/kv4/bin/test_list_metagenomes_bangladesh.txt ./test_classify_pipeline -b 2
The command returns:
- in
./test_classify_pipeline/Bracken/*S.bracken
, Bracken files on 6 metagenomes at species level (default).- also contains the merged table with all samples
./test_classify_pipeline/Bracken/merged_S.bracken
- also contains the merged table with all samples
- in
./test_classify_pipeline/Kraken/*.out
, Kraken2 files on 6 metagenomes.- also contains the two merged tables with all samples from Kraken output:
- the raw read counts:
./test_classify_pipeline/Kraken/merged_raw.txt
- the normalized abundance
./test_classify_pipeline/Kraken/merged_norm.txt
- the raw read counts:
- also contains the two merged tables with all samples from Kraken output:
- in
./test_classify_pipeline/Kraken/*bracken.out
, Kraken2 files on 6 metagenomes, without unclassified reads.
Very similar command to the previous example, except we introduce the --BrackenRank
to specify the rank Bracken should use:
metagm_classify.py --BrackenDB /nfs/pathogen005/team162/Kraken1019 /nfs/team162/kv4/bin/test_list_metagenomes_bangladesh.txt ./test_classify_pipeline_genus -b 2 --BrackenRank G
The command returns:
- in
./test_classify_pipeline_genus/Kraken/*.out
, Kraken2 files on 6 metagenomes.- also contains the two merged tables with all samples from Kraken output: the raw read counts:
./test_classify_pipeline_genus/Kraken/merged_raw.txt
and the normalized abundance./test_classify_pipeline_genus/Kraken/merged_norm.txt
- please note that these files hsould be exactly the same that in
./test_classify_pipeline/Kraken/
as the Kraken step considers all ranks
- also contains the two merged tables with all samples from Kraken output: the raw read counts:
- in
./test_classify_pipeline_genus/Bracken/*G.bracken
, Bracken files on 6 metagenomes at species level (default).- also contains the merged table with all samples
./test_classify_pipeline_genus/Bracken/merged_G.bracken
- also contains the merged table with all samples
For this example, Refseq genomes are detected in metagenome samples. Please note that this exercise is not standard and still experimental, so be careful on the way you use the results. Likely, the detected genomes are from abundant organisms too.
metagm_classify.py --MashDB /nfs/pathogen005/team162/RefSeq94n.msh /nfs/team162/kv4/bin/test_list_metagenomes_bangladesh.txt ./test_classify_pipeline_mash -b 2
The command returns:
- in
./test_classify_pipeline_mash/Mash/*_screen.tab
, Mash sketch on 6 metagenomes. - in
./test_classify_pipeline_mash/Mash/merged_mash.txt
, merged best hits detected with some confidence in metagenomes
This pipeline feature relies on
metagm_classify.py --functional /nfs/team162/kv4/bin/test_list_metagenomes_bangladesh.txt ./test_classify_pipeline_functional -b 2
The command returns:
-
In this section, we describe the process used to assign a given genome to the current taxonomy. As mentioned in the section 'metagm_build module', users have the option to rely on either the gtdb taxonomy done with
gtdb-tk classify_wf
function. -
Current taxonomic tree build from GTDB metadata is stored here:
/nfs/pathogen005/team162/taxonomy
-
Non bacterial and non archeal organisms cannot be assigned using gtdb and need to be manually annotated using
/nfs/pathogen005/team162/taxonomy/names.dmp
file
Kraken software relies on taxonomic information presented in the 'NCBI format' (a pair of nodes.dmp
and names.dmp
).
Unfortunately, gtdb does not provide (yet?) its taxonomy in such format. Therefore, we need to build a GTDB-based taxonomy that can be saved in the NCBI format.
- download all the gtdb archeal metadata and bacterial metadata containing all taxonomic paths for every genome found in the database.
- extract unique paths in the metadata files to make the following steps faster
# merge bacterial and archeal metadata
cat ar122_metadata.tsv bac120_metadata.tsv > bac_ar_metadata.tsv
# get unique taxonomic paths
cut -f17 bac_ar_metadata.tsv | sort -u > tmp.txt
# remove last line that contains header
head -n -1 tmp.txt > taxo_path_gtdb_bac120_ar122_unique.txt
#remove temporary file
rm tmp.txt
- create a Taxonomy tree in Python
#import library
import pickle # save binary object, like trees
import sys # system command library
sys.path.append('/nfs/team162/kv4/github/metagm') # add the metagm library to your session
from metagm.phylogeny.TaxonomyTree import TaxonomyTree # class to generate trees
# this class only needs a list of taxonomic paths and will generate a taxonomy in NCBI format
tree = TaxonomyTree('taxo_path_gtdb_bac120_ar122_unique.txt')
# save the tree in NCBI format at the given location
tree.saveNCBIFormat('taxonomy_bac_ar')
# also save the tree in a Pickle format (Python binary) for later
tree.savePickleFormat('/nfs/team162/kv4/bin/gtdb_metadata/taxonomy_bac_ar/taxo.pyc')
If a tree needs to be updated by adding new nodes, it is not necessary to re-build it from scratch. User can provide a Pickle tree and new taxonomic paths:
#import library
import pickle # load binary object, like trees
import csv # read csv files
import sys # system command library
sys.path.append('/nfs/team162/kv4/github/metagm') # add the metagm library to your session
from metagm.phylogeny.TaxonomyTree import TaxonomyTree # class to generate trees
#load existing tree to add nodes
with open('taxonomy_bac_ar/taxo.pyc', "rb") as input_file:
tree = pickle.load(input_file)
#Here we want to add non bacterial and non archeal nodes (fungi, virus and other eukaryotes)
extraNodes = '/nfs/team162/kv4/Kraken1019/nonBacterial_taxopath.txt'
with open(extraNodes) as tsvfile:
reader = csv.reader(tsvfile, delimiter='\t')
for i, row in enumerate(reader):
print(row[0])
# add a node if it is new
tree.add_node(row[0])
# save the updated tree in NCBI format at the given location
tree.saveNCBIFormat('taxonomy')
# also save the tree in a Pickle format (Python binary) for later
tree.savePickleFormat('taxonomy/taxo.pyc')
This section provides some R snippets to do analysis using the metagm_classify output files.
More details are given in the R vignette '.Rmd', also found in this repository.
- This section describes how to log in the lab mySQL database.
- The screenshots refer to the DbVisualizer software, free to install.
- The free version gives you access to read-only features (no query).
- For obvious security reasons, the password is not provided here and needs to be requested to Nick/Hilary.
- open the software and in the left margin
Databases
, right-click forCreate Database Connection
- Select
Use Wizard
, then choose a name for the connection (example:microbiota_db
), then pressNext
- Select
MySQL
as Database driver, then pressNext
- Fill all the informations as shown in the screenshot (password not included), then press
Finish
- Then, user can explore the different tables in the database (example:
isolates
)
All the genome assemblies can be found in /lustre/scratch118/infgen/team162/kv4/working_list_genomes_kraken
.
This section describes the content of quality control applied to genome assemblies in metagm_build module. Here are the different steps with their corresponding default values:
- the total length of the assembly has to be lower than 8Mbp
- the total number of contigs has to be lower 400
- checkM is then run on the assembly if it passed the previous two steps:
- more than 90% completeness is required
- less than 5% contamination is required
If you want to run PanPhlan tool for a specific species and get strain-level resolution, you might want to check if they provide a database for this species here.
If they do, we already have them downloaded in /lustre/scratch118/infgen/team162/ys4/panphlan/panphlan_db
.
If not, you could build your own folowwing this tutorial.
There is a wrapper script for rnammer found in /nfs/team162/kv4/bin/run_rnammer.sh
.
It takes a list of paths to assembly in a text file as well as a output directory to store the results.
Example:
sh /nfs/team162/kv4/bin/run_rnammer.sh /nfs/team162/kv4/bin/test_rnammer.txt ./16S_extract/
This section describes the built-in classes created in the metagm
library. They can be used to improve further Python script development or add features in an efficient manner.
This class creates a genome object from a sequence file and stores various information about it:
genomename
: name given to the genome (default: ``)genomefilepath
: location of the sequence file (default: ``)tmpdir
: where temporary results are stored (default: tmp)qc
: if true, will run quality control on this genome (default: true)seqrecord
:fasta
sequence efficently stored in python session (default: none)taxid
: known taxonomic ID for this genome (default: none)valid
: whether this genome passed quality control (default: none)VALIDATED_GENOMES
: text file containing genomes that passed QC (default: ValidatedGenomes.txt)FILTERED_GENOMES
: text file containing genomes that failed QC (default: FilteredGenomes.txt)LOG
: text file containing reasons genomes failed QC (default: log.txt)MAX_CONTIGS
: maximum number of contigs to pass QC (default: 400)MAX_SIZE
: maximum assembly size to pass QC (default: 8000000)COMPLETENESS_THRESHOLD
: minimum assembly completeness to pass QC (default: 90)CONTAMINATION_THRESHOLD
: maximum assembly contamination to pass QC (default: 5)QUEUE
: LSF queue to be used when dealing with this genome (default: long)NTHREADS
: number of threads to be used when dealing with this genome (default: 8)MEMORY
: memory (Gb) to be used when dealing with this genome (default: 8)ASSEMBLER
: default assembler to be used (default: spades)DEFAULT_ROARY_CUTOFF
: threshold for Roary operation (default: 0.99)annotation
: location of an annotation file for this genome (default: none)seq16s
: existing 16S sequence for this genome (default: none)filehash
: location of hash file (default: none)
#in python3
#load libraries
import sys
sys.path.append('/nfs/team162/kv4/github/metagm')
from metagm.sequences.BacterialGenome import BacterialGenome
# create a genome using its sequence
g = BacterialGenome(genomefilepath='/lustre/scratch118/infgen/team162/kv4/working_list_genomes_kraken/GCA_900129655.1_IMG-taxon_2695420967_annotated_assembly_genomic.fna', genomename='Bacteroides clarus GCF_900129655', taxid='626929')
# it will run QC on it first (default)
# qc=False can be used in the previous command to avoid running QC
# then, asking if the genome passed QC
g.is_valid()
# True means the genome passed QC
# then, extract the 16S sequence of this genome (rnammer)
g.get_16s_sequence()
g.seq16s # it is a SeqRecord object
len(g.seq16s) # one 16S copy
#get the assocaited 16S fasta sequence
print(g.seq16s["rRNA_FQWK01000014.1_260-1774_DIR+"].format("fasta"))
This class creates a object made of multiple BacterialGenome and makes it easy to serialize analyses. Its components are:
genomelist
: a list of BacterialGenome objects (default: list())taxidlist
: a list of taxids associated with genomes (default: None)taxidfile
: a file where taxids are stored (default: None)genomefilepathlist
: a list of paths where genome assemblies are located (default: {})inputfile
: a text file containing the paths of all genomes (default: None)QUEUE
: LSF queue to be used when dealing with this genome (default: long)NTHREADS
: number of threads to be used when dealing with this genome (default: 4)MEMORY
: memory (Gb) to be used when dealing with this genome (default: 8)TMPDIR
: where temporary results are stored (default: tmp)
#in python3.6
#load libraries
import csv
import sys
sys.path.append('/nfs/team162/kv4/github/metagm')
from metagm.sequences.BacterialGenome import BacterialGenome
from metagm.sequences.GenomeList import GenomeList
# create list of genomes from an input file
print('Loading list of genomes...')
glist = GenomeList(inputfile='/nfs/team162/kv4/bin/test_list_genomes.txt')
with open(glist.inputfile) as tsvfile:
reader = csv.reader(tsvfile, delimiter='\t')
for i, row in enumerate(reader):
g = BacterialGenome(genomefilepath=row[0], genomename=row[1], qc=False)
# append to existing list
glist.append(g, checkvalid=False)
#get list length -->10
len(glist)
#get name of first genome in list
glist.genomelist[0].genomename
#get path to third genome in list
glist.genomelist[3].genomefilepath
#remove the first genome in the list using its name
glist.remove(genomeidentifier=glist.genomelist[0].genomename)
#get list length --> 9
len(glist)
#get name of first genome in list --> different than previously
glist.genomelist[0].genomename
#get gtdb taxid for a list of genomes (submit 1 job on long queue, might take some time)
glist.list_assign_taxo_gtdb()
# create a 'myTaxids.txt' file with the results and store the results in the list also
glist.taxidfile
# get the list of taxids attached to the genomes
glist.taxidlist
This class stores a metagenome sample: Its components are:
metafastqfile1
: a fastq file (default: None)metafastqfile2
: a fastq file, used only if data are paired (default: None)pairedFlag
: are the data paired-end? (default: False)zippedFlag
: are the data zipped? (default: False)krakenReport
: a path to Kraken2 report (default: None)brackenReport
: a path to Bracken report (default: None)mashReport
: a path to Mash report (default: None)metaphlanReport
: a path to MetaPhlan2 report (default: None)QUEUE
: LSF queue to be used when dealing with this genome (default: long)NTHREADS
: number of threads to be used when dealing with this genome (default: 4)MEMORY
: memory (Gb) to be used when dealing with this genome (default: 8)TMPDIR
: where temporary results are stored (default: tmp)
#in python3.6
#load libraries
import sys
sys.path.append('/nfs/team162/kv4/github/metagm')
from metagm.sequences.Metagenome import Metagenome
# create a metagenome object from paired fastq
mg = Metagenome(metagenomename='ERR2835563', metafastqfile1='/lustre/scratch118/infgen/pathogen/pathpipe/pathogen_prok_external/seq-pipelines/human/gut_metagenome/TRACKING/235/PS.170.S4.5.metagenomic/SLX/ERX2842313/ERR2835563/ERR2835563_1.fastq.gz',metafastqfile2='/lustre/scratch118/infgen/pathogen/pathpipe/pathogen_prok_external/seq-pipelines/human/gut_metagenome/TRACKING/235/PS.170.S4.5.metagenomic/SLX/ERX2842313/ERR2835563/ERR2835563_2.fastq.gz')
# builder automatically detect if data are paired-end
mg.pairedFlag
# builder automatically detect if data are zipped
mg.zippedFlag
This class creates a object made of multiple Metagenome and makes it easy to serialize analyses. Its components are:
metagenomelist
: a list of Metagenome objects (default: list())metaphlanFlag
: Has MetaPhlan2 already been run on these metagenomes? (default: False)brackenrank
: taxonomic rank for Bracken prediction (default: S for species)metaphlanID
: a list of MetaPhlan2 job IDs to manage Humann2 dependencies (default: None)outputDir
: where output files are stored (default: tmp)brackenOutputList
: list of files created by Bracken. It is used for the merge step (default: list())
#in python3.6
#load libraries
import csv
import sys
import os
import re
sys.path.append('/nfs/team162/kv4/github/metagm')
from metagm.sequences.Metagenome import Metagenome
from metagm.sequences.MetagenomeList import MetagenomeList
# create list of metagenomes from an input file
mglist = MetagenomeList()
with open('/nfs/team162/kv4/bin/test_list_metagenomes_bangladesh.txt') as tsvfile:
reader = csv.reader(tsvfile, delimiter='\t')
for i, row in enumerate(reader):
# retrieve the paired-end fastq from folder path only
files = [f for f in os.listdir(os.path.dirname(row[0])) if re.match(os.path.basename(row[0])+'_[12].fastq.gz$', f)]
mg=Metagenome(metagenomename=os.path.basename(row[0]),metafastqfile1=os.path.dirname(row[0])+'/'+files[0], metafastqfile2=os.path.dirname(row[0])+'/'+files[1])
# append
mglist.append(mg)
#get list length --> 5
len(mglist)
#classify these samples with Kraken (submit jobs only and return the job IDs)
mglist.classify_kraken('/nfs/pathogen005/team162/Kraken0419_kraken2_taxo_resolved')
#results are stored in './tmp'
This class stores a tree structure using the ETE3 toolkit. Its components are:
tree
: a ETE3 tree object (default: Tree())maxtaxid
: largest taxid used in the tree (default: 1)
#in python3.6
#load libraries
import pickle
import csv
import sys
sys.path.append('/nfs/team162/kv4/github/metagm')
from metagm.phylogeny.TaxonomyTree import TaxonomyTree
# create a tree object from a list of 10 paths
tree = TaxonomyTree('/nfs/team162/kv4/bin/test_list_taxopaths.txt')
tree.maxtaxid
#51 nodes were created
#print the tree in the terminal (ASCII format)
tree.printAscii()
#save tree in NCBI format
tree.saveNCBIFormat('./')
#save tree in Pickle format (can be loaded in any Python session)
tree.savePickleFormat('./taxo.pyc')
This section includes features that are either under-development, or not implemented yet:
- use Sanger lane IDs (e.g., 13470_2#55) instead of absolute paths to assemblies
- improve
metagm_classify.py
data loding step, by usingsplit
command to generate batch files instead of the existing 'for' loop. IT will also rmeove the extra loop currently used for the last batch.