-
Notifications
You must be signed in to change notification settings - Fork 37
CAMISIM metatranscriptomics
Note that this is still a work in progress.
CAMISIM has been extended by a module for the simulation of metatranscriptomic reads.
As in CAMISIM, user-defined reference genomes are used for a community design, whereby the genomes are distributed in one of four modes. In addition, gene annotation files are used to also distribute in one of four modes. The resulting genome and gene abundances per sample are used to calculate gene expression levels. In the read simulation, the reads and associated mapping files can be generated using one of three read simulators. In the post-processing of the simulator, gold standard of the assemblies and anonymisation or binning are carried out depending on user requirements.
UML diagram of the CAMISIM metaT workflow
metatranscriptomic:
>> nextflow run main.nf --pipeline metatranscriptomic
All configuration parameters have default values set.
This configuration file allows you to make some general settings.
-
outdir
The directory to save all output files to. If the directory does not exist, it will be created.
-
size
Size of a single sample in Gigabasepairs (Gbp). The actual size, including mapping files, might be larger.
-
type
Type of the used read simulator.
Choose from "art"/"pbsim3"/"nanosim3"
Note, that nanosim uses the from CAMISIM calculated transcript abundances as an approximation estimation and it will not reflect the same number of reads.
For more information, see the Trans-NanoSim paper (https://academic.oup.com/gigascience/article/9/6/giaa061/5855462).
-
feature_type
Type of feature from the gene annotation file to distribute and simulate reads from.
Only the sequences with the same value in the type attribute of the gene annotation file will be used for simulation.
Defaults are: feature_type = "gene" or feature_type = "mRNA". Check the input annotation file for the correct value.
-
child_feature_type
The child feature type from the gene annotation file to distribute and simulate reads from.
Only the sequences with the same value in the type attribute of the gene annotation file will be used for simulation.
The child needs to reference the parent in the gene annotation file.
All the children of the given feature type are attached together. The extrated feature sequences from a minus strand is reverse complemented.
In type of "CDS", the phase attribute (8th field) will be checked and the number of bases of this feature will be removed accordingly.
Defaults are: child_feature_type = "CDS" or child_feature_type = "exon". Check the input annotation file for the correct value.
-
number_of_samples
The number of samples to simualate reads for.
-
seed
An optional seed to get consistent results.
If None is used, a random seed is chosen.
-
genome_locations_file
A tab-sperated file holding a genome Id and the path to reference fasta file of the genome.
Use either an absolute path to reference fasta file or a relative one.
Note, that if a relative path is used, CAMISIM will expect the reference fasta file(s) to be in the project directory.
-
gene_annotations_file
A tab-sperated file containing a genome ID and the path to the corresponding gene annotation file in gff3 file format.
-
gsa
This parameter specifies whether to do the gold standard assembly per sample.
-
pooled_gsa
The samples to do the gold standard assembly for ("[0,1]").
If the list is empty ("[]") the pooled gsa will be skipped.
If the parameter is set to true, all samples will be considered.
-
anonymization
This parameter specifies whether the reads should be anonymized.
-
metadata_file
Path to the input metadata tsv file.
The file has to be in the format: genome_ID
\t
OTU\t
NCBI_ID\t
novelty_category.
This configuration file configures one of the three readsimulators. If the parameter "type" in the metatranscriptomic.config file is set to "art", this file is used.
-
profile_read_length
The length of reads to be simulated.
-
fragment_size_mean
The mean size of the fragments to be created.
-
fragment_size_sd
The standard deviation of fragments to be created.
-
base_profile_name
Folder containing error profiles for the read simulators.
This configuration file configures one of the three readsimulators. If the parameter "type" in the metatranscriptomic.config file is set to "pbsim3", this file is used.
For more information on this simulator, see the paper (https://academic.oup.com/nargab/article/4/4/lqac092/6855700) or the GitHub page (https://github.com/yukiteruono/pbsim3).
-
method
Type of model-based simulation.
For model-based simulation using quality score model: "qshmm" or model-based simulation using error model: "errhmm".
All quality codes of simulated reads by error model are "!".
-
model
Directory containing the either the quality score model ("QSHMM-.model") or the error model ("ERRHMM-.model").
PBSIM3 provides models on their GitHub page.
The choice of model determines the type of reads.
For PacBio RS II CLR: "ERRHMM-RSII.model"/"QSHMM-RSII.model"
For PacBio ONT: "ERRHMM-ONT.model"/"ERRHMM-ONT-HQ.model"/"QSHMM-ONT.model"/"QSHMM-ONT-HQ.model"
PacBio Sequel CLR (only error model): "ERRHMM-SEQUEL.model"
-
difference_ratio
The difference (error) ratio to use (substitution:insertion:deletion).
Each value must be 0-1000.
PBSIM3 recommends as default value "6:55:39" for PacBio RS II, "39:24:36" for ONT and "22:45:33" for PacBio Sequel.
-
fragments_size_mean
Mean size (bp) of fragment simulated (read length depends on error profile)
-
read_length
-
fragment_size_standard_deviation
Standard deviation (bp) of fragments simulated
This configuration file configures one of the three readsimulators. If the parameter "type" in the metatranscriptomic.config file is set to "nanosim3", this file is used.
For more information on this simulator, see the paper (https://academic.oup.com/gigascience/article/9/6/giaa061/5855462?login=true) or the GitHub page (https://github.com/bcgsc/NanoSim).
Please note that the reads simulated with nanosim are not yet reproducible. For more information see: https://github.com/bcgsc/NanoSim/issues/163 .
-
base_profile_name
The profile to use.
Trans-Nanosim provides profiles on their GitHub page.
-
read_length
The default value is: 1125
-
basecaller
The basecaller to use for the read simulation in fastq format.
It has to be one of these one: {albacore,guppy,guppy-flipflop}
This configuration file configures the creation of the abundance distribution.
-
mode
Mode for changing the gene abundances in different samples
Has to be one of differential/timeseries/replicate
-
mu
Mean of the used log-normal distribution
-
sigma
Standard deviation of the used log-normal distribution
-
gauss_mu
Mean of the used normal distribution
Relevant for the "timeseries" mode
-
gauss_sigma
Standard deviation of the used normal distribution
Relevant for the "timeseries" mode
-
gene_sigma
Relevant for the "replicate" mode
-
genome_distribution_files
The paths to the genome distribution files to use for the read simulation.
Use wildcard or a list of paths to the distribution files ([path1, path2]).
If this parameter is empty, CAMISIM will calculate new distributions.
-
genome_mode
Mode for changing the abundances in different samples
Has to be one of replicates/timeseries_lognormal/timeseries_normal/differential
-
genome_log_mu
Mean of the used log-normal distribution
1 is an empirically good mean
-
genome_log_sigma
Standard deviation of the used log-normal distribution
2 is an empirically good sd
-
genome_gauss_mu
Mean of the used normal distribution
-
genome_gauss_sigma
Standard deviation of the used normal distribution
-
conda.enabled
This is parameter enables conda to install and configure all needed software packages.
-
conda.useMamba
This is parameter lets CAMISIM use mamba instead of conda to install and configure all needed software packages.
This is recommended for performance reasons.
-
conda.cacheDir
This parameter can be used to define a custom conda cache.
If none is defined, nextflow will create it in the work directory.
Note, that the directory has to exist before running the pipeline.
'{i}' is the index for each sample that is to be generated.
- Column 1: genome_ID
- Column 2: abundance
'genome_ID' is the identifier of the genomes used.
'abundance' is the relative abundance of a genome to be simulated. 'abundance' does not reflect the amount of genetic data of a genome, but the amount of genomes.
In a set of two genomes, with both having a abundance of 0.5 but one genome is double the size of the other, the bigger genome will be 66% of the genetic data in the simulated metagenome.
'{i}' is the index for each sample that is to be generated.
- Column 1: gene_ID
- Column 2: abundance
'gene_ID' is the identifier from the parent feature type selected of the gene annotation file 'abundance' is the relative abundance of a gene in that samples genome.
'{i}' is the index for each sample that is to be generated.
- Column 1: gene_ID
- Column 2: expression_value
'gene_ID' is the identifier from the parent feature type selected of the gene annotation file 'expression_value' is the expression value of that transcript in the given sample.
If the metagenome is simulated from profile this file is present. It contains a list of genomes paths to the copies in the output directory in the 'genomes' folder.
- Column 1: genome_ID
- Column 2: file path
The bam files generated based on reads generated from the read simulator.
The simulated reads.
A file containing all reads of this sample.
If anonymization is done, this will be the only fastq file.
Mapping of reads for evaluation
- Column 1: anonymous read id
- Column 2: genome id
- Column 3: taxonomic id
- Column 4: read id
Fasta file with perfect assembly of reads of this sample
Mapping of contigs for evaluation
- Column 1: anonymous contig id
- Column 2: genome id
- Column 3: taxonomic id
- Column 4: sequence id of the original genome (in 'source_genomes' folder)
- Column 5: number of reads used in the contig
- Column 6: start position
- Column 7: end position
Fasta file with perfect assembly of reads from all samples
Mapping of contigs from pooled reads fo_validate_raw_genomesr evaluation.
- Column 1: anonymous_contig_id
- Column 2: genome id
- Column 3: taxonomic id
- Column 4: sequence id of the original genome (in 'source_genomes' folder)
- Column 5: number of reads used in the contig
- Column 6: start position
- Column 7: end position
The seeds used for the simulation. If the results of the simulation need to be reproduced, use the used_initial_seed from the {out}/seed/seed.txt in the next run. The seed can be configured in the nextflow.config file.