CAMISIM metatranscriptomics

Note that this is still a work in progress.

CAMISIM has been extended by a module for the simulation of metatranscriptomic reads.

As in CAMISIM, user-defined reference genomes are used for a community design, whereby the genomes are distributed in one of four modes. In addition, gene annotation files are used to also distribute in one of four modes. The resulting genome and gene abundances per sample are used to calculate gene expression levels. In the read simulation, the reads and associated mapping files can be generated using one of three read simulators. In the post-processing of the simulator, gold standard of the assemblies and anonymisation or binning are carried out depending on user requirements.

UML diagram of the CAMISIM metaT workflow

UML diagram of the CAMISIM metaT workflow

USAGE:

metatranscriptomic:

>> nextflow run main.nf --pipeline metatranscriptomic

Configuration Files:

All configuration parameters have default values set.

pipelines/metatranscriptomic/config/metatranscriptomic.config:

This configuration file allows you to make some general settings.

outdir

The directory to save all output files to. If the directory does not exist, it will be created.
size

Size of a single sample in Gigabasepairs (Gbp). The actual size, including mapping files, might be larger.
type

Type of the used read simulator.

Choose from "art"/"pbsim3"/"nanosim3"

Note, that nanosim uses the from CAMISIM calculated transcript abundances as an approximation estimation and it will not reflect the same number of reads.

For more information, see the Trans-NanoSim paper (https://academic.oup.com/gigascience/article/9/6/giaa061/5855462).
feature_type

Type of feature from the gene annotation file to distribute and simulate reads from.

Only the sequences with the same value in the type attribute of the gene annotation file will be used for simulation.

Defaults are: feature_type = "gene" or feature_type = "mRNA". Check the input annotation file for the correct value.
child_feature_type

The child feature type from the gene annotation file to distribute and simulate reads from.

Only the sequences with the same value in the type attribute of the gene annotation file will be used for simulation.

The child needs to reference the parent in the gene annotation file.

All the children of the given feature type are attached together. The extrated feature sequences from a minus strand is reverse complemented.

In type of "CDS", the phase attribute (8th field) will be checked and the number of bases of this feature will be removed accordingly.

Defaults are: child_feature_type = "CDS" or child_feature_type = "exon". Check the input annotation file for the correct value.
number_of_samples

The number of samples to simualate reads for.
seed

An optional seed to get consistent results.

If None is used, a random seed is chosen.
genome_locations_file

A tab-sperated file holding a genome Id and the path to reference fasta file of the genome.

Use either an absolute path to reference fasta file or a relative one.

Note, that if a relative path is used, CAMISIM will expect the reference fasta file(s) to be in the project directory.
gene_annotations_file

A tab-sperated file containing a genome ID and the path to the corresponding gene annotation file in gff3 file format.
gsa

This parameter specifies whether to do the gold standard assembly per sample.
pooled_gsa

The samples to do the gold standard assembly for ("[0,1]").

If the list is empty ("[]") the pooled gsa will be skipped.

If the parameter is set to true, all samples will be considered.
anonymization

This parameter specifies whether the reads should be anonymized.
metadata_file

Path to the input metadata tsv file.

The file has to be in the format: genome_ID\tOTU\tNCBI_ID\tnovelty_category.

pipelines/metatranscriptomic/config/art.config:

This configuration file configures one of the three readsimulators. If the parameter "type" in the metatranscriptomic.config file is set to "art", this file is used.

profile_read_length

The length of reads to be simulated.
fragment_size_mean

The mean size of the fragments to be created.
fragment_size_sd

The standard deviation of fragments to be created.
base_profile_name

Folder containing error profiles for the read simulators.

pipelines/metatranscriptomic/config/pbsim3.config:

This configuration file configures one of the three readsimulators. If the parameter "type" in the metatranscriptomic.config file is set to "pbsim3", this file is used.

For more information on this simulator, see the paper (https://academic.oup.com/nargab/article/4/4/lqac092/6855700) or the GitHub page (https://github.com/yukiteruono/pbsim3).

method

Type of model-based simulation.

For model-based simulation using quality score model: "qshmm" or model-based simulation using error model: "errhmm".

All quality codes of simulated reads by error model are "!".
model

Directory containing the either the quality score model ("QSHMM-.model") or the error model ("ERRHMM-.model").

PBSIM3 provides models on their GitHub page.

The choice of model determines the type of reads.

For PacBio RS II CLR: "ERRHMM-RSII.model"/"QSHMM-RSII.model"

For PacBio ONT: "ERRHMM-ONT.model"/"ERRHMM-ONT-HQ.model"/"QSHMM-ONT.model"/"QSHMM-ONT-HQ.model"

PacBio Sequel CLR (only error model): "ERRHMM-SEQUEL.model"
difference_ratio

The difference (error) ratio to use (substitution:insertion:deletion).

Each value must be 0-1000.

PBSIM3 recommends as default value "6:55:39" for PacBio RS II, "39:24:36" for ONT and "22:45:33" for PacBio Sequel.
fragments_size_mean

Mean size (bp) of fragment simulated (read length depends on error profile)
read_length
fragment_size_standard_deviation

Standard deviation (bp) of fragments simulated

pipelines/metatranscriptomic/config/nanosim.config:

This configuration file configures one of the three readsimulators. If the parameter "type" in the metatranscriptomic.config file is set to "nanosim3", this file is used.

For more information on this simulator, see the paper (https://academic.oup.com/gigascience/article/9/6/giaa061/5855462?login=true) or the GitHub page (https://github.com/bcgsc/NanoSim).

Please note that the reads simulated with nanosim are not yet reproducible. For more information see: https://github.com/bcgsc/NanoSim/issues/163 .

base_profile_name

The profile to use.

Trans-Nanosim provides profiles on their GitHub page.
read_length

The default value is: 1125
basecaller

The basecaller to use for the read simulation in fastq format.

It has to be one of these one: {albacore,guppy,guppy-flipflop}

pipelines/metatranscriptomic/config/distribution.config:

This configuration file configures the creation of the abundance distribution.

mode

Mode for changing the gene abundances in different samples

Has to be one of differential/timeseries/replicate
mu

Mean of the used log-normal distribution
sigma

Standard deviation of the used log-normal distribution
gauss_mu

Mean of the used normal distribution

Relevant for the "timeseries" mode
gauss_sigma

Standard deviation of the used normal distribution

Relevant for the "timeseries" mode
gene_sigma

Relevant for the "replicate" mode
genome_distribution_files

The paths to the genome distribution files to use for the read simulation.

Use wildcard or a list of paths to the distribution files ([path1, path2]).

If this parameter is empty, CAMISIM will calculate new distributions.
genome_mode

Mode for changing the abundances in different samples

Has to be one of replicates/timeseries_lognormal/timeseries_normal/differential
genome_log_mu

Mean of the used log-normal distribution

1 is an empirically good mean
genome_log_sigma

Standard deviation of the used log-normal distribution

2 is an empirically good sd
genome_gauss_mu

Mean of the used normal distribution
genome_gauss_sigma

Standard deviation of the used normal distribution

pipelines/shared/config/conda.config:

conda.enabled

This is parameter enables conda to install and configure all needed software packages.
conda.useMamba

This is parameter lets CAMISIM use mamba instead of conda to install and configure all needed software packages.

This is recommended for performance reasons.
conda.cacheDir

This parameter can be used to define a custom conda cache.

If none is defined, nextflow will create it in the work directory.

Note, that the directory has to exist before running the pipeline.

Output:

{out}/distributions/genome_distributions/distribution_{i}.txt

'{i}' is the index for each sample that is to be generated.

Column 1: genome_ID
Column 2: abundance

'genome_ID' is the identifier of the genomes used.
'abundance' is the relative abundance of a genome to be simulated. 'abundance' does not reflect the amount of genetic data of a genome, but the amount of genomes.
In a set of two genomes, with both having a abundance of 0.5 but one genome is double the size of the other, the bigger genome will be 66% of the genetic data in the simulated metagenome.

{out}/distributions/gene_distributions/distribution_{genome_ID}sample{i}.txt

'{i}' is the index for each sample that is to be generated.

Column 1: gene_ID
Column 2: abundance

'gene_ID' is the identifier from the parent feature type selected of the gene annotation file 'abundance' is the relative abundance of a gene in that samples genome.

{out}/distributions/final_distributions/{genome_ID}__{i}_final_distribution.txt

'{i}' is the index for each sample that is to be generated.

Column 1: gene_ID
Column 2: expression_value

'gene_ID' is the identifier from the parent feature type selected of the gene annotation file 'expression_value' is the expression value of that transcript in the given sample.

{out}/internal/genome_to_id.tsv

If the metagenome is simulated from profile this file is present. It contains a list of genomes paths to the copies in the output directory in the 'genomes' folder.

Column 1: genome_ID
Column 2: file path

{out}/sample_{i}/bam/sample{i}_{genome_id}.bam

The bam files generated based on reads generated from the read simulator.

{out}/sample_{i}/reads/sample{i}*{genome_id}**.fq

The simulated reads.

{out}/sample_{i}/reads/sample{i}_*.fq

A file containing all reads of this sample.

{out}/sample_{i}/reads/anonymous_reads.fq

If anonymization is done, this will be the only fastq file.

{out}/sample_{i}/reads/reads_mapping.tsv

Mapping of reads for evaluation

Column 1: anonymous read id
Column 2: genome id
Column 3: taxonomic id
Column 4: read id

{out}/sample_{i}/contigs/anonymous_gsa.fasta

Fasta file with perfect assembly of reads of this sample

{out}/sample_{i}/contigs/gsa_mapping.tsv

Mapping of contigs for evaluation

Column 1: anonymous contig id
Column 2: genome id
Column 3: taxonomic id
Column 4: sequence id of the original genome (in 'source_genomes' folder)
Column 5: number of reads used in the contig
Column 6: start position
Column 7: end position

{out}/anonymous_gsa_pooled.fasta

Fasta file with perfect assembly of reads from all samples

{out}/gsa_pooled_mapping.tsv

Mapping of contigs from pooled reads fo_validate_raw_genomesr evaluation.

Column 1: anonymous_contig_id
Column 2: genome id
Column 3: taxonomic id
Column 4: sequence id of the original genome (in 'source_genomes' folder)
Column 5: number of reads used in the contig
Column 6: start position
Column 7: end position

{out}/seed/seed*.txt

The seeds used for the simulation. If the results of the simulation need to be reproduced, use the used_initial_seed from the {out}/seed/seed.txt in the next run. The seed can be configured in the nextflow.config file.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CAMISIM metatranscriptomics

USAGE:

Configuration Files:

pipelines/metatranscriptomic/config/metatranscriptomic.config:

pipelines/metatranscriptomic/config/art.config:

pipelines/metatranscriptomic/config/pbsim3.config:

pipelines/metatranscriptomic/config/nanosim.config:

pipelines/metatranscriptomic/config/distribution.config:

pipelines/shared/config/conda.config:

Output:

{out}/distributions/genome_distributions/distribution_{i}.txt

{out}/distributions/gene_distributions/distribution_{genome_ID}sample{i}.txt

{out}/distributions/final_distributions/{genome_ID}__{i}_final_distribution.txt

{out}/internal/genome_to_id.tsv

{out}/sample_{i}/bam/sample{i}_{genome_id}.bam

{out}/sample_{i}/reads/sample{i}*{genome_id}**.fq

{out}/sample_{i}/reads/sample{i}_*.fq

{out}/sample_{i}/reads/anonymous_reads.fq

{out}/sample_{i}/reads/reads_mapping.tsv

{out}/sample_{i}/contigs/anonymous_gsa.fasta

{out}/sample_{i}/contigs/gsa_mapping.tsv

{out}/anonymous_gsa_pooled.fasta

{out}/gsa_pooled_mapping.tsv

{out}/seed/seed*.txt

Clone this wiki locally

CAMISIM metatranscriptomics

USAGE:

Configuration Files:

pipelines/metatranscriptomic/config/metatranscriptomic.config:

pipelines/metatranscriptomic/config/art.config:

pipelines/metatranscriptomic/config/pbsim3.config:

pipelines/metatranscriptomic/config/nanosim.config:

pipelines/metatranscriptomic/config/distribution.config:

pipelines/shared/config/conda.config:

Output:

{out}/distributions/genome_distributions/distribution_{i}.txt

{out}/distributions/gene_distributions/distribution_{genome_ID}sample{i}.txt

{out}/distributions/final_distributions/{genome_ID}__{i}_final_distribution.txt

{out}/internal/genome_to_id.tsv

{out}/sample_{i}/bam/sample{i}_{genome_id}.bam

{out}/sample_{i}/reads/sample{i}{genome_id}*.fq

{out}/sample_{i}/reads/sample{i}_*.fq

{out}/sample_{i}/reads/anonymous_reads.fq

{out}/sample_{i}/reads/reads_mapping.tsv

{out}/sample_{i}/contigs/anonymous_gsa.fasta

{out}/sample_{i}/contigs/gsa_mapping.tsv

{out}/anonymous_gsa_pooled.fasta

{out}/gsa_pooled_mapping.tsv

{out}/seed/seed*.txt

Clone this wiki locally

{out}/sample_{i}/reads/sample{i}*{genome_id}**.fq