Skip to content

CAMISIM metatranscriptomics

joyceFunk123 edited this page Aug 9, 2024 · 1 revision

Note that this is still a work in progress.

CAMISIM has been extended by a module for the simulation of metatranscriptomic reads.

As in CAMISIM, user-defined reference genomes are used for a community design, whereby the genomes are distributed in one of four modes. In addition, gene annotation files are used to also distribute in one of four modes. The resulting genome and gene abundances per sample are used to calculate gene expression levels. In the read simulation, the reads and associated mapping files can be generated using one of three read simulators. In the post-processing of the simulator, gold standard of the assemblies and anonymisation or binning are carried out depending on user requirements.

UML diagram of the CAMISIM metaT workflow

UML diagram of the CAMISIM metaT workflow

USAGE:

metatranscriptomic:

>> nextflow run main.nf --pipeline metatranscriptomic

Configuration Files:

All configuration parameters have default values set.

pipelines/metatranscriptomic/config/metatranscriptomic.config:

This configuration file allows you to make some general settings.

  • outdir

    The directory to save all output files to. If the directory does not exist, it will be created.

  • size

    Size of a single sample in Gigabasepairs (Gbp). The actual size, including mapping files, might be larger.

  • type

    Type of the used read simulator.

    Choose from "art"/"pbsim3"/"nanosim3"

    Note, that nanosim uses the from CAMISIM calculated transcript abundances as an approximation estimation and it will not reflect the same number of reads.

    For more information, see the Trans-NanoSim paper (https://academic.oup.com/gigascience/article/9/6/giaa061/5855462).

  • feature_type

    Type of feature from the gene annotation file to distribute and simulate reads from.

    Only the sequences with the same value in the type attribute of the gene annotation file will be used for simulation.

    Defaults are: feature_type = "gene" or feature_type = "mRNA". Check the input annotation file for the correct value.

  • child_feature_type

    The child feature type from the gene annotation file to distribute and simulate reads from.

    Only the sequences with the same value in the type attribute of the gene annotation file will be used for simulation.

    The child needs to reference the parent in the gene annotation file.

    All the children of the given feature type are attached together. The extrated feature sequences from a minus strand is reverse complemented.

    In type of "CDS", the phase attribute (8th field) will be checked and the number of bases of this feature will be removed accordingly.

    Defaults are: child_feature_type = "CDS" or child_feature_type = "exon". Check the input annotation file for the correct value.

  • number_of_samples

    The number of samples to simualate reads for.

  • seed

    An optional seed to get consistent results.

    If None is used, a random seed is chosen.

  • genome_locations_file

    A tab-sperated file holding a genome Id and the path to reference fasta file of the genome.

    Use either an absolute path to reference fasta file or a relative one.

    Note, that if a relative path is used, CAMISIM will expect the reference fasta file(s) to be in the project directory.

  • gene_annotations_file

    A tab-sperated file containing a genome ID and the path to the corresponding gene annotation file in gff3 file format.

  • gsa

    This parameter specifies whether to do the gold standard assembly per sample.

  • pooled_gsa

    The samples to do the gold standard assembly for ("[0,1]").

    If the list is empty ("[]") the pooled gsa will be skipped.

    If the parameter is set to true, all samples will be considered.

  • anonymization

    This parameter specifies whether the reads should be anonymized.

  • metadata_file

    Path to the input metadata tsv file.

    The file has to be in the format: genome_ID\tOTU\tNCBI_ID\tnovelty_category.

pipelines/metatranscriptomic/config/art.config:

This configuration file configures one of the three readsimulators. If the parameter "type" in the metatranscriptomic.config file is set to "art", this file is used.

  • profile_read_length

    The length of reads to be simulated.

  • fragment_size_mean

    The mean size of the fragments to be created.

  • fragment_size_sd

    The standard deviation of fragments to be created.

  • base_profile_name

    Folder containing error profiles for the read simulators.

pipelines/metatranscriptomic/config/pbsim3.config:

This configuration file configures one of the three readsimulators. If the parameter "type" in the metatranscriptomic.config file is set to "pbsim3", this file is used.

For more information on this simulator, see the paper (https://academic.oup.com/nargab/article/4/4/lqac092/6855700) or the GitHub page (https://github.com/yukiteruono/pbsim3).

  • method

    Type of model-based simulation.

    For model-based simulation using quality score model: "qshmm" or model-based simulation using error model: "errhmm".

    All quality codes of simulated reads by error model are "!".

  • model

    Directory containing the either the quality score model ("QSHMM-.model") or the error model ("ERRHMM-.model").

    PBSIM3 provides models on their GitHub page.

    The choice of model determines the type of reads.

    For PacBio RS II CLR: "ERRHMM-RSII.model"/"QSHMM-RSII.model"

    For PacBio ONT: "ERRHMM-ONT.model"/"ERRHMM-ONT-HQ.model"/"QSHMM-ONT.model"/"QSHMM-ONT-HQ.model"

    PacBio Sequel CLR (only error model): "ERRHMM-SEQUEL.model"

  • difference_ratio

    The difference (error) ratio to use (substitution:insertion:deletion).

    Each value must be 0-1000.

    PBSIM3 recommends as default value "6:55:39" for PacBio RS II, "39:24:36" for ONT and "22:45:33" for PacBio Sequel.

  • fragments_size_mean

    Mean size (bp) of fragment simulated (read length depends on error profile)

  • read_length

  • fragment_size_standard_deviation

    Standard deviation (bp) of fragments simulated

pipelines/metatranscriptomic/config/nanosim.config:

This configuration file configures one of the three readsimulators. If the parameter "type" in the metatranscriptomic.config file is set to "nanosim3", this file is used.

For more information on this simulator, see the paper (https://academic.oup.com/gigascience/article/9/6/giaa061/5855462?login=true) or the GitHub page (https://github.com/bcgsc/NanoSim).

Please note that the reads simulated with nanosim are not yet reproducible. For more information see: https://github.com/bcgsc/NanoSim/issues/163 .

  • base_profile_name

    The profile to use.

    Trans-Nanosim provides profiles on their GitHub page.

  • read_length

    The default value is: 1125

  • basecaller

    The basecaller to use for the read simulation in fastq format.

    It has to be one of these one: {albacore,guppy,guppy-flipflop}

pipelines/metatranscriptomic/config/distribution.config:

This configuration file configures the creation of the abundance distribution.

  • mode

    Mode for changing the gene abundances in different samples

    Has to be one of differential/timeseries/replicate

  • mu

    Mean of the used log-normal distribution

  • sigma

    Standard deviation of the used log-normal distribution

  • gauss_mu

    Mean of the used normal distribution

    Relevant for the "timeseries" mode

  • gauss_sigma

    Standard deviation of the used normal distribution

    Relevant for the "timeseries" mode

  • gene_sigma

    Relevant for the "replicate" mode

  • genome_distribution_files

    The paths to the genome distribution files to use for the read simulation.

    Use wildcard or a list of paths to the distribution files ([path1, path2]).

    If this parameter is empty, CAMISIM will calculate new distributions.

  • genome_mode

    Mode for changing the abundances in different samples

    Has to be one of replicates/timeseries_lognormal/timeseries_normal/differential

  • genome_log_mu

    Mean of the used log-normal distribution

    1 is an empirically good mean

  • genome_log_sigma

    Standard deviation of the used log-normal distribution

    2 is an empirically good sd

  • genome_gauss_mu

    Mean of the used normal distribution

  • genome_gauss_sigma

    Standard deviation of the used normal distribution

pipelines/shared/config/conda.config:

  • conda.enabled

    This is parameter enables conda to install and configure all needed software packages.

  • conda.useMamba

    This is parameter lets CAMISIM use mamba instead of conda to install and configure all needed software packages.

    This is recommended for performance reasons.

  • conda.cacheDir

    This parameter can be used to define a custom conda cache.

    If none is defined, nextflow will create it in the work directory.

    Note, that the directory has to exist before running the pipeline.

Output:

{out}/distributions/genome_distributions/distribution_{i}.txt

'{i}' is the index for each sample that is to be generated.

  • Column 1: genome_ID
  • Column 2: abundance

'genome_ID' is the identifier of the genomes used.
'abundance' is the relative abundance of a genome to be simulated. 'abundance' does not reflect the amount of genetic data of a genome, but the amount of genomes.
In a set of two genomes, with both having a abundance of 0.5 but one genome is double the size of the other, the bigger genome will be 66% of the genetic data in the simulated metagenome.

{out}/distributions/gene_distributions/distribution_{genome_ID}sample{i}.txt

'{i}' is the index for each sample that is to be generated.

  • Column 1: gene_ID
  • Column 2: abundance

'gene_ID' is the identifier from the parent feature type selected of the gene annotation file 'abundance' is the relative abundance of a gene in that samples genome.

{out}/distributions/final_distributions/{genome_ID}__{i}_final_distribution.txt

'{i}' is the index for each sample that is to be generated.

  • Column 1: gene_ID
  • Column 2: expression_value

'gene_ID' is the identifier from the parent feature type selected of the gene annotation file 'expression_value' is the expression value of that transcript in the given sample.

{out}/internal/genome_to_id.tsv

If the metagenome is simulated from profile this file is present. It contains a list of genomes paths to the copies in the output directory in the 'genomes' folder.

  • Column 1: genome_ID
  • Column 2: file path

{out}/sample_{i}/bam/sample{i}_{genome_id}.bam

The bam files generated based on reads generated from the read simulator.

{out}/sample_{i}/reads/sample{i}{genome_id}*.fq

The simulated reads.

{out}/sample_{i}/reads/sample{i}_*.fq

A file containing all reads of this sample.

{out}/sample_{i}/reads/anonymous_reads.fq

If anonymization is done, this will be the only fastq file.

{out}/sample_{i}/reads/reads_mapping.tsv

Mapping of reads for evaluation

  • Column 1: anonymous read id
  • Column 2: genome id
  • Column 3: taxonomic id
  • Column 4: read id

{out}/sample_{i}/contigs/anonymous_gsa.fasta

Fasta file with perfect assembly of reads of this sample

{out}/sample_{i}/contigs/gsa_mapping.tsv

Mapping of contigs for evaluation

  • Column 1: anonymous contig id
  • Column 2: genome id
  • Column 3: taxonomic id
  • Column 4: sequence id of the original genome (in 'source_genomes' folder)
  • Column 5: number of reads used in the contig
  • Column 6: start position
  • Column 7: end position

{out}/anonymous_gsa_pooled.fasta

Fasta file with perfect assembly of reads from all samples

{out}/gsa_pooled_mapping.tsv

Mapping of contigs from pooled reads fo_validate_raw_genomesr evaluation.

  • Column 1: anonymous_contig_id
  • Column 2: genome id
  • Column 3: taxonomic id
  • Column 4: sequence id of the original genome (in 'source_genomes' folder)
  • Column 5: number of reads used in the contig
  • Column 6: start position
  • Column 7: end position

{out}/seed/seed*.txt

The seeds used for the simulation. If the results of the simulation need to be reproduced, use the used_initial_seed from the {out}/seed/seed.txt in the next run. The seed can be configured in the nextflow.config file.