Skip to content

Configuration File Options

joyceFunk123 edited this page Aug 9, 2024 · 23 revisions

CAMISIM is configured via various configuration files. Initially, these are filled with default data. The individual configuration files and their parameters are documented below. Optional arguments are written in italic.

To facilitate the switch from CAMISIM version 1.3 to CAMISIM 2.0, it is possible to convert outdated configuration files. To do this, use the script convert_config.py like this:

>> python convert_config.py ./defaults/mini_config.ini

The configuration file must be structured like the defaults/mini_config.ini. You can then use the converted configuration file like this:

>> nextflow run workflow.nf -c converted_nextflow.config

pipelines/metagenomic/config/metagenomic.config:

This configuration file allows you to make some general settings.

  • outdir
    The directory to save all output files to.
    If the directory does not exist, it will be created.

  • size
    Size of a single sample in Gigabasepairs (Gbp).
    The actual size, including mapping files, might be larger.

  • type
    Type of the used read simulator. Choose from "art"/"nanosim3"/"wgsim".

  • number_of_samples
    The number of samples to simualate reads for.

  • gsa
    This boolean parameter specifies whether to do the gold standard assembly per sample.

  • pooled_gsa
    The samples to do the gold standard assembly for ("[0,1]").
    If the list is empty ("[]") the pooled gsa will be skipped.
    If the parameter is set to true, all samples will be considered.

  • anonymization
    This boolean parameter specifies whether the data should be anonymized.

  • seed
    An optional seed to get consistent results.
    If None is used, a random seed is chosen.

  • biom_profile
    If a BIOM file is given, the metagenome simulation will be exectuted from profile (biom_profile="${projectDir}/defaults/mini.biom"). For details on this format, please consult the link given above. Also make sure that the id column of your OTUs do not contain special characters (for example &, | or ;) and that the taxonomy field is in the "greengenes taxonomy format", e.g. ["g__Escherichia"," s__Escherichia coli"]

  • genome_locations_file
    A tab-sperated file mapping genome ids with the file path to genomes.

    • Column 1: Genome id
    • Column 2: file path Use either an absolute path to reference fasta file or a relative one.
      Note, that if a relative path is used, CAMISIM will expect the reference fasta file(s) to be in the project directory.
  • _ncbi_taxdump_file
    Taxonomy dump from the NCBI. If no Taxonomy dump is set, CAMISIM will download a new one.

  • metadata_file
    Path to the tab separated input metadata tsv file.
    It maps genome ids with additional information of their classification:

    • Row 1 (header): genome_ID\tOTU\tNCBI_ID\tnovelty_category
    • Column 1: Genome id
    • Column 2: Operational taxonomic unit (OTU) - membership in some taxonomic unit
    • Column 3: NCBI taxonomy identifier
    • Column 4: novelty category - if a genome is not in the database, how "new" is it in comparison to genomes in the NCBI (new_strain, new_species, new_genus, ...)
      Also see more information on this file here: Genome selection

pipelines/metagenomic/config/art.config:

This configuration file configures one of the three readsimulators. If the parameter "type" in the nextflow.config file is set to "art", this file is used.

  • profile_read_length
    The length of reads to be simulated.

  • fragment_size_mean
    The mean size of the fragments to be created.

  • fragment_size_sd
    The standard deviation of fragments to be created.

  • base_profile_name
    Folder containing error profiles for the read simulators.

pipelines/metagenomic/config/nanosim.config:

This configuration file configures one of the three readsimulators. If the parameter "type" in the nextflow.config file is set to "nanosim3", this file is used.

  • base_profile_name Folder containing error profiles for the read simulators.

  • read_length
    This parameter is optional The default value is: 4508 Is the parameter is not set, CAMISIM will calculate the read length

  • simulate_fastq_directly
    This parameter indicates, whether nanosim simulates reads in fastq format directly with "real" quality score. If false, nanosim will simulate reads in fasta format and CAMISIM will convert it to fastq. If true, there needs to be selected a basecaller. Note, that directly simulating fastq may take longer.

  • basecaller
    The basecaller to use for the read simulation in fastq format. It has to be one of these: {albacore,guppy,guppy-flipflop}

pipelines/metagenomic/config/wgsim.config:

This configuration file configures one of the three readsimulators. If the parameter "type" in the nextflow.config file is set to "wgsim", this file is used.

  • profile_read_length
    The length of reads to be simulated.

  • fragment_size_mean
    The mean size of the fragments to be created.

  • fragment_size_sd
    The standard deviation of fragments to be created.

  • base_error_rate
    The base error rate.

  • create_cigar
    This flag indicates, whether to create a "real" CIGAR string in the sam file. Note that this may take a while. If this is set to false, the CIGAR will be a default one: "M"

pipelines/metagenomic/config/profile.config:

This configuration file configures the metagenome simulation from profile. This file is used, when a biom_profile in the nextflow.config is set.

  • max_strains_per_otu
    Maximum number of strains drawn from genomes belonging to a single OTU. OTU is taken from the metadata file. Max strains need to be greater or equal to 2.

  • reference_genomes
    File pointing to reference genomes of the format: NCBI id\tScientific name\tNCBI ftp address of full genome.

  • no_replace
    Use sampling without replacing, so genomes are used for exactly one OTU only (decreases accuracy).

  • fill_up
    If no genomes are found for certain OTUs, fill up with previously unused genomes

  • additional_references
    This parameter is optional.
    File containing additional reference genomes, mapped to OTUs from the input profile.

  • verbose
    Show the used distribution of genomes before simulating.

pipelines/metagenomic/config/distribution.config:

This configuration file configures the creation of the abundance distribution.

  • distribution_files
    The paths to the distribution files to use for the read simulation. Use wildcard or a list of paths to the distribution files ([path1, path2]) If this parameter is empty, CAMISIM will calculate new distributions.
    The file names must look like this: "<name>_<sample_id>". The sample ID must be numerical, starting at 0 in ascending order (example: "distribution_0", "distribution_1", "distribution_2", ...).
    The file has to map genome ids to their abundance in a certain sample, one file for each sample is required. The individual files per sample should be comma-separated

    • Column 1: Genome id
    • Column 2: Abundance (float)
  • just_community_design

    In case a community design is performed (see parameter distribution_files), it is possible to stop the pipeline after the community design.

    Is this parameter set to true, the simulation will stop after the community design and output the distribution files. The user can inspect and modify those and input them again.

    Is this parameter set to false, the pipeline will execute all steps.

  • mode
    Mode for changing the abundances in different samples. Has to be one of "replicates"/"timeseries_lognormal"/"timeseries_normal"/"differential".

  • log_mu
    Mean of the used log-normal distribution
    1 is an empirically good mean

  • log_sigma
    Standard deviation of the used log-normal distribution.
    2 is an empirically good sd.

  • gauss_mu
    Mean of the used normal distribution.

  • gauss_sigma
    Standard deviation of the used normal distribution.

  • genomes_total Total number of simulated genomes Difference between genomes_total and genomes_real are simulated by sgEvolver Needs to be bigger or equal to genomes_real. If it is equal to genomes_total there will be no strain simulation.

  • genomes_real Number of genomes used from the input genomes

  • id_to_gff_file Optional file used by the sgEvolver, mapping togene annotations of the input genomes

  • strain_simulation_template Path to a template.tree for the sgEvolver from the mauve suite Example tree is shipped along the sgEvolver itself within CAMISIM

pipelines/shared/config/conda.config:

  • conda.enabled
    This is parameter enables conda to install and configure all needed software packages.

  • conda.useMamba
    This is parameter lets CAMISIM use mamba instead of conda to install and configure all needed software packages. This is recommended for performance reasons.

  • conda.cacheDir
    This parameter can be used to define a custom conda cache. If none is defined, nextflow will create it in the work directory. Note, that the directory has to exist before running the pipeline.