-
Notifications
You must be signed in to change notification settings - Fork 37
Configuration File Options
CAMISIM is configured via various configuration files. Initially, these are filled with default data. The individual configuration files and their parameters are documented below. Optional arguments are written in italic.
To facilitate the switch from CAMISIM version 1.3 to CAMISIM 2.0, it is possible to convert outdated configuration files. To do this, use the script convert_config.py like this:
>> python convert_config.py ./defaults/mini_config.ini
The configuration file must be structured like the defaults/mini_config.ini. You can then use the converted configuration file like this:
>> nextflow run workflow.nf -c converted_nextflow.config
This configuration file allows you to make some general settings.
-
outdir
The directory to save all output files to.
If the directory does not exist, it will be created. -
size
Size of a single sample in Gigabasepairs (Gbp).
The actual size, including mapping files, might be larger. -
type
Type of the used read simulator. Choose from "art"/"nanosim3"/"wgsim". -
number_of_samples
The number of samples to simualate reads for. -
gsa
This boolean parameter specifies whether to do the gold standard assembly per sample. -
pooled_gsa
The samples to do the gold standard assembly for ("[0,1]").
If the list is empty ("[]") the pooled gsa will be skipped.
If the parameter is set to true, all samples will be considered. -
anonymization
This boolean parameter specifies whether the data should be anonymized. -
seed
An optional seed to get consistent results.
If None is used, a random seed is chosen. -
biom_profile
If a BIOM file is given, the metagenome simulation will be exectuted from profile (biom_profile="${projectDir}/defaults/mini.biom"). For details on this format, please consult the link given above. Also make sure that the id column of your OTUs do not contain special characters (for example &, | or ;) and that the taxonomy field is in the "greengenes taxonomy format", e.g. ["g__Escherichia"," s__Escherichia coli"] -
genome_locations_file
A tab-sperated file mapping genome ids with the file path to genomes.- Column 1: Genome id
- Column 2: file path
Use either an absolute path to reference fasta file or a relative one.
Note, that if a relative path is used, CAMISIM will expect the reference fasta file(s) to be in the project directory.
-
_ncbi_taxdump_file
Taxonomy dump from the NCBI. If no Taxonomy dump is set, CAMISIM will download a new one. -
metadata_file
Path to the tab separated input metadata tsv file.
It maps genome ids with additional information of their classification:- Row 1 (header): genome_ID\tOTU\tNCBI_ID\tnovelty_category
- Column 1: Genome id
- Column 2: Operational taxonomic unit (OTU) - membership in some taxonomic unit
- Column 3: NCBI taxonomy identifier
- Column 4: novelty category - if a genome is not in the database, how "new" is it in comparison to genomes in the NCBI (new_strain, new_species, new_genus, ...)
Also see more information on this file here: Genome selection
This configuration file configures one of the three readsimulators. If the parameter "type" in the nextflow.config file is set to "art", this file is used.
-
profile_read_length
The length of reads to be simulated. -
fragment_size_mean
The mean size of the fragments to be created. -
fragment_size_sd
The standard deviation of fragments to be created. -
base_profile_name
Folder containing error profiles for the read simulators.
This configuration file configures one of the three readsimulators. If the parameter "type" in the nextflow.config file is set to "nanosim3", this file is used.
-
base_profile_name Folder containing error profiles for the read simulators.
-
read_length
This parameter is optional The default value is: 4508 Is the parameter is not set, CAMISIM will calculate the read length -
simulate_fastq_directly
This parameter indicates, whether nanosim simulates reads in fastq format directly with "real" quality score. If false, nanosim will simulate reads in fasta format and CAMISIM will convert it to fastq. If true, there needs to be selected a basecaller. Note, that directly simulating fastq may take longer. -
basecaller
The basecaller to use for the read simulation in fastq format. It has to be one of these: {albacore,guppy,guppy-flipflop}
This configuration file configures one of the three readsimulators. If the parameter "type" in the nextflow.config file is set to "wgsim", this file is used.
-
profile_read_length
The length of reads to be simulated. -
fragment_size_mean
The mean size of the fragments to be created. -
fragment_size_sd
The standard deviation of fragments to be created. -
base_error_rate
The base error rate. -
create_cigar
This flag indicates, whether to create a "real" CIGAR string in the sam file. Note that this may take a while. If this is set to false, the CIGAR will be a default one: "M"
This configuration file configures the metagenome simulation from profile. This file is used, when a biom_profile in the nextflow.config is set.
-
max_strains_per_otu
Maximum number of strains drawn from genomes belonging to a single OTU. OTU is taken from the metadata file. Max strains need to be greater or equal to 2. -
reference_genomes
File pointing to reference genomes of the format: NCBI id\tScientific name\tNCBI ftp address of full genome. -
no_replace
Use sampling without replacing, so genomes are used for exactly one OTU only (decreases accuracy). -
fill_up
If no genomes are found for certain OTUs, fill up with previously unused genomes -
additional_references
This parameter is optional.
File containing additional reference genomes, mapped to OTUs from the input profile. -
verbose
Show the used distribution of genomes before simulating.
This configuration file configures the creation of the abundance distribution.
-
distribution_files
The paths to the distribution files to use for the read simulation. Use wildcard or a list of paths to the distribution files ([path1, path2]) If this parameter is empty, CAMISIM will calculate new distributions.
The file names must look like this:"<name>_<sample_id>"
. The sample ID must be numerical, starting at 0 in ascending order (example: "distribution_0", "distribution_1", "distribution_2", ...).
The file has to map genome ids to their abundance in a certain sample, one file for each sample is required. The individual files per sample should be comma-separated- Column 1: Genome id
- Column 2: Abundance (float)
-
just_community_design
In case a community design is performed (see parameter distribution_files), it is possible to stop the pipeline after the community design.
Is this parameter set to true, the simulation will stop after the community design and output the distribution files. The user can inspect and modify those and input them again.
Is this parameter set to false, the pipeline will execute all steps.
-
mode
Mode for changing the abundances in different samples. Has to be one of "replicates"/"timeseries_lognormal"/"timeseries_normal"/"differential". -
log_mu
Mean of the used log-normal distribution
1 is an empirically good mean -
log_sigma
Standard deviation of the used log-normal distribution.
2 is an empirically good sd. -
gauss_mu
Mean of the used normal distribution. -
gauss_sigma
Standard deviation of the used normal distribution. -
genomes_total Total number of simulated genomes Difference between genomes_total and genomes_real are simulated by sgEvolver Needs to be bigger or equal to genomes_real. If it is equal to genomes_total there will be no strain simulation.
-
genomes_real Number of genomes used from the input genomes
-
id_to_gff_file Optional file used by the sgEvolver, mapping togene annotations of the input genomes
-
strain_simulation_template Path to a template.tree for the sgEvolver from the mauve suite Example tree is shipped along the sgEvolver itself within CAMISIM
-
conda.enabled
This is parameter enables conda to install and configure all needed software packages. -
conda.useMamba
This is parameter lets CAMISIM use mamba instead of conda to install and configure all needed software packages. This is recommended for performance reasons. -
conda.cacheDir
This parameter can be used to define a custom conda cache. If none is defined, nextflow will create it in the work directory. Note, that the directory has to exist before running the pipeline.