-
Notifications
You must be signed in to change notification settings - Fork 37
CAMISIM 1.3
With CAMISIM you can create simulated metagenome data sets out of taxonomic profiles or de novo from a list of genomes. If a taxonomic profile is used as input, the output data set is created from the NCBI complete genomes, reflecting the input profile as closely as possible and will contain the same number of samples as the input profile, if not specified otherwise. If the community is designed de novo, a user-defined number of the complete genomes is used for creating a community which maximizes genome novelty as well as phylogenetic spread.
CAMISIM can generate four types of simulated metagenome samples de novo:
- Individual simulate metagenome sample.
Uses a taxonomic profile sampled from a log-normal distribution - A time series of simulated metagenome samples.
Uses a taxonomic profile sampled from a log-normal distribution with Gaussian noise, a normal distribution added to successively generated samples. - A set of replicate simulated metagenome samples.
Uses a taxonomic profile sampled from a log-normal distribution, with Gaussian noise repeatedly added to the original log-normal values. - Differential abundance metagenome samples.
Uses a taxonomic profile sampled from a log-normal distribution.
The output of CAMISIM is the read files in fastq format, a mapping of these reads to the used genomes in BAM format, "gold standards" for assembly (fasta), binning (CAMI format) and profiling (CAMI format). All files are described here.
Download CAMISIM from the github repository into any directory and make sure the following Dependencies are fullfilled. The metagenome simulation pipeline can be downloaded using git:
git clone https://github.com/CAMI-challenge/CAMISIM
Please make sure that a software is of the specified version, as outdated (or sadly sometimes too new) software can cause errors. The path to the binaries of those tools can be set in the configuration file.
You can also use Docker image with necessary tools installed:
docker pull cami/camisim:latest
Please see usage section for usage instructions.
The pipeline has only been tested on UNIX systems and might fail to run on any other systems (and no support for these will be provided).
The following software is required for the simulation pipeline to run:
Python is the programming language most scripts were written in.
Python package used in the Pipeline for reading/writing sequence files.
Python package for the handling of BIOM files
Python package offering methods for scientific computing.
Python package offering methods for plotting graphs.
Perl 5 and the library XML::Simple
Perl 5 is a programming language, used in the script that generates a perfect assembly from bam files. The perl library XML::Simple can be installed from the unix package 'libxml-simple-perl'
Read simulator which offers error-free and uniform error rates. wgsim is also shipped within CAMISIM and does not have to be installed manually.
Read simulator for the generation of Oxford Nanopore Technologies (ONT) reads. Does not have a random seed in the original version, so cloning and installing the fork by abremges is advised if NanoPore reads should be simulated.
Read simulator for generating Pacific Biosciences (PacBio) reads. Has to be cloned/installed manually if PacBio reads should be simulated.
Microbial ecology software.
The simulation will be conducted primarily in the temporary folder declared in the configuration file or system tmp location by default. The results will then be moved to the output directory. Be sure to have enough space at both locations. Required hard drive space and RAM can vary a lot. Very small simulations can be run on a laptop with 4GB RAM, while realistic simulations of several hundred genomes, given a realistic metagenome size, can easily require several hundreds of gigabyte of both RAM and HD space, the chosen metagenome size being the relevant factor.
A database dump of the NCBI taxonomy is included, current versions can be downloaded from the NCBI FTP-Server.
If the community design should be performed de novo, genomes in fasta format to be sampled from are needed. Otherwise they will be downloaded from the NCBI complete genomes.
The de novo community design needs three files to run:
-
A file containing, tab separated, a genome identifier and that path to the file of the genome.
-
A file containing, tab separated, a genome identifier and that path to the gen annotation of genome. This one is uses in case strains are simulated based on a genome
-
A [[meta data file|meta-data-file-format] that contains, tab separated and with header, genome identifier, novelty categorization, otu assignment and a taxonomic classification.
At minimum the following files are required: "nodes.dmp", "merged.dmp", "names.dmp"
from_profile:
>> python metagenome_from_profile.py -p defaults/mini.biom
or de novo:
>> python metagenomesimulation.py defaults/mini_config.ini
To check CAMISIM is working properly, you can perform a test run using the second command above:
It takes about one hour and creates a ~2.4GB folder "out/" in your CAMISIM directory of a small 10 samples, 24 genomes data set. The configuration file as well as the used mapping files genome_to_id and metadata are available in the defaults
directory.
You can also use the camisim docker image
Example call:
docker run -it -v "/path/to/input/directory:/input:rw" -v "/path/to/output/directory:/output:rw" cami/camisim:latest metagenome_from_profile.py -p /input/mini.biom -o /output
where
-
/path/to/input/directory contains a biom file.
-
/path/to/output/directory is the output directory.
Note! You can replace the latest tag with any release number listed on the releases page.
CAMISIM proceeds in the following four steps:
Multiple kinds of simulated metagenome data sets can be created, either using an existing 16S rRNA profile or de novo: Single sample, differential abundance or time series data sets with
different insert sizes and complexities.
All data sets can be simulated as paired-end sequencing from Illumina HiSeq or other machines as well as other technologies (like ONT or PacBio) and error rates.
For generation of data sets and their samples, the following steps are undertaken:
After genome data validation (step 1), the community composition is designed according to specified criteria or the given taxonomic profile (step 2) and the metagenome data sets simulated (step 3).
Creation of the gold standards (step 4) represents the last part of the pipeline.
For the de novo metagenome simulation, all sequences shorter than 1000 base pairs are removed from the provided genome assemblies and sequences validated to contain only valid characters. The input genomes can be draft genomes in fasta format and as such contain beside the bases ACGT also ambiguous DNA characters such as 'RYWSMKHBVDN'.
If the input is a taxonomic profile in CAMI or BIOM format, the taxa appearing in it, are matched to closely related, complete genomes from the NCBI RefSeq or additional reference files with e.g. new genoems, such that the simulated metagenome (genomes and their abundances) reflects the input profile as closely as possible. This is done via the matching of scientific names in the input profile. Given the taxonomic names of the OTUs in the BIOM profile (this is required), a NCBI taxonomy ID is inferred. For all reference genomes, a taxonomy ID is also required. In the next step, the lineages between the OTUs and the genomes are compared and an OTU is mapped to a genome which has the smallest path between OTU taxonomy ID and genome taxonomy ID. Since the shortest path might be really long, a threshold is set to family level: An OTU will not get mapped to a genome, which is not in the same family, i.e. for every mapped genome it is guaranteed that they are at least of the same family. Usually this leaves some OTUs unassigned, either because their scientific name did not yield a NCBI taxonomy ID or because for the taxonomy ID no complete genomes are available. It is possible to assign "random" genomes from the reference collection(s) to the unmapped OTUs to increase complexity via an extra option.
For the de novo community design, a specified number of sequences (either circular elements or genomes) is selected according to certain criteria, see below, from all input sequences (e.g. circular elements or genomes) to generate a specific data set. The random selection of genomes for a particular data set is guided based on their membership in novelty category (Taxonomic annotation step 4.) and OTUs: To increase taxonomic spread, for a specified overall number of genomes to be included in a particular data set, the selection algorithm attempts to draw the same number of genomes from each novelty category (new strain, new species, new genus, new family, new order). If a category does not have enough genomes to thus enable coming up with the specified genome number, more genomes are drawn from the other categories, in equal amounts, if possible. This is repeated until the specified number of genomes has been sampled. To model strain level diversity within a species, for every OTU, all available genomes are included in one particular data set up to five genomes at most. More than five genomes per OTU are only included, if the specified data set size requires more genomes to be added, that are not available otherwise.
To increase strain level diversity in the data sets, a number of additional genomes can be simulated with sgEvolver12, representing related strains to the genomes selected above (see below). To this end, a genome is randomly chosen, and a strain number is drawn from a geometric distribution (p=0.3), which represented the number of related strains to be created for that particular genome. This is repeated, until the specified overall number strains to be generated is reached. For each of these genomes, sgEvolver simulates 40 strains, with an increasing genetic distance (in steps of 0.1%) to the preceding ones, up to a total distance of 4%. From the 40 strains, the specified strain number is randomly selected and added to the genome set.
For data set simulation, genome and circular element abundance distributions are required for every data set. For a single sample data set, the abundance distribution is created by sampling from a log-normal distribution with mu set to 1 and sigma to 2 by default. To reflect a differential abundance experiment, two or more abundance samples can be drawn independently from a log-normal distribution with the same parameter settings. For consecutive samples of a time series, abundances for the next sample are calculated by sampling new values from the log-normal distribution, adding these to the previous value and dividing by two. The absolute abundance of the circular elements can be set to be like 15 times as high as the abundance of the microbial genomes, to replicate high natural plasmid abundances. For each sample, a taxonomic profile for higher ranking taxa is calculated based on the genome abundance values and their taxonomic IDs.
Based on the genome abundance distribution, the genome sizes and the specified metagenome sample output size, for every genome, the coverage in the simulated samples is calculated. The chosen read simulator then generates reads, using the technology and properties (like read length, insert size or error rates if applicable) as specified by the user, sampling each genome proportionally to the specified abundance in the community. A read simulation of a dataset can be run a second time using a different mean insert size using the same genome abundances as before to replicate the use of a mix of technologies on the same samples. The size of every sample of each dataset can be set to any size.
Outputs of this step are for every sample of a data set, a FASTQ and a BAM file, which specifies the location of every read in the input sequences.
A gold standard assembly is made with SAMtools for each sample individually and for all samples of each dataset together (pooled gold standard), which included every sequence position with a coverage of at least one, respectively. Specifically, the SAM file obtained from the read simulation specifying the location of each read on the reference genomes is used with SAMtools for calculation of the per-site coverage. Sequences are then broken into gold standard contig sequences at zero coverage sites. The contig sequences are then labeled with a unique dataset and sample identifying prefix and an increasing index as suffix. In the last step, sequences of the simulated data sets are shuffled using the unix command 'shuf' and anonymized, thus creating the challenge data sets.
The following file format definitions are used in data exchange between stages of the CAMI pipeline.
-
BIOM file
For running the from_profile mode, a BIOM file is required. For details on this format, please consult the link given above. Also make sure that the id column of your OTUs do not contain special characters (for example &, | or ;) and that the taxonomy field is in the "greengenes taxonomy format", e.g. ["g__Escherichia"," s__Escherichia coli"] -
[[Configuration File|CAMISIM-1.3#Configuration-File-Options]
This contains the options and values required for the pipeline to run and a path to this file is a required program argument for the de novo simulation and, if not the default config is supposed to be used, also for the from profile mode. The following arguments from the config file expect a file path: -
id_to_genome_file
Tab separated data table. It maps genome ids with the file path to genomes. If strain simulation is used, the paths have to be absolute- Column 1: Genome id
- Column 2: file path
-
metadata Tab separated data table It maps genome ids with additional information of their classification
- Row 1 (header): genome_ID\tOTU\tNCBI_ID\tnovelty_category
- Column 1: Genome id
- Column 2: Operational taxonomic unit (OTU) - membership in some taxonomic unit
- Column 3: NCBI taxonomy identifier
- Column 4: novelty category - if a genome is not in the database, how "new" is it in comparison to genomes in the NCBI (new_strain, new_species, new_genus, ...)
Also see more information on this file here: Genome selection
-
id_to_gff_file
Optional, tab separated data table. It maps genome ids with the file path to the gene annotation of a genome. Absolute paths are required.- Column 1: Genome id
- Column 2: file path
-
distributions_file_path Path to optional, tab separated data tables. If this option is not blank, no distribution will be drawn, but the abundance values provided within this file are used. It maps genome ids to their abundance in a certain sample, one file for each sample is required. The individual files per sample should be comma-separated
- Column 1: Genome id
- Column 2: Abundance (float)
'{i}' is the index for each sample that is to be generated.
- Column 1: genome_ID
- Column 2: abundance
'genome_ID' is the identifier of the genomes used.
'abundance' is the relative abundance of a genome to be simulated. 'abundance' does not reflect the amount of genetic data of a genome, but the amount of genomes.
In a set of two genomes, with both having a abundance of 0.5 but one genome is double the size of the other, the bigger genome will be 66% of the genetic data in the simulated metagenome.
All given genomes will be copied and placed in this folder. Doing this, sequence names are made sure to be unique and renamed if required. Comments and descriptions of sequences are removed. Also index files *.fai will be present after simulation
This file contains a list of replaced sequence ids, may be empty if none where replaced.
- Column 1: genome_ID
- Column 2: original sequence id
- Column 3: new sequence id
List of genomes paths to the copies in the output directory in the 'source_genomes' folder.
- Column 1: genome_ID
- Column 2: file path
Merged meta data of genomes of each community that are actually used for the simulation.
Unused meta data of genomes of every community.
bam files generated based on reads generated from the read simulator
If no anonymization is not done in which case the original fastq files will be here.
If anonymization is done, this will be the only fastq file.
Mapping of reads for evaluation
- Column 1: anonymous read id
- Column 2: genome id
- Column 3: taxonomic id
- Column 4: read id
Fasta file with perfect assembly of reads of this sample
Mapping of contigs for evaluation
- Column 1: anonymous contig id
- Column 2: genome id
- Column 3: taxonomic id
- Column 4: sequence id of the original genome (in 'source_genomes' folder)
- Column 5: number of reads used in the contig
- Column 6: start position
- Column 7: end position
Fasta file with perfect assembly of reads from all samples
Mapping of contigs from pooled reads for evaluation.
- Column 1: anonymous_contig_id
- Column 2: genome id
- Column 3: taxonomic id
- Column 4: sequence id of the original genome (in 'source_genomes' folder)
- Column 5: number of reads used in the contig
- Column 6: start position
- Column 7: end position
Taxonomic profile for each sample
These is a list of all configuration options that are required for CAMISIM
to run. Optional arguments are written in italic. The sections, denoted by the words in [brackets] are also required.
[Main]
seed=
- An optional seed to get consistent results, if None is used, random seed is chosen
phase=
- Starting point of the simulation
0: Full run
1: Only community design
2: Start with read simulation
max_processors=
- Maximum number of processors used
dataset_id=
- Name of the created sample
output_directory=
- Output directory
- Will be created if it does not exist.
- Make sure directory is empty before running
CAMISIM
temp_directory=
- Path for storing temporary files of
CAMISIM
- Make sure enough space is available
gsa=
- Whether a gold standard assembly should be created
- May be True/False or Yes/No
pooled_gsa=
- Whether a pooled gold standard over all samples is created
- True/False/Yes/No
anonymous=
- Whether the output is anonymized
- Also True/False/Yes/No
- Can be computationally expensive
compress=
- Whether the data should be compressed or not
- 0 is for no comrepssion, 9 is maximum comporession
- recommended is using compression level 1 at least
[ReadSimulator]
readsim=tools/art_illumina-2.3.6/art_illumina
- path to the executable of the chosen read simulator
- ART is shipped within
CAMISIM
at the above relative path
error_profiles=tools/art_illumina-2.3.6/profiles
- Folder containing error profiles for the read simulators
- Might be emoty for some simulators (e.g. NanoSim)
samtools=tools/samtools-1.3/samtools
- Path to samtools executable
- Version 1.3 (recommended) is shipped within
CAMISIM
profile=mbarc
- used read simulator error profile
- mbarc is recommended for bacterial communities
- choose for ART: mi/hi/hi150/mbarc
- ignored by other read simulators
size=
- size of a single sample in Gigabasepairs (Gbp)
- actual size, including mapping files, might be larger
type=
- type of the used read simulator
- choose from art/wgsim/nanosim/pbsim
fragments_size_mean=
- mean size of the fragments to be created
- will be ignored by NanoSim
fragment_size_standard_deviation=
- standard deviation of fragments to be created
- will be ignored by NanoSim
[CommunityDesign]
distribution_file_paths=
- optional path to input abundance files
- files are tsv files of the format genome
Tab
abundance - One per sample to be created
ncbi_taxdump=tools/ncbi-taxonomy_20170222.tar.gz
- Taxonomy dump form the NCBI
- Working version can be found at the relative path above
strain_simulation_template=scripts/StrainSimulationWrapper/sgEvolver/simulation_dir
- Path to a template.tree for the sgEvolver from the mauve suite
- Example tree is shipped along the sgEvolver itself within
CAMISIM
number_of_samples=
- Number of samples to be created
[community0]
metadata=
- Path of the input metadata tsv file
- file has to be in the format: genome_ID
Tab
OTUTab
NCBI_IDTab
novelty_category - And this line as a header
id_to_genome_file=
- Path to the input genome_to_id tsv file
- format is: genome_ID
Tab
genome path - No header
id_to_gff_file=
- Optional file used by the sgEvolver, mapping togene annotations of the input genomes
genomes_total=
- Total number of simulated genomes
- Difference between
genomes_total
andgenomes_real
are simulated by sgEvolver
genomes_real=
- Number of genomes used from the input genomes
max_strains_per_otu=
- Maximum number of strains drawn from genomes belonging to a single OTU
- OTU is taken from the metadata file
ratio=
- ratio between different communities
- default: 1, i.e. if only one community is present
mode=
- mode for changing the abundances in different samples
- one of replicates/timeseries_lognormal/timeseries_normal/differential
log_mu=
- Mean of the used log-normal distribution
- 1 is an empirically good mean
log_sigma=
- standard deviation of the used log-normal distribution
- 2 is an empirically good sd
gauss_mu=
- mean of the used normal distribution
gauss_sigma=
- standard deviation of the used normal distribution
view=
- Show the used distribution of genomes before simulating
- default: False