Benchmarking Analysis of MicroBiome Inference methods
This repository contains the scripts to perform a large-scale benchmarking effort of differential abundance testing methods for microbiome data that is described in our preprint.
The project relies heavily on SIMBA, an
R package to simulate microbiome data with various methods, evaluate how well
they mimic reality, apply differential abundance testing methods on these simulations,
and finally evaluate the results from these differential abundance testing
methods. This repository contains the code that we used to perform different
benchmarks through SIMBA
, relying on the
batchtools
package for automation
and parallelization of computing jobs.
For reproduction purposes, we additionally make use of the
renv
package, which provides
reproducible environments for R.
Since BAMBI
is not an R package, the setup is a bit more complicated.
Reproducibility is ensured by using renv
and batchtools
is used for
communication with a high-performance computing infrastructure. We ran our
analyses on a SLURM cluster and on a Grid Engine architecture, but batchtools
is very versatile in which architectures it supports, including local machines
and manually-defined computing clusters. We therefore encourage
you to check out the batchtools
documentation if you are using a different
architecture.
- You will need R version 4.0.0 installed.
- Clone the SIMBA package which contains functions for simulating data and applying differential abundance tests
- Install the
renv
v0.12.5 package from CRAN withinstall.packages('renv')
to bootstrap the project-local library contained inrenv.lock
- Restore the environment via
renv
, thereby creating the same environment (in terms of packages used and their versions) that we used in our analyses (you will be prompted by R the first time you open the terminal)
- Restore the environment via
- Configure a
batchtools
template in the cluster_config subdirectory- Default configuration is for SLURM HPC; details for other HPC environments available here
- Adjust the
.Rprofile
file within this repository (it is currently configured for our computing setting)- The
RENV_PATHS_ROOT
parameter indicates whererenv
stores packages - The
simba.loc
parameter indicates where theSIMBA
package is located - The
temp.loc
parameter indicates where temporary files should be stored
- The
- Download the requisite datasets to the data subdirectory by executing
Rscript ./src/download_data.R
The scripts work on a per-simulation basis. That means that one script is
usually executed per simulation file. The scripts to create all the simulations
included in our analyses are stored in the create_simulations
subdirectory.
In these example code chunks, we will use the simulations from McMurdie and
Holmes (MMH) as an example.
The simulations will be saved to simulations/sim_MMH.h5
in
a hierarchical data format
Rscript ./create_simulations/sim_MMH.R
or (on a SLURM cluster)
cd create_simulations
sbatch -J MMH create_sim.sh sim_MMH.R
Expected output:
Loading SIMBA
+ Start checking data
++ BMI is interpreted as body mass index:
will be converted to underweight/normal/overweight/obese
++ Age will be converted to a factor via quartiles
++ Library_Size will be converted to a factor via quartiles
++ meta-variables:
Group
++ have only a single value across all samples and will be removed
++ meta-variables:
Sample_ID
++ have too many values across samples and will be removed
++ Removing repeated measurements for individuals
+ Finished checking data
+ Start checking the simulation location
+ Finished checking the simulation location
+ Start filtering the data
++ Performing prevalence and abundance filtering
++ After filtering, the dataset contains 844 features and 880 samples
+ Finished filtering the data
+ Start checking the simulation parameters
+ Finished checking the simulation parameters
+ Save original data in the h5 file
+ Finished saving original data in the h5 file
+ Starting data generation using the method: McMurdie&Holmes
++ Create simulation template from data
++ Finished creating simulation template
+ Finished data generation
+ Starting to check parameters for test idx creation
++ Remove test idx, if present
++ Test idx do not exist yet
+ Start test idx creation
+ Finished test idx creation
estimated time to create this specific simulation on a desktop computer: 3 mins
Please note: The time to create different simulations can vary dramatically between methods
Please note: for some simulations (for example sim_negbin_cor
), multiple
cores are needed, since the underlying code will call SPIEC-EASI to estimate
the underlying correlation structure, which is prohibitively slow when not
parallelized with multiple cores
Also, please check out the SIMBA vignette
for more information about how the data is stored within the .h5
file produced
by this script.
In a second step, the simulated data are compared to the real data in various
measures, for example sparsity or feature variance. Additionally, PCoA are
computed to visualize the difference between real and simulated samples.
To do so, several jobs are created with batchtools
when you call this script
with the simulation ID as parameters:
Rscript ./src/reality_checks_cluster.R sim_Weiss
The script also submits the jobs to the cluster (since it is usually only a handful of jobs - as many as there are different effect sizes combinations). After all the jobs have been completed, you can combine the results by calling:
Rscript ./src/reality_checks_combination.R sim_Weiss
Since this step demands quite a bit of memory (usually more than is available on a submit node, we also have -again- a SLURM-specific script to send this compute job to the cluster)
cd ./src
sbatch -J Weiss_reality reality_checks.sh
The results will be found in the subdirectory './reality_checks/sim_Weiss'
├── reality_checks
│ ├── sim_MMH <- Reality check results from sim_MMH.
| | ├── auc_all.tsv <- AUCs from machine learning and PERMANOVA
| | ├── auc_plot.pdf <- AUC plotted as heatmap
| | ├── effect_size.pdf <- Effect size measures plotted as scatter plot
| | ├── pco_plots.pdf <- PCoA plots for different effect sizes
| | ├── sparsity_plot.pdf <- Sparsity plotted as heatmap
| | ├── sparsity.tsv <- Sparsity measures as table
| | ├── variance.tsv <- Variance and effect size measures for all features
| | ├── variance_by_group.pdf <- Variance plotted as heatmap
| | ├── variance_by_group_and_type.pdf <- Variance plotted as heatmap
| | └── variance_scatter_plots.pdf <- Variance plotted as scatter plot
| |
│ └── other_sim <- Other simulations will get their own folder
In our benchmark, we applied various differential abundance testing methods to
the simulations, using the test indices created in step 1, so that each test
is applied to exactly the same data. Since there are many different simulated
or implanted effect sizes, several simulated sample sizes, and various methods
that all need a different amount of time, we end up with many different jobs to
send to the cluster. Here, batchtools
comes to the rescue.
We first create a batchtools
registry, in which we initialize all the jobs
that we want to have computed. Then, we send the jobs to the cluster (this part
requires a bit of manual oversight to check if some jobs have failed or if
all have finished).
For these tasks, we have two R scripts. First, we prepare the registry:
Rscript ./src/prepare_registry.R sim_MMH
Then, we can run the jobs with:
Rscript ./src/run_tests.R sim_MMH
Alternatively, to run the tests of a single method (for example ANCOM, because there might be many for this specific method), we can also type:
Rscript ./src/run_tests.R sim_MMH ANCOM
Since the different tests have different time requirements, the jobs are chunked
differently across tests. Faster tests will have less jobs.
Therefore, for the sim_MMH
simulation, we will have to run these tests:
ALDEx2 980
ANCOM 980
corncob 980
distinct 980
ZIBSeq 980
ZIBSeq-sqrt 980
ANCOMBC 140
DESeq2 140
edgeR 140
metagenomeSeq 140
metagenomeSeq2 140
ZINQ 140
KS 70
limma 70
lm 70
wilcoxon 42
For the implantation simulations, we have more repeats and also more effect sizes, therefore we also have to run more tests.
For the confounded simulations, the setup is again slightly different, since
the resampling bias creates an additional parameter to be varied. Therefore,
there are dedicated scripts to prepare the batchtools
registries for
those simulations, but the application of tests can be achieved with the same
script (run_tests.sh
).
Lastly, we can evaluate the performance of the results from the differential abundance testing methods. For this, the P-values which are returned by the various tools are compared with the information about which features were actually used to carry a differential effect and the number of true positive discoveries (TP), false positives (FP), true negative (TN), and false negatives (FN) are computed with P=0.05 as cutoff. From those measures, the precision (PR), the recall (R), and the false discovery rate (FDR) are inferred. Additionally, the AUROC for the separation between true and background features is computed using the P-values as predictor.
This can be achieved by:
Rscript ./src/eval_tests.R sim_MMH
Here, you can again evaluate by single testing methods (in this case ANCOM):
Rscript ./src/eval_tests.R sim_MMH ANCOM
The evaluation metrics are then stored in the subdirectory test_results
and
can be read into R again:
library(tidyverse)
test.results.MMH <- read_tsv('./test_results/sim_MMH.tsv')
head(test.results.MMH)
# A tibble: 6 x 17
rep auroc TP FP TN FN PR R FDR group subset job.id problem test norm
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <chr> <chr> <dbl> <chr> <chr> <chr>
1 1 0.5 0 0 761 84 0 0 0 ab1_… subse… 1 sim_MMH limma TSS
2 2 0.501 0 0 761 84 0 0 0 ab1_… subse… 1 sim_MMH limma TSS
3 3 0.5 0 0 761 84 0 0 0 ab1_… subse… 1 sim_MMH limma TSS
4 4 0.499 0 0 761 84 0 0 0 ab1_… subse… 1 sim_MMH limma TSS
5 5 0.5 0 0 761 84 0 0 0 ab1_… subse… 1 sim_MMH limma TSS
6 6 0.5 0 0 761 84 0 0 0 ab1_… subse… 1 sim_MMH limma TSS
# … with 2 more variables: time.running <dbl>, mem.used <dbl>
We additionally provide the raw files from our analyses in a
Zenodo repository, so that you can
download those files (either the simulation files as .h5
files or the
results from the differential abundance benchmarking) and run additional test
on them or compare to the results we obtained.
If you have any question or comment or if you run into any issue please:
- create an issue in this repository or
- email Morgan Essex or Jakob Wirbel.
BAMBI
is distributed under the
GPL-3 license.
If you use BAMBI
, please cite us by
Wirbel J, Essex M, Foslund, SK Zeller G Evaluation of microbiome association models under realistic and confounded conditions bioRxiv (2022) https://doi.org/10.1101/2022.05.09.491139