Data and analyses for “Local adaptation and host specificity to copepod intermediate hosts by the Schistocephalus solidus tapeworm”
These include the data and scripts for the Bayesian hurdle model ensemble analysis.
The main data is data/chapter_2_copepod_for_bayes.csv
. It has the
following columns: The data file has the following columns:
- number: sequential row number
- cop.lake: Copepod Lake of origin (factor: lau, ech, rob, gos, boo)
- worm.fam: Worm family; nested within worm.lake (factor)
- worm.lake: Worm lake of origin (factor: boo, gos, ech)
- plate: experimental block identifier (factor)
- numb.worm: number of worms present in copepod (integer)
- native: Does cop.lake == worm.lake? (logical)
- genus: Worm genus (factor: M, A)
The hurdle models were originally run on the Texas Advanced Computing Center’s (TACC) Stampede2 supercomputer. It could probably be adapted to other HPC systems.
- Create a directory in
$SCRATCH
for the project, then copy the R & data folders to it. Make a folder called “logs” withmkdir
. - Download the docker image.
a. Make the folder$WORK/singularity
and go there.
b. Open anidev
session c. Runml tacc-singularity
d. Runsingularity pull docker://crpeters/docker-stan:21.1.2
e. This will download a docker image that has all the analysis packages in it. f. Runls *.sif
and copy the name of the sif file. - Configure docker_stan shell script a. Copy the docker_stan file into
your local bin folder (could be
$HOME/bin
or$WORK/apps/bin
or something like that). b. Edit the file with nano or something; replace the sif file it references with the one you coppied in the previous step, then save. c. Make the file executable.
This will create a list of all possible hurdle model candidates and submit them to TACC .
-
Submit the hurdle model batch job with
sbatch slurm/run_hurdle.slurm
. You may need to edit the slurm file’s parameters (cores, nodes, etc) to suite your HPC system. -
If the all of the tasks aren’t completed, run
sbatch slurm/continue_hurdle.slurm
to continue the run. This should be done repeatedly until all models have been run
For each model, there should be two output files: the posterior distribution and the loo log predictive density.
Model stacking is a three-step process, which is fully handled by the
slurm/model_stacking.slurm
script. If it times out, the easiest option
is to give the job more time and then re-run it.
Several other scripts are useful for analyzing the stacking posterior:
get_ev.r
provides expected valuesget_effect_sizes.R
calculates effect sizesmake_effect_size_plots.r
visualizes the effect sizescopepod_corrrelation_plot.r
creates figure S1 in the manuscript. All of these are probably best run on the HPC system used for the other analyses.