Data and analyses for “Local adaptation and host specificity to copepod intermediate hosts by the Schistocephalus solidus tapeworm”

These include the data and scripts for the Bayesian hurdle model ensemble analysis.

Dataset metadata

The main data is data/chapter_2_copepod_for_bayes.csv. It has the following columns: The data file has the following columns:

number: sequential row number
cop.lake: Copepod Lake of origin (factor: lau, ech, rob, gos, boo)
worm.fam: Worm family; nested within worm.lake (factor)
worm.lake: Worm lake of origin (factor: boo, gos, ech)
plate: experimental block identifier (factor)
numb.worm: number of worms present in copepod (integer)
native: Does cop.lake == worm.lake? (logical)
genus: Worm genus (factor: M, A)

Re-analysis instructions

TACC setup

The hurdle models were originally run on the Texas Advanced Computing Center’s (TACC) Stampede2 supercomputer. It could probably be adapted to other HPC systems.

Create a directory in $SCRATCH for the project, then copy the R & data folders to it. Make a folder called “logs” with mkdir.
Download the docker image.
a. Make the folder $WORK/singularity and go there.
b. Open an idev session c. Run ml tacc-singularity d. Run singularity pull docker://crpeters/docker-stan:21.1.2 e. This will download a docker image that has all the analysis packages in it. f. Run ls *.sif and copy the name of the sif file.
Configure docker_stan shell script a. Copy the docker_stan file into your local bin folder (could be $HOME/bin or $WORK/apps/bin or something like that). b. Edit the file with nano or something; replace the sif file it references with the one you coppied in the previous step, then save. c. Make the file executable.

Run the Hurdle models

This will create a list of all possible hurdle model candidates and submit them to TACC .

Submit the hurdle model batch job with sbatch slurm/run_hurdle.slurm. You may need to edit the slurm file’s parameters (cores, nodes, etc) to suite your HPC system.
If the all of the tasks aren’t completed, run sbatch slurm/continue_hurdle.slurm to continue the run. This should be done repeatedly until all models have been run

For each model, there should be two output files: the posterior distribution and the loo log predictive density.

Model Stacking

Model stacking is a three-step process, which is fully handled by the slurm/model_stacking.slurm script. If it times out, the easiest option is to give the job more time and then re-run it.

Post-processing

Several other scripts are useful for analyzing the stacking posterior:

get_ev.r provides expected values
get_effect_sizes.R calculates effect sizes
make_effect_size_plots.r visualizes the effect sizes
copepod_corrrelation_plot.r creates figure S1 in the manuscript. All of these are probably best run on the HPC system used for the other analyses.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Data and analyses for “Local adaptation and host specificity to copepod intermediate hosts by the Schistocephalus solidus tapeworm”

Dataset metadata

Re-analysis instructions

TACC setup

Run the Hurdle models

Model Stacking

Post-processing

Files

README.md

Latest commit

History

README.md

File metadata and controls

Data and analyses for “Local adaptation and host specificity to copepod intermediate hosts by the Schistocephalus solidus tapeworm”

Dataset metadata

Re-analysis instructions

TACC setup

Run the Hurdle models

Model Stacking

Post-processing