Analysis of inference methods on standard population models including selection. Here's how to get going, if you'd like to run this analysis
Now clone the analysis2 repo, and install its dependencies
$ git clone [email protected]:popsim-consortium/analysis2.git
$ cd analysis2/
We recommend you start by creating a new conda
environment for the analysis. This can be done using the command below, which will
create a new conda
env called analysis2
. Currently the workflow is targeted to run on python 3.9
$ conda env create -f environment.yml
$ conda activate analysis2
With the environment in place, the next step is to set the
workflow parameters using the a config file.
analysis2
currently ships with three example config
files, each found in config/snakemake/
: tiny_config.yaml
,
config.yaml
, and production-config.yaml
. Respectively
these represent a very small run, a small run, and the
final production settings used for the paper (TBD)
The workflow can be pointed at one of these config files
by editing the following line in the Snakefile
file
configfile: "workflows/config/snakemake/tiny_config.yaml"
The file can also be overridden by passing a different
yaml file via snakemake's --configfile
flag.
Having set the config file, and perhaps edited to your wishes, you are now ready to try a dry run.
To make sure that things are working, next run a dry run of the complete workflow
$ snakemake -c1 all -n
if the dry run checks out, you should be ready to run.
You should now be set to run the complete workflow. This will consist broadly of: 1) simulating the chromosomes of interest, 2) downloading and installing the tools to be used in the anlaysis of the simulated data, 3) analysis of the simulated tree sequences using the aforementioned tools, and 4) summarization of the analyses into figures.
The Snakemake workflow has a number of targets:
all
-- run the complete analysisclean_all
-- removes all simulations, downloads, and analysisclean_ext
-- removes all downloaded external toolsclean_output
-- removes all simulation and analysis
To run the complete workflow on M
cores use the following
$ snakemake -c M all
One can run the clean_
targets of the workflow similarly.
Sometimes the user only wants to run a subsection of the workflow.
This is possible using Snakemake
with the --snakefile
option
along with the component workflows we have included. For instance,
to just perform the simulation steps of the workflow using 10 CPUs
the user can say
$ snakemake -c 10 --snakefile workflows/simulation.snake
and only that part of the analysis pipeline will run.
We currently have 3 sub-workflows: simulations.snake
which does the simulations, n_t.snake
which performs
N(t) type analyses (e.g. msmc
), and dfe.snake
which
houses the portion of the workflow that does estimation
of the DFE.
Currently we have provided two example Snakemake
profiles
that allow a user to run the analysis2
workflow on quite
easily. These can be found in workflows/config/snakemake/oregon_profile
and workflows/config/snakemake/arizona_profile
.
Each of those directories contains a single file, config.yaml
,
that lays out the cluster specific settings needed to launch jobs.
For instance the oregon_profile/config.yaml
, which is meant
to run using a cluster with a slurm
scheduler looks like this
cluster:
mkdir -p logs/{rule} &&
sbatch
--partition=kern,kerngpu
--account=kernlab
--cpus-per-task={threads}
--mem={resources.mem_mb}
--time={resources.time}
--job-name=smk-{rule}-{wildcards}
--output=logs/{rule}/{rule}-{wildcards}-%j.out
default-resources:
- time=60
- mem_mb=5000
- threads=1
restart-times: 3
max-jobs-per-second: 10
max-status-checks-per-second: 1
local-cores: 1
latency-wait: 60
jobs: 500
keep-going: True
rerun-incomplete: True
printshellcmds: True
scheduler: greedy
use-conda: True
to adopt this to a new slurm
cluster a user would have to:
- change the
partition
value to appropriately named partitions - change the
account
name
At the command line this eases things tremendously. We can lauch the entire workflow simply with
$ snakemake --profile workflows/config/snakemake/oregon_profile/
Snakemake
then will take care of all of the communication with the cluster,
launching jobs, and monitoring them for completeness.
There are a lot of great examples on how to set up profiles for running workflows on various cluster architectures. One excellent resource is this repository of publicly available profiles https://github.com/snakemake-profiles/doc