This repository contains the code to simulate a database of phylogenetic trees that will be used in a machine learning project in which a neural network will be trained to solve phylodynamics problems.
To generate the database, run the main script:
python main.py <path/to/config.json>
If you want a small example to test this out, try using the
config/debugging.json
file. The whole simulation is configured by
the JSON file provided at the command line.
Making a large database takes a while. There is a tool in
src/monitor.py
which you can run and it will report on the progress
of a current database construction and give a (rough) estimate of the
remaining time for each stage of the simulation.
python src/monitor.py <path/to/config.json>
The way in which a dataset is simulated is configured with a JSON file. There is a schema for valid configurations described here. There are some example configurations provided:
Additional information about these datasets is given here.
Two scripts, visualisation.py
and visualisation_temporal.py
can be
used to visualise the output of a simulation.
python visualisation.py <path/to/config.json>
python visualisation_temporal.py <path/to/config.json>
Note that the latter only applies for simulations which are configured
to report temporal data (that is, report_temporal_data
is set to
true
in the config).
The database is an HDF5 file. Each simulation is represented with a
group with a name of the form record_xxxxxx
, e.g. record_000123
.
The data from each simulation is split into two groups: input
and
output
.
The input
group has the following datasets:
present
- the time since the origin of the last sequenced sample
tree_height
- the time between the $T\text{MRCA}$ and the present
tree
- a binary blob which is the pickled reconstructed tree of the sequenced samples in the simulation.
The output
group contains a lot of measurements, but the most
important is the temporal_measurements
dataset. The
temporal_measurements
dataset has the following columns:
measurement_times
(float)- the (forward) time since the origin of the measurements
prevalence
(int)- the number of infected individuals
cumulative
(int)- the cumulative number of infections
reproductive_number
(float)- the reproduction number
The following demonstrates how to use the database in Python. Don’t forget to close the database connection after using it! The following script reads in the tree and measurements from a simulation and produces this CSV file and the figure below.
from Bio import Phylo
import h5py
import pickle
import matplotlib.pyplot as plt
import numpy as np
hdf5_file = "../out/sim-charmander/dataset-charmander.hdf5"
db_conn = h5py.File(hdf5_file)
demo_tree = pickle.loads(db_conn['record_000001/input/tree'][...].tobytes())
fig, ax = plt.subplots()
Phylo.draw(demo_tree, do_show=False, axes=ax)
fig.savefig('../out/sim-charmander/plots/demo-tree.png')
measurements = db_conn['record_000001/output/parameters/temporal_measurements'][...]
column_names = measurements.dtype.names
np.savetxt('../out/sim-charmander/demo-measurements.csv',
measurements, delimiter=',',
header=','.join(column_names))
db_conn.close()
If you want a GUI to inspect the output HDF5 file, the HDFCompass tool provides a simple way to inspect the data that has been generated. There is some basic information about the simulation stored as attributes in the HDF5 file. This includes the date of creation and the size of the dataset.
A conda environment to run this simulation can be created from the
environment.yaml
file by running the following command:
conda env create -f environment.yaml
This environment will have all the correct packages for running the simulations.
BEAST2 is used to simulate the data. If you don’t have BEAST2
installed, there is a script scr/setuplib.sh
which will download and
install this for you. Once you have BEAST2 installed, you will need to
install remaster through BEAUti.