forked from tskit-dev/msprime-1.0-paper
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathpaper.tex
1839 lines (1697 loc) · 97 KB
/
paper.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass{article}
\usepackage[round]{natbib}
\usepackage{listings}
\usepackage[english]{babel}%
\usepackage[T1]{fontenc}%
\usepackage[utf8]{inputenc}%
\usepackage{amsmath,amssymb,amsfonts}%
\usepackage{geometry}%
\usepackage{color}
\usepackage{graphicx}
\usepackage{dsfont}
\usepackage{verbatim}%
\usepackage{environ}%
\usepackage[right]{lineno}%
\usepackage{nameref}
\usepackage{tikz}
%\usepackage{showkeys}
% local definitions
\newcommand{\msprime}[0]{\texttt{msprime}}
\newcommand{\tskit}[0]{\texttt{tskit}}
\newcommand{\ms}[0]{\texttt{ms}}
\newcommand{\scrm}[0]{\texttt{scrm}}
\newcommand{\SLiM}[0]{\texttt{SLiM}}
\newcommand{\fwdpy}[0]{\texttt{fwdpy11}}
\newcommand{\stdpopsim}[0]{\texttt{stdpopsim}}
\newcommand{\SeqGen}[0]{\texttt{Seq-Gen}}
\newcommand{\Pyvolve}[0]{\texttt{Pyvolve}}
\newcommand{\phastsim}[0]{\texttt{phastsim}}
\newcommand{\discoal}[0]{\texttt{discoal}}
\newcommand{\jkcomment}[1]{\textcolor{red}{#1}}
\newcommand{\plrcomment}[1]{\textcolor{cyan}{#1}}
\begin{document}
\title{Msprime 1.0: an efficient and extensible coalescent simulation framework}
\author{Author list to be filled in
}
% \address{
%%Affiliations:
% Franz Baumdicker:
% Cluster of Excellence "Controlling Microbes to Fight Infections", Mathematical and Computational Population Genetics, University of Tübingen, Germany
% \section*{Contact:} \href{[email protected]}{[email protected]}
\maketitle
\begin{abstract}
Coalescent simulation is a key tool in population genetics and
has been integral to theoretical developments since its beginnings.
Because of the ease with which the basic coalescent model can be simulated,
a large number of different simulators have been developed. However,
the coalescent with recombination is far more challenging to simulate,
and, until recently, it was not possible to simulate efficiently.
The \msprime\ software has revolutionised
coalescent simulation, making it possible to simulate millions
of whole genomes for the first time. We summarise the many features
of \msprime\ 1.0, which is built around a core model of efficiently
implementing recombination via the succinct tree sequence data
structure. Moreover, \msprime\ is a community oriented open-source development
project, aimed as a way to reduce duplication of effort and increase
the quality of key simulation software.
\end{abstract}
\textbf{Keywords:} Coalescent simulation, Python
\section*{Introduction}
The coalescent
process~\citep{kingman1982coalescent,hudson1983testing,tajima1983evolutionary}
models the ancestry of a set of sampled genomes. Specifically,
the coalescent provides a mathematical description of the
genealogical tree that relates sampled genomes to one another.
The coalescent has proved to be a powerful model,
and it is now central to
population genetics~\citep{hudson1990gene,hein2004gene,wakely2008coalescent}.
Among the most impactful contributions of the coalescent since its origins
~\citep{hudson1983testing} is that it provides a path towards more efficient
simulation of population genetic samples than is possible using traditional,
forwards in time simulation.
As a result of this great utility, dozens of different simulation tools have been
developed over the decades~\citep{carvajal2008simulation,liu2008survey,
arenas2012simulation,yuan2012overview,hoban2012computer,yang2014critical,
peng2015genetic}.
Two technological developments of recent years, however,
pose major challenges to the majority of these methods.
% Some citations here? Darwin tree of life maybe?
Firstly, whole genome sequencing---and in particular fourth-generation
sequencing technologies---have made chromosome-level assemblies
commonplace. Thus, modelling genetic variation data as a series of unlinked loci,
a previously reasonable assumption, is no longer defensible and
we must fully account for recombination.
While a genealogical tree relating $n$ samples in the single-locus coalescent
can be simulated in $O(n)$ time~\citep{hudson1990gene},
the coalescent with recombination, which generates multiple correlated trees, is
far more challenging.
% JK will need to revise this sentence, in light of what we'll be saying
% down further about needing the SMC for very large rho.
Programs such as Hudson's classical \ms~\citep{hudson2002generating}
can only simulate short segments under the influence of recombination,
and, until recently, approximate models of
recombination~\citep{mcvean2005approximating,staab2015scrm}
were necessary to simulate whole chromosomes.
The second major challenge to existing coalescent simulators is the
exponential increase in sample size of genetic studies. Human datasets such
UK Biobank~\citep{bycroft2018genome} and
gnomAD~\citep{karczewski2020mutational} now consist of hundreds of
thousands of genomes and many other datasets on a similar scale
are becoming available~\citep{tanjo2021practical}.
Classical simulators such as \ms\ and even fast approximate methods
such as \scrm\ simply cannot cope with such a large number of samples.
The \msprime\ simulator~\citep{kelleher2016efficient,kelleher2020coalescent}
has greatly increased the scope of coalescent simulations,
and it now straightforward to simulate millions of whole chromosomes.
The ``succinct tree sequence'' data
structure~\citep{kelleher2016efficient,kelleher2018efficient,kelleher2019inferring,
wohns2021unified},
introduced as part of \msprime\, makes it possible to store
such huge simulations in a few gigabytes, several orders
of magnitude smaller than commonly used formats.
The succinct tree sequence has also lead to major advances in forwards-time
simulation~\citep{kelleher2018efficient,haller2018tree},
ancestry inference~\citep{kelleher2019inferring,wohns2021unified}
and calculation of population genetic
statistics~\citep{kelleher2016efficient,ralph2020efficiently}.
Through a rigorous open-source community development process,
\msprime\ has gained a large number of features since its introduction,
making it a highly efficient and flexible platform for population
genetic simulation.
This paper marks the release of \msprime\ 1.0, a state-of-the-art
coalescent simulator both with respect to efficiency as well as feature-richness.
Here we provide an overview of the extensive abilities of \msprime\ 1.0,
demonstrate through numerous performance comparisons its advantages over
current, alternative software packages, and finally lay out future plans for
further open-source development of what we hope will become an industry
standard piece of computational infrastructure for the genetics community.
\section*{Results}
\jkcomment{Give a quick roadmap,
like, we discuss the underlying tree sequence object,
then mutations (because it's quite separate from the rest)
and then a bunch of new features for ancestry simulation. There's no
point in listing it out section by section.}
% JK: commented out this text as it's misleading and would confuse
% people. We should adapt it when writing the roadmap when the structure
% has settled down.
% In this section we describe the main features of \msprime\ 1.0. Methodologically,
% there exists a logical separation between simulating coalescent genealogies and
% adding neutral mutations onto these genealogies. Simulating genetic ancestry
% requires methods that model the underlying reproductive process (e.g. Moran,
% Wright-Fisher, etc.) as well as accounting for recombination of various sorts
% (including gene conversion), demography, and selection. These methods are
% discussed in the first N subsections below. One of the strengths of the
% approach taken in msprime is that neutral mutations can be retrospectively
% added to simulated genealogies. Methods to generate mutations, and thereby
% generate genetic variation, are therefore discussed in a separate and subsequent
% subsection. Finally, we describe the \msprime\ API, which is designed to enforce
% the distinction between simulating genealogies and mutations; discuss the
% interoperability provided by the use of \tskit\, the tree sequence toolkit;
% and highlight the open source development process which allows for \msprime\
% to be extended by addition of new features while ensuring robustness and
% maintainability of the core codebase.
% The data model
\subsection*{Tree sequences}
\label{sec-ts}
\begin{figure}
\begin{center}
\includegraphics[width=6cm]{illustrations/example_tree_sequence}
\end{center}
\caption{\label{fig-ts-example} Example tree sequence.
An example tree sequence describing genealogies and genotypes
for four samples at ten sites.
}
\end{figure}
One of the key reasons for \msprime's substantial performance advantage
over other simulators~\citep{kelleher2016efficient}
is its usage of the ``succinct tree sequence''
data stucture to represent simulation results.
The succinct tree sequence (usually abbreviated to ``tree sequence'')
is a data structure for concisely encoding
genetic ancestry and sequence variation.
The basic concepts were introduced by \cite{kelleher2016efficient}
and originally developed as part of \msprime. We subsequently
extracted the tree sequence code that was independent of simulations
from \msprime\ into
a standalone Python and C library, called \tskit~\citep{tskit2021tskit}.
While a full discussion of tree sequences and the capabilities
of \tskit\ is beyond the scope of this article, we summarise
some aspects that are important for simulation here.
% Encoding ancestry
Given a set of sample chromosomes, the topology
of their genetic ancestry is described by a set of \emph{nodes}
and \emph{edges} (Fig.~\ref{fig-ts-example}).
A node in tree sequence terminology corresponds
to a single chromosome strand, associated with its birth time
and other optional information. A node may be
associated with an \emph{individual}, so that, for example, a diploid
individual is associated with two separate nodes (we omit
the individual table in Fig.~\ref{fig-ts-example} as it is optional).
Nodes define haplotypes that exist at a specific time within
some individual, and edges define genetic inheritance
between these nodes. An \emph{edge} consists of a
parent node, a child node and the left and right coordinates
of the contiguous genomic interval over which
the child inherited genetic material from the parent.
Critically, edges can span multiple marginal trees: for
example, in Fig.~\ref{fig-ts-example} the branch from node
0 to 4 is present in both marginal trees, and represented
by a single edge (the first row in the edge table).
This simple device, of explicitly associating tree nodes
with specific ancestral chromosomes and recording the
genomic intervals over which parent/child relationships exist,
generalises the original
``coalescence records'' concept~\citep{kelleher2016efficient},
and is the key to the efficiency of tree
sequences~\citep{
kelleher2018efficient,kelleher2019inferring,ralph2020efficiently}.
The tree sequence encoding of ancestry as set of nodes and edges
as just described, is equivalent to a graph in which edges
are annotated with a set of disjoint genomic intervals.
This is closely related to the Ancestral Recombination Graph (ARG),
a term that is often loosely understood
to mean some representation of genetic
ancestry as a graph~\citep[e.g.][]{mathieson2020ancestry}.
In this sense a succinct tree sequence \emph{is} an ARG.
More technically, an ARG is a particular
representation of genetic ancestry as a graph based on the
flow of ancestral material through recombination and common
ancestor events~\citep{griffiths1991two,griffiths1997ancestral}.
In this sense, tree sequences can be used to \emph{encode}
ARGs without loss of information (although
the default minimal tree sequence encoding is
much more concise; see the \nameref{sec-arg} section).
It should be noted that, despite standardisation
efforts~\citep{cardona2008extended,mcgill2013graphml}, there is
no shared format for interchanging ARG data, and no supporting software
libraries.
% OK, back to tree sequences: what about sequence variation?
The final output of most population genetic simulations is some representation
of sequence variation among the specified samples. For coalescent
simulations, we usually have three steps: (1) simulate
the genetic ancestry, and optionally output the resulting marginal trees;
(2) simulate sequence evolution conditioned on this ancestry by generating
mutations (see the \nameref{sec-mutations} section); and (3) output the
resulting sequences by percolating the effects of the mutations through
the trees. Information about the mutations
themselves---e.g., where they have occurred on the
trees---is usually not retained or made available for subsequent analysis.
In \msprime, however, we skip step (3), instead using \tskit's
combined data model of ancestry and mutations
to represent the simulated sequences.
As illustrated in Fig.~\ref{fig-ts-example}, mutations are a
fully integrated part of \tskit's tree sequence data model, and genetic
variation is encoded by recording sites at which mutations
have occurred, and where each mutation at those sites
has occurred on the marginal tree.
Crucially, the genetic sequences themselves are never stored, or indeed directly
represented in memory (unless explicitly decoded using \tskit\ APIs).
It may at first seem inconvenient to have only this indirect representation
of the genome sequences, but it is extremely powerful.
Firstly, the storage space required for simulations is dramatically
reduced. For a simulation of $n$ samples with $m$ variant sites,
we would require $O(nm)$ space to store the sequence data as a variant
matrix. However, if this simulation was of a recombining genome
with $t$ trees, then the \tskit\ tree sequence encoding
requires $O(n + t + m)$ space,
assuming we have $O(1)$ mutations at each site~\citep{kelleher2016efficient}.
For large sample sizes, this difference is profound, making it
conceivable, for example, to store the genetic ancestry
and variation data for the entire human population on a
laptop~\citep{kelleher2019inferring}. As well as the huge
difference in storage efficiency, it is often far more efficient
to compute statistics of the sequence data from the
trees and mutations than it is to work with the sequences themselves.
For example, computing Tajima's $D$ from simulated data stored in the \tskit\ format
is several orders of magnitude faster than efficient variant matrix
libraries for large sample sizes~\citep{ralph2020efficiently}.
The vast genomic datasets produced during the SARS-CoV-2 pandemic
have highlighed the advantages of storing genetic variation data using the underlying trees.
\cite{turakhia2021ultrafast} propose the
Mutation Annotated Tree (MAT)
format (consisting of a Newick tree
and associated mutations in a binary format) and
the \texttt{matUtils} program as an efficient way to store
and process large viral datasets~\citep{mcbroome2021daily},
achieving excellent compression and processing performance.
Similarly, \phastsim~\citep{demaio2021phastsim}
was developed to simulate sequence evolution on such large SARS-CoV-2
phylogenies, and also outputs a Newick tree annotated with mutations (not
in MAT format) to avoid the bottleneck of generating
and storing the simulated sequences.
While these methods illustrate the advantages of the general
approach of storing ancestry and mutations rather than sequences,
they do not generalise beyond their immediate settings, and
no software library support is available.
% \msprime, on the other hand, provides a fast, powerful, and well-tested
% generic interface for generating and storing mutations on trees.
The software ecosystem built around \tskit\ is stable and mature;
simulators such as
\fwdpy~\citep{thornton2014cpp},
\SLiM~\citep{haller2019slim},
\stdpopsim~\citep{adrion2020community},
\texttt{Geonomics}~\citep{de2021geonomics}
and
\texttt{GSpace}~\citep{virgoulay2021gspace},
and inference methods such as
\texttt{tsinfer}~\citep{kelleher2019inferring},
\texttt{tsdate}~\citep{,wohns2021unified}
and \texttt{Relate}~\citep{speidel2019method}
use either the Python or C
APIs to support outputting results in tree sequence format.
Tree sequences are stored in an efficient binary file format, and
are fully portable across operating systems and processor
architectures. The \tskit\ library ensures interoperability
between programs by having strict definitions of how the
information in each of the tables is interpreted, and
stringent checks for the internal consistency of the data model.
% The input - how do we run simulations?
\subsection*{User interface}
\label{sec-sim-interface}
The majority of simulation packages are controlled either through
a command line interface~\citep[e.g.][]{hudson2002generating,kern2016discoal},
a text-based input file
format~\citep[e.g.][]{guillaume2006nemo,excoffier2011fastsimcoal,shlyakhter2014cosi2},
or a mixture of both.
Command line interfaces make it very easy to run simple
simulations, but as model complexity and the number of parameters specified increase,
they become difficult to understand and
error-prone~\citep{ragsdale2020lessons,gower2021demes}.
Specifying parameters through a text file alleviates this problem to a degree,
but lacks flexibility, for example, when running simulations with parameters
drawn from some distribution. In practice, for any reproducible simulation
project users will write a script
to generate the required command lines or input parameter files,
invoke the simulation engine, and process the results in some way.
This process is cumbersome and labour intensive, and
a number of packages have been developed
to allow simulations be run directly in a high-level
scripting language~\citep{staab2016coala,parobek2017skelesim,gladstein2018simprily}.
The more recent trend has been to move away from this file and command-line
driven approach and to instead provide direct interfaces to the simulation
engines via an Application Programming Interface (API)~\citep[e.g.][]{
thornton2014cpp,kelleher2016efficient,becheler2019quetzal,haller2019slim}.
The primary interface for \msprime\ is through a thoroughly documented and
stable Python
API, which has encouraged the development of an ecosystem of downstream
% Found these through the Dependent packages on GitHub
tools~\citep{terhorst2017robust,chan2018likelihood,spence2019inference,
adrion2020community,adrion2020predicting, kamm2020efficiently,
mckenzie2020ipcoal, montinaro2020revisiting,
de2021geonomics,rivera2021simulation}.
As well as providing a stable and efficient platform for building
downstream applications, \msprime's Python API makes it much easier to
build reproducible simulation pipelines, as the entire workflow can
be encapsulated in a single script, and package and version
dependencies explicitly stated using the \texttt{pip}
or \texttt{conda} package managers.
For example, the errors made
in the influential simulation analysis of
\cite{martin2017human} were only detected because the pipeline
could be easily run and reanalysed~\citep{ragsdale2020lessons,martin2020erratum}.
A major change for the \msprime\ 1.0 release is the introduction of a new set of APIs,
designed in part to avoid sources of error (see the Demography section) but
also to provide more appropriate defaults while keeping compatibility with
existing code. In the new APIs, ancestry and mutation simulation are fully
separated, with the \texttt{sim\_ancestry} and \texttt{sim\_mutations}
functions replacing the legacy \texttt{simulate} function. Among other changes,
the new APIs default to discrete genome coordinates and finite sites mutations,
making the default settings more realistic and resolving a major source of confusion and error.
The previous APIs are fully
supported and tested, and will be maintained for the foreseeable future.
We also provide an \ms\ compatible
command line interface to support existing workflows. The
\texttt{msp} program has been extended to include new commands
for simulating ancestry and mutations separately. A particularly useful
feature is the ability to specify demographic models in
Demes format~\citep{gower2021demes} from the command line,
making simulating complex demography models straightforward.
% The output - what do we get out?
\subsection*{Data analysis}
\label{sec-data-analysis}
The point of running simulations is to generate data, and there
are two important ways in which the \tskit\ tree sequences output by
\msprime\ are more useful than classical methods:
they are more efficient and they contain more information.
We will consider these points in turn.
The standard way of representing simulation data is
to render the results in a text format. For example,
\ms~\citep{hudson2002generating} outputs a set of sequences
and can also optionally output the marginal trees along
the genome in Newick format,
and variants of this approach are used by many simulators.
Text files have many advantages, but are slow to process
at scale.
The ability to efficiently process simulation results is
particularly important in simulation-based inference methods
such as Approximate Bayesian Computation
(ABC)~\citep{beaumont2002approximate,csillery2010approximate,wegmann2010abctoolbox}
and machine learning based
approaches~\citep{sheehan2016deep,chan2018likelihood,schrider2018supervised,
flagel2019unreasonable,sanchez2020deep}. Clearly, simulation efficiency is
crucial since the size and number of simulations that can be performed determines
the depth to which one can sample from the model and parameter space.
Equally important,
however, is the efficiency with which the simulation results can be
transformed into the specific input required by the inference method.
In the case of ABC, this is usually a set of summary statistics of the sequence
data, and methods avoid the bottleneck of parsing
text-based file formats to compute these statistics
by either developing their own
simulators~\citep[e.g.][]{cornuet2008inferring,lopes2009popabc}
or creating forked versions (i.e., modified copies) of existing
simulators~\cite[e.g.][]{thornton2006approximate,
hickerson2007msbayes,pavlidis2010msabc,huang2011mtml,quinto2018modeling},
tightly integrated with the inference method.
Modern approaches to ABC such as
ABC-RF~\citep{raynal2019abc,pudlo2016abc} and
ABC-NN~\citep{csillery2012abc,blum2010abc} use large
numbers of weakly informative statistics,
making the need to efficiently compute statistics from simulation
results all the more acute.
Classical text based output formats like \ms\ are inefficient to process,
but also lack a great deal of important information.
The tree-by-tree topology information output by simulators in Newick
format lacks any concept of node identity,
and means that we cannot reliably infer information about ancestors
from the output. Because Newick stores branch lengths rather
than node times, numerical precision issues also arise when summing
over many short branches~\citep{mcgill2013graphml}.
Numerous forks of simulators have been created to access information not provided
in the output. For example, \ms\ has been forked to
output information about migrating segments~\citep{rosenzweig2016powerful},
information about ancestral lineages~\citep{chen2013asymptotic},
an array of population genetic statistics~\citep{ramos2007mlcoalsim}
% Any more?
and \ms's fork \texttt{msHOT}~\citep{hellenthal2007mshot},
has in turn been forked to output information on local
ancestry~\citep{racimo2017archaic}.
All of this information
is either directly available by default in \msprime, or can be optionally
stored via options such as \texttt{record\_migrations} or
\texttt{record\_full\_arg} (see the \nameref{sec-arg} section) and
can be efficiently and conveniently processed via \tskit\ APIs.
These problems of inefficient and limited access to the simulated
data are comprehensively solved by \msprime\ and \tskit. Complete
information about ancestry and mutation in simulations is immediately
available to the user in a single integrated data structure
(rather than multiple independent files), with no intermediate
processing required.
Users can subsequently compute precisely the information that they
require using the powerful suite of population genetic statistics
and other methods available in \tskit.
% Stuff we simulate: mutations
\subsection*{Simulating mutations}
\label{sec-mutations}
% Explain the background: what have people done classically?
Because coalescent simulations are usually concerned with
neutral evolution (however, see the \nameref{sec-selection} section)
the problem of generating synthetic genetic variation can be decomposed into
two independent steps:
firstly, simulating genetic ancestry (the trees), then subsequently simulating
variation by superimposing mutation processes on those trees
(see Fig.~\ref{fig-mutated-trees}).
A number of programs exist to place mutations on trees: for instance,
the classical \SeqGen\ program~\citep{rambaut1997seq}
supports a range of different models of sequence evolution,
and various extensions to the basic
models have been proposed~\citep[e.g.][]{cartwright2005dna,fletcher2009indelible}.
Partly for efficiency and partly in the interest of
simplicity for users (i.e., to avoid intermediate text format conversions),
population genetic simulators have tended to
include their own implementations of mutation simulation, with
most supporting the infinite sites
model~\citep[e.g.][]{hudson2002generating}
but with several supporting a wide range of different models of sequence
evolution~\citep[e.g.][]{mailund2005coasim,excoffier2011fastsimcoal,
virgoulay2021gspace}. Thus, despite the logical separation between
the tasks of simulating ancestry and neutral sequence evolution,
the two have been conflated in practice.
\begin{figure}
\includegraphics{illustrations/unmutated_tree}
\includegraphics{illustrations/mutated_tree}
\caption{
\label{fig-mutated-trees}
Visualization of the separation between ancestry and mutation simulation:
\texttt{sim\_ancestry} produces ancestry
(\textbf{left;} shown are two trees along a 1Kb chunk of genome
relating 200 samples),
and \texttt{sim\_mutations} adds mutations
(\textbf{right;} fourteen nucleotide mutations have been added).
Graphics produced with \tskit's \texttt{draw\_svg} method.
\plrcomment{TODO: fix up the axis labels, if we like this figure?}
\jkcomment{Why 200 samples? Seems like far too many to me, how about 5?}
}
\end{figure}
Part of the reason for this poor record of software reuse and modularity is the
lack of standardised file formats, and in particular, the absence of common
library infrastructure to abstract the details of interchanging simulation
data. Although \msprime\ also supports simulating both ancestry and mutations,
the two aspects are functionally independent within the software.
Both ancestry and mutation simulators are present in \msprime
for reasons of convenience and history,
and could be split into separate packages.
The mature and efficient C and Python interfaces for
\tskit\ makes it straightforward to add further information to
an existing file, and because of its efficient data interchange mechanisms,
there is no performance penalty for performing additional operations
in a different Python package.
Thanks to this interoperability, \msprime's
mutation generator can work with \emph{any} \tskit\ tree sequence,
be it simulated using \SLiM~\citep{haller2019slim} or
\fwdpy~\citep{thornton2014cpp}, or estimated from real
data~\citep{kelleher2019inferring,speidel2019method,wohns2021unified}.
It is a modular
component intended to fit into a larger software ecosystem, and
is in no way dependent on \msprime's ancestry simulator.
As well as providing a new API (see the \nameref{sec-sim-interface} section)
which emphasises the logical split between ancestry and mutation simulation,
we have greatly extended the sophistication of
\msprime's mutation generation engine for version 1.0,
achieving near feature-parity with \SeqGen.
We support a large number of mutation models, including the
JC69~\citep{jukes1969evolution},
F84~\citep{felsenstein1996hidden},
and GTR~\citep{tavare1986some} nucleotide models
and the BLOSUM62~\citep{henikoff1992amino}
and PAM~\citep{dayhoff1978} amino acid models.
A general model accepts an arbitrary transition matrix between
any number of alleles (which can be arbitrary unicode strings),
so that other models, such as the Kimura two and three
parameter models~\citep{kimura1980simple,kimura1981estimation},
can be easily and efficiently defined in user code.
Mutation rates can vary along the genome, and multiple mutation
models can be imposed on a tree sequence by overlaying mutations
in multiple passes.
We have extensively validated the results of mutation simulations
against both theoretical expectations and output from
\SeqGen~\citep{rambaut1997seq} and
Pyvolve~\citep{spielman2015pyvolve}.
\begin{figure}
\includegraphics[width=\textwidth]{figures/mutations-perf}
\caption{\label{fig-mutations-perf} Time required to run
\texttt{sim\_mutations} on tree sequences generated
by \texttt{sim\_ancestry} (with a population size of $10^4$
and recombination rate of $10^{-8}$) for varying (haploid) sample
size and sequence length. We run 10 replicate mutation simulations
each for three different mutation rates, and report the average
CPU time required (Intel Core i7-9700).
(A) Holding sequence length fixed at 10 megabases and varying the
number of samples (tree tips) from 10 to 100,000.
(B) Holding number of samples fixed at 1000, and varying the sequence
length from 1 to 100 megabases.}
\end{figure}
Simulating mutations in \msprime\ is efficient.
Fig.~\ref{fig-mutations-perf} shows the time required to generate
mutations (using the default JC69 model) on
simulated tree sequences for a variety of mutation
rates as we vary the number of samples
(Fig.~\ref{fig-mutations-perf}A) and the sequence
length (Fig.~\ref{fig-mutations-perf}B).
For example, the longest running simulation in
Fig.~\ref{fig-mutations-perf}B required less than 2 seconds to
generate an average of 1.5 million mutations over 137,081 trees
in a tree sequence with 508,125 edges.
This efficiency for large numbers of trees is possible because
the tree sequence encoding allows us to generate mutations
on an edge-by-edge basis
(see Fig.~\ref{fig-ts-example} and the~\nameref{app-mutation-algorithm}
appendix),
rather than tree-by-tree and branch-by-branch as would otherwise be required.
In the above example from Fig.~\ref{fig-mutations-perf}B,
if we generated mutations tree-by-tree, we would have to iterate over 273,887,838 branches
(since there are 137,081 trees and 1,998 branches in each
tree) rather than 508,125 edges, resulting in $\sim500$ times more work.
% python3 evaluation/generate_mutations_perf_data.py benchmark-single-tree
Even if we have a tree sequence consisting of a single tree
(negating the advantage of working edge-by-edge),
\msprime's mutation generator is still very efficient.
For example, we simulated mutations under the BLOSUM62 amino
acid model for a tree with $10^6$ leaves over $10^4$ sites (resulting
in $\sim$260,000 mutations) in about $0.8$ seconds, including
the time required for file input and output.
We do not attempt a systematic benchmarking of \msprime's
mutation generation code against other methods, because at this scale it is
difficult to disentangle the effects of inefficient input and
output formats from the mutation generation algorithms.
Given these timings, it seems quite unlikely
that generating mutations with \msprime\ would be a bottleneck in any
realistic analysis.
There are many ways in which the mutation generation code
in \msprime\ could be extended. For example, we intend to add support for
microsatellites~\citep{mailund2005coasim},
codon models~\citep{arenas2007recodon}
and indels~\citep{cartwright2005dna,fletcher2009indelible},
although some may also require upstream additions to \tskit's data model
which is currently based on the assumption of independent sites.
\subsection*{Recombination}
Crossover recombination is implemented in \msprime\ using Hudson's algorithm, which
works backwards in time, tracking the
effects of common ancestor and recombination
events~\citep{hudson1983properties,hudson1990gene,kelleher2016efficient}.
Common ancestor events combine the ancestral material of two lineages, and may
result in coalescences in the marginal trees. Recombination events
split the ancestral material for some lineage, creating two independent
lineages. Using the appropriate data structures~\citep{kelleher2016efficient},
this process is much more efficient to simulate than either
methods based on the Ancestral Recombination Graph
formulation~\citep{griffiths1991two,griffiths1997ancestral}
or the left-to-right approach~\citep{wiuf1999recombination,wiuf1999ancestry}.
In \msprime\ 1.0 recombination rates can vary along a chromosome, allowing
us to simulate recombination hotspots and patterns
of recombination from empirical maps.
The implementation of recombination in \msprime\ is extensively validated
against analytical results~\citep{hudson1983properties,kaplan1985use}
and simulations by \ms.
The Sequentially Markovian Coalescent (SMC) is an approximation of the
coalescent with recombination~\citep{mcvean2005approximating,marjoram2006fast},
and was primarily motivated by the need to
simulate longer genomes than was possible using tools like \ms.
The SMC is a good approximation to the
coalescent with recombination when we have four or fewer sampled
genomes~\citep{hobolth2014markovian,wilton2015smc}, but the
effects of the approximation are less well understood for larger
sample sizes, and several approaches have been proposed
that allow simulations to more closely approximate the coalescent
with recombination~\citep{chen2009fast,wang2014new,staab2015scrm}.
The SMC and SMC' models are available
in \msprime\ 1.0. However, they are currently implemented using a
naive rejection sampling approach, and are somewhat slower
to simulate than the exact coalescent with recombination. These
models are therefore currently only appropriate for studying the
SMC approximations themselves, although we intend to
implement them more efficiently in future versions.
\begin{figure}
\begin{center}
\includegraphics{figures/ancestry-perf}
\end{center}
\caption{\label{fig-ancestry-perf}
Approximate run time for \msprime\ simulations,
for (A) small and (B) larger simulations.
Each point is the runtime of one simulation,
for various values of effective population size ($N$),
genome length ($L$), and number of samples.
Runtime scales quadratically with the product of $N$ and $L$;
since recombination rate in these simulations was $10^{-8}$,
the scaled recombination rate here is $\rho = NL/10^8$,
shown on the horizontal axis.
The black lines are quadratic fits separately in each panel
and sample size.
}
\end{figure}
In human-like parameter regimes and for large sample sizes,
\msprime's implementation of the
exact coalescent with recombination comprehensively outperforms
all other simulators, including those based on SMC
approximations~\citep{kelleher2016efficient}. However, it is important
to note that although the implementation of Hudson's algorithm is
very efficient, it is still quadratic in the scaled recombination rate
$\rho = 4 N_e L$, where $L$ is the length of the genome in units of recombination distance.
This is because Hudson's algorithms tracks recombinations not only
in segments ancestral to the sample, but also between ancestral segments.
In other words, lineages may ``coalesce'' even if the pieces of ancestral material they carry
do not overlap---representing an ancestral chromosome
that is a genetic ancestor to each of two linages,
but not the most recent common genetic ancestor at any location along the genome.
When this happens, the resulting merged lineage carries ``trapped'' genetic material
that is not ancestral to any samples;
recombinations in the trapped material will cause the two segments of ancestry
to split apart into separate lineages again.
For large $\rho$, recombination events
in this trapped material will dominate.
\citet[Eq.~5.10]{hein2004gene} gave
$\rho (\rho + 1) \left( \sum_{i=1}^{n-1} \frac{1}{i} \right)^2$
as an upper bound
on the number of recombination events within trapped ancestral material
in Hudson's algorithm.
Since the sum is well approximated by $\log{n}$,
a natural conjecture is that the time complexity of Hudson's
algorithm is $O(\rho^2 \log^2 n)$.
This turns out to have the correct dependence on $\rho$---although
caution is needed; see below---but the dependence on $n$ is less accurate.
Figure~\ref{fig-ancestry-perf} gives a rough idea of the runtime of
a given simulation of a constant size population,
and verifies that runtime scales quadratically with $\rho$,
and depends much less strongly on the number of samples.
Taking a typical chromosome to be 1 Morgan in length,
these plot show, roughly, that simulating chromosome-length
samples from a population of thousands of individuals takes seconds,
while samples from a population of tens of thousands takes minutes.
However, it is important to note that the quadratic curves in the two
panels are different---predicting the runtimes of days-long simulations
using the timing of minutes-long runs is unlikely to be very accurate.
Further experiments show that adding a $\log^2 n$ factor
to predict runtimes across different sample sizes does not do well
except possibly at very large values of $\rho$.
% Although \citet{hein2004gene} refer to the bound in \eqref{eqn-hsw-bound}
% as ``crude'', it predicts the scaling of run time quite well for large genomes
% Figure~\ref{fig-expected-perf} shows the observed runtime
% (from the same simulations as Figure~\ref{fig-ancestry-perf})
% plotted against equation~\eqref{eqn-hsw-bound},
% showing that for values of $\rho$ above $4 \times 10^4$ or so
% (i.e., chromosome-length simulations in populations above $10^4$),
% the ratio of observed time to the value from \eqref{eqn-hsw-bound} is constant.
% This implies the runtime for a large sample
% can be predicted from the runtime for a small sample.
% For smaller values of $\rho$, however, the prediction
% severely underestimates runtime.
What about simulations with changing population size?
To understand how runtime depends on demography
it helps to consider why runtime is quadratic in $\rho$.
At any point in time, \msprime\ must keep track of some number of lineages,
each of which contains some number of chunks of genetic material.
Common ancestor events reduce the number of lineages,
and recombination events increase their number.
However, with long genomes
only a small fraction of the common ancestor events
will involve overlapping segments of ancestry,
and lead to coalescence in the marginal trees.
Such disjoint segments are often far apart (on average, about distance $L/2$),
and so recombine apart again immediately;
it is these large numbers of rapid yet inconsequential events that
lead to the quadratic runtime.
\plrcomment{(this is discussed also in the ARG section; check for overlap)}
The maximum number of lineages occurs when
the increase and decrease in numbers of lineages due to
common ancestor and recombination events balance out;
to get an idea of runtime we can estimate when this occurs.
Suppose that the maximal number of lineages is $M$;
at this time the rate of common ancestor events is $M (M-1) / (4 N_e)$
and the total rate of recombination is $M \ell$,
where $\ell$ is the mean length of genome carried by each lineage
(including ``trapped'' non-ancestral material).
It is known that the maximum number of lineages is reached
very quickly~\citep{nelson2020accounting},
and at this maximum, recombination and coalescence rates are equal.
Therefore, many (roughly half) of these lineages have at
least two distant segments of ancestral material
separated by a long segment of genome, implying that $\ell \approx L / 4$.
This implies that $M$ is roughly $L N_e$.
The number of linages decreases gradually,
on the coalescent time scale---so, over roughly $2 N_e$ generations.
Since the total rate of events when there are the maximum number of lineages
happens is roughly $L^2 N_e / 4$,
then the total number of events is proportional to $(L N_e)^2$---i.e.,
proportional to $\rho^2$.
What does this tell us about time-varying population sizes?
The argument above implies that the work is spread out relatively
evenly on the coalescent time scale: in other words,
more events happen in the time period from 0 to $N_e/2$ generations
than from $N_e/2$ to $N_e$ generations,
but the runtime is not dominated by the more recent time period.
Suppose that population size today is $N_1$,
while longer ago than $T$ generations ago it was $N_2$:
does the runtime depend more on $4 N_1 L$ or $4 N_2 L$?
The answer depends on how $T$ compares to $N_1$: if $T/N_1$ is large,
then runtime will be similar to a population of size $N_1$;
while if $T/N_1$ is small, it will be similar to a population of size $N_2$.
For instance, in many agricultural species $N_1 \propto 100$, while $N_2 \propto 10^5$,
and the runtime will depend critically on $T$---in other words,
simulation will be quick in a species with a strong domestication bottleneck,
and slow otherwise.
In practice, we recommend that users estimate runtime by
measuring runtime in a series of simulations with short genome lengths,
possibly varying other parameters to check for interactions,
and then predict runtime by fitting a quadratic curve to genome length.
In these experiments, sample sizes should be roughly
those desired---which is not a problem in practice because of
the weak dependency on sample size
---notice that in Figure~\ref{fig-ancestry-perf} increasing sample size 100-fold
only increases runtime by a factor of 2 or so.
\subsection*{Gene conversion}
%introduction
Gene conversion is a form of recombination that results in unidirectional transfer
of a short segment of genetic material,
for example between homologous chromosomes~\citep{chen2007gene}.
%description of ancestral process
Just as in crossover recombination, gene conversion events split how genetic material
is inherited along some lineage, creating two independent lineages.
In contrast to the previously described recombination process, where the ancestral
material to the left and right of a recombination breakpoint follows two
different ancestral lineages, gene conversion events distribute the genetic material
of a short segment to one ancestral lineage and the part to the left and right of this
segment to another lineage.
%Franz Baumdicker: Not sure if this next sentence is necessary,
% but it might be of interest for some people?
The resulting distribution of ancestral material after a gene conversion event
equals the distribution after two recombination events at the left and right position
of the gene conversion tract, followed by a common ancestor event.
%Impact on LD
However, gene conversions impact much shorter segments than crossover recombinations,
typically below 1kb, and thus affect linkage patterns differently to
recombinations~\citep{korunes2017gene}.
% other simulators
Hudson's \ms\ optionally includes gene conversion and so does
MaCS~\citep{chen2009fast}, while scrm~\citep{staab2015scrm} does not
support gene conversions.
As bacterial recombination can be modeled analogously to eukaryotic allelic
gene conversion, simulators for microbial evolution often include gene conversion
instead of recombination.
This includes the backward simulator SimBac~\citep{brown2016simbac},
the SMC based approach used in FastSimBac~\citep{demaio2017the},
and forward-in-time tools within the SLiM framework~\citep{cury2020simulation}.
In \msprime\ 1.0 we implemented the coalescent with gene conversion as defined by
\cite{wiuf2000coalescent}, where gene conversion events are initiated between any
two sites in the sequence at a constant rate $\gamma$ and
the length of the transferred segment is given by a geometric distributed
random variable with mean length $L$.
% Above I used the discrete case, but we could easily switch to a description for the
% continuous case.
We have implemented this model of gene
conversion in \msprime\ 1.0, and validated the output against
\ms\ and analytical results~\citep{wiuf2000coalescent}.
% In terms of file size, as expected from the discussion above, we observe that gene
% conversion is equivalent to double the rate of
% crossover recombination, so that a tree sequence produced with a `recombination_rate` of $r$
% and a `gene_conversion_rate` of $g$ is roughly the same size as a tree sequence with no gene
% conversion and a `recombination_rate` of $r+2g$. However, because gene conversion tends to
% affect only short regions of the genome, each gene conversion event usually creates
% less trapped material, hence simulations using mammalian-like gene conversion parameters
% run faster than simulations in which an equivalent amount of crossover recombination is
% imposed.
\begin{figure}
\begin{center}
\includegraphics[width=7cm]{figures/gc-perf}
\end{center}
\caption{\label{fig-gc-perf}Comparison of the running time
for E-coli simulations using \msprime\ and SimBac for varying
sample sizes. The parameters used are chosen to mimic the simulation
of \textit{E. coli} genomes. The genome length was set to 4.5 Mb, the scaled per site
gene conversion rate was set to $N_e g = \gamma = 0.015$ and
the mean gene conversion tract length was set to $L = 500$, which is close to
estimated parameters for \textit{E.\ coli}, i.e.\ an effective populations size
$N_e = 1.8e^8$ and a per site per generation conversion rate $g = 8.9e^{-11}$
with mean tract length $L = 542$~\citep{lapierre2016the}.}
\end{figure}
Gene conversion is particularly useful to model homologous recombination in
bacterial evolution,
so we compare the performance of \msprime\ with gene conversion to
SimBac~\citep{brown2016simbac}, a simulator developed specifically to
simulate bacterial evolution. Although \msprime\ is missing some features
relevant to this case, e.g.\ the simulation of circular genomes, it is far
more efficient at simulating the basic processes.
Furthermore, gene conversion can be considered in combination
with other current and future features of \msprime\ 1.0,
including varying recombination rates and complex demographies.
An important avenue for future work
will be to add further features to model bacterial
evolution. In particular, we aim to include bacterial gene transfer and utilize
the succinct tree sequence structure to represent the resulting ancestral gene
transfer graph~\citep{baumdicker2014AGTG}.
\subsection*{Demography}
% TODO should we put in some citations here for the models? Hardly seems
% necessary.
One of the key applications of population genetic simulations is to generate
data for complex demographies. Beyond idealised cases such as stepping-stone or
island models, or specialised cases such as isolation-with-migration models,
analytical results are rarely possible. Simulation is therefore integral to the
development and evaluation of methods for demographic inference. The demography
model in \msprime\ is directly derived from the approach used in \ms, and
supports an arbitrary number of randomly mating populations exchanging migrants
at specified rates. A range of demographic events are supported, which allow for
varying population sizes and growth rates, changing migration rates over time,
as well as population splits, admixtures and pulse migrations.
The location of sampled lineages
can be tracked through time in as much detail as required: each tree node is
automatically associated with the population in which it arose, the location of
lineages can be recorded at any given time via census events, or every lineage
migration can be recorded.
A major change for \msprime\ 1.0 is the introduction of the new Demography API,
designed to address a design flaw in the \msprime\ 0.x interface which lead to
a number of avoidable errors in downstream
simulations~\citep{ragsdale2020lessons}. Briefly, the 0.x API required three
separate parameters be provided to the \texttt{simulate} function to describe a
demographic model, making it easy to accidentally omit information. The 1.0 API
resolves this issue by creating a new \texttt{Demography} class, which
encapsulates all information about the demographic model, and fully decouples
the definition from other simulation details. An instance of this class is then
provided as a parameter to the new \texttt{sim\_ancestry} function,
substantially reducing the potential for error.
Another improvement over the
0.X APIs is the introduction of explicit population split and admixture events,
and a population state machine that ensures that lineages cannot migrate into
(or be sampled from) inactive populations. This demography model is compatible
with the Demes standard~\citep{gower2021demes}, and the \texttt{Demography}
class supports importing and exporting Demes models.
Models previously constructed using the 0.x API can be seamlessly imported into
the \texttt{Demography} class, and we also support importing demographic
models in the form of Newick species tree descriptions, as well as those
produced by programs such as *BEAST~\citep{heled2009bayesian}.
The \texttt{DemographyDebugger} can output detailed information for debugging
demographic models, but also provides
methods to compute a suite of numerical predictions about the location of
lineages over time, for example the coalescence rates for two or more lineages
drawn from populations at specified times in the past,
which can be inverted to obtain the ``inverse instantaneous coalescence rate'',
or IICR of \citet{chikhi2018iicr}.
Many popular approaches in population genetics use the distribution of
coalescence rates between pairs of lineages in one or more populations to infer
effective population sizes over
time~\citep{li2011inference,sheehan2013estimating,schiffels2014inferring} or
split times and subsequent migration rates between
populations~\citep{wang2020tracking}.
The \texttt{DemographyDebugger} provides a valuable
ground-truth when evaluating such inference methods, as illustrated by
\cite{adrion2020community}.
\subsection*{Instantaneous bottlenecks}
A common approach to modelling the effect of demographic history on genealogies
is to assume that effective population size ($N_e$) changes in discrete steps
which define a series of epochs~\citep{griffiths1994sampling, marth2004allele,
keightley2007joint,li2011inference}. In this setting of piecewise constant
$N_e$, capturing a population bottleneck requires three epochs: $N_e$ is
reduced by some fraction $b$ at the start of the bottleneck, $T_{start}$, and
recovers to its initial value at time $T_{end}$~\citep{marth2004allele}. If
bottlenecks are short both on the timescale of coalescence and mutations,
there may be little information about the duration of a bottleneck
($T_{end}-T_{start}$) in sequence data. Thus a simpler, alternative model is to
assume that bottlenecks are instantaneous ($T_{end}-T_{start} \rightarrow 0$)
and generate a sudden burst of coalescence events (a multiple merger event) in
the genealogy. The strength of the bottleneck $B$ can be thought of as an
(imaginary) time period during which coalescence events are collapsed,
i.e.\ there is no growth in genealogical branches during $B$ and the probability that a
single pair of lineages entering the bottleneck coalesce during the bottleneck
is $1-e^{-B}$. Although this simple two parameter model of bottlenecks is
attractive and both analytic results and empirical
inference~\citep{griffiths1994sampling, galtier2000detecting,
bunnefeld2015inferring} have been developed under this model, there has
been no software available to simulate data under instantaneous
bottleneck histories.
We have implemented instantaneous bottlenecks in \msprime~1.0
using a variant of Hudson's linear time single-locus coalescent
algorithm~\citep{hudson1990gene}. Instantaneous bottlenecks are specified
by adding events to the Demography object (see the Demography section)
and can be used in combination with any other demographic modelling
features. We have validated the results of these simulations by comparing
against analytical expectations for coalescence times and the
site frequency spectrum~\citep{bunnefeld2015inferring}.
\subsection*{Multiple merger coalescents}
Kingman's coalescent assumes that only two ancestral lineages can merge at
each merger event. Although this is generally a reasonable approximation there