This repository has been archived by the owner on Aug 24, 2022. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 21
/
paper.tex
2149 lines (1989 loc) · 107 KB
/
paper.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
\documentclass[9pt,twocolumn,twoside]{gsajnl}
\articletype{inv} % article type
\usepackage{amsmath,amssymb}%
\usepackage{graphicx}
\usepackage{authblk}
\usepackage{dsfont}
\usepackage{environ}%
\usepackage{caption}
\usepackage{subcaption}
\usepackage{tikz}
\usetikzlibrary{calc,positioning}
\usepackage{booktabs}
% local definitions
\newcommand{\msprime}[0]{\texttt{msprime}}
\newcommand{\tskit}[0]{\texttt{tskit}}
\newcommand{\ms}[0]{\texttt{ms}}
\newcommand{\msms}[0]{\texttt{msms}}
\newcommand{\cosiTwo}[0]{\texttt{cosi2}}
\newcommand{\msHOT}[0]{\texttt{msHOT}}
\newcommand{\scrm}[0]{\texttt{scrm}}
\newcommand{\SLiM}[0]{\texttt{SLiM}}
\newcommand{\fwdpy}[0]{\texttt{fwdpy11}}
\newcommand{\stdpopsim}[0]{\texttt{stdpopsim}}
\newcommand{\SeqGen}[0]{\texttt{Seq-Gen}}
\newcommand{\Pyvolve}[0]{\texttt{Pyvolve}}
\newcommand{\phastsim}[0]{\texttt{phastsim}}
\newcommand{\discoal}[0]{\texttt{discoal}}
\newcommand{\ARGON}[0]{\texttt{ARGON}}
\newcommand{\SimBac}[0]{\texttt{SimBac}}
\newcommand{\FastSimBac}[0]{\texttt{fastSimBac}}
\newcommand{\jkcomment}[1]{\textcolor{red}{#1}}
\newcommand{\plrcomment}[1]{\textcolor{cyan}{#1}}
\newcommand{\grgcomment}[1]{\textcolor{yellow!60!red}{#1}}
\newcommand{\adkcomment}[1]{\textcolor{blue}{#1}}
%% local definitions for section multiple merger coalescents
\newcommand{\be}{\begin{equation}}
\newcommand{\ee}{\end{equation}}
\newcommand{\bd}{\begin{displaymath}}
\newcommand{\ed}{\end{displaymath}}
\newcommand{\IN}{\ensuremath{\mathds{N}}}%
\newcommand{\EE}[1]{\ensuremath{\mathds{E}\left[ #1 \right]}}%
\newcommand{\one}[1]{\ensuremath{\mathds{1}_{\left\{ #1 \right\}}}}%
\newcommand{\prb}[1]{\ensuremath{\mathds{P}\left( #1 \right) } }%
\NewEnviron{esplit}[1]{%
\begin{equation}
\label{#1}
\begin{split}
\BODY
\end{split}\end{equation}
}
\title{Efficient ancestry and mutation simulation with msprime 1.0}
% First authors
\author[1,$\star$]{Franz Baumdicker}
\author[2,$\star$]{Gertjan Bisschop}
\author[3,4,$\star$]{Daniel Goldstein}
\author[5,$\star$]{Graham Gower}
\author[6,$\star$]{Aaron P. Ragsdale}
\author[7,$\star$]{Georgia Tsambos}
\author[8,$\star$]{Sha Zhu}
% Middle Authors
\author[9]{Bjarki Eldon}
\author[10]{E. Castedo Ellerman}
\author[11,12]{Jared G. Galloway}
\author[13,14]{Ariella L. Gladstein}
\author[15]{Gregor Gorjanc}
\author[16]{Bing Guo}
\author[8]{Ben Jeffery}
\author[17]{Warren W. Kretzschmar}
\author[2]{Konrad Lohse}
\author[18]{Michael Matschiner}
\author[19]{Dominic Nelson}
\author[20]{Nathaniel S. Pope}
\author[21]{Consuelo D. Quinto-Cort\'es}
\author[11]{Murillo F. Rodrigues}
\author[22]{Kumar Saunack}
\author[23]{Thibaut Sellinger}
\author[24]{Kevin Thornton}
\author[25]{Hugo van Kemenade}
\author[4,8]{Anthony W. Wohns}
\author[8]{Yan Wong}
% Senior Authors
\author[19,$\dagger$]{Simon Gravel}
\author[11,$\dagger$]{Andrew D. Kern}
\author[26,$\dagger$]{Jere Koskela}
\author[11,27,$\dagger$]{Peter L. Ralph}
% Corresponding
\author[8,$\ddagger$]{Jerome Kelleher}
% Franz
\affil[1]{Cluster of Excellence ``Controlling Microbes to Fight Infections'', Mathematical and Computational Population Genetics, University of T\"ubingen, 72076 T\"ubingen, Germany}
% Gertjan, Konrad
\affil[2]{Institute of Evolutionary Biology, The University of Edinburgh, EH9 3FL, UK}
% Daniel
\affil[3]{Khoury College of Computer Sciences, Northeastern University, MA 02115, USA}
% Daniel, Wilder
\affil[4]{Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA}
% Graham
\affil[5]{Lundbeck GeoGenetics Centre, Globe Institute, University of Copenhagen, 1350 Copenhagen K, Denmark}
% Aaron
\affil[6]{Department of Integrative Biology, University of Wisconsin--Madison, WI 53706, USA}
% Georgia
\affil[7]{Melbourne Integrative Genomics, School of Mathematics and Statistics, University of Melbourne, Victoria, 3010, Australia}
% Jerome, Yan, Wilder, Joe
\affil[8]{Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, OX3 7LF, UK}
% Bjarki
\affil[9]{Leibniz Institute for Evolution and Biodiversity Science, Museum f\"ur Naturkunde Berlin, 10115, Germany }
% Castedo
\affil[10]{Fresh Pond Research Institute, Cambridge, MA 02140, USA}
% Jared, Murillo, Peter, Andy
\affil[11]{Institute of Ecology and Evolution, Department of Biology, University of Oregon, OR 97403-5289, USA}
% Jared
\affil[12]{Computational Biology Program, Fred Hutchinson Cancer Research Center, Seattle, WA 98102, USA}
% Ariella
\affil[13]{Department of Genetics, University of North Carolina at Chapel Hill, NC 27599-7264, USA}
\affil[14]{Embark Veterinary, Inc., Boston, MA 02111, USA}
% Gregor
\affil[15]{The Roslin Institute and Royal (Dick) School of Veterinary Studies, University of Edinburgh, EH25 9RG, UK}
% Bing
\affil[16]{Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, MD, 21201, USA}
% Winni
\affil[17]{Center for Hematology and Regenerative Medicine, Karolinska Institute, 141 83 Huddinge, Sweden}
% Michael
\affil[18]{Natural History Museum, University of Oslo, Blindern 0318 Oslo, Norway}
% Simon, Dom, Ivan
\affil[19]{Department of Human Genetics, McGill University, Montr\'eal, QC H3A 0C7, Canada}
% Nathaniel
\affil[20]{Department of Entomology, Pennsylvania State University, PA 16802, USA}
% Consuelo
\affil[21]{National Laboratory of Genomics for Biodiversity (LANGEBIO), Unit of Advanced Genomics, CINVESTAV, Irapuato, Mexico}
% Saunack
\affil[22]{IIT Bombay, Powai, Mumbai 400 076, Maharashtra, India}
% Thibaut
\affil[23]{Professorship for Population Genetics, Department of Life Science Systems, Technical University of Munich, 85354 Freising, Germany}
% Kevin
\affil[24]{Ecology and Evolutionary Biology, University of California, Irvine, CA 92697, USA}
% Hugo
\affil[25]{Independent Researcher}
% Jere
\affil[26]{Department of Statistics, University of Warwick, CV4 7AL, UK}
% Peter (second affil)
\affil[27]{Department of Mathematics, University of Oregon, OR 97403-5289 USA}
% % \section*{Contact:} \href{[email protected]}{[email protected]}
% \maketitle
\keywords{Simulation, Coalescent, Mutations, Ancestral Recombination Graphs}
\runningtitle{Efficient simulation with msprime 1.0}
\runningauthor{Baumdicker \textit{et al.}}
\begin{abstract}
Stochastic simulation is a key tool in population genetics,
since the models involved are often analytically intractable
and simulation is usually the only way of obtaining ground-truth data to
evaluate inferences.
Because of this, a large number
of specialized simulation programs have been developed, each
filling a particular niche, but with largely overlapping functionality
and a substantial duplication of effort.
Here, we introduce \msprime\ version 1.0, which
efficiently implements ancestry and mutation simulations based on
the succinct tree sequence data structure and the \tskit\ library.
We summarize \msprime's many features, and show that
its performance is excellent, often many times faster
and more memory efficient than specialized alternatives.
These high-performance features have been thoroughly
tested and validated, and built using a collaborative,
open source development model,
which reduces duplication of effort and promotes
software quality via community engagement.
% "Bazaar" development model: https://en.wikipedia.org/wiki/The_Cathedral_and_the_Bazaar
\end{abstract}
\begin{document}
\maketitle
\thispagestyle{firststyle}
\marginmark
\firstpagefootnote
% Use the \equalcontrib command to mark authors with equal
% contributions, using the relevant superscript numbers
% \equalcontrib{1}
% \equalcontrib{2}
\makeatletter
\newcommand{\equalcontribfirst}[1]{\@authfootnote{#1}{These authors contributed equally to this work, as joint first authors.}}
\newcommand{\equalcontriblast}[1]{\@authfootnote{#1}{These authors contributed equally to this work, as joint last authors.}}
\makeatother
\equalcontribfirst{$\star$} % {Denotes shared first authorship, listed alphabetically}
\equalcontriblast{$\dagger$} % {Denotes shared senior authorship, listed alphabetically}
% \equalcontrib{$\ddagger$}{Denotes corresponding author}
\correspondingauthoraffiliation{$\ddagger$}{\url{[email protected]}}
\vspace{-33pt}% Only used for adjusting extra space in the left column of the first page
\section{Introduction}
The coalescent
process~\citep{kingman1982genealogy,kingman1982coalescent,hudson1983testing,
tajima1983evolutionary}
models the ancestry of a set of sampled genomes,
providing a mathematical description of the
genealogical tree that relates the samples to one another.
It has proved to be a powerful model,
and is now central to
population genetics~\citep{hudson1990gene,hein2004gene,wakely2008coalescent}.
The coalescent is an efficient framework for population genetic
simulation, because it allows us to simulate the genetic ancestry for
a sample from an idealized population model, without explicitly representing
the population in memory or stepping through the generations.
Indeed, \cite{hudson1983testing} independently derived the
% Anyone disagree here? That's the impression I get from the paper.
coalescent \emph{in order to} efficiently simulate data,
and used these simulations to characterize an analytically
intractable distribution.
This inherent efficiency, and the great utility of simulations for a wide
range of purposes, has led to dozens of different tools
being developed over the decades~\citep{carvajal2008simulation,liu2008survey,
arenas2012simulation,yuan2012overview,hoban2012computer,yang2014critical,
peng2015genetic}.
Two technological developments of recent years, however,
pose major challenges to most existing simulation methods.
Firstly, fourth-generation sequencing technologies have made
complete chromosome-level assemblies possible~\citep{miga2020telomere},
% Some citations here? Darwin tree of life maybe?
and high quality assemblies are now available for many species.
Thus, modeling genetic variation data as a series of unlinked
non-recombining loci
is no longer a reasonable approximation, and
we must fully account for recombination.
However, while a genealogical tree relating $n$ samples in the single-locus coalescent
can be simulated in $O(n)$ time~\citep{hudson1990gene},
the coalescent with recombination is far more complex,
and programs such as
Hudson's classical \ms~\citep{hudson2002generating}
can only simulate short segments under the influence of recombination.
The second challenge facing simulation methods is that
sample sizes in genetic studies have grown very quickly
in recent years, enabled by the precipitous fall in genome sequencing costs.
Human datasets like the
UK Biobank~\citep{bycroft2018genome} and
gnomAD~\citep{karczewski2020mutational} now consist of hundreds of
thousands of genomes and many other datasets on a similar scale
are becoming available~\citep{tanjo2021practical}.
Classical simulators such as \ms\ and even fast approximate methods
such as \scrm~\citep{staab2015scrm} simply cannot
cope with such a large number of samples.
The \msprime\ simulator~\citep{kelleher2016efficient,kelleher2020coalescent}
has greatly increased the scope of coalescent simulations,
and it is now straightforward to simulate millions of whole chromosomes
for a wide range of organisms.
The ``succinct tree sequence'' data
structure~\citep{kelleher2016efficient,kelleher2018efficient,kelleher2019inferring,
wohns2021unified},
originally introduced as part of \msprime, makes it possible to store
such large simulations in a few gigabytes, several orders
of magnitude smaller than commonly used formats.
The succinct tree sequence has also led to major advances in forwards-time
simulation~\citep{kelleher2018efficient,haller2018tree},
ancestry inference~\citep{kelleher2019inferring,wohns2021unified}
and calculation of population genetic
statistics~\citep{kelleher2016efficient,ralph2020efficiently}.
Through a rigorous open-source community development process,
\msprime\ has gained a large number of features since its introduction,
making it a highly efficient and flexible platform for population
genetic simulation.
This paper marks the release of \msprime\ 1.0.
We provide an overview of its extensive features,
demonstrate its performance advantages over alternative software,
and discuss opportunities for ongoing
open-source community-based development.
The efficiency of coalescent simulations depends
crucially on the assumption of neutrality, and it is important
to note that there are many situations in which this will be a poor approximation
of biological reality~\citep{johri2021statistical}.
In particular, background selection has been shown to affect genome wide
sequence variation in a wide range of
species~\citep{charlesworth1993effect,charlesworth1995pattern,charlesworth2021effects}.
Thus care must be taken to ensure that the results of purely neutral simulations are
appropriate for the question and genomic partition under study.
A major strength of \msprime, however, is that it can be used in conjunction
with forwards-time simulators, enabling the simulation of
more realistic models than otherwise
possible~\citep{kelleher2018efficient,haller2018tree}.
\section*{Results}
In the following sections we describe the main features of \msprime\ 1.0,
focusing on the aspects that are either new for this version, or in which
our approach differs significantly from classical methods
(summarized in Table~\ref{tab-v1-summary}).
Where appropriate,
we benchmark \msprime\ against other simulators, but the comparisons are
illustrative and not intended to be systematic or exhaustive. Please
see~\cite{kelleher2016efficient} for a performance comparison of
\msprime\ against simulators such as \ms, \msms, and \scrm.
\begin{table}
\small{
{ % begin box to localize effect of arraystretch change
\renewcommand{\arraystretch}{1.5}
\begin{tabular}{lp{6.5cm}}
Interface
& Separation of ancestry and mutation simulations.
Ability to store arbitrary metadata along with simulation
results, and automatic recording of provenance information
for reproducibility.
Jupyter notebook~\citep{kluyver2016jupyter} integration.
Rich suite of analytical and visualization methods via the \tskit\ library. \\
Ancestry
& SMC, SMC', Beta- and Dirac-coalescent, discrete time Wright-Fisher,
and selective sweep models. Instantaneous bottlenecks.
Discrete or continuous genomic coordinates, arbitrary ploidy,
gene conversion.
Output full ARG with recombination nodes, ARG likelihood calculations.
Record full migration history and census
events. Improved performance for large numbers of populations.
Integration with forward simulators such as \SLiM\ and \fwdpy\
(``recapitation'').
\\
Demography
&
Improved interface with integrated metadata and referencing populations by
name. Import from Newick species tree, *BEAST~\citep{heled2009bayesian},
and Demes~\citep{gower2021demes}. Numerical methods to compute
coalescence rates.\\
Mutations
& JC69, HKY, F84, GTR, BLOSUM62, PAM, infinite alleles, SLiM and
general matrix mutation models.
Varying rates along the genome, recurrent/back mutations,
discrete or continuous genomic coordinates, overlaying multiple layers
of mutations, exact times associated with mutations.
\end{tabular}
}
}
\caption{\label{tab-v1-summary}Major features
of \msprime\ 1.0 added since version 0.3.0 \citep{kelleher2016efficient}.}
\end{table}
% The input - how do we run simulations?
\subsection*{User interface}
\label{sec-sim-interface}
The majority of simulation packages are controlled either through
a command line interface~\citep[e.g.][]{hudson2002generating,kern2016discoal},
a text-based input file
format~\citep[e.g.][]{guillaume2006nemo,excoffier2011fastsimcoal,shlyakhter2014cosi2},
or a mixture of both.
Command line interfaces make it easy to run simple
simulations, but as model complexity and the number of parameters increase,
they become difficult to understand and
error-prone~\citep{ragsdale2020lessons,gower2021demes}.
Specifying parameters through a text file alleviates this problem to a degree,
but lacks flexibility, for example, when running simulations with parameters
drawn from a distribution. In practice, for any reproducible simulation
project users will write a script
to generate the required command lines or input parameter files,
invoke the simulation engine, and process the results in some way.
This process is cumbersome and labor intensive, and
a number of packages have been developed
to allow simulations to be run directly in a high-level
scripting language~\citep{staab2016coala,parobek2017skelesim,gladstein2018simprily}.
The more recent trend has been to move away from this file and command-line
driven approach and to instead provide direct interfaces to the simulation
engines via an Application Programming Interface (API)~\citep[e.g.][]{
thornton2014cpp,kelleher2016efficient,becheler2019quetzal,haller2019slim}.
The primary interface for \msprime\ is through a thoroughly documented
Python
API, which has encouraged the development of an ecosystem of downstream
% Found these through the Dependent packages on GitHub
tools~\citep{terhorst2017robust,chan2018likelihood,spence2019inference,
adrion2020community,adrion2020predicting, kamm2020efficiently,
mckenzie2020ipcoal, montinaro2020revisiting,
de2021geonomics,rivera2021simulation}.
As well as providing a stable and efficient platform for building
downstream applications, \msprime's Python API makes it much easier to
build reproducible simulation pipelines, as the entire workflow can
be encapsulated in a single script, and package and version
dependencies explicitly stated using the \texttt{pip}
or \texttt{conda} package managers.
For example, the errors made
in the influential simulation analysis of
\cite{martin2017human} were only detected because the pipeline
could be easily run and reanalyzed~\citep{ragsdale2020lessons,martin2020erratum}.
\begin{figure}
\begin{center}
\includegraphics{illustrations/mutated_tree}
\end{center}
\caption{
\label{fig-mutated-trees}
Visualization of the separation between ancestry and mutation
simulation. (A) The result of an invocation of \texttt{sim\_ancestry}
is two trees along a 1kb chunk of genome relating three diploid samples.
Each diploid individual consists of two genomes (or nodes), indicated
by color.
(B) This ancestry is provided as the input to \texttt{sim\_mutations},
which adds mutations.
Graphics produced using \tskit's \texttt{draw\_svg} method.
}
\end{figure}
A major change for the \msprime\ 1.0 release is the introduction of a new set of APIs,
designed in part to avoid sources of error (see the Demography section) but
also to provide more appropriate defaults while keeping compatibility with
existing code. In the new APIs, ancestry and mutation simulation are fully
separated (see Fig.~\ref{fig-mutated-trees}),
with the \texttt{sim\_ancestry} and \texttt{sim\_mutations}
functions replacing the legacy \texttt{simulate} function. Among other changes,
the new APIs default to discrete genome coordinates and finite sites mutations,
making the default settings more realistic and resolving a major source of confusion and error.
The previous APIs are fully
supported and tested, and will be maintained for the foreseeable future.
The \texttt{msp} program (a command line interface to the library)
has been extended to include new commands
for simulating ancestry and mutations separately. A particularly useful
feature is the ability to specify demographic models in
Demes format~\citep{gower2021demes} from the command line,
making simulation of complex demographies straightforward.
We also provide an \ms\ compatible
command line interface to support existing workflows.
% The data model
\subsection*{Tree sequences}
\label{sec-ts}
\begin{figure*}
\begin{center}
\includegraphics{illustrations/example_tree_sequence}
\end{center}
\caption{\label{fig-ts-example} An example tree sequence describing
genealogies and sequence variation for four samples at ten sites
on a chromosome of twenty bases long.
Information is stored in a set of tables
(the tables shown here include only essential columns, and
much more information can be associated with the various entities).
The node table
stores information about sampled and ancestral genomes. The
edge table describes how these genomes are related along a
chromosome, and defines the genealogical tree at each position.
The site and mutation tables together describe sequence variation
among the samples. The genotype matrix and tree topologies shown
on the left are derived from these tables.
}
\end{figure*}
One of the key reasons for \msprime's substantial performance advantage
over other simulators~\citep{kelleher2016efficient}
is its use of the ``succinct tree sequence''
data structure to represent simulation results.
The succinct tree sequence (usually abbreviated to ``tree sequence'')
was introduced by \cite{kelleher2016efficient}
to concisely encode genetic ancestry and sequence variation
and was originally implemented as part of \msprime.
We subsequently extracted the core tree sequence functionality
from \msprime\ to create the \tskit\ library,
which provides a large suite of tools for processing genetic ancestry
and variation data via APIs
in the Python and C languages~\citep{tskit2021tskit}.
The availability of \tskit\ as a liberally licensed (MIT) open source
toolkit has enabled several other
projects~\citep[e.g.][]{kelleher2019inferring,haller2019slim,wohns2021unified,
de2021geonomics} to take advantage
of the same efficient data structures used in \msprime,
and we hope that many more will follow.
While a full discussion of tree sequences and the capabilities
of \tskit\ is beyond the scope of this article, we summarize
some aspects that are important for simulation.
% Encoding ancestry
% JK: we can say this some other way, this just seemed like the
% most concise way. People may be irritated by being told what
% the definition of a "genome" is, but I think we need to be explicit.
Let us define a genome as the complete set of genetic material
that a child inherits from one parent. Thus, a diploid individual
has two (monoploid) genomes, one inherited from each parent. Since
each diploid individual lies at the end of two distinct lineages
of descent, they will be represented by \emph{two} places (nodes)
in any genealogical tree. In the tree sequence encoding a
\emph{node} therefore corresponds to a single genome, which
is associated with its creation time (and other optional information),
and recorded in a simple tabular format (Fig.~\ref{fig-ts-example}).
Genetic inheritance between genomes (nodes) % delete?
is defined by edges.
An \emph{edge} consists of a
parent node, a child node and the left and right coordinates
% JK: This is potentially confusing. Changed "genomic interval"
% to "chromosome" interval to avoid talking about "the" genome.
of the contiguous chromosomal segment over which
the child genome inherited genetic material from the parent genome.
Parent and child nodes may correspond to ancestor and descendant
genomes separated by many generations.
Critically, edges can span multiple trees along the
genome (usually referred to as ``marginal'' trees),
and identical node IDs across different trees corresponds
to the same ancestral genome.
For example, in Fig.~\ref{fig-ts-example} the branch from node
0 to 4 is present in both marginal trees, and represented
by a single edge (the first row in the edge table).
This simple device, of explicitly associating tree nodes
with specific ancestral genomes and recording the
contiguous segments over which parent-child relationships exist,
generalizes the original
``coalescence records'' concept~\citep{kelleher2016efficient},
and is the key to the efficiency of tree
sequences~\citep{
kelleher2018efficient,kelleher2019inferring,ralph2020efficiently}.
Note that this formulation is fully compatible with
the concept of an Ancestral Recombination Graph (ARG)
and any ARG topology can be fully and efficiently encoded
in the node and edge tables illustrated in Fig.~\ref{fig-ts-example};
see the Ancestral Recombination Graphs section below for
more details.
% OK, back to tree sequences: what about sequence variation?
The final output of most population genetic simulations is some representation
of sequence variation among the specified samples. For coalescent
simulations, we usually have three steps: (1) simulate
the genetic ancestry, and optionally output the resulting marginal trees;
(2) simulate sequence evolution conditioned on this ancestry by generating
mutations (see the Simulating mutations section); and (3) output the
resulting nucleotide sequences by percolating the effects of the mutations through
the trees. Information about the mutations
themselves---e.g., where they have occurred on the
trees---is usually not retained or made available for subsequent analysis.
In \msprime, however, we skip step (3), instead using \tskit's
combined data model of ancestry and mutations
to represent the simulated sequences.
As illustrated in Fig.~\ref{fig-ts-example}, mutations are a
fully integrated part of \tskit's tree sequence data model, and genetic
variation is encoded by recording sites at which mutations
have occurred, and where each mutation at those sites
has occurred on the marginal tree.
Crucially, the genome sequences themselves are never stored, or indeed directly
represented in memory (although \tskit\ can output the variant matrix
in various formats, if required).
It may at first seem inconvenient to have only this indirect representation
of the genome sequences, but it is extremely powerful.
Firstly, the storage space required for simulations is dramatically
reduced. For a simulation of $n$ samples with $m$ variant sites,
we would require $O(nm)$ space to store the sequence data as a variant
matrix. However, if this simulation was of a recombining genome
with $t$ trees, then the \tskit\ tree sequence encoding
requires $O(n + t + m)$ space,
assuming we have $O(1)$ mutations at each site~\citep{kelleher2016efficient}.
For large sample sizes, this difference is profound, making it
conceivable, for example, to store the genetic ancestry
and variation data for the entire human population on a
laptop~\citep{kelleher2019inferring}. As well as the huge
difference in storage efficiency, it is often far more efficient
to compute statistics of the sequence data from the
trees and mutations than it is to work with the sequences themselves.
For example, computing Tajima's $D$ from simulated data stored in the \tskit\ format
is several orders of magnitude faster than efficient variant matrix
libraries for large sample sizes~\citep{ralph2020efficiently}.
The vast genomic datasets produced during the SARS-CoV-2 pandemic
have highlighted the advantages of storing genetic variation data using the underlying trees.
\cite{turakhia2021ultrafast} propose the
Mutation Annotated Tree (MAT)
format (consisting of a Newick tree
and associated mutations in a binary format) and
the \texttt{matUtils} program as an efficient way to store
and process large viral datasets~\citep{mcbroome2021daily},
achieving excellent compression and processing performance.
Similarly, \phastsim~\citep{demaio2021phastsim}
was developed to simulate sequence evolution on such large SARS-CoV-2
phylogenies, and also outputs a Newick tree annotated with mutations (not
in MAT format) to avoid the bottleneck of generating
and storing the simulated sequences.
While these methods illustrate the advantages of the general
approach of storing ancestry and mutations rather than sequences,
they do not generalize beyond their immediate settings, and
no software library support is available.
% \msprime, on the other hand, provides a fast, powerful, and well-tested
% generic interface for generating and storing mutations on trees.
The software ecosystem built around \tskit\ is stable, mature and
rapidly growing.
Simulators such as
\fwdpy~\citep{thornton2014cpp},
\SLiM~\citep{haller2019slim},
\stdpopsim~\citep{adrion2020community},
\texttt{Geonomics}~\citep{de2021geonomics}
and
\texttt{GSpace}~\citep{virgoulay2021gspace},
and inference methods such as
\texttt{tsinfer}~\citep{kelleher2019inferring},
\texttt{tsdate}~\citep{,wohns2021unified}
and \texttt{Relate}~\citep{speidel2019method}
use either the Python or C
APIs to support outputting results in tree sequence format.
Tree sequences are stored in an efficient binary file format, and
are fully portable across operating systems and processor
architectures. The \tskit\ library ensures interoperability
between programs by having strict definitions of how the
information in each of the tables is interpreted, and
stringent checks for the internal consistency of the data model.
% The output - what do we get out?
\subsection*{Data analysis}
\label{sec-data-analysis}
% We could drop this first sentence?
% The point of running simulations is to generate data, and there
% are two important ways in which the \tskit\ tree sequences output by
% \msprime\ are more useful than classical methods:
% they are more efficient and they contain more information.
% We will consider these points in turn.
The standard way of representing simulation data is
to render the results in a text format, which must subsequently
be parsed and processed as part of some analysis pipeline. For example,
\ms\ outputs a set of sequences
and can also optionally output the marginal trees along
the genome in Newick format,
and variants of this approach are used by many simulators.
Text files have many advantages, but are slow to process at scale.
The ability to efficiently process simulation results is
particularly important in simulation-based inference methods
such as Approximate Bayesian Computation
(ABC)~\citep{beaumont2002approximate,csillery2010approximate,wegmann2010abctoolbox}
and machine learning based
approaches~\citep{sheehan2016deep,chan2018likelihood,schrider2018supervised,
flagel2019unreasonable,sanchez2020deep}. Clearly, simulation efficiency is
crucial since the size and number of simulations that can be performed determines
the depth to which one can sample from the model and parameter space.
Equally important,
however, is the efficiency with which the simulation results can be
transformed into the specific input required by the inference method.
In the case of ABC, this is usually a set of summary statistics of the sequence
data, and methods avoid the bottleneck of parsing
text-based file formats to compute these statistics
by either developing their own
simulators~\citep[e.g.][]{cornuet2008inferring,lopes2009popabc}
or creating forked versions (i.e., modified copies) of existing
simulators~\cite[e.g.][]{thornton2006approximate,
hickerson2007msbayes,pavlidis2010msabc,huang2011mtml,quinto2018modeling},
tightly integrated with the inference method.
Modern approaches to ABC such as
ABC-RF~\citep{raynal2019abc,pudlo2016abc} and
ABC-NN~\citep{csillery2012abc,blum2010abc} use large
numbers of weakly informative statistics,
making the need to efficiently compute statistics from simulation
results all the more acute.
By using the stable APIs and efficient data interchange mechanisms
provided by \tskit,
the results of an \msprime\ simulation can be immediately processed,
without format conversion overhead.
The \tskit\ library has a rich suite of population
genetic statistics and other utilities, and is in many cases
orders of magnitude faster than matrix-based methods for
large sample sizes~\citep{ralph2020efficiently}. Thus, the combination
of \msprime\ and \tskit\ substantially increases the overall
efficiency of many simulation analysis pipelines.
Classical text based output formats like \ms\ are inefficient to process,
but also lack a great deal of important information about the simulated
process.
The tree-by-tree topology information output by simulators in Newick
format lacks any concept of node identity,
and means that we cannot reliably infer information about ancestors
from the output. Because Newick stores branch lengths rather
than node times, numerical precision issues also arise for
large trees~\citep{mcgill2013graphml}.
Numerous forks of simulators have been created to access information not provided
in the output. For example, \ms\ has been forked to
output information about migrating segments~\citep{rosenzweig2016powerful},
ancestral lineages~\citep{chen2013asymptotic},
% leaving this out because it's not about hidden information
% an array of population genetic statistics~\citep{ramos2007mlcoalsim}
% Any more?
and \ms's fork \msHOT~\citep{hellenthal2007mshot}
has in turn been forked to output information on local
ancestry~\citep{racimo2017archaic}.
All of this information
is either directly available by default in \msprime, or can be optionally
stored via options such as \texttt{record\_migrations} or
\texttt{record\_full\_arg} (see the Ancestral Recombination Graphs section) and
can be efficiently and conveniently processed via \tskit\ APIs.
% These problems of inefficient and limited access to the simulated
% data are comprehensively solved by \msprime\ and \tskit. Complete
% information about ancestry and mutation in simulations is immediately
% available to the user in a single integrated data structure
% (rather than multiple independent files), with no intermediate
% processing required.
% Users can subsequently compute precisely the information that they
% require using the powerful suite of population genetic statistics
% and other methods available in \tskit.
% Stuff we simulate: mutations
\subsection*{Simulating mutations}
\label{sec-mutations}
% Explain the background: what have people done classically?
Because coalescent simulations are usually concerned with
neutral evolution (see the Selective Sweeps section, however)
the problem of generating synthetic genetic variation can be decomposed into
two independent steps:
firstly, simulating genetic ancestry (the trees), then subsequently simulating
variation by superimposing mutation processes on those trees
(see Fig.~\ref{fig-mutated-trees}).
A number of programs exist to place mutations on trees: for instance,
the classical \SeqGen\ program~\citep{rambaut1997seq}
supports a range of different models of sequence evolution,
and various extensions to the basic
models have been proposed~\citep[e.g.][]{cartwright2005dna,fletcher2009indelible}.
Partly for efficiency and partly in the interest of
simplicity for users (i.e., to avoid intermediate text format conversions),
population genetic simulators have tended to
include their own implementations of mutation simulation, with
most supporting the infinite sites
model~\citep[e.g.][]{hudson2002generating}
but with several supporting a wide range of different models of sequence
evolution~\citep[e.g.][]{mailund2005coasim,excoffier2011fastsimcoal,
virgoulay2021gspace}. Thus, despite the logical separation between
the tasks of simulating ancestry and neutral sequence evolution,
the two have been conflated in practice.
Part of the reason for this poor record of software reuse and modularity is the
lack of standardized file formats, and in particular, the absence of common
library infrastructure to abstract the details of interchanging simulation
data. Although \msprime\ also supports simulating both ancestry and mutations,
the two aspects are functionally independent within the software; both ancestry and mutation simulators are present in \msprime\
for reasons of convenience and history,
and could be split into separate packages.
The efficient C and Python interfaces for
\tskit\ make it straightforward to add further information to
an existing file, and because of its efficient data interchange mechanisms,
there is no performance penalty for operations being performed
in a different software package.
Thanks to this interoperability, \msprime's
mutation generator can work with \emph{any} \tskit\ tree sequence,
be it simulated using \SLiM~\citep{haller2019slim} or
\fwdpy~\citep{thornton2014cpp}, or estimated from real
data~\citep{kelleher2019inferring,speidel2019method,wohns2021unified}.
It is a modular
component intended to fit into a larger software ecosystem, and
is in no way dependent on \msprime's ancestry simulator.
We have greatly extended the sophistication of
\msprime's mutation generation engine for version 1.0,
achieving near feature-parity with \SeqGen.
We support a large number of mutation models, including the
JC69~\citep{jukes1969evolution},
F84~\citep{felsenstein1996hidden},
and GTR~\citep{tavare1986some} nucleotide models
and the BLOSUM62~\citep{henikoff1992amino}
and PAM~\citep{dayhoff1978} amino acid models.
Other models, such as the Kimura two and three
parameter models~\citep{kimura1980simple,kimura1981estimation},
can be defined easily and efficiently in
user code by specifying a transition matrix between
any number of alleles.
Mutation rates can vary along the genome, and multiple mutation
models can be imposed on a tree sequence by overlaying mutations
in multiple passes.
We have extensively validated the results of mutation simulations
against both theoretical expectations and output from
\SeqGen~\citep{rambaut1997seq} and
\Pyvolve~\citep{spielman2015pyvolve}.
\begin{figure}
\includegraphics{figures/mutations-perf}
\caption{\label{fig-mutations-perf} Time required to run
\texttt{sim\_mutations} on tree sequences generated
by \texttt{sim\_ancestry} (with a population size of $10^4$
and recombination rate of $10^{-8}$) for varying (haploid) sample
size and sequence length. We ran 10 replicate mutation simulations
each for three different mutation rates, and report the average
CPU time required (Intel Core i7-9700).
(A) Holding sequence length fixed at 10 megabases and varying the
number of samples (tree tips) from 10 to 100,000.
(B) Holding number of samples fixed at 1000, and varying the sequence
length from 1 to 100 megabases.
}
\end{figure}
Simulating mutations in \msprime\ is efficient.
Fig.~\ref{fig-mutations-perf} shows the time required to generate
mutations (using the default JC69 model) on
simulated tree sequences for a variety of mutation
rates as we vary the number of samples
(Fig.~\ref{fig-mutations-perf}A) and the sequence
length (Fig.~\ref{fig-mutations-perf}B).
For example, the longest running simulation in
Fig.~\ref{fig-mutations-perf}B required less than 2 seconds to
generate an average of 1.5 million mutations over 137,081 trees
in a tree sequence with 508,125 edges.
This efficiency for large numbers of trees is possible because
the tree sequence encoding allows us to generate mutations
on an edge-by-edge basis
(see Fig.~\ref{fig-ts-example} and the Mutation Generation
appendix),
rather than tree-by-tree and branch-by-branch as would otherwise be required.
Simulating mutations on a single tree is also very efficient;
for example,
we simulated mutations under the BLOSUM62 amino
acid model for a tree with $10^6$ leaves over $10^4$ sites (resulting
in $\sim$260,000 mutations) in about $0.8$ seconds, including
the time required for file input and output.
We do not attempt a systematic benchmarking of \msprime's
mutation generation code against other methods, because at this scale it is
difficult to disentangle the effects of inefficient input and
output formats from the mutation generation algorithms.
Given the above timings, it seems unlikely
that generating mutations with \msprime\ would be a bottleneck in any
realistic analysis.
There are many ways in which the mutation generation code
in \msprime\ could be extended. For example, we intend to add support for
microsatellites~\citep{mailund2005coasim},
codon models~\citep{arenas2007recodon}
and indels~\citep{cartwright2005dna,fletcher2009indelible},
although changes may be required to \tskit's data model
which is currently based on the assumption of independent sites.
\subsection*{Recombination}
\label{sec-recombination}
Crossover recombination is implemented in \msprime\ using Hudson's algorithm, which
works backwards in time, generating common ancestor and recombination
events and tracking their effects on segments of ancestral
material inherited from the
sample~\citep{hudson1983properties,hudson1990gene,kelleher2016efficient}.
Common ancestor events merge the ancestral material of two lineages,
and result in coalescences in the marginal trees when ancestral
segments overlap.
Recombination events split the ancestral material for some lineage
at a breakpoint, creating two independent
lineages. Using the appropriate data structures~\citep{kelleher2016efficient},
this process is much more efficient to simulate
than the equivalent left-to-right
approach~\citep{wiuf1999recombination,wiuf1999ancestry}.
In \msprime\ 1.0, recombination rates can vary along a chromosome, allowing
us to simulate recombination hotspots and patterns
of recombination from empirical maps.
The implementation of recombination in \msprime\ is extensively validated
against analytical results~\citep{hudson1983properties,kaplan1985use}
and simulations by \ms, \msHOT\ and \SLiM.
The Sequentially Markovian Coalescent (SMC) is an approximation of the
coalescent with recombination~\citep{mcvean2005approximating,marjoram2006fast},
and was primarily motivated by the need to
simulate longer genomes than was possible using tools like \ms.
The SMC is a good approximation to the
coalescent with recombination when we have fewer than five sampled
genomes~\citep{hobolth2014markovian,wilton2015smc}, but the
effects of the approximation are less well understood for larger
sample sizes, and several approaches have been proposed
that allow simulations to more closely approximate the coalescent
with recombination~\citep{chen2009fast,wang2014new,staab2015scrm}.
The SMC and SMC' models are supported
in \msprime\ 1.0. However, they are currently implemented using a
naive rejection sampling approach, and are somewhat slower
to simulate than the exact coalescent with recombination. These
models are therefore currently only appropriate for studying the
SMC approximations themselves, although we intend to
implement them more efficiently in future versions.
In human-like parameter regimes and for large sample sizes,
\msprime's implementation of the
exact coalescent with recombination comprehensively outperforms
all other simulators, including those based on SMC
approximations~\citep{kelleher2016efficient}. However, it is important
to note that although the implementation of Hudson's algorithm is
very efficient, it is still quadratic in the population scaled recombination rate
$\rho = 4 N_e L$, where $L$ is the length of the genome in units of recombination distance.
This is because Hudson's algorithm tracks recombinations not only
in segments ancestral to the sample, but also between ancestral segments.
As mentioned above, common ancestor events in which the ancestral material
of two lineages is merged only result in coalescences in the marginal trees
if their ancestral segments overlap. If there is no overlap, the merged
segments represent an ancestral chromosome that is a genetic ancestor
of the two lineages, but not the most recent common genetic ancestor
at any location along the genome.
When this happens, the merged lineage carries ``trapped'' genetic material
that is not ancestral to any samples, but where recombinations can still
occur~\citep{wiuf1999recombination}.
For large $\rho$, recombination events in trapped
ancestral material will dominate, and so we can use this as a proxy for the
overall number of events in Hudson's algorithm.
\citet[Eq.~5.10]{hein2004gene} gave
\begin{equation}\label{eqn-hsw-bound}
\rho (\rho + 1) \left( \sum_{i=1}^{n-1} \frac{1}{i} \right)^2
\end{equation}
as an upper bound
on the number of recombination events within trapped ancestral material
for $n$ samples.
As discussed in the Time complexity of Hudson's algorithm appendix, the quadratic
dependence of simulation running time on $\rho$ implied
by \eqref{eqn-hsw-bound} is well supported
by observations, and provides a useful means of predicting how long
a particular simulation might require.
\subsection*{Gene conversion}
Gene conversion is a form of recombination that results in the transfer
of a short segment of genetic material,
for example between homologous chromosomes~\citep{chen2007gene}.
Since gene conversion impacts much shorter segments than
crossover recombination (typically below 1kb) it
affects patterns of linkage disequilibrium differently~\citep{korunes2017gene}.
\cite{wiuf2000coalescent} modeled gene conversion in the coalescent
via a rate at which gene conversion events are initiated
along the genome and a geometrically distributed tract length.
In terms of the ancestral process, gene conversion differs from
crossover recombination (as described in the previous section)
in that it extracts a short tract of ancestry into
an independent lineage, rather than splitting ancestry
to the left and right of a given breakpoint.
We have implemented this model of gene
conversion in \msprime\ 1.0, and validated the output against
\ms\ and analytical results~\citep{wiuf2000coalescent}.
\begin{figure}
\begin{center}
\includegraphics{figures/gc-perf}
\end{center}
\caption{\label{fig-gc-perf}Comparison of simulation performance
using \msprime\ (\texttt{sim\_ancestry}), \SimBac, and \FastSimBac\ for varying
(haploid) sample sizes,
and the current estimates
for \textit{E.~coli} parameters~\citep{lapierre2016the}:
a 4.6Mb genome, $N_e = 1.8\times 10^8$, gene conversion rate of $8.9\times 10^{-11}$
per base and mean tract length of $542$.
We report (A) the total CPU time and (B) maximum memory usage
averaged over 5 replicates (Intel Xeon E5-2680 CPU).
We did not run \SimBac\ beyond first two data points because
of the very long running times.}
\end{figure}
Gene conversion is particularly useful to model homologous recombination in
bacterial evolution,
and so we compare the performance of \msprime\ with gene conversion to
two specialized bacterial simulators,
\SimBac~\citep{brown2016simbac} and \FastSimBac~\citep{demaio2017the}.
Figure~\ref{fig-gc-perf}A shows that \msprime\ is far more efficient than
both \SimBac\ and the SMC-based approximation \FastSimBac.
Figure~\ref{fig-gc-perf}B shows that \msprime\ requires somewhat more memory
than \FastSimBac, (as expected since \FastSimBac\ uses a left-to-right
SMC approximation) but is still reasonably modest at around 1GiB for a simulation
of 500 whole \emph{E.~coli} genomes.
However, \msprime\ is currently lacking many of the specialized features
required to model bacteria, and so an important avenue for future work
is to add features such as circular genomes
and bacterial gene transfer~\citep{baumdicker2014AGTG}.
\subsection*{Demography}
\label{sec-demography}
One of the key applications of population genetic simulations is to generate
data for complex demographies. Beyond idealized cases such as stepping-stone or
island models, or specialized cases such as isolation-with-migration models,
analytical results are rarely possible. Simulation is therefore integral to the
development and evaluation of methods for demographic inference. The demography
model in \msprime\ is directly derived from the approach used in \ms, and
supports an arbitrary number of randomly mating populations exchanging migrants
at specified rates. A range of demographic events are supported, which allow for
varying population sizes and growth rates, changing migration rates over time,
as well as population splits, admixtures and pulse migrations.
A major change for \msprime\ 1.0 is the introduction of the new Demography API,
designed to address a design flaw in the \msprime\ 0.x interface which led to
avoidable errors in downstream simulations~\citep{ragsdale2020lessons}.
The new API is more user-friendly, providing the ability, for example,
to refer to populations by name rather than their integer identifiers.
We also provide numerical methods to compute
the coalescence rates for two or more lineages
which can be inverted to obtain the
``inverse instantaneous coalescence rate'' of \citet{chikhi2018iicr}.
Many popular approaches in population genetics use the distribution of
coalescence rates between pairs of lineages to infer
effective population sizes over
time~\citep{li2011inference,sheehan2013estimating,schiffels2014inferring} or
split times and subsequent migration rates between
populations~\citep{wang2020tracking}.
These numerical methods provide a valuable
ground-truth when evaluating such inference methods, as illustrated by
\cite{adrion2020community}.