-
Notifications
You must be signed in to change notification settings - Fork 3
/
paper.tex
1833 lines (1645 loc) · 98.4 KB
/
paper.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
% Template for PLoS
% Version 3.1 February 2015
%
% To compile to pdf, run:
% latex plos.template
% bibtex plos.template
% latex plos.template
% latex plos.template
% dvipdf plos.template
%
% % % % % % % % % % % % % % % % % % % % % %
%
% -- IMPORTANT NOTE
%
% This template contains comments intended
% to minimize problems and delays during our production
% process. Please follow the template instructions
% whenever possible.
%
% % % % % % % % % % % % % % % % % % % % % % %
%
% Once your paper is accepted for publication,
% PLEASE REMOVE ALL TRACKED CHANGES in this file and leave only
% the final text of your manuscript.
%
% There are no restrictions on package use within the LaTeX files except that
% no packages listed in the template may be deleted.
%
% Please do not include colors or graphics in the text.
%
% Please do not create a heading level below \subsection. For 3rd level headings, use \paragraph{}.
%
% % % % % % % % % % % % % % % % % % % % % % %
%
% -- FIGURES AND TABLES
%
% Please include tables/figure captions directly after the paragraph where they are first cited in the text.
%
% DO NOT INCLUDE GRAPHICS IN YOUR MANUSCRIPT
% - Figures should be uploaded separately from your manuscript file.
% - Figures generated using LaTeX should be extracted and removed from the PDF before submission.
% - Figures containing multiple panels/subfigures must be combined into one image file before submission.
% For figure citations, please use "Fig." instead of "Figure".
% See http://www.plosone.org/static/figureGuidelines for PLOS figure guidelines.
%
% Tables should be cell-based and may not contain:
% - tabs/spacing/line breaks within cells to alter layout or alignment
% - vertically-merged cells (no tabular environments within tabular environments, do not use \multirow)
% - colors, shading, or graphic objects
% See http://www.plosone.org/static/figureGuidelines#tables for table guidelines.
%
% For tables that exceed the width of the text column, use the adjustwidth environment as illustrated in the example table in text below.
%
% % % % % % % % % % % % % % % % % % % % % % % %
%
% -- EQUATIONS, MATH SYMBOLS, SUBSCRIPTS, AND SUPERSCRIPTS
%
% IMPORTANT
% Below are a few tips to help format your equations and other special characters according to our specifications. For more tips to help reduce the possibility of formatting errors during conversion, please see our LaTeX guidelines at http://www.plosone.org/static/latexGuidelines
%
% Please be sure to include all portions of an equation in the math environment.
%
% Do not include text that is not math in the math environment. For example, CO2 will be CO\textsubscript{2}.
%
% Please add line breaks to long display equations when possible in order to fit size of the column.
%
% For inline equations, please do not include punctuation (commas, etc) within the math environment unless this is part of the equation.
%
% % % % % % % % % % % % % % % % % % % % % % % %
%
% Please contact [email protected] with any questions.
%
% % % % % % % % % % % % % % % % % % % % % % % %
\documentclass[10pt,letterpaper]{article}
\usepackage[top=0.85in,left=2.75in,footskip=0.75in]{geometry}
% Use adjustwidth environment to exceed column width (see example table in text)
\usepackage{changepage}
% Use Unicode characters when possible
\usepackage[utf8]{inputenc}
% textcomp package and marvosym package for additional characters
\usepackage{textcomp,marvosym}
% fixltx2e package for \textsubscript
\usepackage{fixltx2e}
% amsmath and amssymb packages, useful for mathematical formulas and symbols
\usepackage{amsmath,amssymb}
% cite package, to clean up citations in the main text. Do not remove.
\usepackage{cite}
% Use nameref to cite supporting information files (see Supporting Information section for more info)
\usepackage{nameref,hyperref}
% line numbers
\usepackage[right]{lineno}
% ligatures disabled
\usepackage{microtype}
\DisableLigatures[f]{encoding = *, family = * }
% rotating package for sideways tables
\usepackage{rotating}
% Remove comment for double spacing
%\usepackage{setspace}
%\doublespacing
% Text layout
\raggedright
\setlength{\parindent}{0.5cm}
\textwidth 5.25in
\textheight 8.75in
% Bold the 'Figure #' in the caption and separate it from the title/caption with a period
% Captions will be left justified
\usepackage[aboveskip=1pt,labelfont=bf,labelsep=period,justification=raggedright,singlelinecheck=off]{caption}
% Use the PLoS provided BiBTeX style
\bibliographystyle{plos2015}
% Remove brackets from numbering in List of References
\makeatletter
\renewcommand{\@biblabel}[1]{\quad#1.}
\makeatother
% Leave date blank
\date{}
% Header and Footer with logo
\usepackage{lastpage,fancyhdr,graphicx}
\usepackage{epstopdf}
\pagestyle{myheadings}
\pagestyle{fancy}
\fancyhf{}
\lhead{\includegraphics[width=2.0in]{PLOS-submission.eps}}
\rfoot{\thepage/\pageref{LastPage}}
\renewcommand{\footrule}{\hrule height 2pt \vspace{2mm}}
\fancyheadoffset[L]{2.25in}
\fancyfootoffset[L]{2.25in}
\lfoot{\sf PLOS}
%% Include all macros below
%%%%%%%%%%%%%%%%%%%%%%%%%% JK MACROS and includes %%%%%%%%%%%%%%%%%%%%%%%%
% \usepackage{amsmath,amssymb,amsthm}
% \usepackage{graphicx}
% \usepackage[round]{natbib}
\usepackage[square,numbers,compress]{natbib}
\usepackage{url}
\usepackage{caption} % Used to change labels for SI figures.
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%%%% Local definitions
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\newcommand{\E}[0]{\ensuremath{\mathbb{E}}}
\newcommand{\cardinality}[1]{\ensuremath{\left|#1\right|}}
\newcommand{\vect}[1]{\ensuremath{\mathbf{#1}}}
\newcommand{\randomuniform}[0]{\mathcal{R}_U}
\newcommand{\randomexponential}[0]{\mathcal{R}_E}
\newcommand{\randomdiscrete}[0]{\mathcal{R}_D}
\newcommand{\algref}[1]{#1}
% Notation for the algorithm descriptions
\newcommand{\attrin}[0]{\ensuremath{\mbox{in}}}
\newcommand{\attrout}[0]{\ensuremath{\mbox{out}}}
\newcommand{\attrparent}[0]{\ensuremath{\mbox{parent}}}
\newcommand{\attrchildren}[0]{\ensuremath{\mbox{children}}}
\newcommand{\attrnode}[0]{\ensuremath{\mbox{node}}}
\newcommand{\attrtime}[0]{\ensuremath{\mbox{time}}}
\newcommand{\attrleft}[0]{\ensuremath{\mbox{left}}}
\newcommand{\attrright}[0]{\ensuremath{\mbox{right}}}
\newcommand{\attrroot}[0]{\ensuremath{\mbox{root}}}
% Notation for Hudson's algorithm
\DeclareMathOperator{\segright}{right}
\DeclareMathOperator{\segleft}{left}
\DeclareMathOperator{\segnode}{node}
\DeclareMathOperator{\segnext}{next}
\DeclareMathOperator{\segprev}{prev}
\DeclareMathOperator{\segment}{Segment}
% Binary Indexed Tree operations
\DeclareMathOperator{\bittotal}{total}
\DeclareMathOperator{\bitfind}{find}
% AVL Tree operations
\DeclareMathOperator{\avlsearch}{search}
\DeclareMathOperator{\avlnextkey}{nextkey}
% Index vectords
\newcommand{\indexin}[0]{\ensuremath{\mathcal{I}}}
\newcommand{\indexout}[0]{\ensuremath{\mathcal{O}}}
\newcommand{\ms}[0]{\texttt{ms}}
\newcommand{\msms}[0]{\texttt{msms}}
\newcommand{\msprime}[0]{\texttt{msprime}}
\newcommand{\mspms}[0]{\texttt{mspms}}
\newcommand{\scrm}[0]{\texttt{scrm}}
\newcommand{\cosi}[0]{\texttt{cosi2}}
\newcommand{\MaCS}[0]{\texttt{MaCS}}
\newcommand{\fastsimcoal}[0]{\texttt{fastsimcoal2}}
\newcommand{\plink}[0]{\texttt{plink}}
\newcommand{\dadi}[0]{\texttt{$\partial a \partial i$}}
% These macros are borrowed from TAOCPMAC.tex
\newcommand{\slug}{\hbox{\kern1.5pt\vrule width2.5pt height6pt depth1.5pt\kern1.5pt}}
\def\xskip{\hskip 7pt plus 3pt minus 4pt}
\newdimen\algindent
\newif\ifitempar \itempartrue % normally true unless briefly set false
\def\algindentset#1{\setbox0\hbox{{\bf #1.\kern.25em}}\algindent=\wd0\relax}
\def\algbegin #1 #2{\algindentset{#21}\alg #1 #2} % when steps all have 1 digit
\def\aalgbegin #1 #2{\algindentset{#211}\alg #1 #2} % when 10 or more steps
\def\alg#1(#2). {\medbreak % Usage: \algbegin Algorithm A (algname). This...
\noindent{\bf#1}({\it#2\/}).\xskip\ignorespaces}
\def\kalgstep#1.{\ifitempar\smallskip\noindent\else\itempartrue
\hskip-\parindent\fi
\hbox to\algindent{\bf\hfil #1.\kern.25em}%
\hangindent=\algindent\hangafter=1\ignorespaces}
\newcommand{\algstep}[3]{\kalgstep #1 [#2] #3 }
\newenvironment{taocpalg}[3]{%
\vspace{1em}%
\algbegin Algorithm #1. ({#2}). #3 }
{\vspace{1em}}
%%%%%%%%%%%%%%%%%%%%%%%%%% END JK MACROS and includes %%%%%%%%%%%%%%%%%%%%%%%%
%% END MACROS SECTION
\begin{document}
\vspace*{0.35in}
% Title must be 250 characters or less.
% Please capitalize all terms in the title except conjunctions, prepositions, and articles.
\begin{flushleft}
{\Large
\textbf\newline{Efficient coalescent simulation and genealogical
analysis for large sample sizes}
}
\newline
% Insert author names, affiliations and corresponding author email
\\
Jerome Kelleher\textsuperscript{1},
Alison M Etheridge\textsuperscript{2},
Gilean McVean\textsuperscript{1}
\\
\bigskip
\bf{1} Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK
\\
\bf{2} Department of Statistics, University of Oxford, Oxford, UK
\\
\bigskip
% Use the asterisk to denote corresponding authorship and provide email address in note below.
\end{flushleft}
% Please keep the abstract below 300 words
\section*{Abstract}
A central challenge in the analysis of genetic variation is to
provide realistic genome simulation across millions of samples. Present day
coalescent simulations do not scale well, or use approximations that fail to
capture important long-range linkage properties. Analysing the results of
simulations also presents a substantial challenge, as current methods to store
genealogies consume a great deal of space, are slow to parse and do not
take advantage of shared structure in correlated trees. We solve these problems
by introducing sparse trees and coalescence records as the key units of
genealogical analysis. Using these tools, exact simulation of the coalescent
with recombination for chromosome-sized regions over hundreds of thousands of
samples is possible, and substantially faster than present-day approximate
methods. We can also analyse the results orders of magnitude more quickly than
with existing methods.
% Please keep the Author Summary between 150 and 200 words Use first person.
% PLOS ONE authors please skip this step. Author Summary not valid for PLOS ONE
% submissions.
\section*{Author Summary}
Our understanding of the distribution of genetic variation in natural
populations has been driven by mathematical models of the underlying biological
and demographic processes. A key strength of such coalescent models is that
they enable efficient simulation of data we might see under a variety of
evolutionary scenarios. However, current methods are not well suited to
simulating genome-scale data sets on hundreds of thousands of samples, which is
essential if we are to understand the data generated by population-scale
sequencing projects. Similarly, processing the results of large simulations
also presents researchers with a major challenge, as it can take many days just
to read the data files. In this paper we solve these problems by introducing a
new way to represent information about the ancestral process. This new
representation leads to huge gains in simulation speed and storage efficiency
so that large simulations complete in minutes and the output files can be
processed in seconds.
\linenumbers
\section*{Introduction}
\label{sec-introduction}
The coalescent process~\citep{k82,h83a} underlies much of modern population
genetics and is fundamental to our understanding of molecular evolution. The
coalescent describes the ancestry of a sample of $n$ genes in the absence of
recombination, selection, population structure and other complicating factors.
The model has proved to be highly extensible, and these and many other
complexities required to model real populations have successfully been
incorporated~\citep{w08}. Simulation has played a key role in coalescent theory
since its beginnings~\citep{h83a}, partly due to the ease with which it can be
simulated: for a sample of $n$ genes, we require only $O(n)$ time and space to
simulate a genealogy~\citep{h90}.
Soon after the single locus coalescent was derived, Hudson defined an algorithm
to simulate the coalescent with recombination~\citep{h83b}. However, after some
early successes in characterising this process~\citep{hk85,kh85} little
progress was made because of the complex distribution of blocks of ancestral
material among ancestors. Some years after Hudson's pioneering work, the study
of recombination in the coalescent was recast in the framework of the Ancestral
Recombination Graph~\citep{g91,gm97}. In the ARG, nodes are events (either
recombination or common ancestor) and the edges are ancestral chromosomes. A
recombination event results in a single ancestral chromosome splitting into two
chromosomes, and a common ancestor event results in two chromosomes merging
into a common ancestor. Analytically, the ARG is a considerable simplification
of Hudson's earlier work as it models all recombination events that occurred in
the history of a sample and not just those that can potentially affect the
genealogies. Many important results have been derived using this framework, one
of which is particularly significant for our purposes here. Ethier and
Griffiths~\citep{eg90} proved that the expected number of recombination events
back to the Grand MRCA of a sample of $n$ individuals grows like $e^\rho$ as
$\rho \rightarrow \infty$, where $\rho$ is the population scaled recombination
rate. In this paper we consider a diploid model in which we have a sequence of
$m$ discrete sites that are indexed from zero. Recombination occurs between
adjacent sites at rate $r$ per generation, and therefore $\rho = 4 N_e r(m -
1)$. The Ethier and Griffiths result implies that the time required to simulate
an ARG grows exponentially with the sequence length, and we can only ever hope
to simulate ARGs for the shortest of sequences.
This result, coupled with the observed poor scaling of coalescent simulators
such as the seminal \texttt{ms} program~\citep{h02} seems to imply that
simulating the coalescent with recombination over chromosome scales is
hopeless, and researchers have therefore sought alternatives. The sequentially
Markov coalescent (SMC) approximation~\citep{mc05,mw06} underlies the majority
of present day genome scale simulation~\citep{cmw09,ef11,szml14} and inference
methods~\citep{ld11,sd14,rhgs14}. The SMC simplifies the process of simulating
genealogies by assuming that each marginal tree depends only on its immediate
predecessor as we move from left-to-right across the sequence. As a
consequence, the time required to simulate genealogies scales linearly with
increasing sequence length. In practice, SMC based simulators such as
\MaCS~\citep{cmw09} and \scrm~\citep{szml14} are many times faster than \ms.
The SMC has disadvantages, however. Firstly, the SMC discards all long range
linkage information and therefore can be a poor approximation when modelling
features such as the length of admixture blocks~\citep{ln14}. Improving the
accuracy of the SMC can also be difficult.
For example, \MaCS\ has a parameter to increase the
number of previous trees on which a marginal tree can depend.
Counter-intuitively, increasing this parameter beyond a certain limit
results in a \emph{worse} approximation to the coalescent with
recombination~\citep{szml14}. (The \scrm\ simulator
provides a similar parameter that does not exhibit this unfortunate
behaviour, however.)
Incorporating complexities such as population structure~\citep{emm09},
intra-codon recombination~\citep{ap10} and inversions~\citep{pkgk13} is
non-trivial and can be substantially more complex than the corresponding
modification to the exact coalescent model. Also, while SMC based methods scale
well in terms of increasing sequence length, currently available simulators do
not scale well in terms of sample size.
We solve these problems by introducing sparse trees and coalescence records as
the fundamental units of genealogical analysis. By creating a concrete
formalisation of the genealogies generated by the coalescent process in terms
of an integer vector, we greatly increase the efficiency of simulating the
exact coalescent with recombination. In the section \textbf{\nameref{sec-simulation}},
we
discuss how Hudson's classical simulation algorithm can be defined in terms of
these sparse trees, and why this leads to substantial gains in terms of the
simulation speed and memory usage. We show that our implementation of the exact
coalescent, \msprime, is competitive with approximate simulators for small
sample sizes, and is faster than all other simulators for large sample sizes.
This is possible because Hudson's algorithm does not traverse the entire ARG,
but rather a small subset of it. The ARG contains a large number of nodes that
do not affect the genealogies of the sample~\cite{wh99}, and Hudson's algorithm saves time
by not visiting these nodes. This subset of the ARG (sometimes known as the
`little' ARG) has not been well characterised, which makes
analysis of Hudson's algorithm difficult. However, we show some numerical
results indicating that the number of nodes in the little ARG may be a quadratic
function of the scaled recombination rate $\rho$ rather than an exponential.
Generating simulated data is of little use if the results cannot be processed
in an efficient and convenient manner. Currently available methods for storing
and processing genealogies perform very poorly on trees with hundreds of thousands of
nodes. In the section \textbf{\nameref{sec-analysis}}, we show how the encoding of the
correlated trees output by our simulations leads
to an extremely compact method of storing these genealogies. For large
simulations, the representation can be thousands of times smaller than the most
compact tree serialisation format currently available. Our encoding also leads
to very efficient tree processing algorithms; for example, sequential access to
trees is several orders of magnitude faster than existing methods.
The advantages of faster and more accurate simulation over huge sample sizes,
and the ability to quickly process very large result sets may enable
applications that were not previously feasible. In the \textbf{\nameref{sec-discussion}}
we conclude by considering some of these applications and other uses of our
novel encoding of genealogies.
The methods developed in this paper allow us to simulate the coalescent for
very large sample sizes, where the underlying assumptions of the model
may be violated~\citep{wt03,msbw11,bcs14}. Addressing these issues is beyond
the scope of this work, but we note that the majority of our results
can be applied to simulations of any retrospective population model.
\section*{Methods}
\subsection*{Efficient coalescent simulation}
\label{sec-simulation}
In this section we define our encoding of coalescent genealogies, and
show how this leads to very efficient simulations. There are many different
simulation packages, and so we begin with a brief review of the state-of-the-art before
defining our encoding and analysing the resulting algorithm in the
following subsections.
Two basic approaches exist to simulate the coalescent with recombination. The
first approach was defined by Hudson~\citep{h83b}, and works by applying the effects of
recombination and common ancestor events to the ancestors of the sample as we
go backwards in time. Events occur at a rate that depends only on the
state of the extant ancestors, and so we can generate the waiting times to
these events efficiently without considering the intervening generations.
This contrasts with time-reversed generation-by-generation
methods~\citep{ens00,le04,arch05,lza07} which are more flexible but
also considerably less efficient.
The first simulation program published based on Hudson's algorithm was
\ms~\citep{h02}. After this, many programs were published to simulate various
evolutionary complexities not handled by \ms, such as
selection~\citep{sc04,ti09,eh10,sss14}, recombination hotspots~\citep{hs07},
codon models~\citep{ap07}, intra-codon recombination~\citep{ap10}
and models of species with a skewed offspring distribution~\citep{zdge15}. Others
developed user interfaces to facilitate easier analysis~\citep{mspmms05,rm07}.
The second fundamental method of simulating the coalescent with recombination
is due to Wiuf and Hein~\citep{wh99}. In Wiuf and Hein's algorithm we begin by generating a
coalescent tree for the left-most locus and then move across the sequence,
updating the genealogy to account for recombination events. This process is
considerably more complex than Hudson's algorithm because the relationship
between trees as we move across the genome is non-Markovian: each tree depends
on all previously generated trees. Because of this complexity, exact simulators
based on Wiuf and Hein's algorithm are significantly less efficient than
\ms~\citep{szml14,wzlclmx14}. However, Wiuf and Hein's algorithm has provided the
basis for the SMC approximation~\citep{mc05,mw06}, and programs based on
this approach~\citep{cmw09,ef11,szml14} can simulate long sequences far more
efficiently than exact methods such as \ms. Very roughly, we can think of
Wiuf and Hein's algorithm performing a depth-first traversal of the ARG, and
Hudson's algorithm a breadth-first traversal. Neither explore the full ARG,
but instead traverse the subset required to contruct all marginal genealogies.
Recently, Hudson's algorithm has been utilised in \cosi~\citep{sss14}, which
takes a novel approach to simulating sequences under the coalescent. The
majority of simulators first generate genealogies and then throw down mutations
in a separate process. In \cosi\ these two processes are merged, so that
mutations are generated during traversal of the ARG. Instead of associating a
partial genealogy with each ancestral segment, \cosi\ maps ancestral segments
directly to the set of sampled individuals at the leaves of this tree. When a
coalescence between two overlapping segments occurs, we then have sufficient
information to generate mutations and map them to the affected samples. This
strategy, coupled with the use of sophisticated data structures, makes \cosi\
many times faster than competing simulators such as \msms~\citep{eh10}.
The disadvantage of
combining the mutation process with ARG traversal, however, is that the
underlying genealogies are not available, and \cosi\ cannot directly output
coalescent trees.
Many reviews are available to compare the various coalescent simulators in
terms of their features~\citep{c08,law08,a12,ymzhw12,hbg12,ydn14}. Little
information is available, however, about their relative efficiencies.
Hudson's \ms\ is widely regarded as the most efficient implementation of
the exact coalescent and is the benchmark against which other programs are
measured~\citep{mw06,cmw09,ef11,szml14,ydn14,wzlclmx14}. However,
for larger sample sizes and long sequence lengths, \msms\
is much faster than \ms. Also, for these larger sequence lengths and sample
sizes, \ms\ is unreliable and crashes~\citep{ef11,ydn14}. Thus,
\msms\ is a much more suitable baseline against which to judge performance.
The \scrm\ simulator is the most efficient SMC based
method currently available~\citep{szml14}.
\paragraph*{Hudson's algorithm with sparse trees}
\label{sec-simulation-algorithm}
An oriented tree~\cite[p.\ 461]{k11} is a sequence of integers
$\pi_1\pi_2\dots$, such that $\pi_u$ is the parent of node
$u$ and $u$ is a root if $\pi_u = 0$. For example, the trees
\begin{equation*}\label{eqn-otex}
\begin{array}{c}
\includegraphics[width=5cm]{figures/otex}
\end{array}
\end{equation*}
are defined by the sequences $\langle 5,4,4,5,0\rangle$,
$\langle 4,4,4,0\rangle$ and $\langle 4,4,5,5,0\rangle$, respectively.
Oriented trees provide a concise and efficient method of representing
genealogies, and have been used in coalescent simulations of a spatial
continuum model~\citep{kbe13,keb14}. These simulations adopted the convention
that the individuals in the sample (leaf nodes) are mapped to the integers $1,
\dots, n$. For every internal node $u$ we have $n < u < 2n$ and (for a binary
tree) the root is $2n - 1$. We refer to such trees as dense because the $2n -
2$ non-zero entries of the (binary) tree $\pi$ occur at $u = 1, \dots, 2n - 2$. A
sparse oriented tree (or more concisely, sparse tree) is an oriented tree $\pi$
in which the leaf nodes are $1, \dots, n$ as before, but internal nodes can be
any integer $> n$. For example, the oriented trees $\langle 5,4,4,5,0\rangle$ and
$\langle 6,5,5,0,6,0\rangle$ are topologically equivalent, but the
former is dense and the latter sparse.
In our simulations, ancestral nodes are numbered sequentially from $n + 1$,
and a new node is created when a coalescence occurs within one or more of
the marginal genealogies. Note that we make a distinction between
common ancestor events and coalescence events throughout. A common ancestor
event occurs when two ancestors merge to form a common ancestor. If these
ancestors have overlapping ancestral material, then there will also be at least
one coalescence event, which is defined as a single contiguous block of sequence
coalescing within a common ancestor. In Hudson's algorithm there are many common
ancestor events that do not result in coalescence, and it is important to
distinguish between them.
Let the tuple $(\ell, r, u)$ define a segment carrying ancestral material.
This segment represents the mapping of the half-closed genomic interval $[\ell,
r)$ to the tree node $u$. Each ancestor $a$ is defined by a set of
non-overlapping segments. Initially we have $n$ ancestors, each consisting of a
single segment $(0, m, u)$ for $1 \leq u \leq n$. The only other state required
by the algorithm is the time $t$, and the next node $w$; initially, $t =
0$ and $w = n + 1$.
Let $P$ be the set of ancestors at a given time $t$. Recombination events
happen at rate $\rho L / (m - 1)$ where
\[
L = \sum_{a \in P}\left( \max_{(\ell, r, u) \in a}r
- \min_{(\ell, r, u) \in a}\ell - 1 \right)
\]
is the number of available `links' that may be broken. (We use a fixed
recombination rate here for simplicity, but an arbitrary recombination
map can be incorporated without difficulty.) We choose one of the
available breakpoints uniformly, and split the ancestry of the individual at that point
into two recombinant ancestors. If this breakpoint is at $k$, we assign all
segments with $r \leq k$ to one ancestor and all segments with $\ell \geq k$ to
the other. If there is a segment $(\ell, r, u)$ such that $\ell < k < r$,
then $k$ falls within this segment and it is split
such that the segment $(\ell, k, u)$ is assigned to one ancestor and
$(k, r, u)$ is assigned to the other.
Common ancestor events occur at rate $|P|(|P| - 1)$. Two ancestors $a$ and $b$
are chosen and their ancestry merged to form their common ancestor. If their
segments do not overlap, the set of ancestral segments of the common ancestor is
the union of those of $a$ and $b$. If segments do overlap, we have coalescence
events which must be recorded. We define a coalescence event as the merging of two
segments over the interval $[\ell, r)$ into a single ancestral segment. In
general the coordinates of overlapping segments $x$ and $y$ will not exactly
coincide, in which case we create an equivalent set of segments by
subdividing into the intersections and `overhangs'. Suppose then that we
have two exactly intersecting segments $(\ell, r, u)$ and $(\ell, r, v)$ from $a$ and $b$
respectively; over the interval $[\ell, r)$ the nodes $u$ and $v$ coalesce into
a common ancestor, which we associate with the next available node $w$. We
record this information by storing the coalescence record $\left(\ell, r, w,
(u, v), t\right)$. As we see in the \textbf{\nameref{sec-generating-trees}}
section, these
records provide sufficient information to later recover all marginal trees.
After recording this coalescence, we then check if there are any other segments
in $P$ that intersect with $[\ell, r)$. If there are, the simulation of this
region is not yet complete and we insert the segment $(\ell, r, w)$ into the
ancestor of $a$ and $b$. On the other hand, if there is some subset of $[\ell,
r)$ such that there is no other segment in $P$ that intersects with it,
we know that the marginal
tree covering this interval is complete and therefore we do not need to trace
its history any further. If any other intervals overlap in $a$ and $b$, we
perform the same operations, and finally update the next available node by
incrementing $w$. In this way, all coalescing intervals within the same
ancestor map to the same node $w$, even if they are disjoint. Conversely, if
two disjoint marginal trees contain the same node, we know that this is
because multiple segments coalesced simultaneously within the same ancestor.
The algorithm continues generating recombination and common ancestor
events at the appropriate rates until $P$ is empty, and all marginal trees are
complete. This interpretation of Hudson's algorithm differs from the standard
formulations~\citep{h83b,h90,mc05} by concretely defining the representation of
ancestry and by introducing the idea of coalescence records. We have omitted
many important details here in the interest of brevity; see
\nameref{app-algorithm-listing} for a detailed listing of our
implementation of Hudson's algorithm, and
\nameref{app-algorithm-illustration} for an illustration of a complete
invocation of the algorithm.
There are several advantages to our sparse tree representation of ancestry.
Firstly, we do not need to store partially built trees in memory, and the only
state we need to maintain is the set of ancestral segments. This leads to
substantial time and memory savings, since we no longer have to copy partially
built trees at recombination events or update them during coalescences. We can
also actively defragment the segments in memory. For example, suppose that as a
result of a common ancestor event we have two segments $(\ell, k, u)$ and $(k,
r, u)$ in an ancestor. We can replace these segments with the equivalent
segment $(\ell, r, u)$. Such defragmentation yields significant time and memory
savings.
We have developed an implementation of Hudson's algorithm called \msprime\
based on these ideas. This package (written in C and Python) provides an \ms\
compatible command line interface along with a Python API, and is freely
available under the terms of the GNU GPL at
\url{https://pypi.python.org/pypi/msprime}. The implementation uses a simple
linked-list based representation of ancestral segments, and uses a binary
indexed tree~\citep{f94,f95} to ensure the choice of ancestral segment involved
in a recombination event can be done in logarithmic time. The implementation
of \msprime\ is based on the listings for Hudson's algorithm given in
\nameref{app-algorithm-listing}, which should provide sufficient detail to make
implementation in a variety of languages routine.
\paragraph*{Performance analysis}
\label{sec-simulation-performance}
\begin{figure}
\begin{center}
\includegraphics[width=10cm]{figures/num_events}
\end{center}
\caption{\label{fig-algorithm-complexity} The mean number of
recombination events in Hudson's algorithm over 100 replicates
for varying sequence length and sample size. In the left panel we fix
$n = 1000$ and vary the sequence length. Shown in dots is
a quadratic fitted to these data, which has a
leading coefficient of $8.4 \times 10^{-3}$.
In the right panel
we fix the sequence length at $50$ megabases and vary the sample
size. }
\end{figure}
Surprisingly little is known about the complexity of Hudson's algorithm. We do
not know, for example, what the expected maximum number of extant ancestors is,
nor the distribution of ancestral material among them. The most important
unknown value in terms of quantifying the complexity of the algorithm is the
expected number of events that must be generated. It is sufficient to consider
the recombination events as the number of common ancestor and recombination
events is approximately equal~\citep{wh99}. Hudson's algorithm traverses a
subset of the ARG as it generates the marginal genealogies in which we are
interested, and so we know that the expected number of recombination events we
encounter is less than $e^\rho$~\citep{eg90}. This subset of the ARG is
sometimes known as the `little' ARG, but the relationship between the `big' and
little ARGs has not been well characterised.
Fig.~\ref{fig-algorithm-complexity} plots the average number of
recombination events generated by Hudson's algorithm for varying sequence
lengths and sample sizes. In this plot we also show the results of fitting a
quadratic function to the number of recombination events as we increase the
scaled recombination rate $\rho$. The fit is excellent, suggesting that the
current upper bound of $e^\rho$ is far too pessimistic. Wiuf and Hein~\citep{wh99}
previously
noted that the observed number of events in Hudson's algorithm was
`subexponential' but did not suggest a quadratic bound. Another point to note is
that the rate at which the number of events grows as we increase the sample
size is extremely slow, suggesting that Hudson's algorithm should scale well
for large sample sizes.
\begin{figure}
\begin{center}
\includegraphics[width=10cm]{figures/tree_simulation_time}
\end{center}
\caption{\label{fig-tree-simulation-time} Comparison of the average
running time over 100 replicates for various coalescent simulators
with varying sequence length and sample size.
\msms~\citep{eh10} is the most efficient published simulator based on
Hudson's algorithm that can output genealogies.
\MaCS~\citep{cmw09} is a popular SMC based simulator, and
\scrm~\citep{szml14} is the most efficient sequential simulator
currently available. Both \MaCS\ and \scrm\ were run in SMC$'$ mode.
Two results are shown for \msprime; one
outputting Newick trees and another outputting the native HDF5 based format.
}
\end{figure}
These expectations are borne out well in observations of our implementation of
Hudson's algorithm in \msprime. Fig.~\ref{fig-tree-simulation-time} compares
the time required to simulate coalescent trees using a number of simulation
packages. As we increase the sequence length in the left-hand panel, the
running time of \msprime\ increases faster than linearly, but at quite a slow
rate. \msprime\ is faster than the SMC approximations (\MaCS\ and \scrm) until
$\rho$ is roughly $20000$, and the difference is minor for sequence lengths
greater than this. \msprime\ is far faster than \msms, the only other exact
simulator in the comparison (we did not include \ms\ in these comparisons as it
was too slow and is unreliable for large sample sizes). As we increase the
sample size in the right-hand panel, we can see that \msprime\ is far faster
than any other simulator. Two versions of \msprime\ are shown in these plots:
one outputting Newick trees (to ensure that the comparison with other
simulators is fair), and another that outputs directly in \msprime's native
format. Conversion to Newick is an expensive process, particularly for larger
sample sizes. When we eliminate this bottleneck, simulation time grows
at quite a slow, approximately linear rate. The memory usage of \msprime\ is also
modest, with the simulations in Fig.~\ref{fig-tree-simulation-time} requiring
less than a gigabyte of RAM. Supporting Fig.~\ref{fig-tree-simulation-num-trees} shows
that the mean number of recombination breakpoints (i.e., the number
of recombination events within ancestral material) output by all these
simulators is identical, and matches Hudson and Kaplan's
prediction~\citep{hk85} very well, giving us some confidence in the correctness of the
results. Supporting Fig.~\ref{fig-small-n-simulation-time} shows the relative
performance of \msprime\ and \scrm\ for a small sample size, and also shows the
effect of increasing the size of \scrm's sliding window.
We are often interested in the haplotypes that result from imposing a mutation
process onto genealogies as well as the genealogies themselves. Supporting
Fig.~\ref{fig-haplotype-simulation-time} compares the time required to
generate haplotypes using \scrm, \msprime\ and \cosi. Simulation times are
similar in all three for a fixed sample size of $1000$ and increasing sequence
length. For increasing sample sizes, both \cosi\ and \msprime\ are
substantially faster than \scrm. However, \msprime\ is significantly faster
than \cosi\ (and uses less memory; see Supporting
Fig.~\ref{fig-haplotype-simulation-memory}), particularly when we remove the
large overhead of outputting the haplotypes in text form.
Performance statistics were measured on Intel Xeon E5-2680 processors
running Debian 8.2. All code required to run comparisons and
generate plots is available at
\url{https://github.com/jeromekelleher/msprime-paper}.
\subsection*{Efficient genealogical analysis}
\label{sec-analysis}
There has been much recent interest in the problem of representing large scale
genetic data in formats that facilitate efficient access and calculation of
statistics~\citep{d14,lkkq15,l15}. The use of
`succinct' data structures, which are highly compressed but also allow for
efficient queries is becoming essential: the scale of the data available to
researchers is so large that naive methods simply no longer work.
Although genealogies are fundamental to biology, there has been little
attention to the problem of encoding trees in a form that facilitates efficient
computation. The majority of research has focused on the accurate interchange
of tree structures and associated metadata. The most common format for
exchanging tree data is the Newick format~\citep{f89}, which although
ill-defined~\citep{vbch12} has become the de-facto standard. Newick is based on
the correspondence of tree structures with nested parentheses, and is a concise
method of expressing tree topologies. Because of this recursive structure,
specific extensions to the syntax are required to associate information with
tree nodes~\citep{msm97,ze01}. XML based formats~\citep{hz09,vbch12} are much
more flexible, but tend to require substantially more storage space than
Newick~\citep{vbch12}. Various extensions to Newick have been proposed to
incorporate more general graph structures~\citep{mm06,bn06,crv08,trn08},
as well as a GraphML extension to encode ARGs directly~\citep{mwk13}.
Because Newick stores branch lengths rather than node times, numerical
precision issues also arise when summing over many short branches~\citep{mwk13}.
General purpose Bioinformatics toolkits such as BioPerl~\citep{sbbb10} and
BioPython~\citep{cacc09} provide basic tools to import trees in the various
formats. More specific tree processing libraries such as DendroPy~\citep{sh10},
ETE~\citep{hdg10}, and APE~\citep{pcs04} provide more sophisticated tools such
as visualisation and tree comparison algorithms. None of these libraries are
designed to handle large collections of correlated trees, and cannot make use
of the shared structure within a sequence of correlated genealogies. The
methods employed rarely scale well to trees containing hundreds of thousands of
nodes.
In this section we introduce a new representation of the correlated trees
output by a coalescent simulation using coalescence records. In
the \textbf{\nameref{sec-tree-sequences}} subsection we discuss this structure and
show how it compares in practice to existing approaches in terms of storage
size. Then, the \textbf{\nameref{sec-generating-trees}} subsection
presents an algorithm to sequentially
generate the marginal genealogies from a tree sequence, which we compare with
existing Newick-based methods. Finally, in the \textbf{\nameref{sec-counting-leaves}}
subsection we
show how the algorithm to sequentially visit trees can be extended to
efficiently maintain the counts of leaves from a specific subset, and show how
this can be applied in a calculation commonly used in genome wide association
studies.
\paragraph*{Tree Sequences}
\label{sec-tree-sequences}
\begin{figure}
\begin{center}
\includegraphics{figures/tree-sequence-illustration}
\end{center}
\caption{\label{fig-tree-sequence-illustration} Coalescence records
and corresponding marginal trees. The $x$-axis represents genomic coordinates,
and $y$-axis represents time (with the present at the top).
Each line segment in the top section of the figure represents
a coalescence record; e.g., the first segment corresponds to the
coalescence record $(2, 10, 5, (3, 4), 0.071)$.
The lower section of the figure shows the
corresponding trees in pictorial and sparse tree
form. We have omitted commas and brackets from this sequence
representation for compactness.
}
\end{figure}
As described earlier, the output of our formulation of Hudson's algorithm
is a list of coalescence records.
Each coalescence record is a tuple $(\ell, r, u, c, t)$
describing the coalescence of a list of child nodes $c$ into the parent $u$ at
time $t$ over the half-closed genomic interval $[\ell, r)$. (Because only
binary trees are possible in the standard coalescent, we assume the child node
list $c$ is a $2$-tuple $(c_1, c_2)$ throughout. However, arbitrary numbers of
children can be accommodated without difficulty to support
common ancestor events in which more than two lineages
merge~\citep{gdb00,dk99,p99,s99}.)
We refer to this set of records as a
\emph{tree sequence}, as it is a compact encoding of the set of correlated
trees representing the genealogies of a sample.
Fig.~\ref{fig-tree-sequence-illustration} shows an illustration of
the tree sequence output by an example simulation (see
\nameref{app-algorithm-illustration} for a full trace of this simulation).
The tree sequence provides a concise method of representing the correlated
genealogies generated by coalescent simulations because it stores node
assignments shared across adjacent trees exactly once. Consider node $7$ in
Fig.~\ref{fig-tree-sequence-illustration}. This node is shared in the first
two marginal trees, and in both cases it has two children, $1$ and $6$. Even though
the node spans two marginal trees, the node assignment is represented in one
coalescence record $(0, 7, 7, (1, 6), 0.170)$. Importantly, this holds true
even though the subtree beneath $6$ is different in these trees. Thus, any
assignment of a pair of children to a given parent that is shared across
adjacent trees will be represented by exactly one coalescence record.
Coalescence records provide a full history of the coalescence events
that occurred in our simulation.
(Recall that we distinguish between common ancestor events, which
may or may not result in marginal coalescences, and coalescence events which
are defined as a single contiguous block of genome merging within a common
ancestor.) The effects of recombination events are also stored indirectly in
this representation in the form of the left and right coordinate of each
record. For every distinct coordinate between $0$ and $m$, there must have been
at least one recombination event that occurred at that breakpoint. However,
there is no direct information about the times of these recombination events,
and many recombinations will happen that leave no trace in the set of
coalescence records. For example, if we have a recombination event that splits
the ancestry of a given lineage, and this is immediately followed by a common
ancestor event involving these two lineages, there will be no record of this
pair of events.
On the other hand, if we consider the records in order of their left and right coordinates
we can also see them as defining the way in which we transform the
marginal genealogies as we move across a chromosome. Because many
adjacent sites may share the same genealogy, we need only consider the
coordinates of our records in order to recover the distinct genealogies and
the coordinate ranges over which they are defined. To obtain the marginal
tree covering the interval $[0, 2)$, for example, we simply find all records
with left coordinate equal to $0$ and apply these to the empty sparse tree
$\pi$. To subsequently obtain the tree corresponding to the interval $[2, 7)$
we first remove the records that do not apply over this interval, which must
have right coordinate equal to $2$. In the example, this corresponds to
removing the assignments $(2,4) \rightarrow 6$ and $(3, 7) \rightarrow 9$.
Having removed the `stale' records that do not cover the current interval, we
must now apply the new records that have left coordinate $2$. In this case, we
have two node assignments $(3,4) \rightarrow 5$ and $(2, 5) \rightarrow 6$, and
applying these changes to the current tree completes the transformation of the
first marginal tree into the second.
There is an important point here. As we moved from left-to-right across the
simulated chromosome we transitioned from one marginal tree to the next by
removing and applying only two records. Crucially, modifying the nodes that
were affected by this transition did not result in a relabelling of any
nodes that were not affected.
As Wiuf and Hein~\citep{wh99,wh99b} showed, the effect of a
recombination at a given point in the sequence is to cut the branch above some
node in the tree to the left of this point, and reattach it within another
branch. This process is known as a subtree-prune-and-regraft~\citep{s03,s06}
and requires a maximum of three records to express in our tree sequence
formulation.
\begin{figure}
\begin{center}
\includegraphics{figures/tree-transition}
\end{center}
\caption{\label{fig-tree-transition} A prune and regraft not involving the
root requires three records.
(i) We begin with two subtrees rooted at $x$ and $y$, and we
wish to prune the subtree rooted at $b$ and graft it in the branch
joining $e$ to $y$.
(ii) We remove the assignments $(a, b)\rightarrow \alpha$,
$(\alpha, c) \rightarrow x$ and $(d, e) \rightarrow y$. After this operation,
the subtrees $a,\dots,e$ are disconnected from the main tree. The main trunk
the tree rooted at $z$ is unaffected, as are the subtrees below $a, \dots, e$.
(iii) We add the records $(a ,c) \rightarrow x$,
$(b, e) \rightarrow \beta$ and $(d, \beta) \rightarrow y$, completing the
transition.
}
\end{figure}
Prune-and-regraft operations that do not affect the root
require three records, as illustrated in Fig.~\ref{fig-tree-transition}.
Two other possibilities exist
for how the current tree can be edited as we move along the sequence. The first
case is when we have a prune and regraft that involves a change in root
node; this requires only two records and is illustrated in the first transition
in Fig.~\ref{fig-tree-sequence-illustration}. The other case that
can arise from a single recombination event is a simple root change in which
the only difference between the adjacent trees is the time of the MRCA. This
requires one record, and is illustrated in the second
transition in Fig.~\ref{fig-tree-sequence-illustration}. These three
possibilities are closely related to the three classes of
subtree-prune-and-regraft identified by Song~\citep{s03,s06}.
Knowing the maximum number of records arising from a single recombination event
provides us with a useful bound on the expected number of records in a tree
sequence. Because the expected number of recombination events within ancestral
material is approximately $\rho \log n$~\citep{hk85,wh99} we know that
the expected number of tree transitions is $\rho \log n$. The number of records we
require for these tree transitions is then clearly $\leq 3 \rho \log n$. We
also require $n - 1$ records to describe the first tree in the sequence, and so
the total number of records is $\leq n + 3 \rho \log n - 1$.
Storing a tree sequence as a set of coalescence records therefore requires $O(n
+ \rho\log n)$ space, whereas any representation that stores each tree
separately (such as Newick) must require $O(n \rho \log n)$ space. This
difference is substantial in practice. As an example of a practical simulation
of the sort currently being undertaken, we repeated the simulation run
by Layer et al.~\citep{lkkq15}, in which we simulate a $100$ megabase region with a
recombination rate of $10^{-3}$ per base per $4N_e$ generations for a sample of
100,000 individuals. This simulation required approximately $6$ minutes and
850MB of RAM to run using \msprime; the original simulation reportedly required
over $4$ weeks using \MaCS\ on similar hardware.
Outputting the results as coalescence records in a simple tab-delimited text
format resulted in a 173MB file (52MB when gzip compressed). In contrast,
writing the trees out in Newick form required around 3.5TB of space. Because plain
text is a poor format for storing structured numerical data~\citep{knh13},
\msprime\ provides a tree sequence storage file format based of the HDF5
standard~\citep{hdf5}. Using this storage format, the file size is reduced to
88MB (41MB using the transparent zlib compression provided by the HDF5 library).
To compare the efficiency of storing correlated trees as coalescence records
with the TreeZip compression algorithm~\citep{msw10} we output the first
1000 trees in Newick format, resulting in a 3.2GB text file (1.1GB gzip
compressed). The TreeZip compression algorithm required 10 hours to run
and resulted in an 882MB file (83MB gzip compressed). Unfortunately,
it was not feasible to run TreeZip on all 3.5TB of the Newick data,
but we can see that with only around $0.1\%$ of the input data, the
compressed representation is already larger than the simple text output
of the entire tree sequence when expressed as coalescence records.
Associating mutation information with a tree sequence is straightforward. For
example, to represent a mutation that occurs on the branch that
joins node $7$ to node $9$ at site $1$ in
Fig.~\ref{fig-tree-sequence-illustration}, we simply record the tuple $(7,
1)$. (Infinite sites mutations can be readily accommodated by assuming that
the coordinate space is continuous rather than discrete.)
Because only the associated node and position of each mutation needs to
be stored, this results in a very concise representation of the full
genealogical history and mutational state of a sample. Repeating the simulation
above with a scaled mutation rate of $10^{-3}$ per unit of sequence
length per $4N_e$ generations resulted in 1.2 million infinite sites
mutations. The total size of the
HDF5 representation of the tree sequence and mutations was 102MB (49MB using
HDF5's zlib compression). In contrast, the text-based haplotype strings
consumed 113GB (9.7GB gzip compressed). Converting to text haplotypes required
roughly 9 minutes and 14GB of RAM.
The PBWT~\citep{d14} represents binary haplotype data in a format that
is both highly compressed and enables efficient pattern matching algorithms. We
converted the mutation data above into PBWT form, which required 22MB
of storage. Thus, the PBWT is a more compact representation of a set of
haplotypes than the tree sequence. However, the PBWT does not contain any
genealogical data, and therefore contains less information than the tree
sequence.
\paragraph*{Generating trees}
\label{sec-generating-trees}
Coalescence records provide a very compact means of encoding correlated
genealogies. Compressed representations of data usually come at the cost of
increased decompression effort when we wish to access the information. In
contrast, we can recover the marginal trees from a set of coalescence records
orders of magnitude more quickly than is possible using existing methods. In
this section we define the basic algorithm required to sequentially generate
these marginal genealogies.
For algorithms involving tree sequences it is useful to regard the set of
coalescence records as a table and to index the columns independently (see
Supporting Table~\ref{tab-tree-sequence} for the table corresponding to
Fig.~\ref{fig-tree-sequence-illustration}). Therefore define a tree sequence
$T$ as a tuple of vectors $T = (\vect{l}, \vect{r}, \vect{u}, \vect{c},
\vect{t})$, such that for each index $1 \leq j \leq M$, $(\vect{l}_j,
\vect{r}_j, \vect{u}_j, \vect{c}_j, \vect{t}_j)$ corresponds to one coalescence
record output by Hudson's algorithm, and there are $M$ records in total. It is
also useful to impose an ordering among the children at a node, and so we
assert that $\vect{c}_{j, 1} < \vect{c}_{j, 2}$ for all $1 \leq j \leq M$.
If we wish to obtain the tree for a given site $x$ we simply find the $n - 1$
records that intersect with this point and build the tree by applying these
records. We begin by setting $\pi_j \leftarrow 0 $ for $1 \leq j \leq
\max(\vect{u})$, and then set $\pi_{\vect{c}_{j, 1}} \leftarrow \vect{u}_{j}$
and $\pi_{\vect{c}_{j, 2}} \leftarrow \vect{u}_{j}$ for all $j$ such that
$\vect{l}_j \leq x < \vect{r}_j$. Spatial indexing structures
such as the segment tree~\citep{s89} allow us to find all $k$ segments out
of a set of $N$ that intersect with a given point in $O(k + \log N)$ time.
Therefore, since the expected number of records is
$O(n + \rho \log n)$ as shown in the previous subsection,
the overall complexity of generating a single tree
is $O(n + \log(n + \rho \log n))$.
A common requirement is to sequentially visit all trees in a tree sequence in
left-to-right order. One possible way to do this would be to find all of the
distinct left coordinates in the $\vect{l}$ vector and apply the process outlined
above. However, adjacent trees are highly correlated and share much of their
structure, and so this approach would be quite wasteful.
A more efficient approach is given in Algorithm T below. For this algorithm
we require two `index vectors' $\indexin$ and $\indexout$ which give
the indexes of the records in the order in which they are inserted
and removed, respectively. Records are applied in order of nondecreasing
left coordinate and increasing time, and records are removed in nondecreasing