Multichrom example #2665

jeromekelleher · 2022-12-15T15:58:31Z

~~Stacked on #2663 (and #2619)~~

@petrelharp @bhaller @molpopgen You might be interested in this: the new file c/examples/multichrom_wright_fisher.c gives a simple example of multi chromosome simulations with parallel simplify (most of the other stuff is from other in-flight PRs)

This adds a simple example of running a multi chromosome forwards simulation and running the sort/simplify steps in parallel using pthreads. This was really just to make sure that the basic idea of share the node table across the different table collections would work, and that we were getting the details right with the new TSK_SIMPLIFY_NO_FILTER_NODES and TSK_SIMPLIFY_NO_UPDATE_SAMPLE_FLAGS.

Running this through helgrind gives no errors (I think it's pretty good at picking up data races):

$ valgrind --tool=helgrind ./build/multichrom_wright_fisher 1000 10 1 tmp/sdf 1234

So, hooray, @molpopgen's ingenious idea works and with the infrastructure in #2663 and #2619 we can do it pretty easily!

Another thing I ended up doing here in 31c47e4 is to add a basic implementation of "delete rows", which is an operation we badly need in the API I think (mentioned here: #1034 (comment)). I think it's worth talking about that separately, so I'll open an issue for it.

We can probably let this sit for a while now anyway, and should try to clear up some other stuff first.

molpopgen · 2022-12-15T16:32:12Z

Great! I'll take a closer look in a bit.

bhaller · 2022-12-15T19:53:37Z

Super cool. Good to know about helgrind, too; I will look into that too! :->

hyanwong · 2022-12-16T13:06:48Z

I guess this is relevant to #176, and might inform how we deal with multiple chromosomes.

bhaller · 2022-12-16T15:01:30Z

I guess this is relevant to #176, and might inform how we deal with multiple chromosomes.

As I understand it, we could choose to make the two things tied together – to make the separate tree sequences begin/end at chromosome boundaries – but we don't have to do that, and there are pretty strong reasons not to do that, perhaps. (For example, you want to be able to keep multiple tree sequences to speed up even single-chromosome models; and also, chromosomes will be of very different lengths, whereas you want to divide the work of simplification as evenly as you can across your multiple tree sequences; and also, the number of chromosomes you are simulating will often not match the number of threads you have available for simplification.) So I would tentatively lean towards not connecting these two things.

hyanwong · 2022-12-16T18:04:21Z

I would tentatively lean towards not connecting these two things.

I think it's not exclusive. If you want to represent multiple chromosomes, then this gives a nice way to do it. But there's not reason why you couldn't also do it on a single chromosome, or indeed on part of a chromosome within a multi-chromosome set.

But I do take you point that there's an elegancy about keeping the things completely independent.

jeromekelleher · 2022-12-16T19:26:16Z

This is really about low-level C API stuff here for now - how we might (or not) surface this to higher level things like representing multiple chromosomes is something we should wait and see (which is I think what we're all saying!).

codecov · 2023-01-10T14:15:50Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 89.76%. Comparing base (ff06945) to head (e51c6c2).
Report is 76 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #2665      +/-   ##
==========================================
- Coverage   89.76%   89.76%   -0.01%     
==========================================
  Files          30       29       -1     
  Lines       29892    29886       -6     
  Branches     5803     5803              
==========================================
- Hits        26832    26826       -6     
  Misses       1753     1753              
  Partials     1307     1307

Flag	Coverage Δ
c-tests	`86.15% <ø> (ø)`
lwt-tests	`80.78% <ø> (ø)`
python-c-tests	`67.87% <ø> (-0.02%)`	⬇️
python-tests	`99.00% <ø> (ø)`

Flags with carried forward coverage won't be shown. Click here to find out more.

see 2 files with indirect coverage changes

jeromekelleher · 2023-01-19T17:46:16Z

@petrelharp @bhaller - I realised there was an easy way to join the tables by appending all the edges to the first table, sorting and then squashing adjacent edges. This should be equivalent to the haploid_wright_fisher example now (but that'll need to be verified).

jeromekelleher · 2023-01-19T17:46:19Z

@petrelharp @bhaller - I realised there was an easy way to join the tables by appending all the edges to the first table, sorting and then squashing adjacent edges. This should be equivalent to the haploid_wright_fisher example now (but that'll need to be verified).

jeromekelleher · 2023-01-19T17:49:45Z

Ah, probably need to set the sample flags for the final generation.

jeromekelleher · 2023-01-19T20:26:16Z

One more update - I've marked the sample flags and updated the original WF example so that it also indexes the output, and I think the outputs are now identical up to node reordering:

tree 0:
  num_sites: 0
  interval:  (0.000000, 0.006008)
                                                     322                                                         321                                                                         >
                                    ┏━━━━━┳━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━┓                                     ┏━━━┻━━━━┓                                                                     >
                  312               ┃     ┃                             ┃                                     ┃        ┃                                                                     >
          ┏━━━━━━━━┻━━━━━━━━┓       ┃     ┃                             ┃                                     ┃        ┃                                                                     >
         299                ┃       ┃     ┃                             ┃                                     ┃        ┃                                                                     >
    ┏━━━━━┻━━━━━┓           ┃       ┃     ┃                             ┃                                     ┃        ┃                                                                     >
    ┃           ┃           ┃       ┃     ┃                             ┃                                    278       ┃                                                                     >
    ┃           ┃           ┃       ┃     ┃                             ┃                                   ┏━┻┓       ┃                                                                     >
    ┃           ┃           ┃       ┃     ┃                            258                   264            ┃  ┃       ┃                                                                     >
    ┃           ┃           ┃       ┃     ┃                     ┏━━━━━━━┻━━━━━━━┓     ┏━━━━━━━┻━━━━━━━┓     ┃  ┃       ┃                                                                     >
   238          ┃           ┃       ┃     ┃                     ┃               ┃     ┃               ┃     ┃  ┃       ┃                                                                     >
  ┏━┻━┓         ┃           ┃       ┃     ┃                     ┃               ┃     ┃               ┃     ┃  ┃       ┃                                                                     >
  ┃   ┃        228          ┃       ┃     ┃                    229              ┃     ┃               ┃     ┃  ┃       ┃                                               219                   >
  ┃   ┃     ┏━━━┻━━━┓       ┃       ┃     ┃           ┏━━━━━━━━━┻━┳━━━━━━━┓     ┃     ┃               ┃     ┃  ┃       ┃                                ┏━━━━━━━━━━━━━━━┻━┳━━━━━━━┳━━━━━┓    >
  ┃   ┃     ┃       ┃      195     199    ┃           ┃           ┃       ┃     ┃     ┃              184    ┃  ┃      192            188                ┃                 ┃       ┃     ┃    >
  ┃   ┃     ┃       ┃   ┏━━━╋━━━━┓ ┏┻━┓   ┃           ┃           ┃       ┃     ┃     ┃           ┏━━━┻━━━┓ ┃  ┃    ┏━━┻━━┓     ┏━━━━━┻━┳━━━┓           ┃                 ┃       ┃     ┃    >
 176  ┃     ┃       ┃   ┃  167   ┃ ┃  ┃  170          ┃          172     162    ┃     ┃          165      ┃ ┃  ┃    ┃    181    ┃       ┃   ┃          169                ┃       ┃    180   >
┏━┻┓  ┃     ┃       ┃   ┃  ┏┻━┓  ┃ ┃  ┃ ┏━┻━┓         ┃      ┏━━━━╋━━━┓  ┏┻━┓   ┃     ┃     ┏━━━━━╋━━━━┓  ┃ ┃  ┃    ┃   ┏━┻┓    ┃       ┃   ┃   ┏━━━━━┳━┻━━━━━━┓          ┃       ┃    ┏┻━┓  >
┃  ┃  ┃    102     138  ┃  ┃  ┃  ┃ ┃  ┃ ┃  121       146     ┃   101  ┃  ┃  ┃  124   129   105   116   ┃  ┃ ┃  ┃   107  ┃  ┃   134     144  ┃   ┃    120      126        119     123   ┃  ┃  >
┃  ┃  ┃  ┏━━╋━━┓  ┏━┻┓  ┃  ┃  ┃  ┃ ┃  ┃ ┃  ┏┻━┓  ┏━━┳━┻┳━━┓  ┃  ┏━┻┓  ┃  ┃  ┃  ┏┻━┓ ┏━╋━━┓ ┏┻━┓  ┏┻━┓  ┃  ┃ ┃  ┃  ┏━┻┓  ┃  ┃ ┏━━╋━━┓  ┏━┻┓  ┃   ┃  ┏━━╋━━┓  ┏━━╋━━┓  ┏━━┳━┻┳━━┓  ┏┻━┓  ┃  ┃  >
0 96 57 14 69 90 58 87 11 20 93 47 1 34 8 66 85 17 26 89 91 21 40 54 95 75 81 18 31 2 9 12 3 36 16 49 22 72 4 32 45 55 67 88 5 42 92 62 98 73 6 7 19 60 84 38 70 76 24 28 43 74 63 77 68 79 1>

tree 1:
  num_sites: 0
  interval:  (0.006008, 0.008218)
:

and the second

tree 0:
  num_sites: 0
  interval:  (0.000000, 0.006008)
                                                                            1                                                                                    0                           >
                                                   ┏━━━━━━━━┳━━━━━━━━━━━━━━━┻━━━━━━━━━━━━━━━━━━━━━━━┓                                                      ┏━━━━━┻━━━━━┓                     >
                         18                        ┃        ┃                                       ┃                                                      ┃           ┃                     >
              ┏━━━━━━━━━━━┻━━━━━━━━━━━━┓           ┃        ┃                                       ┃                                                      ┃           ┃                     >
             41                        ┃           ┃        ┃                                       ┃                                                      ┃           ┃                     >
      ┏━━━━━━━┻━━━━━━━┓                ┃           ┃        ┃                                       ┃                                                      ┃           ┃                     >
      ┃               ┃                ┃           ┃        ┃                                       ┃                                                     53           ┃                     >
      ┃               ┃                ┃           ┃        ┃                                       ┃                                                    ┏━┻━┓         ┃                     >
      ┃               ┃                ┃           ┃        ┃                                      60                              66                    ┃   ┃         ┃                     >
      ┃               ┃                ┃           ┃        ┃                            ┏━━━━━━━━━━┻━━━━━━━━━━┓         ┏━━━━━━━━━━┻━━━━━━━━━━━┓        ┃   ┃         ┃                     >
     73               ┃                ┃           ┃        ┃                            ┃                     ┃         ┃                      ┃        ┃   ┃         ┃                     >
   ┏━━┻━━┓            ┃                ┃           ┃        ┃                            ┃                     ┃         ┃                      ┃        ┃   ┃         ┃                     >
   ┃     ┃           108               ┃           ┃        ┃                           109                    ┃         ┃                      ┃        ┃   ┃         ┃                     >
   ┃     ┃       ┏━━━━┻━━━━┓           ┃           ┃        ┃              ┏━━━━━━━━━━━━━┻━┳━━━━━━━━━━━┓       ┃         ┃                      ┃        ┃   ┃         ┃                     >
   ┃     ┃       ┃         ┃          129         133       ┃              ┃               ┃           ┃       ┃         ┃                     118       ┃   ┃        126                   1>
   ┃     ┃       ┃         ┃     ┏━━━━━╋━━━━━┓   ┏━┻━┓      ┃              ┃               ┃           ┃       ┃         ┃                ┏━━━━━┻━━━━┓   ┃   ┃     ┏━━━┻━━━┓         ┏━━━━━━━>
  170    ┃       ┃         ┃     ┃    161    ┃   ┃   ┃     164             ┃              166         156      ┃         ┃               159         ┃   ┃   ┃     ┃      175        ┃       >
 ┏━┻━┓   ┃       ┃         ┃     ┃   ┏━┻━┓   ┃   ┃   ┃   ┏━━┻━━┓           ┃         ┏━━━━━╋━━━━━┓   ┏━┻━┓     ┃         ┃         ┏━━━━━━┻┳━━━━━┓   ┃   ┃   ┃     ┃     ┏━┻━┓       ┃       >
 ┃   ┃   ┃      180       216    ┃   ┃   ┃   ┃   ┃   ┃   ┃    199         224        ┃    179    ┃   ┃   ┃    202       207       183     194    ┃   ┃   ┃   ┃    185    ┃   ┃      212      >
 ┃   ┃   ┃   ┏━━━╋━━━┓   ┏━┻━┓   ┃   ┃   ┃   ┃   ┃   ┃   ┃   ┏━┻━┓   ┏━━━┳━┻━┳━━━┓   ┃   ┏━┻━┓   ┃   ┃   ┃   ┏━┻━┓   ┏━━━╋━━━┓   ┏━┻━┓   ┏━┻━┓   ┃   ┃   ┃   ┃   ┏━┻━┓   ┃   ┃   ┏━━━╋━━━┓   >
231 327 288 245 300 321 289 318 242 251 324 278 232 265 239 297 316 248 257 320 322 252 271 285 326 306 312 249 262 233 240 243 234 267 247 280 253 303 235 263 276 286 298 319 236 273 323 2

We don't reorder the samples from 0 - n in the new version. If you do want this you can run simplify on it again and it'll renumber the output. (You could do this more cheaply too if you wanted to for compatibility purposes.)

jeromekelleher · 2023-01-19T20:30:56Z

I think we can meaningfully compare the timings of this with haploid_wright_fisher now @petrelharp and @bhaller

petrelharp · 2023-09-11T21:34:31Z

I think I know what's going on with the dissappointing scaling here. Here's an example from running the code:

$ N=10000;  T=5000; S=1000
$ ./haploid_wright_fisher $N $T $S n10k.trees 42
Simplify at generation 4000: (10010000 nodes 20000000 edges) -> (84542 nodes 439275 edges)
Simplify at generation 3000: (10084542 nodes 20439275 edges) -> (92436 nodes 494357 edges)
Simplify at generation 2000: (10092436 nodes 20494357 edges) -> (97097 nodes 527279 edges)
Simplify at generation 1000: (10097097 nodes 20527279 edges) -> (100073 nodes 547226 edges)
Simplify at generation 0: (10100073 nodes 20547226 edges) -> (102133 nodes 560584 edges)

$ ./multichrom_wright_fisher $N $T $S n10k-multichrom.trees 42
Simplify 10010000 nodes
	chunk 1: 11250775 -> 70861
	chunk 5: 11249822 -> 68845
	chunk 0: 11250882 -> 69585
	chunk 6: 11248979 -> 71402
	chunk 3: 11250703 -> 71395
	chunk 7: 11247502 -> 71564
	chunk 4: 11251640 -> 71959
	chunk 2: 11249697 -> 71370

The numbers for the multichromosome example are numbers of edges before and after simplification. Notice that each worker has about one-half the total number of edges: there were 20M edges for the whole simulation (2 edges * 10K individuals * 1K generations); and each of the 8 workers have 10M edges to deal with.

This is because: suppose that a chromosome has been inherited in k pieces, i.e., there are k edges that partition that chromosome. If we divide the chromosome among n workers, then we will get n + k - 1 edges, since we're breaking the chromosome at (a) n places, and also at (b) k places. Unless we get lucky and some of the breakpoints fall exactly on a division between adjacent chunks. (In this simulation, everyone is divided into exactly k=2 pieces.)

Now, what does simplify have to do under the hood? It's got a data structure that records the collection of ancestral segments that it's tracking; and then it goes through edges one at a time and deals with them. This means that if the cost of dealing with each edge is O(1), then our maximal speedup would be (n+k-1)/(nk) ~ 1/k. (This is because we've increased the number of edges by a factor of (n+k-1)/k but then divided that among n workers.)

However, the work required by simplify's data structure does scale with the number of segments it is tracking, to some degree, since although simplify has to look at all those edges, it only has to actually do something when it encounters an edge that some of its segments move along. If this was a substantial contribution to runtime then we might expect better speedups. The things we'd need to know are (a) what portion of simplify's work is "do something to the segments" versus "look at the next edge"; and (b) once we divide up a chromosome into chunks, how much does that reduce the proportion of its edges through which a segment of ancestry passes? I'm not very clear on the point, but I suspect the answer to (b) is "not much"; if so, then there's no hope for a speedup.

However, this approach would work just fine if each worker got literally one chromosome, since then edge breakpoints would usually (well, half the time) fall exactly on the divisions between workers. I may do a modification to the code here to have a look at that case, and see if we have wins there.

ping @jeromekelleher @molpopgen

petrelharp · 2023-09-12T02:38:36Z

I did some very simple benchmarking - comparing runtimes for k chromosomes, done (a) in parallel, and (b) not. (To do this I added a new file, multichrom_wright_fisher_singlethreaded.c, which if we merge this example in probably don't want to keep.)
Here's the script:

#!/bin/bash

N=1000
T=5000
S=200

for k in $(seq 12)
do
    /usr/bin/time ./multichrom_wright_fisher_singlethreaded $N $T $S n10k.trees 42 $k &> single${k}.log
    /usr/bin/time ./multichrom_wright_fisher $N $T $S n10k.trees 42 $k &> multi${k}.log
done

echo "type num_chroms user system elapsed percent_cpu N T S" > results.txt
for t in single multi
do
    (for x in ${t}*.log; do echo ${x%.log} $(tail -n 2 $x | head -n 1 | cut -f 1-4 -d ' ') $N $T $S; done | sed -e 's/user//' | sed -e 's/chroms//' | sed -e 's/system//' | sed -e 's/elapsed//' | sed -e 's/.CPU//' | sort -n ) | tr ' ' '\t' >> results.txt
done

With that, I get this: black points are singlethreaded, red are multithreaded. Note that the simulations are still doing a fair bit of non-parallel work that is included in these timings:

This is looking very good! The max CPU utilization percent reported is 675%. Here's the rest of the results:

     type num_chroms  user system elapsed percent_cpu
1   multi          1  4.22   0.48    4.70          99
2   multi          2  8.50   1.02    5.36         177
3   multi          3 13.21   1.47    5.92         247
4   multi          4 18.17   2.01    6.52         309
5   multi          5 24.05   2.86    8.27         325
6   multi          6 30.71   3.70    9.02         381
7   multi          7 38.30   4.45    9.79         436
8   multi          8 46.23   5.26   10.56         487
9   multi          9 54.43   6.11   11.41         530
10  multi         10 63.40   7.06   12.16         579
11  multi         11 73.02   8.35   12.95         628
12  multi         12 83.72   9.45   13.79         675
13 single          1  3.97   0.05    4.02         100
14 single          2  7.91   0.16    8.08         100
15 single          3 12.31   0.61   12.93          99
16 single          4 16.69   0.82   17.52          99
17 single          5 21.33   1.14   22.47          99
18 single          6 25.93   1.39   27.33          99
19 single          7 30.78   1.79   32.57          99
20 single          8 35.85   2.18   38.04          99
21 single          9 40.84   1.96   42.81          99
22 single         10 45.17   2.69   47.87          99
23 single         11 50.77   3.11   53.89          99
24 single         12 55.90   3.41   59.32          99

jeromekelleher · 2023-09-12T08:17:23Z

Aha! That's great @petrelharp, it really clarifies things 👍

petrelharp · 2023-09-12T12:39:23Z

I'm thinking it's worth fixing up the multichrom example to merge in? @bhaller is enthusiastic about getting this into SLiM at some point. I think this just will require switching over delete_rows to keep_rows, and removing the extra file I put in?

jeromekelleher · 2023-09-12T15:59:40Z

OK, I'll fix it up when I get a chance.

jeromekelleher · 2023-09-13T13:07:41Z

I've updated this so that it's a clean build against upstream now @petrelharp. I think it needs a bit of documentation though - I don't really follow what the algorithm is doing. Could you update with some comments to help clarify that the model is, etc?

jeromekelleher · 2023-09-13T13:10:17Z

I think the docs build is because of a stale cache. The simplest thing is probably to create a new PR. I've squashed my original commits down to one.

molpopgen · 2023-09-13T16:42:49Z

I'm away on vacation for a couple of weeks, but:

The speedup should be 75-80% at most if one doesn't do one "chromosome" per thread. That's what I showed w/my prototype when I presented this method.
@petrelharp -- if I understand your figure, 12 chromosomes means 12 threads and a 6x speedup? Is that correct? If you run perf, you should be able to quantify the % of time that is purely single-threaded, which will give a better sense of how efficient the threading is.

bhaller · 2024-09-23T13:04:39Z

Hi @benjeffery and @jeromekelleher. Do you really want to close this? It remains a useful example; I'm going to be consulting it carefully in the multichromosome work I'm doing in SLiM right now, in fact, and I can imagine other folks finding it interesting and useful too. I'm not saying you need to merge it necessarily, now or perhaps ever, but if it's closed, it seems to me that a fair bit of very useful knowledge and discussion will become much harder to find. Reopening so this comment doesn't get lost, but if you really want to close it, I won't get into a close/reopen war with you. :->

jeromekelleher · 2024-09-23T13:08:23Z

It's fairly esoteric knowledge @bhaller which I'm not sure is worth keeping a PR open in perpetuity for. We've learned the key lessons from it, I think, and moved on?

bhaller · 2024-09-23T13:13:42Z

But anybody who wants to actually use tskit in a multithreaded manner (which would be me, and very probably others, sooner or later) will continue to find it useful. Which is the point of an example, isn't it? Why purge it from the knowledge pool?

jeromekelleher · 2024-09-23T14:03:22Z

If it's useful we should merge it I guess - I didn't think too hard about it tbh. Keeping it here in limbo isn't an answer.

Does anyone want to have a look over and make suggestions for what would be needed to make this a useful documentation example?

bhaller · 2024-09-23T14:33:00Z

I nominate @petrelharp :->. In all seriousness, he knows the tskit ecosystem way better than I do, knows your doc system, and has been involved in this all along. Can I volunteer you, Peter? :->

petrelharp · 2024-09-23T19:06:00Z

I think the proposal is to add something to the docs, here. I could do that, sure. I'm imagining just talking through, like "we can have the tables for each chromosome share the same node and individual tables; here's how we do that".

jeromekelleher · 2024-09-24T08:15:51Z

I can finish it up too @petrelharp, if you'd prefer to have a look through and post some review comments.

petrelharp · 2024-09-25T04:15:34Z

Go for it!

bhaller · 2024-12-09T16:04:45Z