-
Notifications
You must be signed in to change notification settings - Fork 86
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C code] DTWF performance #878
Comments
@DomNelson, any thoughts? |
@jeromekelleher absolutely, sounds like a good plan. As I remember it, we only updated the Fenwick tree to simplify hybrid simulation models, so if that can be done another way I don't see any reason to keep it. I imagine we'll want something similar for the pedigree sims too then. Any idea what kind of improvement we could see? |
Dunno, we'll see once @daniel-goldstein has a chance to do a little benchmarking on it. |
I fiddled with a few parameters to get some benchmarks and here are some times looking at sample size and Will do some profiling to investigate perf fixes. |
Good stuff, thanks @daniel-goldstein. What's your sequence length here? 1e-10 is quite a low recombination rate (1e-8 and Ne=10^4 is the usual human regime). I think a good benchmark would be Ne=10^4, r=1e-8 and L=100 megabases, and to run this for a fixed number of generations (say, 100 or 1000, depending on what's feasible). That's what we really want DTWF for, I think. (Also, for Ne=10^5 and 10^6) |
Whoops, I had written down 1e-10 for recombination rate (and quite a short sequence length since I thought that didn't affect much). I'll rerun the sim with the above specs. |
Also, I'm seeing ~70% of simulation time spent in malloc/free in |
Great - sounds like some low-hanging fruit for perf then! |
Adding the merge queue to the |
Removing unnecessary calls to However, DTWF separately struggle very quickly as I have not yet been able to find something in the implementation explaining the slowdown that is easily fixable, and profiling has shown a lot of time spent in merging ancestors and tracking overlap counts. @DomNelson Any thoughts on this? Are these numbers expected? |
I did some work on this, and I think the culprit is a massive Still have some tests failing (zero_pop_size), will submit a pull request once it is fixed. Question: using avl as a generic container feels a little hacky. Specifically, I rely on @DomNelson @daniel-goldstein |
That's awesome @ivan-krukov! Do you have numbers from profiling of where time was being spent/what other parameters you're running? I was hoping it would be some wasted memory problem but wasn't able to uncover it in profiling. An avl tree seems reasonable to me because otherwise we would be dealing with a very sparse array of parents. Basically we want the dictionary-like behavior we have in python but we have to roll our own dictionaries and this is how we'd do it. However, I do think you're right that there's a little unnecessary complexity in how you're using it currently. An avl node wraps a record, or key-value pair in our case, that knows nothing about the avl tree. The avl tree then manages connections between the nodes that allow for quick querying/traversal. So, I think your
Then when you're iterating over the tree you can start at tree->head, which gives you an avl node with a pointer to the next in the tree. Here's a small example of iterating over the nodes in an avl tree: Lines 768 to 774 in ae6402c
When updating the tree, I would recommend the strategy of search --> if there then update --> if not then insert new. |
You're right we don't seem to use |
Awesome! I still need to do some refactoring, so your comments are on point. I was mostly concerned that the I lifted the iteration from the Wrt search first, insert after - makes sense. That will let me avoid checking Thank you |
This is fantastic, thanks @ivan-krukov! Can you open a PR with your changes? It'll make it easier to see the diff and disuss. Don't worry if it's not fully ready for action. |
Working on the benchmarks now. Also looks like I need to add a few more tests. As far as the code goes - I needed to add a new structure If you think that it would be better to be more explicit about the data structures used, I won't mind adding a generic |
The runtime is less clean than I have expected. Seems that I ah paying an avl price to maintain the tree with a large sample size. However, with smaller sample sizes, it seems to be quite beneficial. The runtimes below are based on:
I am not sure why the decrease in runtime with sample size |
@ivan-krukov Interesting, and great for the small sample sizes! I also don't fully understand the drop, but have found that different configurations can go through the first 100 generations in very different ways. Are you able to see what the timings look like for the whole simulation? That might give some more insight. My guess would be that when the sample size is close to Ne there are more coalescences early on which make those 100 generations take longer, but that's a rough guess. As for why the avl tree is taking longer, maybe we're getting good cache behavior when Ne is small so allocating the full array + quick access beats the overhead of managing the AVL tree. But that can only go so far. Is the calloc the primary factor in the spike from Ne=10^7 to 10^8? @jeromekelleher Do you think we're facing a tradeoff with the cache, or am I in the wrong ballpark here? |
This looks like an excellent improvement, well done @ivan-krukov! I've no real idea what's going on with the drop in run time, but I guess there could be strong stochastic effects here so maybe we should be wary of single realisations. I think we need to run a few examples through linux perf to get more insight. I'd be interested to see the profile of n=1000, Ne=10**7 for both (since they're about equal). This should give us a good view into the tradeoffs. |
I wonder if the whole sims will be less informative, since their runtime will be dependent on the total coalescent time. But I will check. The calloc is not the primary issue on it's own - it's iterating over all the elements that was the problem, I believe. Which is why this was not seen in the profiler - the runtime was spread over several calls. I think a line-based profiler would have picked this up. The scaling with respect to the population size looks vaguely logarithmic to me, but it remains quadratic with respect to the sample size. I've replotted the same numbers, but facetted on the population size this time: |
Really interesting, thanks @ivan-krukov. I'd be suprised if the big uptick we're seeing for your code on large sample size is real (well, more accurately, something we can't easily fix). Let's run this through perf to get some insights. I'm guessing there's a crapton of little mallocs happening there because we're not currently using the local allocators for this yet. (One minor request: would you mind drawing lines on these plots if you're doing them again? For some reason I just can't see the trends as easily when it's just points!) |
Dropping this one out of the 1.0 list, we don't have time to get this done before release now. |
Run a performance analysis on the DTWF for a large simulation and see where the time is being spent. Since it seems like we're not using the Fenwick tree at all within DTWF simulations, it may be worthwhile 'turning off' Fenwick tree updates during a DTWF sim (which we should be able to do easily enough now, with the new abstractions).
If we do, we can update the Fenwick tree after we finish the DTWF phase so that the simulation state is correct (we want to be able to transition into a Hudson simulation if necessary).
The text was updated successfully, but these errors were encountered: