-
Notifications
You must be signed in to change notification settings - Fork 74
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert tsk_id_t/tsk_size_t to 64 bit. #343
Comments
Note: arguably we should go the whole hog here and typedef |
It seems like we need to allow bigger tables (see #402), but on the other hand, the implications for memory usage are pretty unfortunate. The least bad other solution I can think of is to have tables switch over from 32-bit to 64-bit once they get big, although that'd be quite annoying. |
I think we'll probably just have to bite the bullet and make everything 64bit. Having a tsk_table_t and a tsk_table64_t would be a huge amount of extra work. Let's do a prototype and see what the actual memory costs of going to 64 bit are. It'll probably make no practical difference for the vast majority of tree sequences. We should make the storage code more flexible though, and only store in 64 bit if we need to. This will create a little bit of forward-backward incompatability, but nothing we can't manage (the versioning infrastructure is in there just for this type of thing). |
The memory impacts will be appreciable for larger sims. For example, if I modify the script from #402 to just be plain W-F:
And run the following command:
The timing is:
So, that's 32GB. Roughly, we can reason that the memory is mostly taken up by edges. The simulation has no mutation, so we'll assume the size of individual genomes is negligible. Changing the node integer to uint64_t bumps the total size of an edge up by 1/3, so we can guesstimate that the RAM use will go up accordingly, so about 40GB. The same model in import fwdpy11
import numpy as np
N = 1000000
pop = fwdpy11.DiploidPopulation(N, 1.)
pdict = {'gvalue': fwdpy11.Multiplicative(1.),
'nregions': [],
'sregions': [],
'recregions': [],
'rates': (0, 0, 0),
'demography': np.array([N]*500, dtype=np.int32)
}
params = fwdpy11.ModelParams(**pdict)
rng = fwdpy11.GSLrng(42)
# the 500 is the time b/w simplifications.
# here, the parameter means once at the end.
fwdpy11.evolvets(rng, pop, params, 500) The same type of command for timing gives:
So, pushing 70GB. It may be important that the back-end does not copy edge tables to sort. Rather, the sorting is in-place. I'm not convinced, though, that that difference affects peak RAM use, as the memory would be freed prior to simplification. In any case, upping that number by 1.3 is close to 90GB. Simplifying every 100 generations is a big help:
I have not tested the data dump from the |
Quick update. i hacked fwdpp to use 64 bit signed integers for the tree sequences and reran the example from above. The memory use wasn't as bad as I'd guessed, so my loose approximation was too loose:
So, that is about a 10% performance hit, but I have to admit to laziness. I merely changed a typdef. There are probably loads of narrowing conversions, etc., going on. I was just playing. |
Thanks @molpopgen, this is good info. The more I think about it, the more I wonder whether we should be using tables for forward simulations at all though. (I'll send you an email about this) Do we think that the simplified output of a forward simulation is every likely to need 64 bit tables? |
Tricky question. Personally, I am not interesting in attempting to simulate "all humans", but others may be. More generally, for For this particular issue, we need offsets larger than the current for metadata. For #402, we haven't totally established what is going on, or at least I'm not clear that we have. |
Indeed! |
Update here; #402 was indeed due to running out of space in the metadata column; other columns (eg the node ID column) are a factor of 10 smaller, but will still be hit pretty quick in sims of size 1M. Some additional discussion there. |
A user has just posted on slim-discuss (https://groups.google.com/g/slim-discuss/c/lCtps2FEBxI) with a model that cannot run under tree-sequence recording because it causes the error "tsk_mutation_table_add_row: Table column too large; cannot be more than 2**32 bytes." (For that user, it crashes, in fact; unclear why, have requested a crash log, but the table overrun seems to be the actual culprit, and running the model myself, I get the nice error message rather than a crash.) The model is quite simple: it creates a population of size 50,000 (diploid) and then loads a 2GB VCF file with a bunch of mutations. It runs fine with tree-sequence recording turned off. Since the table overflow happens immediately, during the load of the VCF file in the first generation, it is an example of a case where even the simplified form of a tree-seq forward simulation exceeds the table size limit. The details of the use case are interesting, however. The model is not that big. 50,000 individuals, and the VCF file contains only 9760 mutations. The catch is that, because those mutations are loaded from a VCF file, there is no ancestral relationship between the various instances of each of those 9760 mutations. If a given mutation is possessed by 40,000 individuals, it gets put into the mutation table 40,000 times, because there is no ancestral edge to put it on from which the 40,000 individuals descend, so as far as tskit is concerned it is not one mutation, but 40,000 independent mutations. I guess that if the user were to run tsinfer on their VCF data, and then load the inferred tree sequence into SLiM rather than loading a VCF, then the model might be quite manageable, in fact (assuming the VCF data makes sense from an evolutionary perspective, i.e., a reasonable tree sequence can be inferred from it, I suppose). Anyway, the result is that the memory usage for the mutation table's metadata column overflows due to too many bytes. Given that this model really isn't that large, it would probably be easy to construct a similar example for which the number of rows in the mutation table overflowed as well, not just the number of bytes, while still being quite reasonable for an end user to want to be able to run. (FYI, the math: 50000 individuals * 9760 mutations * an average frequency of 0.5 * 20 bytes of SLiM metadata per mutation instance == 4.88e+09 bytes, which is greater than 2^32. So it is indeed quite plausible that this use case overflows the mutation metadata column.) So, maybe this case is not a super-clear example of a need to move to 64-bit indexes, although I thought it was when I wrote my reply to the slim-discuss list (I'll follow up shortly over there). Maybe it is an example of a need to infer a tree sequence prior to starting simulation, in order to make the simulation tractable (which is interesting; that hasn't come up before, for me). On the other hand, it is also true that if tskit were using 64-bit ints for these things, this model would have loaded and run just fine; and I can see users perhaps not wanting to run tsinfer first, if they don't want their simulation to be conditioned upon a particular inference about the ancestral relationships of the individuals being simulated. So I think, on balance, this probably does argue for a need for 64 bits in tskit; the use of tsinfer would just be a workaround, assuming it does serve to fix the problem at all. Curious to hear all your thoughts about this! |
My thoughts... are (still) that this falls firmly in the category of things I don't want to think about, but ought to do something about. Thanks for the reminder. (I agree, tsinfer should fix this user's problem. And the mysterious error might be due to running out of memory on the cluster?) |
He said he had a 500 GB memory limit on his cluster process, I think, so if that's true (open to question, he seems newbie) he should have hit the error message, not crashed. Hard to guess/debug, and I'm not particularly worried about that aspect of it. :-> |
Thanks for the info here @bhaller, all very interesting and useful. I'm sort of with Peter here, in that I agree this is something we need to sort out, but I'm reluctant to actually take it on in the short term. In reality, it will take up a couple of months of dev time to resolve, and I feel like there's more urgent stuff that needs doing and that this is still a relatively remote corner case. In this case, I think 64 bit tables would have allowed the model to run, but I'm not sure that the resulting tree sequence would have been much use - ~50K mutations per site is very ugly, and I suspect actually trying to do anything with the resulting tree sequence (like computing stats) would be very slow. If we're going to use this as a motivating example for taking on the BigTable work, then we should try it out to see what working with a tree sequence like this is actually like (i.e., on something just big enough to fit into the 32bit space). The idea of inferring a tree sequence from the data before running the simulation is very interesting! I think it would probably work quite well in this case, assuming the data isn't very noisy. We'd need to put some work into improving the toolchain to make this easy though, if we did recommend this as an approach for users. So, I'm not sure I have any answers here, but you asked for thoughts @bhaller, so there they are 😉 One question @bhaller:
Do you think this is a bug in tskit? What kind of crash,e.g. do we get a segfault or what? |
This is a tricky one. It'll be useful to move to 64 bit, but it is a lot of work. On the other hand, it seems that simulations with tree sequence recording should be "all in", and that starting them from a VCF should at least trigger a warning. |
Unfortunately I have only the output line "Command terminated by signal 11" and the user's claim that this error is a segfault. I've asked for a proper crash log, haven't gotten a reply to that yet. No idea whether it's a tskit bug (unlikely?) or a SLiM bug, probably from ignoring a tskit error return somewhere (likely?). Sorry. |
Hmm; really? I see the objection, but it's not really wrong, just non-optimal, right? What do you think @petrelharp? |
I view the node field as the evolutionary origin of the mutation, not simply the data structure origin. My view led me to disallow this entirely--if you want tree sequences, you either start w/one or start "empty". |
Well, that's a philosophical point I think. I think Ben's right in that it's not wrong per-se, just not obviously a good idea either. I think it's worth having a look at the tree sequences produced all right though, as users might be confused if they are massive and really slow to process. |
The user replied, says his cluster admin says that there is no crash log. He has a 1.5 GB core dump file. I have no idea what to do with those. :-> I downloaded it, and it appears to be a big binary file. If you know how to get a symbolicated backtrace out of a core dump, the download link is https://drive.google.com/drive/folders/14KyP82kHqnaRbJsCEW0LbQGA17dlMTPU?usp=sharing. I don't know what the hardware/software platform is, but I could ask. The file appears to start with a tag, "ELF>", that indicates it is an ELF core file, whatever that is (https://stackoverflow.com/questions/5986366/elf-core-file-format I guess). :-> |
I agree - not wrong but also not the right way to do things. I don't think it's worth chasing down this crash any further, either. |
I think the user would need to do a GDB backtrace on the core file, like this SO post. We can't do it because you need the original executable. |
Note that this has indeed showed up in the wild in MesserLab/SLiM#198 (resulting in #1509). |
I gave it a whirl. Unfortunately, the naive approach produced a crash:
This is in expand_column(), at the call to realloc(). I tried looking at the crash with a debug build, but in that case, the crash no longer occurs. :-> In a release build, on the other hand, it seems to be 100% reproducible, but I can't tell what's going on since the debugger can't get values and such reliably. So, it looked like a bit of a headache to debug, and the crash might well be due to something that the compiler is already warning about anyway – no point in spending a bunch of time tracking that down. So I stopped. :-> So I think the next step is probably to make a test 64-bit branch of tskit, get the warnings fixed and the tests all passing on your end, and then we can try merging that version of the code into SLiM and trying it out. |
My thinking is that we should grasp the nettle here and make a decision, now is the time to do this as we would like C 1.0 as a christmas present at the latest! I think getting a rough C-only branch and seeing how bad things are in SLiM is the first step, as @bhaller says. If performance is bad, we could consider only using 64-bit values for the offset columns, as that seems to be the first place that the limits are hit. I'm been trying to think through the implications for the Python side. The 64bit LWT I think will quite happily accept a 32bit LWT without any extra logic - the LWT code uses numpy "safe" casting to load from the dict. This means a 64bit msprime could load a dict from a 32bit tskit dict. However, heavily used things like
|
OK, let's get the simple ground work done anyway, so we can have a better view on what the actual complexities are. |
OK, I merged in the 64-bit experimental PR to SLiM and ran some tests using a simple SLiM model that uses tree-seq recording, based on the model from MesserLab/SLiM#198 that caused 32-bit tskit to crash in #1509. Results:
So, the speed impact is quite small; I personally think that is not a big deal, and indicates that there's no need to jump through hoops to avoid it (supporting running in either 32-bit or 64-bit mode, etc.). When simplification is off, the peak memory usage is actually larger for 32-bit than for 64-bit; that is presumably just due to stochasticity. (I used the same random seed for all of these runs, but memory usage of a process on macOS is stochastic based upon factors external to the process, for some reason; so I'd have to do multiple replicate runs to get a good mean value for peak memory usage). But overall it looks like it is not a problem. When simplification is on, the peak memory usage is quite a bit higher for 64-bit – enough higher that people might care. If these memory usage stats need pinning down, I'm happy to do replicate runs. |
Hm, thanks! The speed impact is small, I agree. Memory is a substantial worry, because it constrains simulations in practice. In the second point, you're saying that the 64-bit version without simplification uses less memory? For the version with simplification, that's an increase of about 50%, pretty big - but, if you think there's substantial variation in that number, I think it's worth doing some more runs to nail that down. |
Thanks for this @bhaller, this is super helpful. Thinking through the numbers again, it's getting harder for me to see going to 64 bit IDs as being worth it. Assuming there's no metadata, one row of the node table requires 24 bytes (time=8 bytes; flags=4, individual=4, population=4, metadata_offset=4). If we want 2^31 of those rows, we need 48GiB. That's just the node table - if we wanted an actual meaningful tree sequence to go with this, we'd surely be talking about 100GiB of edges. Given that we make copies of the table data a few times during loads, etc, you would definitely need 1TB of RAM to work with a tree sequence this large, if not 2TB. (These numbers are for 2^31 rows which we can already do.) Where would such a tree sequence come from? Well, it would have to come from a forward simulation, I think. If it came from a forward simulation, then either it's raw unsimplified output (so, wouldn't you be better off simplifying it down, 99.9999% of the time?), or it is the simplified result of an absolutely gigantic simulation which was run on a machine with hundreds of terabytes of RAM. So, to me, the only time people are going to actually want to create tables with this many rows is (a) if they're making a mistake and should be simplifying more often of (b) they're living 10 years in the future (and they've still probably made a mistake!). The metadata is a different issue, and I can definitely see an argument for making all the |
Ah-ha. I think you're exactly right, @jeromekelleher. |
No worries.
Well, I think it is not unreasonable to push it out further. However, I think "10 years in the future" is actually kinda arriving now. :-> There are certainly SLiM users already doing runs that require terabytes of memory; big clusters are commonplace now, already. There are certainly users wanting to do runs that involve more than a million diploid individuals per generation; that fills up 2^31 in only ~1000 generations without simplification, right? (I've had one person so far who wanted to do a billion diploid individuals per generation, which would fill up 2^31 in, uh, ~1 generation, but that seems a bit out of reach still :->). And there are lots of people wanting to do runs that are unsimplified, or that "remember" so many individuals that they might as well be unsimplified, for various purposes; I don't know how many such people are doing this, but "wouldn't you be better off simplifying it down, 99.9999% of the time?" is definitely an overestimate :->. Also, tons of edges to go with the tons of nodes is not always the way it goes; there are lots of people simulating species that are entirely/mostly clonal, or that have a low recombination rate, or – perhaps even more commonly – they might just be simulating a short stretch of the genome rather than the whole thing (and thus blow up the node table without blowing up the edge table). All of this is happening now. I think we will definitely want to move to 64 bit; the only question, in my mind, is whether to bite the bullet and do it now, or put it off until it is more of a forced move. Perfectly reasonable to put it off, but if the price paid is smallish it might be better to just do it early, for the 1.0 C API release (if I understand correctly that that is the choice, at the moment), and avoid growing pains later. I agree that metadata is the place where 64-bit is presently necessary – the only place where this is really becoming urgent. I think all of the cases of table overflow that we've seen so far have been due to metadata, right @petrelharp? So if you want to just shift that to 64-bit and keep the rest 32-bit for now, I have no objection at all. Or if you want me to run more replicates to get a better estimate of impact, I can do that. |
Right. And, my vote is with putting off entirely-64-bit (just doing offsets now), since a 50% memory increase would be pretty painful now, and it's not clear to me that the transition will be more painful later. |
Hmmm, very interesting, thanks Ben. How would you feel about the final output of a forward simulation needing to have 32 bit IDs (for compatibility with the rest of the tskit world, at the moment), but having 64 bit IDs supported internally so that these big models could at least be simulated, and simplified down to something more manageable at the end? I don't think it would be very hard for us to support 32 and 64 bit builds of tskit C as a compile time option, and that could be the right compromise? So, to be concrete, I'm proposing that SLiM would build tskit in 64 bit mode, but the library would raise an error if someone tried to save a file having tables with more than 2**31 rows. This leaves the door open for all of tskit moving to 64 bit some day in the future, while hopefully enabling the very large simulations you want to do, without too much compromise. I guess the advice for users who want to run massive models like this is that they should simplify down/subset or something at the end of the simulation, so that they can analyse their results. |
That's an interesting option - I don't like the memory consequences of always doing 64-bit tables in SLiM, though. Maybe an option to build 64-bit-table-SLiM? |
I think this sounds like overcomplicating things. I don't want to have to support multiple builds of SLiM, and I don't imagine you folks want to support multiple builds of tskit either; just that much more testing and other tsuris. If nobody has yet hit this for anything other than metadata, and we're not sure how far in the future it will be before somebody does, and the memory impact is big enough to be worrisome, then maybe let's put it off. In the Future, using twice as much memory probably won't bother people as much as it does now, too. In the Future, everything will be easy and our problems will just melt away. :-> So let's wait for the Future to arrive (apart from 64-bit metadata). I'm sufficiently convinced. |
Realizing that my previous comment might sound snarky, let me clarify that that is not my intent. :-> My point is: I think eventually we will want to go full 64-bit, but we are not there yet; apart from metadata, no user has hit this limit. How long it will take before a user does legitimately hit the limit is hard to guess, who knows. If we needed to do it now, then maybe separate 32/64-bit builds would be the right solution, because the memory penalty involved is presently pretty painful. If it turns out that we don't need to do it for 5 or 10 years, which is certainly possible (although my guess is sooner), the memory penalty would probably feel less painful, and we could simply abandon 32-bit and go right to 64-bit. That would be cleaner/easier/better. So let's kick the can down the road (apart from metadata), and see whether the Future solves our problems for us. :-> |
So I suggest repurposing this issue to represent "metadata should be 64-bit", and effectively closing the remaining "tskit should be 64-bit" issue as a side effect. We will revisit that larger issue when somebody brings us a real-world case that is not pathological. Sound good? |
I've made #1526 which has 64bit The current uses of 64bit IDs seem niche enough and "power user" enough that they could be supported by a compile time flag in the C code with no support in the python, and no compatibility with 32bit land. I'm not suggesting to do this now, but if there was demand. I think fixing the immediate metadata issue only is the right thing for C 1.0. |
I will try it out in a SLiM build and report back here, thanks! |
OK, so. Again, the test model without simplification runs to 50,000 generations with no problems, unlike with 32-bit tskit, so this PR still fixes the bug at hand. I did 5 replicates of each type of run this time, so you can get a picture of the variance. It's not enough replicates to really do great stats, but it's something. :-> Take-home point: times are statistically identical, memory usage looks to be ~10% higher. Details below. I think this is acceptable overhead for the benefit; I think we should do this. For 32-bit sizes: 20000 generations, no simplification: 20000 generations, simplification every 500: For 64-bit sizes: 20000 generations, no simplification: 20000 generations, simplification every 500: Times are statistically identical between 32-bit and 64-bit, for both models (p=0.31, p=0.18). Memory usage is harder to say, because it is so oddly stochastic on macOS. I wish I knew what caused that. All of these replicates use the same random number seed, so they should be doing literally exactly the same thing every time, near as I can figure it, and yet their memory usage is widely divergent. Without simplification, the noise from that variance makes it pretty hard to see a pattern; the two highest-usage runs are for 64-bit sizes, but then again, the two lowest-usage runs are, too, and stats says there's no significant difference (p = 0.4055). With simplification, for whatever reason, the variance is lower, and we can see a significant (p = 4.012e-05) difference in memory usage, which appears to be about 10% higher for 64-bit. |
Huh - memory usage is either 10% higher or 2x highter... ?? It's also kinda worrisome that your 32 bit tests have simplification being longer and taking more memory, but that's a separate issue... |
As far as I can tell, the high variance is because there is, as well as stochasticity, a kind of granularity in the way macOS allocates memory to processes. Note this set of results, for example: "5314.6, 3157.5, 5168.0, 3139.7, 5152.6". Either the process takes ~3 GB, or it takes around 5 GB; it's bimodal. I've noticed this many times. When the memory usage of a process is close to a threshold, you see this pattern; either it stays below the threshold and registers a lower value (~3 GB in this case) or it crosses the threshold and registers a much higher value (~5 GB in this case). I assume this is due to some kind of kernel memory allocation procedure – mapping a bunch of new pages to a process whenever it exceeds a threshold, or something. Again, I don't understand the details, but I've been observing this for years. It's nothing to worry about. The runs with simplification are apparently not close to such a threshold, and so their variance is much lower. I think those results are therefore more reliable and representative. Shrug. Maybe running these tests on Linux would produce results with lower variance, if the Linux kernel handles memory differently? I'm happy to share the scripts used, if you want to get into the weeds. :->
Nah, that doesn't worry me. I think it's just a sign that the simplification interval is non-optimal. We know that if simplification is too infrequent, it actually makes runtimes longer and increases peak memory usage. So the optimal interval is probably less than 500 generations. But that's not important here; the point here is to to have a test that exercises the use of the tree sequence, not just the recording of it, so that we see the effect of 64-bit sizes on that. Overall, I think these results are pretty clear, that the impact on speed is negligible and the impact on memory is ~10%, at least for this particular model. |
OK, great, sounds like we're in agreement! Here's how I suggest we go forward:
so I think we can close this issue? I think we came to the right decision, thanks for the input everyone! |
FYI, we've got someone hitting this, here: https://groups.google.com/g/slim-discuss/c/p5LL2SB1BTY/m/fN1quMiIAwAJ?utm_medium=email&utm_source=footer&pli=1 |
They don't, I don't think; they just turned off simplification completely, due to a misunderstanding, it looks like. |
Is it worth us adding something to the error message to the effect of "This error is often caused by disabling simplification in simulations"? |
I think that's a good idea. The error lives in the tskit code, I guess; can it be added there? |
It seems that requiring metadata columns to being < 4G is an unwelcome limitation in forward simulation applications, so we should consider upgrading the metadata_offset columns (and other offset columns) to 64 bit integers. Here is what it would entail:
tsk_offset_t
typedef and go through the C tables API making sure this is used for all offset columns (metadata_offset, ancestral_state_offset, etc). Typedef this to uint64_t.From a user perspective, this will mean that old versions of tskit won't be able to read newer files. New versions of tskit will continue to read older files without issues, as we're making the code more flexible in terms of expected types.
pinging @petrelharp and @molpopgen for opinions.
The text was updated successfully, but these errors were encountered: