Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Convert tsk_id_t/tsk_size_t to 64 bit. #343

Closed
jeromekelleher opened this issue Aug 27, 2019 · 58 comments
Closed

Convert tsk_id_t/tsk_size_t to 64 bit. #343

jeromekelleher opened this issue Aug 27, 2019 · 58 comments
Labels
C API Issue is about the C API
Milestone

Comments

@jeromekelleher
Copy link
Member

It seems that requiring metadata columns to being < 4G is an unwelcome limitation in forward simulation applications, so we should consider upgrading the metadata_offset columns (and other offset columns) to 64 bit integers. Here is what it would entail:

  1. Create a tsk_offset_t typedef and go through the C tables API making sure this is used for all offset columns (metadata_offset, ancestral_state_offset, etc). Typedef this to uint64_t.
  2. Increment the file-format minor version. In the file writing code, look at the last value in each offset array. If it's < UINT32_MAX, store it as a 32 bit value; if not, store as a 64 bit. In the reading code, check which type is being used to store and update accordingly. Note that this means we don't be able to use the current zero-copy behaviour where we use the memory in the kastore to back the arrays. Figure out the best approach here (note, using the memory in the kastore was motivated by using mmap for io, which we've dropped now as it's inherently dangerous and hard to do properly cross-platform).
  3. Clean up _tskitmodule.c to use the correct new sizes for numpy arrays (possibly also needing to backport this over to msprime where we use the LightweightTableCollection for interchange).

From a user perspective, this will mean that old versions of tskit won't be able to read newer files. New versions of tskit will continue to read older files without issues, as we're making the code more flexible in terms of expected types.

pinging @petrelharp and @molpopgen for opinions.

@jeromekelleher
Copy link
Member Author

Note: arguably we should go the whole hog here and typedef tsk_size_t to uint64_t and tsk_id_t to int64_t. It's less clear that we really do need more than 2*31 rows in a table though, and this will definitely affect memory usage and (possibly) performance. Perhaps not as much as we might think though, so as we're disrupting things it might be as well to try this out and see what the perf implication really are.

@petrelharp
Copy link
Contributor

It seems like we need to allow bigger tables (see #402), but on the other hand, the implications for memory usage are pretty unfortunate. The least bad other solution I can think of is to have tables switch over from 32-bit to 64-bit once they get big, although that'd be quite annoying.

@jeromekelleher
Copy link
Member Author

It seems like we need to allow bigger tables (see #402), but on the other hand, the implications for memory usage are pretty unfortunate. The least bad other solution I can think of is to have tables switch over from 32-bit to 64-bit once they get big, although that'd be quite annoying.

I think we'll probably just have to bite the bullet and make everything 64bit. Having a tsk_table_t and a tsk_table64_t would be a huge amount of extra work. Let's do a prototype and see what the actual memory costs of going to 64 bit are. It'll probably make no practical difference for the vast majority of tree sequences.

We should make the storage code more flexible though, and only store in 64 bit if we need to. This will create a little bit of forward-backward incompatability, but nothing we can't manage (the versioning infrastructure is in there just for this type of thing).

@molpopgen
Copy link
Member

molpopgen commented Nov 13, 2019

The memory impacts will be appreciable for larger sims. For example, if I modify the script from #402 to just be plain W-F:

initialize() {
        // initializeSLiMModelType("nonWF");
        initializeTreeSeq(simplificationInterval=500);
        // defineConstant("Np", 50);
        // defineConstant("K", 1000000);        // total carrying capacity
        // defineConstant("Kp", asInteger(K / Np));

        initializeMutationType("m1", 0.5, "f", 0.0);
        m1.convertToSubstitution = T;

        initializeGenomicElementType("g1", m1, 1.0);
        initializeGenomicElement(g1, 0, 600);
        initializeMutationRate(0);
        initializeRecombinationRate(0);
}
// reproduction() {
//      subpop.addCrossed(individual, subpop.sampleIndividuals(1));
// }
1 early() {
        print(time());
    sim.addSubpop("p1",1000000);
        // for (i in 1:Np)
        //      sim.addSubpop(i, Kp);
}
// early() {
        // for (subpop in sim.subpopulations)
    //     print(subpop.individualCount);
        // for (subpop in sim.subpopulations)
        //      subpop.fitnessScaling = Kp / subpop.individualCount;
// }
500 late() {
        print(time());
        sim.outputFixedMutations();
}

And run the following command:

/usr/bin/time -f "%e %M" -o bigslim_time.txt ./slim bug2.slim

The timing is:

cat bigslim_time.txt 
Command exited with non-zero status 1
6636.40 31636984

So, that's 32GB. Roughly, we can reason that the memory is mostly taken up by edges. The simulation has no mutation, so we'll assume the size of individual genomes is negligible. Changing the node integer to uint64_t bumps the total size of an edge up by 1/3, so we can guesstimate that the RAM use will go up accordingly, so about 40GB.

The same model in fwdpy11 runs through to completion, and includes the simplification and the edge table indexing. The following is run against fwdpy11/dev:

import fwdpy11
import numpy as np

N = 1000000

pop = fwdpy11.DiploidPopulation(N, 1.)

pdict = {'gvalue': fwdpy11.Multiplicative(1.),
         'nregions': [],
         'sregions': [],
         'recregions': [],
         'rates': (0, 0, 0),
         'demography': np.array([N]*500, dtype=np.int32)
         }
params = fwdpy11.ModelParams(**pdict)

rng = fwdpy11.GSLrng(42)

# the 500 is the time b/w simplifications.
# here, the parameter means once at the end.
fwdpy11.evolvets(rng, pop, params, 500)

The same type of command for timing gives:

cat bigfp11_time.txt 
687.72 67584080

So, pushing 70GB. It may be important that the back-end does not copy edge tables to sort. Rather, the sorting is in-place. I'm not convinced, though, that that difference affects peak RAM use, as the memory would be freed prior to simplification. In any case, upping that number by 1.3 is close to 90GB.

Simplifying every 100 generations is a big help:

614.79 14609144

I have not tested the data dump from the fwdpp table collection format to tskit, but I could if there is interest--maybe I'd get an over flow there? in any case, I'm not totally convinced that #402 is entirely due to overflow. Perhaps there's a subtle bug elsewhere?

@molpopgen
Copy link
Member

Quick update. i hacked fwdpp to use 64 bit signed integers for the tree sequences and reran the example from above. The memory use wasn't as bad as I'd guessed, so my loose approximation was too loose:

cat bigfp11_time.txt 
738.32 79347220

So, that is about a 10% performance hit, but I have to admit to laziness. I merely changed a typdef. There are probably loads of narrowing conversions, etc., going on. I was just playing.

@jeromekelleher
Copy link
Member Author

Thanks @molpopgen, this is good info. The more I think about it, the more I wonder whether we should be using tables for forward simulations at all though. (I'll send you an email about this)

Do we think that the simplified output of a forward simulation is every likely to need 64 bit tables?

@molpopgen
Copy link
Member

molpopgen commented Nov 15, 2019

Do we think that the simplified output of a forward simulation is every likely to need 64 bit tables?

Tricky question. Personally, I am not interesting in attempting to simulate "all humans", but others may be. More generally, for tsinfer, if you think about the rate at which adding a sample adds new nodes, what does a "Continent Biobank" do to the total number?

For this particular issue, we need offsets larger than the current for metadata. For #402, we haven't totally established what is going on, or at least I'm not clear that we have.

@jeromekelleher
Copy link
Member Author

Tricky question.

Indeed!

@jeromekelleher jeromekelleher changed the title Allow ragged columns with > 2**32 entries Convert tskid_t/tsk_size_t to 64 bit. May 14, 2020
@jeromekelleher jeromekelleher added this to the C API 1.0.0 milestone May 14, 2020
@jeromekelleher jeromekelleher changed the title Convert tskid_t/tsk_size_t to 64 bit. Convert tsk_id_t/tsk_size_t to 64 bit. May 14, 2020
@petrelharp
Copy link
Contributor

Update here; #402 was indeed due to running out of space in the metadata column; other columns (eg the node ID column) are a factor of 10 smaller, but will still be hit pretty quick in sims of size 1M. Some additional discussion there.

@jeromekelleher jeromekelleher added the C API Issue is about the C API label Sep 29, 2020
@bhaller
Copy link

bhaller commented Mar 14, 2021

Do we think that the simplified output of a forward simulation is every likely to need 64 bit tables?

A user has just posted on slim-discuss (https://groups.google.com/g/slim-discuss/c/lCtps2FEBxI) with a model that cannot run under tree-sequence recording because it causes the error "tsk_mutation_table_add_row: Table column too large; cannot be more than 2**32 bytes." (For that user, it crashes, in fact; unclear why, have requested a crash log, but the table overrun seems to be the actual culprit, and running the model myself, I get the nice error message rather than a crash.) The model is quite simple: it creates a population of size 50,000 (diploid) and then loads a 2GB VCF file with a bunch of mutations. It runs fine with tree-sequence recording turned off. Since the table overflow happens immediately, during the load of the VCF file in the first generation, it is an example of a case where even the simplified form of a tree-seq forward simulation exceeds the table size limit.

The details of the use case are interesting, however. The model is not that big. 50,000 individuals, and the VCF file contains only 9760 mutations. The catch is that, because those mutations are loaded from a VCF file, there is no ancestral relationship between the various instances of each of those 9760 mutations. If a given mutation is possessed by 40,000 individuals, it gets put into the mutation table 40,000 times, because there is no ancestral edge to put it on from which the 40,000 individuals descend, so as far as tskit is concerned it is not one mutation, but 40,000 independent mutations. I guess that if the user were to run tsinfer on their VCF data, and then load the inferred tree sequence into SLiM rather than loading a VCF, then the model might be quite manageable, in fact (assuming the VCF data makes sense from an evolutionary perspective, i.e., a reasonable tree sequence can be inferred from it, I suppose).

Anyway, the result is that the memory usage for the mutation table's metadata column overflows due to too many bytes. Given that this model really isn't that large, it would probably be easy to construct a similar example for which the number of rows in the mutation table overflowed as well, not just the number of bytes, while still being quite reasonable for an end user to want to be able to run.

(FYI, the math: 50000 individuals * 9760 mutations * an average frequency of 0.5 * 20 bytes of SLiM metadata per mutation instance == 4.88e+09 bytes, which is greater than 2^32. So it is indeed quite plausible that this use case overflows the mutation metadata column.)

So, maybe this case is not a super-clear example of a need to move to 64-bit indexes, although I thought it was when I wrote my reply to the slim-discuss list (I'll follow up shortly over there). Maybe it is an example of a need to infer a tree sequence prior to starting simulation, in order to make the simulation tractable (which is interesting; that hasn't come up before, for me). On the other hand, it is also true that if tskit were using 64-bit ints for these things, this model would have loaded and run just fine; and I can see users perhaps not wanting to run tsinfer first, if they don't want their simulation to be conditioned upon a particular inference about the ancestral relationships of the individuals being simulated. So I think, on balance, this probably does argue for a need for 64 bits in tskit; the use of tsinfer would just be a workaround, assuming it does serve to fix the problem at all.

Curious to hear all your thoughts about this!

@petrelharp
Copy link
Contributor

My thoughts... are (still) that this falls firmly in the category of things I don't want to think about, but ought to do something about. Thanks for the reminder.

(I agree, tsinfer should fix this user's problem. And the mysterious error might be due to running out of memory on the cluster?)

@bhaller
Copy link

bhaller commented Mar 15, 2021

He said he had a 500 GB memory limit on his cluster process, I think, so if that's true (open to question, he seems newbie) he should have hit the error message, not crashed. Hard to guess/debug, and I'm not particularly worried about that aspect of it. :->

@jeromekelleher
Copy link
Member Author

Thanks for the info here @bhaller, all very interesting and useful.

I'm sort of with Peter here, in that I agree this is something we need to sort out, but I'm reluctant to actually take it on in the short term. In reality, it will take up a couple of months of dev time to resolve, and I feel like there's more urgent stuff that needs doing and that this is still a relatively remote corner case.

In this case, I think 64 bit tables would have allowed the model to run, but I'm not sure that the resulting tree sequence would have been much use - ~50K mutations per site is very ugly, and I suspect actually trying to do anything with the resulting tree sequence (like computing stats) would be very slow. If we're going to use this as a motivating example for taking on the BigTable work, then we should try it out to see what working with a tree sequence like this is actually like (i.e., on something just big enough to fit into the 32bit space).

The idea of inferring a tree sequence from the data before running the simulation is very interesting! I think it would probably work quite well in this case, assuming the data isn't very noisy. We'd need to put some work into improving the toolchain to make this easy though, if we did recommend this as an approach for users.

So, I'm not sure I have any answers here, but you asked for thoughts @bhaller, so there they are 😉

One question @bhaller:

(For that user, it crashes, in fact; unclear why, have requested a crash log, but the table overrun seems to be the actual culprit, and running the model myself, I get the nice error message rather than a crash.)

Do you think this is a bug in tskit? What kind of crash,e.g. do we get a segfault or what?

@molpopgen
Copy link
Member

This is a tricky one. It'll be useful to move to 64 bit, but it is a lot of work. On the other hand, it seems that simulations with tree sequence recording should be "all in", and that starting them from a VCF should at least trigger a warning.

@molpopgen molpopgen reopened this Mar 15, 2021
@bhaller
Copy link

bhaller commented Mar 15, 2021

One question @bhaller:

(For that user, it crashes, in fact; unclear why, have requested a crash log, but the table overrun seems to be the actual culprit, and running the model myself, I get the nice error message rather than a crash.)

Do you think this is a bug in tskit? What kind of crash,e.g. do we get a segfault or what?

Unfortunately I have only the output line "Command terminated by signal 11" and the user's claim that this error is a segfault. I've asked for a proper crash log, haven't gotten a reply to that yet. No idea whether it's a tskit bug (unlikely?) or a SLiM bug, probably from ignoring a tskit error return somewhere (likely?). Sorry.

@bhaller
Copy link

bhaller commented Mar 15, 2021

This is a tricky one. It'll be useful to move to 64 bit, but it is a lot of work. On the other hand, it seems that simulations with tree sequence recording should be "all in", and that starting them from a VCF should at least trigger a warning.

Hmm; really? I see the objection, but it's not really wrong, just non-optimal, right? What do you think @petrelharp?

@molpopgen
Copy link
Member

I view the node field as the evolutionary origin of the mutation, not simply the data structure origin. My view led me to disallow this entirely--if you want tree sequences, you either start w/one or start "empty".

@jeromekelleher
Copy link
Member Author

Well, that's a philosophical point I think. I think Ben's right in that it's not wrong per-se, just not obviously a good idea either. I think it's worth having a look at the tree sequences produced all right though, as users might be confused if they are massive and really slow to process.

@bhaller
Copy link

bhaller commented Mar 15, 2021

@jeromekelleher

Do you think this is a bug in tskit? What kind of crash,e.g. do we get a segfault or what?

Unfortunately I have only the output line "Command terminated by signal 11" and the user's claim that this error is a segfault. I've asked for a proper crash log, haven't gotten a reply to that yet. No idea whether it's a tskit bug (unlikely?) or a SLiM bug, probably from ignoring a tskit error return somewhere (likely?). Sorry.

The user replied, says his cluster admin says that there is no crash log. He has a 1.5 GB core dump file. I have no idea what to do with those. :-> I downloaded it, and it appears to be a big binary file. If you know how to get a symbolicated backtrace out of a core dump, the download link is https://drive.google.com/drive/folders/14KyP82kHqnaRbJsCEW0LbQGA17dlMTPU?usp=sharing. I don't know what the hardware/software platform is, but I could ask. The file appears to start with a tag, "ELF>", that indicates it is an ELF core file, whatever that is (https://stackoverflow.com/questions/5986366/elf-core-file-format I guess). :->

@petrelharp
Copy link
Contributor

I agree - not wrong but also not the right way to do things.

I don't think it's worth chasing down this crash any further, either.

@jeromekelleher
Copy link
Member Author

I think the user would need to do a GDB backtrace on the core file, like this SO post. We can't do it because you need the original executable.

@petrelharp
Copy link
Contributor

Note that this has indeed showed up in the wild in MesserLab/SLiM#198 (resulting in #1509).

@bhaller
Copy link

bhaller commented Jun 22, 2021

You could try this out if you like @bhaller - you just change the typedefs for tsk_size_t and tsk_id_t, turn off your compiler warnings, and run a big SLiMulation before and after. Just don't try to interact with stuff on disk, as this will result in nasty things happening.

I gave it a whirl. Unfortunately, the naive approach produced a crash:

slim(64218,0x10079edc0) malloc: *** error for object 0x100000000: pointer being realloc'd was not allocated

This is in expand_column(), at the call to realloc(). I tried looking at the crash with a debug build, but in that case, the crash no longer occurs. :-> In a release build, on the other hand, it seems to be 100% reproducible, but I can't tell what's going on since the debugger can't get values and such reliably. So, it looked like a bit of a headache to debug, and the crash might well be due to something that the compiler is already warning about anyway – no point in spending a bunch of time tracking that down. So I stopped. :->

So I think the next step is probably to make a test 64-bit branch of tskit, get the warnings fixed and the tests all passing on your end, and then we can try merging that version of the code into SLiM and trying it out.

@benjeffery
Copy link
Member

@benjeffery, what's your thoughts?

My thinking is that we should grasp the nettle here and make a decision, now is the time to do this as we would like C 1.0 as a christmas present at the latest! I think getting a rough C-only branch and seeing how bad things are in SLiM is the first step, as @bhaller says. If performance is bad, we could consider only using 64-bit values for the offset columns, as that seems to be the first place that the limits are hit.

I'm been trying to think through the implications for the Python side. The 64bit LWT I think will quite happily accept a 32bit LWT without any extra logic - the LWT code uses numpy "safe" casting to load from the dict. This means a 64bit msprime could load a dict from a 32bit tskit dict. However, heavily used things like copy_table in msprime need to go the other way. This doesn't work as it is an unsafe cast. It wouldn't be too bad to have a as64bitdict with the original asdict erroring out if things were too big, but then other methods like column arrays are still going to be 64bit. Making all those optional 64bit seems like too much work and duplication.

0.4.0 with 64bit only seems to give us the least forward maintenance burden and complexity if the breakage isn't too bad. I assume I am missing some of the breakage though.

@jeromekelleher
Copy link
Member Author

OK, let's get the simple ground work done anyway, so we can have a better view on what the actual complexities are.

@bhaller
Copy link

bhaller commented Jun 24, 2021

OK, I merged in the 64-bit experimental PR to SLiM and ran some tests using a simple SLiM model that uses tree-seq recording, based on the model from MesserLab/SLiM#198 that caused 32-bit tskit to crash in #1509. Results:

  • The unmodified model now runs out to generation 52,000, when I killed it. Seemed to be fine. This model crashed inside tskit around generation 21,000 with the 32-bit version of tskit (which I just confirmed with the same SLiM version I'm using for the 64-bit tests). So going to 64-bit does seem to fix that problem, as hoped.

  • A version of the model that runs out to generation 20,000 (thus avoiding the crash), without simplification, runs in 54.236 seconds for 64-bit (peak memory usage 3968MB), 47.3903 seconds for 32-bit (peak memory usage 4348MB).

  • A version of the model that simplifies every 500 generations, running out to generation 20,000, runs in 101.504 seconds for 64-bit (peak memory usage 8029MB), 97.4028 seconds for 32-bit (peak memory usage 5504MB).

So, the speed impact is quite small; I personally think that is not a big deal, and indicates that there's no need to jump through hoops to avoid it (supporting running in either 32-bit or 64-bit mode, etc.). When simplification is off, the peak memory usage is actually larger for 32-bit than for 64-bit; that is presumably just due to stochasticity. (I used the same random seed for all of these runs, but memory usage of a process on macOS is stochastic based upon factors external to the process, for some reason; so I'd have to do multiple replicate runs to get a good mean value for peak memory usage). But overall it looks like it is not a problem. When simplification is on, the peak memory usage is quite a bit higher for 64-bit – enough higher that people might care.

If these memory usage stats need pinning down, I'm happy to do replicate runs.

@petrelharp
Copy link
Contributor

Hm, thanks! The speed impact is small, I agree. Memory is a substantial worry, because it constrains simulations in practice. In the second point, you're saying that the 64-bit version without simplification uses less memory? For the version with simplification, that's an increase of about 50%, pretty big - but, if you think there's substantial variation in that number, I think it's worth doing some more runs to nail that down.

@jeromekelleher
Copy link
Member Author

jeromekelleher commented Jun 24, 2021

Thanks for this @bhaller, this is super helpful.

Thinking through the numbers again, it's getting harder for me to see going to 64 bit IDs as being worth it. Assuming there's no metadata, one row of the node table requires 24 bytes (time=8 bytes; flags=4, individual=4, population=4, metadata_offset=4). If we want 2^31 of those rows, we need 48GiB. That's just the node table - if we wanted an actual meaningful tree sequence to go with this, we'd surely be talking about 100GiB of edges. Given that we make copies of the table data a few times during loads, etc, you would definitely need 1TB of RAM to work with a tree sequence this large, if not 2TB. (These numbers are for 2^31 rows which we can already do.)

Where would such a tree sequence come from? Well, it would have to come from a forward simulation, I think. If it came from a forward simulation, then either it's raw unsimplified output (so, wouldn't you be better off simplifying it down, 99.9999% of the time?), or it is the simplified result of an absolutely gigantic simulation which was run on a machine with hundreds of terabytes of RAM.

So, to me, the only time people are going to actually want to create tables with this many rows is (a) if they're making a mistake and should be simplifying more often of (b) they're living 10 years in the future (and they've still probably made a mistake!).

The metadata is a different issue, and I can definitely see an argument for making all the x_offset columns 64 bit, and I think we should almost certainly do that. Making the ID type 64 bit seems like bad engineering to me, the numbers just don't add up. All we'd be doing is letting people make bigger mistakes, with certain downsides in terms of extra storage, cache pollution, etc etc.

@petrelharp
Copy link
Contributor

petrelharp commented Jun 24, 2021

Ah-ha. I think you're exactly right, @jeromekelleher.

@bhaller
Copy link

bhaller commented Jun 24, 2021

Thanks for this @bhaller, this is super helpful.

No worries.

...So, to me, the only time people are going to actually want to create tables with this many rows is (a) if they're making a mistake and should be simplifying more often of (b) they're living 10 years in the future (and they've still probably made a mistake!).

The metadata is a different issue, and I can definitely see an argument for making all the x_offset columns 64 bit, and I think we should almost certainly do that. Making the ID type 64 bit seems like bad engineering to me, the numbers just don't add up. All we'd be doing is letting people make bigger mistakes, with certain downsides in terms of extra storage, cache pollution, etc etc.

Well, I think it is not unreasonable to push it out further. However, I think "10 years in the future" is actually kinda arriving now. :-> There are certainly SLiM users already doing runs that require terabytes of memory; big clusters are commonplace now, already. There are certainly users wanting to do runs that involve more than a million diploid individuals per generation; that fills up 2^31 in only ~1000 generations without simplification, right? (I've had one person so far who wanted to do a billion diploid individuals per generation, which would fill up 2^31 in, uh, ~1 generation, but that seems a bit out of reach still :->). And there are lots of people wanting to do runs that are unsimplified, or that "remember" so many individuals that they might as well be unsimplified, for various purposes; I don't know how many such people are doing this, but "wouldn't you be better off simplifying it down, 99.9999% of the time?" is definitely an overestimate :->. Also, tons of edges to go with the tons of nodes is not always the way it goes; there are lots of people simulating species that are entirely/mostly clonal, or that have a low recombination rate, or – perhaps even more commonly – they might just be simulating a short stretch of the genome rather than the whole thing (and thus blow up the node table without blowing up the edge table). All of this is happening now.

I think we will definitely want to move to 64 bit; the only question, in my mind, is whether to bite the bullet and do it now, or put it off until it is more of a forced move. Perfectly reasonable to put it off, but if the price paid is smallish it might be better to just do it early, for the 1.0 C API release (if I understand correctly that that is the choice, at the moment), and avoid growing pains later.

I agree that metadata is the place where 64-bit is presently necessary – the only place where this is really becoming urgent. I think all of the cases of table overflow that we've seen so far have been due to metadata, right @petrelharp? So if you want to just shift that to 64-bit and keep the rest 32-bit for now, I have no objection at all. Or if you want me to run more replicates to get a better estimate of impact, I can do that.

@petrelharp
Copy link
Contributor

petrelharp commented Jun 24, 2021

I think all of the cases of table overflow that we've seen so far have been due to metadata, right @petrelharp?

Right.

And, my vote is with putting off entirely-64-bit (just doing offsets now), since a 50% memory increase would be pretty painful now, and it's not clear to me that the transition will be more painful later.

@jeromekelleher
Copy link
Member Author

Hmmm, very interesting, thanks Ben. How would you feel about the final output of a forward simulation needing to have 32 bit IDs (for compatibility with the rest of the tskit world, at the moment), but having 64 bit IDs supported internally so that these big models could at least be simulated, and simplified down to something more manageable at the end? I don't think it would be very hard for us to support 32 and 64 bit builds of tskit C as a compile time option, and that could be the right compromise?

So, to be concrete, I'm proposing that SLiM would build tskit in 64 bit mode, but the library would raise an error if someone tried to save a file having tables with more than 2**31 rows. This leaves the door open for all of tskit moving to 64 bit some day in the future, while hopefully enabling the very large simulations you want to do, without too much compromise. I guess the advice for users who want to run massive models like this is that they should simplify down/subset or something at the end of the simulation, so that they can analyse their results.

@petrelharp
Copy link
Contributor

That's an interesting option - I don't like the memory consequences of always doing 64-bit tables in SLiM, though. Maybe an option to build 64-bit-table-SLiM?

@bhaller
Copy link

bhaller commented Jun 24, 2021

I think this sounds like overcomplicating things. I don't want to have to support multiple builds of SLiM, and I don't imagine you folks want to support multiple builds of tskit either; just that much more testing and other tsuris. If nobody has yet hit this for anything other than metadata, and we're not sure how far in the future it will be before somebody does, and the memory impact is big enough to be worrisome, then maybe let's put it off. In the Future, using twice as much memory probably won't bother people as much as it does now, too. In the Future, everything will be easy and our problems will just melt away. :-> So let's wait for the Future to arrive (apart from 64-bit metadata). I'm sufficiently convinced.

@bhaller
Copy link

bhaller commented Jun 24, 2021

Realizing that my previous comment might sound snarky, let me clarify that that is not my intent. :-> My point is: I think eventually we will want to go full 64-bit, but we are not there yet; apart from metadata, no user has hit this limit. How long it will take before a user does legitimately hit the limit is hard to guess, who knows. If we needed to do it now, then maybe separate 32/64-bit builds would be the right solution, because the memory penalty involved is presently pretty painful. If it turns out that we don't need to do it for 5 or 10 years, which is certainly possible (although my guess is sooner), the memory penalty would probably feel less painful, and we could simply abandon 32-bit and go right to 64-bit. That would be cleaner/easier/better. So let's kick the can down the road (apart from metadata), and see whether the Future solves our problems for us. :->

@bhaller
Copy link

bhaller commented Jun 24, 2021

So I suggest repurposing this issue to represent "metadata should be 64-bit", and effectively closing the remaining "tskit should be 64-bit" issue as a side effect. We will revisit that larger issue when somebody brings us a real-world case that is not pathological. Sound good?

@benjeffery
Copy link
Member

I've made #1526 which has 64bit tsk_size_t. This is more than just 64bit metadata offsets - but is the easiest thing to do that gets close. I assume perf will be much closer to all 32bit.

The current uses of 64bit IDs seem niche enough and "power user" enough that they could be supported by a compile time flag in the C code with no support in the python, and no compatibility with 32bit land. I'm not suggesting to do this now, but if there was demand. I think fixing the immediate metadata issue only is the right thing for C 1.0.

@bhaller
Copy link

bhaller commented Jun 24, 2021

I will try it out in a SLiM build and report back here, thanks!

@bhaller
Copy link

bhaller commented Jun 25, 2021

OK, so. Again, the test model without simplification runs to 50,000 generations with no problems, unlike with 32-bit tskit, so this PR still fixes the bug at hand.

I did 5 replicates of each type of run this time, so you can get a picture of the variance. It's not enough replicates to really do great stats, but it's something. :-> Take-home point: times are statistically identical, memory usage looks to be ~10% higher. Details below. I think this is acceptable overhead for the benefit; I think we should do this.

For 32-bit sizes:

20000 generations, no simplification:
CPU time (sec.): 43.7, 42.1, 45.8, 47.1, 43.9
Peak memory (MB): 4327.7, 4190.7, 3938.3, 3684.5, 3357.6

20000 generations, simplification every 500:
CPU time (sec.): 91.0, 90.2, 91.6, 91.5, 91.4
Peak memory (MB): 5268.4, 5404.2, 5360.2, 5414.7, 5446.3

For 64-bit sizes:

20000 generations, no simplification:
CPU time (sec.): 44.6, 42.3, 43.1, 44.2, 43.0
Peak memory (MB): 5314.6, 3157.5, 5168.0, 3139.7, 5152.6

20000 generations, simplification every 500:
CPU time (sec.): 91.0, 89.0, 90.1, 91.1, 91.0
Peak memory (MB): 5913.4, 5904.4, 5930.4, 5690.0, 5906.8

Times are statistically identical between 32-bit and 64-bit, for both models (p=0.31, p=0.18). Memory usage is harder to say, because it is so oddly stochastic on macOS. I wish I knew what caused that. All of these replicates use the same random number seed, so they should be doing literally exactly the same thing every time, near as I can figure it, and yet their memory usage is widely divergent. Without simplification, the noise from that variance makes it pretty hard to see a pattern; the two highest-usage runs are for 64-bit sizes, but then again, the two lowest-usage runs are, too, and stats says there's no significant difference (p = 0.4055). With simplification, for whatever reason, the variance is lower, and we can see a significant (p = 4.012e-05) difference in memory usage, which appears to be about 10% higher for 64-bit.

@petrelharp
Copy link
Contributor

Huh - memory usage is either 10% higher or 2x highter... ??

It's also kinda worrisome that your 32 bit tests have simplification being longer and taking more memory, but that's a separate issue...

@bhaller
Copy link

bhaller commented Jun 25, 2021

Huh - memory usage is either 10% higher or 2x highter... ??

As far as I can tell, the high variance is because there is, as well as stochasticity, a kind of granularity in the way macOS allocates memory to processes. Note this set of results, for example: "5314.6, 3157.5, 5168.0, 3139.7, 5152.6". Either the process takes ~3 GB, or it takes around 5 GB; it's bimodal. I've noticed this many times. When the memory usage of a process is close to a threshold, you see this pattern; either it stays below the threshold and registers a lower value (~3 GB in this case) or it crosses the threshold and registers a much higher value (~5 GB in this case). I assume this is due to some kind of kernel memory allocation procedure – mapping a bunch of new pages to a process whenever it exceeds a threshold, or something. Again, I don't understand the details, but I've been observing this for years. It's nothing to worry about. The runs with simplification are apparently not close to such a threshold, and so their variance is much lower. I think those results are therefore more reliable and representative. Shrug. Maybe running these tests on Linux would produce results with lower variance, if the Linux kernel handles memory differently? I'm happy to share the scripts used, if you want to get into the weeds. :->

It's also kinda worrisome that your 32 bit tests have simplification being longer and taking more memory, but that's a separate issue...

Nah, that doesn't worry me. I think it's just a sign that the simplification interval is non-optimal. We know that if simplification is too infrequent, it actually makes runtimes longer and increases peak memory usage. So the optimal interval is probably less than 500 generations. But that's not important here; the point here is to to have a test that exercises the use of the tree sequence, not just the recording of it, so that we see the effect of 64-bit sizes on that.

Overall, I think these results are pretty clear, that the impact on speed is negligible and the impact on memory is ~10%, at least for this particular model.

@jeromekelleher
Copy link
Member Author

OK, great, sounds like we're in agreement! Here's how I suggest we go forward:

so I think we can close this issue? I think we came to the right decision, thanks for the input everyone!

@petrelharp
Copy link
Contributor

FYI, we've got someone hitting this, here: https://groups.google.com/g/slim-discuss/c/p5LL2SB1BTY/m/fN1quMiIAwAJ?utm_medium=email&utm_source=footer&pli=1
... I haven't looked into the details yet (eg. whether they actually need such big tables).

@bhaller
Copy link

bhaller commented Nov 8, 2022

FYI, we've got someone hitting this, here: https://groups.google.com/g/slim-discuss/c/p5LL2SB1BTY/m/fN1quMiIAwAJ?utm_medium=email&utm_source=footer&pli=1
... I haven't looked into the details yet (eg. whether they actually need such big tables).

They don't, I don't think; they just turned off simplification completely, due to a misunderstanding, it looks like.

@benjeffery
Copy link
Member

Is it worth us adding something to the error message to the effect of "This error is often caused by disabling simplification in simulations"?

@bhaller
Copy link

bhaller commented Nov 8, 2022

Is it worth us adding something to the error message to the effect of "This error is often caused by disabling simplification in simulations"?

I think that's a good idea. The error lives in the tskit code, I guess; can it be added there?

@benjeffery
Copy link
Member

@bhaller How does #2630 sound?

@bhaller
Copy link

bhaller commented Nov 8, 2022

@bhaller How does #2630 sound?

LGTM, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C API Issue is about the C API
Projects
None yet
Development

No branches or pull requests

5 participants