Cannot handle columns > 2GiB #69

ShyamieG · 2023-04-05T17:17:50Z

I am experiencing an error when trying to compress certain tree sequence files:

ValueError: Codec does not support buffers of > 2147483647 bytes

It seems that this error is originating from some function in the zarr package related to chunking? This occurs with both python and command-line versions of tszip.

I can't upload an example here because even the gzipped version of my file is too large (72.3MB).

Any insight into why this is happening and how I might resolve it?

The text was updated successfully, but these errors were encountered:

jeromekelleher · 2023-04-06T08:52:05Z

Very interesting @ShyamieG! Can you provide us with a bit more info please:

The full stack trace from your ValueError above
The output of tskit info for one of these ts files.

ShyamieG · 2023-04-07T15:13:06Z

Hi there, sorry for the delay. Thanks for helping me with this!

Here is the result of running tszip on the command line.

And the result of tskit info on the offending ts.

jeromekelleher · 2023-04-11T08:47:16Z

Here's the tskit info from your file:

╔════════════════════════╗
║TreeSequence            ║
╠═══════════════╤════════╣
║Trees          │    7871║
╟───────────────┼────────╢
║Sequence Length│24214675║
╟───────────────┼────────╢
║Time Units     │   ticks║
╟───────────────┼────────╢
║Sample Nodes   │    5098║
╟───────────────┼────────╢
║Total Size     │ 3.1 GiB║
╚═══════════════╧════════╝
╔═══════════╤═════╤═════════╤════════════╗
║Table      │Rows │Size     │Has Metadata║
╠═══════════╪═════╪═════════╪════════════╣
║Edges      │37085│  1.1 MiB│          No║
╟───────────┼─────┼─────────┼────────────╢
║Individuals│ 4958│486.0 KiB│         Yes║
╟───────────┼─────┼─────────┼────────────╢
║Migrations │    0│  8 Bytes│          No║
╟───────────┼─────┼─────────┼────────────╢
║Mutations  │ 1093│ 65.9 KiB│         Yes║
╟───────────┼─────┼─────────┼────────────╢
║Nodes      │ 8152│303.2 KiB│         Yes║
╟───────────┼─────┼─────────┼────────────╢
║Populations│    4│  2.4 KiB│         Yes║
╟───────────┼─────┼─────────┼────────────╢
║Provenances│  232│  3.1 GiB│          No║
╟───────────┼─────┼─────────┼────────────╢
║Sites      │  680│ 16.0 KiB│          No║
╚═══════════╧═════╧═════════╧════════════╝

That's a heck of a lot of provenance data (3.1 GiB!) - I wonder what's going on there? It's very unlikely this is of any use, so it's worth figuring out where it came from and stopping it from happening.

I think that must be the tszip issue, the codec can't handle columns of length > 2G.

ShyamieG · 2023-04-11T13:45:37Z

Ah, okay, got it. Any tips on how I can go about figuring out what all of this stuff is?

jeromekelleher · 2023-04-11T15:18:18Z

I guess the first step would be to look at the provenances, like e.g.

tskit provenances <file> | less -S

I'm not sure how well it'll deal with having a 2G record though.

ShyamieG · 2023-04-11T18:25:11Z

Awesome, thank you! I took a look at one of my smaller files and saw that there is indeed a lot of redundant information. These trees are the result of merging several other trees, so that is part of it. I set record_provenance to False in my tskit.union() call to mitigate this somewhat.

However, the other issue is that my files are being produced as a result of passing a tree sequence from one SLiM script to another dozens to hundreds of times. This also results in a lot of redundant information being stored.

Is is possible to simply delete certain kinds of provenance information entirely? For example, I don't need to store information about the SLiM model or parameters with the ts file. Any problems with deleting this kind of information that I should be aware of?

For context, I'm still working on the same general problem that I describe in this post some months ago.

jeromekelleher · 2023-04-11T22:00:15Z

Easiest thing to do is just drop the provenance info entirely by truncating the provenance table. It's unlikely to have any effect on things working, as code shouldn't really be depending on the contents.

The docs here might help https://tskit.dev/tskit/docs/stable/provenance.html

jeromekelleher · 2023-06-19T15:04:06Z

I'm going to keep this one open @ShyamieG as it is genuinely a bug in tszip. At a minimum we should emit a better error message saying what the problem is.

jeromekelleher added the bug Something isn't working label Apr 6, 2023

jeromekelleher changed the title ~~Error when compressing tree sequences~~ Cannot handle columns > 2GiB Apr 11, 2023

ShyamieG closed this as completed Jun 19, 2023

jeromekelleher reopened this Jun 19, 2023

ShyamieG closed this as completed Jun 23, 2023

ShyamieG reopened this Jun 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cannot handle columns > 2GiB #69

Cannot handle columns > 2GiB #69

ShyamieG commented Apr 5, 2023

jeromekelleher commented Apr 6, 2023

ShyamieG commented Apr 7, 2023

jeromekelleher commented Apr 11, 2023 •

edited

Loading

ShyamieG commented Apr 11, 2023

jeromekelleher commented Apr 11, 2023

ShyamieG commented Apr 11, 2023

jeromekelleher commented Apr 11, 2023

jeromekelleher commented Jun 19, 2023

Cannot handle columns > 2GiB #69

Cannot handle columns > 2GiB #69

Comments

ShyamieG commented Apr 5, 2023

jeromekelleher commented Apr 6, 2023

ShyamieG commented Apr 7, 2023

jeromekelleher commented Apr 11, 2023 • edited Loading

ShyamieG commented Apr 11, 2023

jeromekelleher commented Apr 11, 2023

ShyamieG commented Apr 11, 2023

jeromekelleher commented Apr 11, 2023

jeromekelleher commented Jun 19, 2023

jeromekelleher commented Apr 11, 2023 •

edited

Loading