Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot handle columns > 2GiB #69

Open
ShyamieG opened this issue Apr 5, 2023 · 8 comments
Open

Cannot handle columns > 2GiB #69

ShyamieG opened this issue Apr 5, 2023 · 8 comments
Labels
bug Something isn't working

Comments

@ShyamieG
Copy link

ShyamieG commented Apr 5, 2023

I am experiencing an error when trying to compress certain tree sequence files:

ValueError: Codec does not support buffers of > 2147483647 bytes

It seems that this error is originating from some function in the zarr package related to chunking? This occurs with both python and command-line versions of tszip.

I can't upload an example here because even the gzipped version of my file is too large (72.3MB).

Any insight into why this is happening and how I might resolve it?

@jeromekelleher jeromekelleher added the bug Something isn't working label Apr 6, 2023
@jeromekelleher
Copy link
Member

Very interesting @ShyamieG! Can you provide us with a bit more info please:

  • The full stack trace from your ValueError above
  • The output of tskit info for one of these ts files.

@ShyamieG
Copy link
Author

ShyamieG commented Apr 7, 2023

Hi there, sorry for the delay. Thanks for helping me with this!

Here is the result of running tszip on the command line.

And the result of tskit info on the offending ts.

@jeromekelleher
Copy link
Member

jeromekelleher commented Apr 11, 2023

Here's the tskit info from your file:

╔════════════════════════╗
║TreeSequence            ║
╠═══════════════╤════════╣
║Trees          │    7871║
╟───────────────┼────────╢
║Sequence Length│24214675║
╟───────────────┼────────╢
║Time Units     │   ticks║
╟───────────────┼────────╢
║Sample Nodes   │    5098║
╟───────────────┼────────╢
║Total Size     │ 3.1 GiB║
╚═══════════════╧════════╝
╔═══════════╤═════╤═════════╤════════════╗
║Table      │Rows │Size     │Has Metadata║
╠═══════════╪═════╪═════════╪════════════╣
║Edges      │37085│  1.1 MiB│          No║
╟───────────┼─────┼─────────┼────────────╢
║Individuals│ 4958│486.0 KiB│         Yes║
╟───────────┼─────┼─────────┼────────────╢
║Migrations │    0│  8 Bytes│          No║
╟───────────┼─────┼─────────┼────────────╢
║Mutations  │ 1093│ 65.9 KiB│         Yes║
╟───────────┼─────┼─────────┼────────────╢
║Nodes      │ 8152│303.2 KiB│         Yes║
╟───────────┼─────┼─────────┼────────────╢
║Populations│    4│  2.4 KiB│         Yes║
╟───────────┼─────┼─────────┼────────────╢
║Provenances│  232│  3.1 GiB│          No║
╟───────────┼─────┼─────────┼────────────╢
║Sites      │  680│ 16.0 KiB│          No║
╚═══════════╧═════╧═════════╧════════════╝

That's a heck of a lot of provenance data (3.1 GiB!) - I wonder what's going on there? It's very unlikely this is of any use, so it's worth figuring out where it came from and stopping it from happening.

I think that must be the tszip issue, the codec can't handle columns of length > 2G.

@jeromekelleher jeromekelleher changed the title Error when compressing tree sequences Cannot handle columns > 2GiB Apr 11, 2023
@ShyamieG
Copy link
Author

Ah, okay, got it. Any tips on how I can go about figuring out what all of this stuff is?

@jeromekelleher
Copy link
Member

I guess the first step would be to look at the provenances, like e.g.

tskit provenances <file> | less -S

I'm not sure how well it'll deal with having a 2G record though.

@ShyamieG
Copy link
Author

Awesome, thank you! I took a look at one of my smaller files and saw that there is indeed a lot of redundant information. These trees are the result of merging several other trees, so that is part of it. I set record_provenance to False in my tskit.union() call to mitigate this somewhat.

However, the other issue is that my files are being produced as a result of passing a tree sequence from one SLiM script to another dozens to hundreds of times. This also results in a lot of redundant information being stored.

Is is possible to simply delete certain kinds of provenance information entirely? For example, I don't need to store information about the SLiM model or parameters with the ts file. Any problems with deleting this kind of information that I should be aware of?

For context, I'm still working on the same general problem that I describe in this post some months ago.

@jeromekelleher
Copy link
Member

Easiest thing to do is just drop the provenance info entirely by truncating the provenance table. It's unlikely to have any effect on things working, as code shouldn't really be depending on the contents.

The docs here might help https://tskit.dev/tskit/docs/stable/provenance.html

@jeromekelleher
Copy link
Member

I'm going to keep this one open @ShyamieG as it is genuinely a bug in tszip. At a minimum we should emit a better error message saying what the problem is.

@ShyamieG ShyamieG reopened this Jun 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants