Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Come up with a clever way to store the data or interpolant in memory #16

Open
MichaelClerx opened this issue Oct 28, 2021 · 1 comment

Comments

@MichaelClerx
Copy link
Member

Zip compression on this data set is amazing

The grid is 700km by 1300km, with points every 50m. That gives you 700e3 * 1300e3 / 50^2 = 364e6 points

Stored as 32bit (4 byte) floats, that's 1456e6 bytes or approx 1.4GB. (So we've stored 1 GB and 0.4 sea, presumably).
My cached .npy numpy array is 1,456,000,128 bytes, so only 128 overhead from headers, which is pretty good.
Zip compression brings this down to 200,776,764 bytes (~200MB).

Interestingly, the downloaded zip of ASC files plus meta data was only 161,575,097 bytes.
Inside the downloaded zip, the asc files each contain 200*200 points in ascii text, but with low precision where possible.
A randomly sampled file was ~200kb for an average of 5 bytes per number (e.g. "1.234"; sea is e.g. "-0.1", hill points are e.g. "1.2345678").
But zip, which searches for repetition, seems to perform really well on this representation (approx 20% better than on floats!).

What all of this means is that there is lots and lots of structure in the data that ZIP can exploit, that we might be able to exploit ourselves for compact (and fast) storage in memory.
At the moment it's 1.6gb in memory, which many but not all laptops can do, but we need a spline in memory too.
If we want to try and get our hands on the 5m spaced data set (10^2 times more data), we need to solve this problem too (or have a machine that can easily store 160gb in memory).

A typical solution might be to calculate which block to load from disk, and then keep this cached until the user goes outside the block. But for a server situation we'd want all blocks in memory all the time...

I wonder if a wavelet approach (haar wavelets!) might work?

Separate issue, but a fun one to think about for a future student?

@MichaelClerx
Copy link
Member Author

It strikes me that gaming people will have thought about this a lot :D

@MichaelClerx MichaelClerx changed the title Come up with a clever way to store the data (in memory) Come up with a clever way to store the data or interpolant in memory Dec 14, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant