You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The grid is 700km by 1300km, with points every 50m. That gives you 700e3 * 1300e3 / 50^2 = 364e6 points
Stored as 32bit (4 byte) floats, that's 1456e6 bytes or approx 1.4GB. (So we've stored 1 GB and 0.4 sea, presumably).
My cached .npy numpy array is 1,456,000,128 bytes, so only 128 overhead from headers, which is pretty good.
Zip compression brings this down to 200,776,764 bytes (~200MB).
Interestingly, the downloaded zip of ASC files plus meta data was only 161,575,097 bytes.
Inside the downloaded zip, the asc files each contain 200*200 points in ascii text, but with low precision where possible.
A randomly sampled file was ~200kb for an average of 5 bytes per number (e.g. "1.234"; sea is e.g. "-0.1", hill points are e.g. "1.2345678").
But zip, which searches for repetition, seems to perform really well on this representation (approx 20% better than on floats!).
What all of this means is that there is lots and lots of structure in the data that ZIP can exploit, that we might be able to exploit ourselves for compact (and fast) storage in memory.
At the moment it's 1.6gb in memory, which many but not all laptops can do, but we need a spline in memory too.
If we want to try and get our hands on the 5m spaced data set (10^2 times more data), we need to solve this problem too (or have a machine that can easily store 160gb in memory).
A typical solution might be to calculate which block to load from disk, and then keep this cached until the user goes outside the block. But for a server situation we'd want all blocks in memory all the time...
I wonder if a wavelet approach (haar wavelets!) might work?
Separate issue, but a fun one to think about for a future student?
The text was updated successfully, but these errors were encountered:
MichaelClerx
changed the title
Come up with a clever way to store the data (in memory)
Come up with a clever way to store the data or interpolant in memory
Dec 14, 2021
Zip compression on this data set is amazing
The grid is 700km by 1300km, with points every 50m. That gives you 700e3 * 1300e3 / 50^2 = 364e6 points
Stored as 32bit (4 byte) floats, that's 1456e6 bytes or approx 1.4GB. (So we've stored 1 GB and 0.4 sea, presumably).
My cached
.npy
numpy array is 1,456,000,128 bytes, so only 128 overhead from headers, which is pretty good.Zip compression brings this down to 200,776,764 bytes (~200MB).
Interestingly, the downloaded zip of ASC files plus meta data was only 161,575,097 bytes.
Inside the downloaded zip, the asc files each contain 200*200 points in ascii text, but with low precision where possible.
A randomly sampled file was ~200kb for an average of 5 bytes per number (e.g. "1.234"; sea is e.g. "-0.1", hill points are e.g. "1.2345678").
But zip, which searches for repetition, seems to perform really well on this representation (approx 20% better than on floats!).
What all of this means is that there is lots and lots of structure in the data that ZIP can exploit, that we might be able to exploit ourselves for compact (and fast) storage in memory.
At the moment it's 1.6gb in memory, which many but not all laptops can do, but we need a spline in memory too.
If we want to try and get our hands on the 5m spaced data set (10^2 times more data), we need to solve this problem too (or have a machine that can easily store 160gb in memory).
A typical solution might be to calculate which block to load from disk, and then keep this cached until the user goes outside the block. But for a server situation we'd want all blocks in memory all the time...
I wonder if a wavelet approach (haar wavelets!) might work?
Separate issue, but a fun one to think about for a future student?
The text was updated successfully, but these errors were encountered: