pickling very large maps fails #314

jllanfranchi · 2017-03-29T17:17:35Z

Not sure if this is fix-able. Pickle seems like a bad way to store really large maps (e.g. HDF5 would make more sense). But it might be a bug...

Alternately, could we integrate with npy binary-file format somehow? https://docs.scipy.org/doc/numpy/neps/npy-format.html Do we need to abandon pickle altogether?

jllanfranchi · 2017-12-13T16:57:31Z

Other alternatives for this:

If we use .npy files, we will need to create a directory and each key as a filename and the contents be the value, either a .npy file or .json or somesuch. This gets ugly fast trying to translate a dict into a dir with files.
- Can use .npz for multiple arrays in one file, but this doesn't help for arbitrary Python objects
Google Flatbuffers... but the Python interface looks rather clunky and not well maintained. More stuff to install that requires compilation. Doesn't seem to be an active community of users in Python.
Apache Arrow... seems to work well with large arrays, can be memory mapped (not necessary here but nice) and is zero-copy (& fast) like Flatbuffers, though still a nascent project. pip installable, which is nice. Can use feather file format, or the native format, or Apache Parquet(?)
- Since we already have serializable_state in many core objects which produces a dict of simple Python datatypes (plus numpy types), it seems Arrow might be able to handle this as-is: http://arrow.apache.org/docs/python/ipc.html#arbitrary-object-serialization
- Spec is not promised to be stable across versions, so this should not be used for long-term storage; can use for caching, though, and with storage of version, data can be read/interpreted correctly (though this gets hairy to have to have different versions of the same lib to read different files)
- EDIT: Apache Arrow uses Google Flatbuffers under the hood for some pieces of its internal representation of data
HDF5: this is good and only a little bad. We've used it before, it stores large arrays and can store arbitrary things. It's just a big, bloated library that carries far more complexity than necessary. But it works, is cross-platform, not terribly slow or terribly large files, etc.

jllanfranchi · 2017-12-13T22:19:09Z

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pickling very large maps fails #314

pickling very large maps fails #314

jllanfranchi commented Mar 29, 2017 •

edited

Loading

jllanfranchi commented Dec 13, 2017

jllanfranchi commented Dec 13, 2017 •

edited

Loading

pickling very large maps fails #314

pickling very large maps fails #314

Comments

jllanfranchi commented Mar 29, 2017 • edited Loading

jllanfranchi commented Dec 13, 2017

jllanfranchi commented Dec 13, 2017 • edited Loading

jllanfranchi commented Mar 29, 2017 •

edited

Loading

jllanfranchi commented Dec 13, 2017 •

edited

Loading