Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pickling very large maps fails #314

Open
jllanfranchi opened this issue Mar 29, 2017 · 2 comments
Open

pickling very large maps fails #314

jllanfranchi opened this issue Mar 29, 2017 · 2 comments
Labels
Outdated Ancient issues that can be discarded

Comments

@jllanfranchi
Copy link
Contributor

jllanfranchi commented Mar 29, 2017

Not sure if this is fix-able. Pickle seems like a bad way to store really large maps (e.g. HDF5 would make more sense). But it might be a bug...

Alternately, could we integrate with npy binary-file format somehow? https://docs.scipy.org/doc/numpy/neps/npy-format.html Do we need to abandon pickle altogether?

@jllanfranchi
Copy link
Contributor Author

Other alternatives for this:

  • If we use .npy files, we will need to create a directory and each key as a filename and the contents be the value, either a .npy file or .json or somesuch. This gets ugly fast trying to translate a dict into a dir with files.
    • Can use .npz for multiple arrays in one file, but this doesn't help for arbitrary Python objects
  • Google Flatbuffers... but the Python interface looks rather clunky and not well maintained. More stuff to install that requires compilation. Doesn't seem to be an active community of users in Python.
  • Apache Arrow... seems to work well with large arrays, can be memory mapped (not necessary here but nice) and is zero-copy (& fast) like Flatbuffers, though still a nascent project. pip installable, which is nice. Can use feather file format, or the native format, or Apache Parquet(?)
    • Since we already have serializable_state in many core objects which produces a dict of simple Python datatypes (plus numpy types), it seems Arrow might be able to handle this as-is: http://arrow.apache.org/docs/python/ipc.html#arbitrary-object-serialization
    • Spec is not promised to be stable across versions, so this should not be used for long-term storage; can use for caching, though, and with storage of version, data can be read/interpreted correctly (though this gets hairy to have to have different versions of the same lib to read different files)
    • EDIT: Apache Arrow uses Google Flatbuffers under the hood for some pieces of its internal representation of data
  • HDF5: this is good and only a little bad. We've used it before, it stores large arrays and can store arbitrary things. It's just a big, bloated library that carries far more complexity than necessary. But it works, is cross-platform, not terribly slow or terribly large files, etc.

@jllanfranchi
Copy link
Contributor Author

jllanfranchi commented Dec 13, 2017

See also #26

@LeanderFischer LeanderFischer added this to the PISA 4.2 milestone Apr 24, 2024
@LeanderFischer LeanderFischer added the Outdated Ancient issues that can be discarded label Apr 24, 2024
@thehrh thehrh removed this from the PISA 4.2 milestone Feb 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Outdated Ancient issues that can be discarded
Projects
None yet
Development

No branches or pull requests

3 participants