-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sparse histograms and maped-based storage #389
Comments
Yes, you can use any kind of map as a storage which supports the standard library map interface. Please see https://www.boost.org/doc/libs/develop/libs/histogram/doc/html/histogram/guide.html#histogram.guide.expert.user_defined_storage_class The map should preferably be a hash map. This looks like a good candidate https://github.com/greg7mdp/parallel-hashmap, specifically phmap::parallel_flat_hash_map. I haven't tested it, though. |
Thanks @HDembinski, I'll give that a try! |
I think more the point here is that this addition would be good to have as part of boost-histogram (or its bindings) so that we don't create a fine dust of packages. Lots of HEP users of this package and the python bindings would benefit from sparse storage. Given that the easiest route here probably doesn't fit in boost (given probably using parallel-hashmap), I guess we should open a PR extending https://github.com/scikit-hep/boost-histogram ? Thanks for the pointer to the parallel-hashmap, one thing we definitely want here is thread friendliness. |
@lgray Yes, please add an issue to https://github.com/scikit-hep/boost-histogram. I agree that if there is a clear use-case for sparse storages, then we should support them in the Python package. @henryiii It should be easy to add a hashmap storage to boost-histogram, right? |
Adding it should be pretty easy. The only tricky part on the C++ side is setting up the conditional parts for functions that have really different returns when using this sparse storage. Also this will increase the compile complexity - how many backends would be set up for sparse storage? None of this is insurmountable, it's just something that I don't currently have time to work on, but I'd be able to help someone work on it (like @btovar). I could possibly put it forward as a Fellows project, as well. Footnotes
|
Some spontaneous thoughts on the implementation in boost-histogram (Python lib).
|
Accessing the data without copying may be a requirement given memory constraints. However, if we can process one pair of (key, value) at a time it may work. Could the storage of one boost_histogram be a boost_histogram-like structure? E.g., have the first histogram be sparse categorical axes, and have its storage be a dense histogram. In python we could then return dictionaries that transparently show the dense data. There would be need to overload functions like h.axes so that it reflects the second histogram, etc. |
The accessors ( I think We do need the ability to modify the state of the storage even for a first pass, I think. Unpickling, for example, has to make and fill a storage directly. We also have to make and fill storages for some indexing operations, etc. Footnotes
|
I don't understand exactly what this means. If you have 3 categories, each with a dense storage, then that's already what a dense overall storage with 3 categories is. Boost-histogram supports growing category axes. If you mean each of the "dense" histograms inside the categories is different, then this already sounds complicated for only a small benefit, and it would be better to just solve this with a fully sparse storage that would allow more than one "special sparse" axis. Hist does almost support this (if I understood what you were asking), by the way, with "Stack"s. It just doesn't support filling stacks and miss-matched categories on categorical axes, but that could probably be added. So if you are thinking about "one special" axes, I'd look at possibly adding this to Hist's Stack's. |
Yes, I was thinking of having different histograms as the storage, which is what we currently do in a class purely implemented in python. We use several categorical axes to index the dense parts by implementing I looked at the histogram stacks, but indeed, there is no special categorical axes. (E.g. we may sum over any of the as needed.) I agree that solving it with a full sparse storage would be best. |
From this post from some years ago, there seems to be the implication that having a sparse histogram should be feasible using boost-histograms. Is there a map-based storage supported out of the box that we could use, or that would be easy to implement to target the scikit-hep python bindings?
For some context, we implemented some histograms using scikit-hep hist. These histograms have a handful of categorical axes, and ~300 regular/variable axes. The histograms are pretty sparse, with 97% of zeros of about a billion bins. In our HEP analysis, adding the results becomes impractical as they need > 120GB of memory. Using a subclass of hist that internally uses a dictionary of histograms [categorical keys -> dense hist], we were able to reduce the memory usage to about 20GB, which made the approach practical again in terms of memory and runtime. This subclass tries to mimic the h[...] access of the original hist, and uses awkward arrays where the original hist would return numpy arrays, etc. However, it is somewhat inefficient as the storage (in the sense of the function that tells which bin to update) is completely in python and has to use python dictionaries.
Ideally, we are looking for a storage that would enable spare histograms that then can be used in scikit-hep's boost-histograms and hist. We would appreciate any guidance on this topic.
The text was updated successfully, but these errors were encountered: