Skip to content

Sparse storage for growable axes #235

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
nsmith- opened this issue Nov 21, 2019 · 7 comments
Open

Sparse storage for growable axes #235

nsmith- opened this issue Nov 21, 2019 · 7 comments
Labels
enhancement New feature or request

Comments

@nsmith-
Copy link
Member

nsmith- commented Nov 21, 2019

Copy of part 7 of #214

About categorical axes, it looks like the storage contains the outer product of growable categories:

h = bh.histogram(
  bh.axis.category([''], growth=True),
  bh.axis.category([''], growth=True),
  bh.axis.category([''], growth=True),
  bh.axis.regular(60, 60, 120),
)
h.fill(
  ['dataset1', 'dataset2', 'dataset2'],
  ['region1', 'region1', 'region2'],
  ['', '', 'JESup'],
  [90., 86., 92.],
)
import pickle
len(pickle.dumps(h))

returns 9290 (787 empty), while for comparison,

import coffea.hist as hist
h = hist.Hist('events',
  hist.Cat('dataset', ''),
  hist.Cat('region', ''),
  hist.Cat('systematic', ''),
  hist.Bin('mass', '', 60, 60, 120),
)
h.fill(dataset='dataset1', region='region1', systematic='', mass=90.)
h.fill(dataset='dataset2', region='region1', systematic='', mass=86.)
h.fill(dataset='dataset2', region='region2', systematic='JESup', mass=92.)
len(pickle.dumps(h))

returns 4898 (2880 empty). Coffea histograms have a fair bit of pickling overhead compared to boost, but the sparseness catches up.
Is there a way to request sparse bin storage here?

Comment from @HDembinski

Sparse storages are implemented in boost::histogram (C++), but currently there are no wrappers in boost-histogram (Python wrapper). Histograms with sparse storage are neither performant nor memory efficient. A sparse storage needs lots of memory overhead for each cell that is filled and cell lookup is much slower, too. Not everything that sounds good in theory also works well in practice. Could you run your coffea histogram with sparse storage against boost-histogram with dense storage on a realistic data set?

To clarify, in coffea its a mix of sparse and dense storage, where any fixed-sized axes are stored in a dense format, and each dense chunk is treated like a cell in the sparse product

@HDembinski
Copy link
Member

Some notes:

  1. Due to the structure of Boost.Histogram, the storage is always equal to the product of axis sizes. The storage does not know anything about the axis structure by design, to have storages and axes fully decoupled orthogonal components. This design has many obvious benefits, but because of it we cannot have a mix of sparse and dense storage as in coffea. Nevertheless we can have a general sparse storage, but research has shown that it is slow to fill and for a histogram that fills about half of its cells it is also less memory efficient than a dense storage. Since coffea mixes spare and dense, it does not have to pay a large price in memory.
  2. We could easily make the pickled size smaller by "zero suppression", where we only write out entries which are "non-zero" (quotes because we also have accumulators, where the "zero" is the default constructed entry). A zero suppression format uses more size when the histogram is fairly well filled, so it should only be used when it is beneficial. Detecting whether to use this format or not requires two passes over the storage memory. People who want to write as fast as possible may not like this trade-off.
  3. Boost.Histogram was designed to be filled as fast as possible from arrays of data. I would guess it is faster than coffea, but it would be nice to compare.

@henryiii henryiii added the enhancement New feature or request label Nov 22, 2019
@HDembinski
Copy link
Member

On more word on sparse storage: we should benchmark with this one https://github.com/greg7mdp/sparsepp. It is supposed to be a drop-in replacement for unordered_map, and so it should work out of the box with boost::histogram.

@HDembinski
Copy link
Member

The author of sparsepp suggests this:
https://github.com/greg7mdp/parallel-hashmap

@henryiii
Copy link
Member

parallel-hashmap looks nice and would be easy to add as a dep.

@btovar
Copy link

btovar commented Aug 30, 2023

Per the discussion in boostorg/histogram#389, I can work on adding the map storage to the boost histogram bindings based on https://github.com/greg7mdp/parallel-hashmap.

@Superharz
Copy link

Hello everybody,

what is the status of the sparse implementation in boost histogram? I would like to use it because I am dealing with many dimensions and many empty bins (approx. 95% of the bins are empty).

@btovar
Copy link

btovar commented Apr 16, 2025

@Superharz, we end up not working on this one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants