Sparse storage for growable axes #235

nsmith- · 2019-11-21T16:05:58Z

Copy of part 7 of #214

About categorical axes, it looks like the storage contains the outer product of growable categories:

h = bh.histogram(
  bh.axis.category([''], growth=True),
  bh.axis.category([''], growth=True),
  bh.axis.category([''], growth=True),
  bh.axis.regular(60, 60, 120),
)
h.fill(
  ['dataset1', 'dataset2', 'dataset2'],
  ['region1', 'region1', 'region2'],
  ['', '', 'JESup'],
  [90., 86., 92.],
)
import pickle
len(pickle.dumps(h))

returns 9290 (787 empty), while for comparison,

import coffea.hist as hist
h = hist.Hist('events',
  hist.Cat('dataset', ''),
  hist.Cat('region', ''),
  hist.Cat('systematic', ''),
  hist.Bin('mass', '', 60, 60, 120),
)
h.fill(dataset='dataset1', region='region1', systematic='', mass=90.)
h.fill(dataset='dataset2', region='region1', systematic='', mass=86.)
h.fill(dataset='dataset2', region='region2', systematic='JESup', mass=92.)
len(pickle.dumps(h))

returns 4898 (2880 empty). Coffea histograms have a fair bit of pickling overhead compared to boost, but the sparseness catches up.
Is there a way to request sparse bin storage here?

Comment from @HDembinski

Sparse storages are implemented in boost::histogram (C++), but currently there are no wrappers in boost-histogram (Python wrapper). Histograms with sparse storage are neither performant nor memory efficient. A sparse storage needs lots of memory overhead for each cell that is filled and cell lookup is much slower, too. Not everything that sounds good in theory also works well in practice. Could you run your coffea histogram with sparse storage against boost-histogram with dense storage on a realistic data set?

To clarify, in coffea its a mix of sparse and dense storage, where any fixed-sized axes are stored in a dense format, and each dense chunk is treated like a cell in the sparse product

The text was updated successfully, but these errors were encountered:

HDembinski · 2019-11-22T08:30:11Z

Some notes:

Due to the structure of Boost.Histogram, the storage is always equal to the product of axis sizes. The storage does not know anything about the axis structure by design, to have storages and axes fully decoupled orthogonal components. This design has many obvious benefits, but because of it we cannot have a mix of sparse and dense storage as in coffea. Nevertheless we can have a general sparse storage, but research has shown that it is slow to fill and for a histogram that fills about half of its cells it is also less memory efficient than a dense storage. Since coffea mixes spare and dense, it does not have to pay a large price in memory.
We could easily make the pickled size smaller by "zero suppression", where we only write out entries which are "non-zero" (quotes because we also have accumulators, where the "zero" is the default constructed entry). A zero suppression format uses more size when the histogram is fairly well filled, so it should only be used when it is beneficial. Detecting whether to use this format or not requires two passes over the storage memory. People who want to write as fast as possible may not like this trade-off.
Boost.Histogram was designed to be filled as fast as possible from arrays of data. I would guess it is faster than coffea, but it would be nice to compare.

HDembinski · 2019-11-23T12:22:33Z

On more word on sparse storage: we should benchmark with this one https://github.com/greg7mdp/sparsepp. It is supposed to be a drop-in replacement for unordered_map, and so it should work out of the box with boost::histogram.

HDembinski · 2019-11-24T21:37:52Z

The author of sparsepp suggests this:
https://github.com/greg7mdp/parallel-hashmap

henryiii · 2021-03-16T19:26:04Z

parallel-hashmap looks nice and would be easy to add as a dep.

btovar · 2023-08-30T13:16:18Z

Per the discussion in boostorg/histogram#389, I can work on adding the map storage to the boost histogram bindings based on https://github.com/greg7mdp/parallel-hashmap.

Superharz · 2025-04-16T12:48:40Z

Hello everybody,

what is the status of the sparse implementation in boost histogram? I would like to use it because I am dealing with many dimensions and many empty bins (approx. 95% of the bins are empty).

btovar · 2025-04-16T14:56:56Z

@Superharz, we end up not working on this one.

henryiii added the enhancement New feature or request label Nov 22, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sparse storage for growable axes #235

Sparse storage for growable axes #235

nsmith- commented Nov 21, 2019

HDembinski commented Nov 22, 2019

HDembinski commented Nov 23, 2019

HDembinski commented Nov 24, 2019

henryiii commented Mar 16, 2021

btovar commented Aug 30, 2023

Superharz commented Apr 16, 2025

btovar commented Apr 16, 2025

Sparse storage for growable axes #235

Sparse storage for growable axes #235

Comments

nsmith- commented Nov 21, 2019

HDembinski commented Nov 22, 2019

HDembinski commented Nov 23, 2019

HDembinski commented Nov 24, 2019

henryiii commented Mar 16, 2021

btovar commented Aug 30, 2023

Superharz commented Apr 16, 2025

btovar commented Apr 16, 2025