Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for double weights #203

Open
EnricoMi opened this issue Oct 17, 2022 · 2 comments
Open

Add support for double weights #203

EnricoMi opened this issue Oct 17, 2022 · 2 comments

Comments

@EnricoMi
Copy link

What are your thoughts on supporting double weights, instead of integer weights only? This would allow to use (0..1] weights, which would be more convenient than mapping those weights to integers in user code.

This would require to distinguish the semantics of count from weight, which could be beneficial in other use cases as well, e.g. #198.

Obviously, this will introduce a breaking change to the API.

@tdunning
Copy link
Owner

Yeah... there has been a fair bit of discussion on this.

The core question is what does the t-digest invariant actually mean with non-integer weights.

Do you have thoughts on that?

The key problems in the past include:

  • violation of invariant has been allowed for centroids with weight = 1. If weights < 1 are allowed, what is the status of this exemption? Is a weight less than one still an indivisible value?
  • if the exemption is removed so that all centroids must meet the invariant, we will require an infinite number of centroids for K=2 or K=3. How could that be resolved?
  • if we assume that any centroid that is added represents a single sample with variable weight rather than the number of samples represented by the weight, then we could allow the exemption from the scale invariant for all samples <= 1. What happens if we merge two such centroids and the weight is still < 1. Do we have to remember that this new centroid has more than one sample?

So, what do you think?

@EnricoMi
Copy link
Author

EnricoMi commented Dec 13, 2022

Centroids now have to maintain their cardinality (the number of samples). Then, the exemption can be done based on the cardinality, not the weight (in fact, weight used to be some kind of cardinality, with the assumption of unit weight).

With non-integer weights, the t-digest invariant ${|C|}_{k} = k (q_{right}) − k (q_{left}) ≤ 1$ requires ${q}_{left}$ and ${q}_{right}$ to be normalized by the sum of all weights, not the number of all samples (which used to be the same with unit weights):

$$q_{left} = W_{left}(C)/∑w$$

$$q_{right} = q_{left} + |C|/∑w$$

Then, the invariant should behave identical to equivalent integer weights.

For example, non-integer weighted samples (sample value, sample weight)

(1, 0.1), (2, 0.25), (2, 0.2)

with quantiles $({q}_{left}, {q}_{right})$ for clusters $\{(1, 0.1)\}, \{2, 0.45)\}$

(0, 0.1/0.55), (0.1/0.55, 1)

are equivalent to these integer-weighted samples:

(1, 2), (2, 9)

with quantiles $({q}_{left}, {q}_{right})$ for clusters $\{(1, 2)\}, \{(2, 9)\}$

(0, 2/11), (2/11, 1)

Difference is that a cluster $\{(1, 0.1)\}$ has cardinality 1, which is exempted from the invariant, while cluster $\{(1, 2)\}$ has cardinality 2, which is not exempted from the invariant.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants