Skip to content

Optimize histogram bucket calculation based on case #18496

@pepijnve

Description

@pepijnve

Is your feature request related to a problem or challenge?

Histograms in SQL can be computed fairly easily using case and order by.

An example of this kind of query is:

select 
    case 
        when salary between 75000 and 90000 then '75000-90000' 
        when salary between 90000 and 120000 then '90000-120000'
        else '120000+'
    end as salary_band, 
    count(*) 
from employee_salary
group by 1

The default evaluation logic for case (which is equivalent to a chain of if/else statements) does not leverage the fact that the ranges being tested here are static and disjunct. In the example above it would be more optimal to compute the boundaries as an array [75000, 90000, 120000] and then try to find the appropriate bucket index by scanning that array.

It would be useful to have a dedicated case evaluation code path for this type of 'binning function' case expression that handles them more optimally. This is similar in idea to the work that's being done in #18183 for equality checks.

Describe the solution you'd like

In the case physical expression implementation, analyse the when conditions. If they are all range tests for the same input value using literals for the range boundaries and the order of the when expressions permits it, extract the range boundaries to an array and test the input value against the array to determine the appropriate then expression index.

If the then expressions are all literals, then there's also no need to actually evaluate those. Instead a lookup table can be used to map each row to the appropriate output value.

Describe alternatives you've considered

No response

Additional context

No response

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions