You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ARROW-11290: [Rust][DataFusion] Address hash aggregate performance issue with high number of groups
Currently, we loop to the hashmap for every key.
However, as we receive a batch, if we a lot of groups in the group by expression (or receive sorted data, etc.) then we could create a lot of empty batches and call `update_batch` for each of the key already in the hashmap.
In the PR we keep track of which keys we received in the batch and only update the accumulators with the same keys instead of all accumulators.
On the db-benchmark h2oai/db-benchmark#182 this is the difference (mainly q3 and q5, others seem to be noise). It doesn't seem to completely solve the problem, but it reduces the problem already quite a bit.
This PR:
```
q1 took 340 ms
q2 took 1768 ms
q3 took 10975 ms
q4 took 337 ms
q5 took 13529 ms
```
Master:
```
q1 took 330 ms
q2 took 1648 ms
q3 took 16408 ms
q4 took 335 ms
q5 took 21074 ms
```
Closes#9234 from Dandandan/hash_agg_speed2
Authored-by: Heres, Daniel <[email protected]>
Signed-off-by: Andrew Lamb <[email protected]>
0 commit comments