|
| 1 | +--- |
| 2 | +title: Distinct Optimization |
| 3 | +summary: Introduce the `distinct` optimization in the TiDB query optimizer. |
| 4 | +--- |
| 5 | + |
| 6 | +# Distinct Optimization |
| 7 | + |
| 8 | +This document introduces the `distinct` optimization in the TiDB query optimizer, including `SELECT DISTINCT` and `DISTINCT` in the aggregate functions. |
| 9 | + |
| 10 | +## `DISTINCT` modifier in `SELECT` statements |
| 11 | + |
| 12 | +The `DISTINCT` modifier specifies removal of duplicate rows from the result set. `SELECT DISTINCT` is transformed to `GROUP BY`, for example: |
| 13 | + |
| 14 | +```sql |
| 15 | +mysql> explain SELECT DISTINCT a from t; |
| 16 | ++--------------------------+---------+-----------+---------------+-------------------------------------------------------+ |
| 17 | +| id | estRows | task | access object | operator info | |
| 18 | ++--------------------------+---------+-----------+---------------+-------------------------------------------------------+ |
| 19 | +| HashAgg_6 | 2.40 | root | | group by:test.t.a, funcs:firstrow(test.t.a)->test.t.a | |
| 20 | +| └─TableReader_11 | 3.00 | root | | data:TableFullScan_10 | |
| 21 | +| └─TableFullScan_10 | 3.00 | cop[tikv] | table:t | keep order:false, stats:pseudo | |
| 22 | ++--------------------------+---------+-----------+---------------+-------------------------------------------------------+ |
| 23 | +3 rows in set (0.00 sec) |
| 24 | +``` |
| 25 | + |
| 26 | +## `DISTINCT` option in aggregate functions |
| 27 | + |
| 28 | +Usually, aggregate functions with the `DISTINCT` option is executed in the TiDB layer in a single-threaded execution model. |
| 29 | + |
| 30 | +The [`tidb_opt_distinct_agg_push_down`](/system-variables.md#tidb_opt_distinct_agg_push_down) system variable or the [`distinct-agg-push-down`](/tidb-configuration-file.md#distinct-agg-push-down) configuration item in TiDB controls whether to rewrite the distinct aggregate queries and push them to the TiKV/TiFlash Coprocessor. |
| 31 | + |
| 32 | +Take the following queries as an example of this optimization. `tidb_opt_distinct_agg_push_down` is disabled by default, which means the aggregate functions are executed in the TiDB layer. After enabling this optimization by setting its value to `1`, the `distinct a` part of `count(distinct a)` is pushed to TiKV/TiFlash Coprocessor: there is a HashAgg_5 to remove the duplicated values on column a in the TiKV Coprocessor. It might reduce the computation overhead of `HashAgg_8` in the TiDB layer. |
| 33 | + |
| 34 | +```sql |
| 35 | +mysql> desc select count(distinct a) from test.t; |
| 36 | ++-------------------------+----------+-----------+---------------+------------------------------------------+ |
| 37 | +| id | estRows | task | access object | operator info | |
| 38 | ++-------------------------+----------+-----------+---------------+------------------------------------------+ |
| 39 | +| StreamAgg_6 | 1.00 | root | | funcs:count(distinct test.t.a)->Column#4 | |
| 40 | +| └─TableReader_10 | 10000.00 | root | | data:TableFullScan_9 | |
| 41 | +| └─TableFullScan_9 | 10000.00 | cop[tikv] | table:t | keep order:false, stats:pseudo | |
| 42 | ++-------------------------+----------+-----------+---------------+------------------------------------------+ |
| 43 | +3 rows in set (0.01 sec) |
| 44 | + |
| 45 | +mysql> set session tidb_opt_distinct_agg_push_down = 1; |
| 46 | +Query OK, 0 rows affected (0.00 sec) |
| 47 | + |
| 48 | +mysql> desc select count(distinct a) from test.t; |
| 49 | ++---------------------------+----------+-----------+---------------+------------------------------------------+ |
| 50 | +| id | estRows | task | access object | operator info | |
| 51 | ++---------------------------+----------+-----------+---------------+------------------------------------------+ |
| 52 | +| HashAgg_8 | 1.00 | root | | funcs:count(distinct test.t.a)->Column#3 | |
| 53 | +| └─TableReader_9 | 1.00 | root | | data:HashAgg_5 | |
| 54 | +| └─HashAgg_5 | 1.00 | cop[tikv] | | group by:test.t.a, | |
| 55 | +| └─TableFullScan_7 | 10000.00 | cop[tikv] | table:t | keep order:false, stats:pseudo | |
| 56 | ++---------------------------+----------+-----------+---------------+------------------------------------------+ |
| 57 | +4 rows in set (0.00 sec) |
| 58 | +``` |
0 commit comments