-
Notifications
You must be signed in to change notification settings - Fork 985
Description
Environment
python version
Python 3.12.8
sqlglot commit
cb9338ad8ed35eca80d57509c1de049b0110a9d8
Problem Description
The SQL optimizer incorrectly alters the logic of a query that uses CTEs and nested JOIN
operations.
The original query is designed to start with a core dataset (an INNER JOIN
of t0
and t1
), and then enrich it by LEFT JOIN
-ing two other data sources: t3
and a pre-joined result of t4
and t5
. The key intent is that records from the core dataset should be preserved even if they don't have a corresponding match in t3
or in the (t4 INNER JOIN t5)
subquery.
However, the optimized query flattens the JOIN
structure and incorrectly changes the final join to t5
into an INNER JOIN
(written as JOIN
). This causes any record where t4.id
is not found (resulting in a NULL
from the LEFT JOIN t4
) or where t4.country_code
does not exist in t5
to be filtered out from the final result set. This behavior contradicts the logic of the original query and leads to unintended data loss.
Fully reproducible code snippet
import sqlglot
import sqlglot.optimizer as sgopt
dialect = "spark"
sql = '''
WITH `res` AS (
SELECT
`t2`.`id`,
CAST(IF(NOT `t3`.`id` IS NULL, 1, 0) AS BIGINT) AS `id_active`,
`t6`.`country_id` AS `cid`,
`t2`.`uip`,
FROM (
SELECT
`t0`.`id`,
`t0`.`uip`,
FROM `t0`
INNER JOIN `t1`
ON `t0`.`id` = `t1`.`id`
) AS `t2`
LEFT OUTER JOIN `t3`
ON `t2`.`id` = `t3`.`id`
LEFT OUTER JOIN (
SELECT
`t4`.`id`,
`t5`.`country_code`,
`t5`.`country_id`
FROM `t4`
INNER JOIN `t5`
ON `t4`.`country_code` = `t5`.`country_code`
) AS `t6`
ON `t2`.`id` = `t6`.`id`
)
INSERT OVERWRITE TABLE `res_table`
SELECT
*
FROM `res`
'''
print(sgopt.optimize(sqlglot.parse_one(sql, dialect=dialect)).sql(dialect=dialect, pretty=True))
result is
INSERT OVERWRITE TABLE `res_table`
SELECT
`t0`.`id` AS `id`,
CAST(IF(NOT `t3`.`id` IS NULL, 1, 0) AS BIGINT) AS `id_active`,
`t5`.`country_id` AS `cid`,
`t0`.`uip` AS `uip`
FROM `t0` AS `t0`
JOIN `t1` AS `t1`
ON `t0`.`id` = `t1`.`id`
LEFT JOIN `t3` AS `t3`
ON `t0`.`id` = `t3`.`id`
LEFT JOIN `t4` AS `t4`
ON `t0`.`id` = `t4`.`id`
JOIN `t5` AS `t5`
ON `t4`.`country_code` = `t5`.`country_code`
however, in the optimized sql code. the third join condition should be left join
not inner join
. in the original SQL code, the condition should be (t0
inner join t1
) left join (t3
) left join (t4
inner join t5
). but the optimized code remove the brackets.
In the final result, the optimized code will remove all NULL
country_code
from the table, which is not expected.
here is the expected sql
SELECT
t0.id AS id,
CAST(IF(t3.id IS NOT NULL, 1, 0) AS BIGINT) AS id_active,
t5.country_id AS cid,
t0.uip AS uip
FROM t0
JOIN t1 ON t0.id = t1.id
LEFT JOIN t3 ON t0.id = t3.id
LEFT JOIN t4 ON t0.id = t4.id
-- CRITICAL FIX: This must be a LEFT JOIN to preserve records that don't match in t5.
LEFT JOIN t5 ON t4.country_code = t5.country_code
ORDER BY id;
Official Documentation
N/A