Skip to content

SQL optimizer convert join condition incorrectly #5969

@karta0807913

Description

@karta0807913

Environment
python version

Python 3.12.8

sqlglot commit

cb9338ad8ed35eca80d57509c1de049b0110a9d8

Problem Description
The SQL optimizer incorrectly alters the logic of a query that uses CTEs and nested JOIN operations.
The original query is designed to start with a core dataset (an INNER JOIN of t0 and t1), and then enrich it by LEFT JOIN-ing two other data sources: t3 and a pre-joined result of t4 and t5. The key intent is that records from the core dataset should be preserved even if they don't have a corresponding match in t3 or in the (t4 INNER JOIN t5) subquery.
However, the optimized query flattens the JOIN structure and incorrectly changes the final join to t5 into an INNER JOIN (written as JOIN). This causes any record where t4.id is not found (resulting in a NULL from the LEFT JOIN t4) or where t4.country_code does not exist in t5 to be filtered out from the final result set. This behavior contradicts the logic of the original query and leads to unintended data loss.

Fully reproducible code snippet

import sqlglot
import sqlglot.optimizer as sgopt

dialect = "spark"

sql = '''
WITH `res` AS (
  SELECT
    `t2`.`id`,
    CAST(IF(NOT `t3`.`id` IS NULL, 1, 0) AS BIGINT) AS `id_active`,
    `t6`.`country_id` AS `cid`,
    `t2`.`uip`,
  FROM (
    SELECT
      `t0`.`id`,
      `t0`.`uip`,
    FROM `t0`
    INNER JOIN `t1`
      ON `t0`.`id` = `t1`.`id`
  ) AS `t2`
  LEFT OUTER JOIN `t3`
    ON `t2`.`id` = `t3`.`id`
  LEFT OUTER JOIN (
    SELECT
      `t4`.`id`,
      `t5`.`country_code`,
      `t5`.`country_id`
    FROM `t4`
    INNER JOIN `t5`
      ON `t4`.`country_code` = `t5`.`country_code`
  ) AS `t6`
    ON `t2`.`id` = `t6`.`id`
)
INSERT OVERWRITE TABLE `res_table`
SELECT
  *
FROM `res`
'''

print(sgopt.optimize(sqlglot.parse_one(sql, dialect=dialect)).sql(dialect=dialect, pretty=True))

result is

INSERT OVERWRITE TABLE `res_table`
SELECT
  `t0`.`id` AS `id`,
  CAST(IF(NOT `t3`.`id` IS NULL, 1, 0) AS BIGINT) AS `id_active`,
  `t5`.`country_id` AS `cid`,
  `t0`.`uip` AS `uip`
FROM `t0` AS `t0`
JOIN `t1` AS `t1`
  ON `t0`.`id` = `t1`.`id`
LEFT JOIN `t3` AS `t3`
  ON `t0`.`id` = `t3`.`id`
LEFT JOIN `t4` AS `t4`
  ON `t0`.`id` = `t4`.`id`
JOIN `t5` AS `t5`
  ON `t4`.`country_code` = `t5`.`country_code`

however, in the optimized sql code. the third join condition should be left join not inner join. in the original SQL code, the condition should be (t0 inner join t1) left join (t3) left join (t4 inner join t5). but the optimized code remove the brackets.

In the final result, the optimized code will remove all NULL country_code from the table, which is not expected.

here is the expected sql

SELECT
  t0.id AS id,
  CAST(IF(t3.id IS NOT NULL, 1, 0) AS BIGINT) AS id_active,
  t5.country_id AS cid,
  t0.uip AS uip
FROM t0
JOIN t1 ON t0.id = t1.id
LEFT JOIN t3 ON t0.id = t3.id
LEFT JOIN t4 ON t0.id = t4.id
-- CRITICAL FIX: This must be a LEFT JOIN to preserve records that don't match in t5.
LEFT JOIN t5 ON t4.country_code = t5.country_code
ORDER BY id;

Official Documentation
N/A

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions