Skip to content

Ducklake written delete files are correct by values but wrong by metadata #1173

@apatrida

Description

@apatrida

What happens?

While working on my Trino and Doris ducklake plugins, I ran into this, and had an agent write it up with some (maybe annoying) details.

DuckLake position-delete parquet files declare file_path / pos as OPTIONAL; Iceberg-compatible readers expect REQUIRED (as per spec)

Affects: DuckDB 1.5.2 (8a5851971f) + ducklake extension. Reproduced
via both duckdb (Python pip 1.5.2) and org.duckdb:duckdb_jdbc:1.5.2.0.

Severity: blocks downstream Iceberg-compatible engines from honouring
DuckLake position deletes through the standard delete-file dispatch.

Summary

DuckLake's on-disk position-delete file format borrows the Iceberg
position-delete shape — same column names (file_path, pos), same
types (string, long), same row semantics (one row per deleted
position in the referenced data file). The Iceberg spec
(Position-delete files)
requires both columns to be required (non-null). DuckLake writes them
as OPTIONAL instead, with definition-levels of 1 for every row
(every value is present, no actual nulls).

For any downstream engine that selects its column reader based on the
declared Iceberg schema, this looks like a parquet/spec inconsistency
and raises before reading any data.

Why this matters

The Iceberg position-delete spec is the de-facto delete-file format for
the broader lakehouse ecosystem. Many engines (Trino, Doris, Spark with
the Iceberg connector, etc.) already implement an Iceberg-shaped delete
dispatch and reuse it for compatible formats. DuckLake's column shape
matches that spec by intent — clearly enabling these engines to consume
DuckLake delete files via their existing iceberg reader path.

The nullability deviation breaks that compatibility silently. A
consumer that takes the not-nullable fast path on an Iceberg schema
(which says file_path and pos are required) sees the parquet
schema disagree and raises. Concrete example from Doris:

[INTERNAL_ERROR]Read parquet file …-delete.parquet failed,
reason = [CORRUPTION]Not nullable column has null values in parquet file

(There are no actual null values; the error is about the OPTIONAL
declaration on a column the reader treats as REQUIRED.)

Suggested fix

In the ducklake extension's position-delete parquet writer, mark the
file_path and pos columns as REQUIRED (not nullable) rather than
OPTIONAL. The columns are never null by construction — every row in
a position-delete file has a valid path and position — so the change
has no impact on what gets written, only on what the schema declares.

Expected scope: one-line change at the parquet writer that emits these
columns.

Notes for context

  • DuckLake's data-file parquet uses field-ids that align with what
    Iceberg-shape readers expect (sanity-check between Doris and DuckLake
    confirms data files are read transparently via the Iceberg reader
    with no changes). The delete-file nullability is the only schema-level
    deviation we've hit.
  • Workaround on the consumer side requires per-query rewriting of the
    delete parquet to flip the repetition_type, which is cost-prohibitive
    for any non-trivial workload.
  • A complementary fix on the Doris side is being tracked in parallel
    but the DuckLake-side fix is the most spec-aligned
    resolution and helps any downstream engine that ever takes the
    not-nullable fast path.

To Reproduce

-- 1.5.2 default would inline this DELETE; force the file-based path
-- so we get an on-disk position-delete parquet.
CALL lake.set_option('data_inlining_row_limit', '0');

CREATE TABLE lake.demo.orders (id INTEGER);
INSERT INTO lake.demo.orders SELECT range FROM range(100);
DELETE FROM lake.demo.orders WHERE id = 42;

-- The ducklake_delete_file row points at a parquet file under DATA_PATH.
SELECT path FROM ducklake_delete_file
WHERE end_snapshot IS NULL;
-- → s3://.../demo/orders/ducklake-019e...-delete.parquet

Inspect the file's parquet schema:

SELECT name, type, repetition_type, converted_type
FROM parquet_schema('s3://.../demo/orders/ducklake-019e...-delete.parquet');

Output:

('duckdb_schema', NULL,         'REQUIRED', NULL)
('file_path',     'BYTE_ARRAY', 'OPTIONAL', 'UTF8')      // WRONG
('pos',           'INT64',      'OPTIONAL', 'INT_64')    // WRONG

The actual row data has no nulls (verified by reading the file: every
row has a valid (file_path, pos) pair). The discrepancy is purely in
the parquet schema metadata.

OS:

all

DuckDB Version:

1.5.2.0

DuckLake Version:

latest

DuckDB Client:

Python, Java

Hardware:

No response

Full Name:

Jason Minard

Affiliation:

locals.com

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a stable release

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • Yes, I have

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions