What happens?
While working on my Trino and Doris ducklake plugins, I ran into this, and had an agent write it up with some (maybe annoying) details.
DuckLake position-delete parquet files declare file_path / pos as OPTIONAL; Iceberg-compatible readers expect REQUIRED (as per spec)
Affects: DuckDB 1.5.2 (8a5851971f) + ducklake extension. Reproduced
via both duckdb (Python pip 1.5.2) and org.duckdb:duckdb_jdbc:1.5.2.0.
Severity: blocks downstream Iceberg-compatible engines from honouring
DuckLake position deletes through the standard delete-file dispatch.
Summary
DuckLake's on-disk position-delete file format borrows the Iceberg
position-delete shape — same column names (file_path, pos), same
types (string, long), same row semantics (one row per deleted
position in the referenced data file). The Iceberg spec
(Position-delete files)
requires both columns to be required (non-null). DuckLake writes them
as OPTIONAL instead, with definition-levels of 1 for every row
(every value is present, no actual nulls).
For any downstream engine that selects its column reader based on the
declared Iceberg schema, this looks like a parquet/spec inconsistency
and raises before reading any data.
Why this matters
The Iceberg position-delete spec is the de-facto delete-file format for
the broader lakehouse ecosystem. Many engines (Trino, Doris, Spark with
the Iceberg connector, etc.) already implement an Iceberg-shaped delete
dispatch and reuse it for compatible formats. DuckLake's column shape
matches that spec by intent — clearly enabling these engines to consume
DuckLake delete files via their existing iceberg reader path.
The nullability deviation breaks that compatibility silently. A
consumer that takes the not-nullable fast path on an Iceberg schema
(which says file_path and pos are required) sees the parquet
schema disagree and raises. Concrete example from Doris:
[INTERNAL_ERROR]Read parquet file …-delete.parquet failed,
reason = [CORRUPTION]Not nullable column has null values in parquet file
(There are no actual null values; the error is about the OPTIONAL
declaration on a column the reader treats as REQUIRED.)
Suggested fix
In the ducklake extension's position-delete parquet writer, mark the
file_path and pos columns as REQUIRED (not nullable) rather than
OPTIONAL. The columns are never null by construction — every row in
a position-delete file has a valid path and position — so the change
has no impact on what gets written, only on what the schema declares.
Expected scope: one-line change at the parquet writer that emits these
columns.
Notes for context
- DuckLake's data-file parquet uses field-ids that align with what
Iceberg-shape readers expect (sanity-check between Doris and DuckLake
confirms data files are read transparently via the Iceberg reader
with no changes). The delete-file nullability is the only schema-level
deviation we've hit.
- Workaround on the consumer side requires per-query rewriting of the
delete parquet to flip the repetition_type, which is cost-prohibitive
for any non-trivial workload.
- A complementary fix on the Doris side is being tracked in parallel
but the DuckLake-side fix is the most spec-aligned
resolution and helps any downstream engine that ever takes the
not-nullable fast path.
To Reproduce
-- 1.5.2 default would inline this DELETE; force the file-based path
-- so we get an on-disk position-delete parquet.
CALL lake.set_option('data_inlining_row_limit', '0');
CREATE TABLE lake.demo.orders (id INTEGER);
INSERT INTO lake.demo.orders SELECT range FROM range(100);
DELETE FROM lake.demo.orders WHERE id = 42;
-- The ducklake_delete_file row points at a parquet file under DATA_PATH.
SELECT path FROM ducklake_delete_file
WHERE end_snapshot IS NULL;
-- → s3://.../demo/orders/ducklake-019e...-delete.parquet
Inspect the file's parquet schema:
SELECT name, type, repetition_type, converted_type
FROM parquet_schema('s3://.../demo/orders/ducklake-019e...-delete.parquet');
Output:
('duckdb_schema', NULL, 'REQUIRED', NULL)
('file_path', 'BYTE_ARRAY', 'OPTIONAL', 'UTF8') // WRONG
('pos', 'INT64', 'OPTIONAL', 'INT_64') // WRONG
The actual row data has no nulls (verified by reading the file: every
row has a valid (file_path, pos) pair). The discrepancy is purely in
the parquet schema metadata.
OS:
all
DuckDB Version:
1.5.2.0
DuckLake Version:
latest
DuckDB Client:
Python, Java
Hardware:
No response
Full Name:
Jason Minard
Affiliation:
locals.com
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a stable release
Did you include all relevant data sets for reproducing the issue?
Yes
Did you include all code required to reproduce the issue?
Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?
What happens?
While working on my Trino and Doris ducklake plugins, I ran into this, and had an agent write it up with some (maybe annoying) details.
DuckLake position-delete parquet files declare
file_path/posas OPTIONAL; Iceberg-compatible readers expect REQUIRED (as per spec)Affects: DuckDB 1.5.2 (
8a5851971f) + ducklake extension. Reproducedvia both
duckdb(Python pip 1.5.2) andorg.duckdb:duckdb_jdbc:1.5.2.0.Severity: blocks downstream Iceberg-compatible engines from honouring
DuckLake position deletes through the standard delete-file dispatch.
Summary
DuckLake's on-disk position-delete file format borrows the Iceberg
position-delete shape — same column names (
file_path,pos), sametypes (
string,long), same row semantics (one row per deletedposition in the referenced data file). The Iceberg spec
(Position-delete files)
requires both columns to be
required(non-null). DuckLake writes themas
OPTIONALinstead, with definition-levels of 1 for every row(every value is present, no actual nulls).
For any downstream engine that selects its column reader based on the
declared Iceberg schema, this looks like a parquet/spec inconsistency
and raises before reading any data.
Why this matters
The Iceberg position-delete spec is the de-facto delete-file format for
the broader lakehouse ecosystem. Many engines (Trino, Doris, Spark with
the Iceberg connector, etc.) already implement an Iceberg-shaped delete
dispatch and reuse it for compatible formats. DuckLake's column shape
matches that spec by intent — clearly enabling these engines to consume
DuckLake delete files via their existing iceberg reader path.
The nullability deviation breaks that compatibility silently. A
consumer that takes the not-nullable fast path on an Iceberg schema
(which says
file_pathandposarerequired) sees the parquetschema disagree and raises. Concrete example from Doris:
(There are no actual null values; the error is about the OPTIONAL
declaration on a column the reader treats as REQUIRED.)
Suggested fix
In the ducklake extension's position-delete parquet writer, mark the
file_pathandposcolumns asREQUIRED(not nullable) rather thanOPTIONAL. The columns are never null by construction — every row ina position-delete file has a valid path and position — so the change
has no impact on what gets written, only on what the schema declares.
Expected scope: one-line change at the parquet writer that emits these
columns.
Notes for context
Iceberg-shape readers expect (sanity-check between Doris and DuckLake
confirms data files are read transparently via the Iceberg reader
with no changes). The delete-file nullability is the only schema-level
deviation we've hit.
delete parquet to flip the repetition_type, which is cost-prohibitive
for any non-trivial workload.
but the DuckLake-side fix is the most spec-aligned
resolution and helps any downstream engine that ever takes the
not-nullable fast path.
To Reproduce
Inspect the file's parquet schema:
Output:
The actual row data has no nulls (verified by reading the file: every
row has a valid
(file_path, pos)pair). The discrepancy is purely inthe parquet schema metadata.
OS:
all
DuckDB Version:
1.5.2.0
DuckLake Version:
latest
DuckDB Client:
Python, Java
Hardware:
No response
Full Name:
Jason Minard
Affiliation:
locals.com
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have tested with a stable release
Did you include all relevant data sets for reproducing the issue?
Yes
Did you include all code required to reproduce the issue?
Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?