Ducklake written delete files are correct by values but wrong by metadata

### What happens?

While working on my Trino and Doris ducklake plugins, I ran into this, and had an agent write it up with some (maybe annoying) details. 

# DuckLake position-delete parquet files declare `file_path` / `pos` as OPTIONAL; Iceberg-compatible readers expect REQUIRED (as per spec)

**Affects**: DuckDB 1.5.2 (`8a5851971f`) + ducklake extension. Reproduced
via both `duckdb` (Python pip 1.5.2) and `org.duckdb:duckdb_jdbc:1.5.2.0`.

**Severity**: blocks downstream Iceberg-compatible engines from honouring
DuckLake position deletes through the standard delete-file dispatch.

## Summary

DuckLake's on-disk position-delete file format borrows the Iceberg
position-delete shape — same column names (`file_path`, `pos`), same
types (`string`, `long`), same row semantics (one row per deleted
position in the referenced data file). The Iceberg spec
([Position-delete files](https://iceberg.apache.org/spec/#position-delete-files))
requires both columns to be `required` (non-null). DuckLake writes them
as `OPTIONAL` instead, with definition-levels of 1 for every row
(every value is present, no actual nulls).

For any downstream engine that selects its column reader based on the
declared Iceberg schema, this looks like a parquet/spec inconsistency
and raises before reading any data.


## Why this matters

The Iceberg position-delete spec is the de-facto delete-file format for
the broader lakehouse ecosystem. Many engines (Trino, Doris, Spark with
the Iceberg connector, etc.) already implement an Iceberg-shaped delete
dispatch and reuse it for compatible formats. DuckLake's column shape
matches that spec by intent — clearly enabling these engines to consume
DuckLake delete files via their existing iceberg reader path.

The nullability deviation breaks that compatibility silently. A
consumer that takes the not-nullable fast path on an Iceberg schema
(which says `file_path` and `pos` are `required`) sees the parquet
schema disagree and raises. Concrete example from Doris:

```
[INTERNAL_ERROR]Read parquet file …-delete.parquet failed,
reason = [CORRUPTION]Not nullable column has null values in parquet file
```

(There are no actual null values; the error is about the OPTIONAL
declaration on a column the reader treats as REQUIRED.)

## Suggested fix

In the ducklake extension's position-delete parquet writer, mark the
`file_path` and `pos` columns as `REQUIRED` (not nullable) rather than
`OPTIONAL`. The columns are never null by construction — every row in
a position-delete file has a valid path and position — so the change
has no impact on what gets written, only on what the schema declares.

Expected scope: one-line change at the parquet writer that emits these
columns.

## Notes for context

- DuckLake's data-file parquet uses field-ids that align with what
  Iceberg-shape readers expect (sanity-check between Doris and DuckLake
  confirms data files are read transparently via the Iceberg reader
  with no changes). The delete-file nullability is the only schema-level
  deviation we've hit.
- Workaround on the consumer side requires per-query rewriting of the
  delete parquet to flip the repetition_type, which is cost-prohibitive
  for any non-trivial workload.
- A complementary fix on the Doris side is being tracked in parallel  
  but the DuckLake-side fix is the most spec-aligned
  resolution and helps any downstream engine that ever takes the
  not-nullable fast path.



### To Reproduce

```sql
-- 1.5.2 default would inline this DELETE; force the file-based path
-- so we get an on-disk position-delete parquet.
CALL lake.set_option('data_inlining_row_limit', '0');

CREATE TABLE lake.demo.orders (id INTEGER);
INSERT INTO lake.demo.orders SELECT range FROM range(100);
DELETE FROM lake.demo.orders WHERE id = 42;

-- The ducklake_delete_file row points at a parquet file under DATA_PATH.
SELECT path FROM ducklake_delete_file
WHERE end_snapshot IS NULL;
-- → s3://.../demo/orders/ducklake-019e...-delete.parquet
```

Inspect the file's parquet schema:

```sql
SELECT name, type, repetition_type, converted_type
FROM parquet_schema('s3://.../demo/orders/ducklake-019e...-delete.parquet');
```

Output:

```
('duckdb_schema', NULL,         'REQUIRED', NULL)
('file_path',     'BYTE_ARRAY', 'OPTIONAL', 'UTF8')      // WRONG
('pos',           'INT64',      'OPTIONAL', 'INT_64')    // WRONG
```

The actual row data has no nulls (verified by reading the file: every
row has a valid `(file_path, pos)` pair). The discrepancy is purely in
the parquet schema metadata.


### OS:

all

### DuckDB Version:

1.5.2.0

### DuckLake Version:

latest

### DuckDB Client:

Python, Java

### Hardware:

_No response_

### Full Name:

Jason Minard

### Affiliation:

locals.com

### What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a stable release

### Did you include all relevant data sets for reproducing the issue?

Yes

### Did you include all code required to reproduce the issue?

- [x] Yes, I have

### Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

- [x] Yes, I have

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ducklake written delete files are correct by values but wrong by metadata #1173

What happens?

DuckLake position-delete parquet files declare `file_path` / `pos` as OPTIONAL; Iceberg-compatible readers expect REQUIRED (as per spec)

Summary

Why this matters

Suggested fix

Notes for context

To Reproduce

OS:

DuckDB Version:

DuckLake Version:

DuckDB Client:

Hardware:

Full Name:

Affiliation:

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

Did you include all relevant data sets for reproducing the issue?

Did you include all code required to reproduce the issue?

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Ducklake written delete files are correct by values but wrong by metadata #1173

Description

What happens?

DuckLake position-delete parquet files declare file_path / pos as OPTIONAL; Iceberg-compatible readers expect REQUIRED (as per spec)

Summary

Why this matters

Suggested fix

Notes for context

To Reproduce

OS:

DuckDB Version:

DuckLake Version:

DuckDB Client:

Hardware:

Full Name:

Affiliation:

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

Did you include all relevant data sets for reproducing the issue?

Did you include all code required to reproduce the issue?

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions

DuckLake position-delete parquet files declare `file_path` / `pos` as OPTIONAL; Iceberg-compatible readers expect REQUIRED (as per spec)