Skip to content

Inconsistent reads from iceberg table #400

@ibenian

Description

@ibenian

What happens?

When reading iceberg table using direct metadata reference, we get inconsistent query results.

-- Shows value exists in distinct list
SELECT DISTINCT dv FROM iceberg_scan('s3://.../00292-....metadata.json') WHERE runid = '11b...'
-- Result: Includes 'e70...'

-- But direct filter returns empty
SELECT COUNT(*) FROM iceberg_scan('s3://.../00292-....metadata.json') WHERE dv = 'e70...'
-- Result: 0 rows

-- Even more specific combination fails
SELECT COUNT(*) FROM iceberg_scan('s3://.../00292-....metadata.json') WHERE runid = '11b...' AND dv = 'e70...'
-- Result: 0 rows
  • Physical files exist in S3*
aws s3 ls "s3://.../"

# Result: 3 parquet files from multiple updates
# 2025-07-31 12:14:32  7954805  00000-0-a61....parquet
# 2025-07-31 12:30:30  7949663  00000-45-27a....parquet  
# 2025-07-31 21:29:26  7957154  00000-0-9da....parquet

Both dv and runid are partition keys, so I suspect that it is able to discover partitions regardless of the metadata state. But somehow it is not able find the latest written parquet snapshot. When I checked from Athena which also points to this table, it was able to generate consistent outputs. This indicates that the metadata had enough information for Athena to read and use the latest data, but the DuckDB iceberg extension did not interpret it correctly.

Rewriting data on the same partition fixed the problem. So we now have a 4th parquet file, and duckdb can read it consistently.

I've also tried FORCE INSTALL iceberg FROM core_nightly; and reproduced the issue.

To Reproduce

This happened after a large data backfill. So I am not sure what caused it. I suspect that one of the processes that wrote to the table may have been killed before write completed.

OS:

MacOS Sequoia 15.5

DuckDB Version:

v1.3.1 (Ossivalis)

DuckDB Client:

Both python and cli

Hardware:

Both AMD based EC2 and Apple M3 Pro

Full Name:

Ilkay Benian

Affiliation:

Foursquare Labs

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have not tested with any build

Did you include all relevant data sets for reproducing the issue?

No - Other reason (please specify in the issue body)

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • Yes, I have

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions