-
Notifications
You must be signed in to change notification settings - Fork 68
Description
What happens?
When reading iceberg table using direct metadata reference, we get inconsistent query results.
-- Shows value exists in distinct list
SELECT DISTINCT dv FROM iceberg_scan('s3://.../00292-....metadata.json') WHERE runid = '11b...'
-- Result: Includes 'e70...'
-- But direct filter returns empty
SELECT COUNT(*) FROM iceberg_scan('s3://.../00292-....metadata.json') WHERE dv = 'e70...'
-- Result: 0 rows
-- Even more specific combination fails
SELECT COUNT(*) FROM iceberg_scan('s3://.../00292-....metadata.json') WHERE runid = '11b...' AND dv = 'e70...'
-- Result: 0 rows
- Physical files exist in S3*
aws s3 ls "s3://.../"
# Result: 3 parquet files from multiple updates
# 2025-07-31 12:14:32 7954805 00000-0-a61....parquet
# 2025-07-31 12:30:30 7949663 00000-45-27a....parquet
# 2025-07-31 21:29:26 7957154 00000-0-9da....parquet
Both dv and runid are partition keys, so I suspect that it is able to discover partitions regardless of the metadata state. But somehow it is not able find the latest written parquet snapshot. When I checked from Athena which also points to this table, it was able to generate consistent outputs. This indicates that the metadata had enough information for Athena to read and use the latest data, but the DuckDB iceberg extension did not interpret it correctly.
Rewriting data on the same partition fixed the problem. So we now have a 4th parquet file, and duckdb can read it consistently.
I've also tried FORCE INSTALL iceberg FROM core_nightly;
and reproduced the issue.
To Reproduce
This happened after a large data backfill. So I am not sure what caused it. I suspect that one of the processes that wrote to the table may have been killed before write completed.
OS:
MacOS Sequoia 15.5
DuckDB Version:
v1.3.1 (Ossivalis)
DuckDB Client:
Both python and cli
Hardware:
Both AMD based EC2 and Apple M3 Pro
Full Name:
Ilkay Benian
Affiliation:
Foursquare Labs
What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.
I have not tested with any build
Did you include all relevant data sets for reproducing the issue?
No - Other reason (please specify in the issue body)
Did you include all code required to reproduce the issue?
- Yes, I have
Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?
- Yes, I have