Skip to content

iceberg_scan() is slower than read_parquet() if multithreaded #419

@alippai

Description

@alippai

What happens?

Using the same single file, the iceberg_scan() is slower than read_parquet().

The file is about 4GB with ~35 row groups.
A simple select * from iceberg_scan():

Run Time (s): real 8.250 user 4.033364 sys 4.176462

The same parquet file:

Run Time (s): real 0.918 user 6.176052 sys 12.617458

Only 40 rows are shown as usual, the CLI output is identical. Maybe it's as simple as not skipping row groups, not pushing down the limit+offest when displaying the sample.

To Reproduce

select * from iceberg_scan('an iceberg table with the same single parquet')

OS:

linux

DuckDB Version:

1.3.2

DuckDB Client:

CLI

Hardware:

No response

Full Name:

Adam Lippai

Affiliation:

N/A

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a stable release

Did you include all relevant data sets for reproducing the issue?

Yes

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • Yes, I have

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions