Skip to content

Quack metadata manager remote-executes parquet_full_metadata during ducklake_add_data_files #1162

@tobilg

Description

@tobilg

What happens?

Summary

When using DuckLake with a Quack-backed metadata catalog, ducklake_add_data_files(...) appears to send the parquet_full_metadata(...) inspection query to the remote Quack catalog backend.

This differs from the Postgres metadata catalog path, where the Parquet inspection query is executed locally in the DuckDB process, and only metadata writes are sent to Postgres.

This means a remote Quack metadata service must also have access to the data files and object-store credentials, even though the DuckDB client already has those credentials.

Why this matters

For serverless Quack metadata backends, this creates an unexpected requirement:

  • The DuckDB client has the TYPE r2 / TYPE s3 secret needed to read/write DuckLake data files.
  • But ducklake_add_data_files(...) sends parquet_full_metadata('r2://...') through Quack.
  • Therefore the remote Quack backend also needs R2/S3 access, or must reimplement Parquet metadata extraction.

This is surprising because the metadata catalog backend should not necessarily need object-store credentials. For example, Postgres does not need to read Parquet files.

Code path observed

In ducklake_add_data_files.cpp, DuckLakeFileProcessor::ReadParquetFullMetadata(...) builds a query containing:

FROM parquet_full_metadata(...)

and calls:

transaction.Query(...)

For the base metadata manager path, DuckLakeMetadataManager::Query(...) eventually executes the query locally through transaction.ExecuteRaw(...).

For Postgres, PostgresMetadataManager::Query(...) falls back to the base implementation, so parquet_full_metadata(...) runs locally in DuckDB. Postgres only overrides Execute(...) for metadata writes via postgres_execute(...).

For Quack, QuackMetadataManager::Query(...) overrides this behavior and wraps the query in:

CALL system.main.quack_query_by_name(...)

As a result, parquet_full_metadata(...) is executed by the remote Quack endpoint instead of the local DuckDB client.

Expected behavior

ducklake_add_data_files(...) should inspect Parquet files in the DuckDB client process, where the relevant filesystem extensions and secrets already exist.

The Quack metadata catalog should only receive the resulting metadata writes/reads that truly belong to the metadata catalog.

Actual behavior

With a Quack metadata catalog, parquet_full_metadata(...) is sent to the remote Quack backend. A remote backend that only implements the metadata catalog SQL cannot support ducklake_add_data_files(...) unless it also implements Parquet metadata extraction and has access to the same data files.

Reproduction shape

Using a Quack-backed DuckLake catalog:

LOAD httpfs;
LOAD quack;
LOAD ducklake;

CREATE SECRET (
  TYPE quack,
  TOKEN '...'
);

CREATE SECRET lake_r2 (
  TYPE r2,
  KEY_ID '...',
  SECRET '...',
  ACCOUNT_ID '...',
  SCOPE 'r2://bucket/lake/'
);

ATTACH 'ducklake:quack:<host>:443' AS lake (
  DATA_PATH 'r2://bucket/lake/'
);

CALL ducklake_add_data_files('lake', 'some_table', 'r2://bucket/path/file.parquet');

The remote Quack service receives a query involving parquet_full_metadata(...).

Suggested direction

One possible fix would be to avoid routing client-local file inspection table functions through QuackMetadataManager::Query(...).

For example, DuckLake could split ducklake_add_data_files(...) into:

  1. Local DuckDB phase:

    • run parquet_full_metadata(...)
    • validate schema/types/partition info
    • build DuckLakeDataFile metadata
  2. Metadata catalog phase:

    • write the resulting DuckLake metadata rows through the metadata manager

Alternatively, the Quack metadata manager could distinguish metadata-catalog SQL from client-local helper queries and execute the latter locally.

Impact

This would make Quack-backed DuckLake behavior align better with Postgres-backed DuckLake behavior and avoid requiring remote Quack metadata services to have data-file credentials or to reimplement DuckDB table functions such as parquet_full_metadata(...).

To Reproduce

See above

OS:

MacOS

DuckDB Version:

1.5.2

DuckLake Version:

1.0

DuckDB Client:

CLI

Hardware:

No response

Full Name:

TobiLG

Affiliation:

None

What is the latest build you tested with? If possible, we recommend testing with the latest nightly build.

I have tested with a nightly build

Did you include all relevant data sets for reproducing the issue?

Not applicable - the reproduction does not require a data set

Did you include all code required to reproduce the issue?

  • Yes, I have

Did you include all relevant configuration (e.g., CPU architecture, Python version, Linux distribution) to reproduce the issue?

  • Yes, I have

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions