split GetOrphanFilesForCleanup by qsliu2017 · Pull Request #755 · duckdb/ducklake

qsliu2017 · 2026-02-09T13:14:20Z

part of #731

split GetOrphanFilesForCleanup into GetActiveFiles(on metadata db) and get all files by DuckDB's read_blob.

pdet

Thanks for the PR. I have one comment!

pdet · 2026-02-11T15:32:06Z

-FROM read_blob({DATA_PATH} || '**')
-WHERE filename NOT IN (
+
+unordered_set<string> DuckLakeMetadataManager::GetActiveFiles(const string &separator) {


In general, we should avoid splitting up queries, as these will increase round-trips to the DBMS catalog, which can decrease performance.

@pdet

read_blob({DATA_PATH} || '**') is a duckdb specific function.

actually I think this split will eventually reduce round-trips:

original query: contains read_blob, must be executed in duckdb. duckdb reads each related catalog table by one select * from table. # of round-trips == # of catalog table in query

new query: can be executed as one query in catalog. # of round-trips == 1

qsliu2017 · 2026-02-26T08:39:37Z

Hi @pdet , just checking in on this PR when you have a moment. Let me know if there's anything else you need from my side. Thanks!

pdet · 2026-02-26T09:29:19Z

Hi @pdet , just checking in on this PR when you have a moment. Let me know if there's anything else you need from my side. Thanks!

Hi, I'll have a look this afternoon, but I'm holding off on merges for a bit for the release!

Extract the metadata query (data files, delete files, scheduled-for-deletion files) into a separate GetActiveFiles method that returns an unordered_set of known file paths. GetOrphanFilesForCleanup now queries the filesystem independently and filters against the active file set. This separation allows the metadata query to run on the metadata connection while the filesystem scan runs on the transaction connection.

qsliu2017 mentioned this pull request Feb 9, 2026

Extends DuckLakeMetadataManager to be easier overridden #731

Draft

qsliu2017 changed the title ~~divide GetOrphanFilesForCleanup~~ split GetOrphanFilesForCleanup Feb 10, 2026

pdet reviewed Feb 11, 2026

View reviewed changes

qsliu2017 requested a review from pdet February 22, 2026 05:54

qsliu2017 force-pushed the divide-cleanup branch from 1d6396a to 5b976c0 Compare March 24, 2026 06:29

qsliu2017 changed the title ~~split GetOrphanFilesForCleanup~~ Move metadata queries to DuckLakeMetadataManager::Query/Execute Mar 24, 2026

qsliu2017 force-pushed the divide-cleanup branch from 5e1bd18 to 5b976c0 Compare March 24, 2026 09:54

qsliu2017 changed the title ~~Move metadata queries to DuckLakeMetadataManager::Query/Execute~~ split GetOrphanFilesForCleanup Mar 24, 2026

qsliu2017 force-pushed the divide-cleanup branch from 5b976c0 to e2b9b2b Compare March 26, 2026 09:36

pdet added merge conflict CI failure labels Apr 16, 2026

fuziontech mentioned this pull request Apr 29, 2026

delete_orphaned_files does unbounded full-bucket LIST, times out at scale #1090

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

split GetOrphanFilesForCleanup#755

split GetOrphanFilesForCleanup#755
qsliu2017 wants to merge 1 commit into
duckdb:mainfrom
qsliu2017:divide-cleanup

qsliu2017 commented Feb 9, 2026 •

edited

Loading

Uh oh!

pdet left a comment

Uh oh!

pdet Feb 11, 2026

Uh oh!

qsliu2017 Feb 11, 2026 •

edited

Loading

Uh oh!

qsliu2017 commented Feb 26, 2026

Uh oh!

pdet commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

qsliu2017 commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

pdet left a comment

Choose a reason for hiding this comment

Uh oh!

pdet Feb 11, 2026

Choose a reason for hiding this comment

Uh oh!

qsliu2017 Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

qsliu2017 commented Feb 26, 2026

Uh oh!

pdet commented Feb 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

qsliu2017 commented Feb 9, 2026 •

edited

Loading

qsliu2017 Feb 11, 2026 •

edited

Loading