Skip to content

split GetOrphanFilesForCleanup#755

Open
qsliu2017 wants to merge 1 commit into
duckdb:mainfrom
qsliu2017:divide-cleanup
Open

split GetOrphanFilesForCleanup#755
qsliu2017 wants to merge 1 commit into
duckdb:mainfrom
qsliu2017:divide-cleanup

Conversation

@qsliu2017
Copy link
Copy Markdown
Contributor

@qsliu2017 qsliu2017 commented Feb 9, 2026

part of #731

split GetOrphanFilesForCleanup into GetActiveFiles(on metadata db) and get all files by DuckDB's read_blob.

@qsliu2017 qsliu2017 changed the title divide GetOrphanFilesForCleanup split GetOrphanFilesForCleanup Feb 10, 2026
Copy link
Copy Markdown
Member

@pdet pdet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. I have one comment!

FROM read_blob({DATA_PATH} || '**')
WHERE filename NOT IN (

unordered_set<string> DuckLakeMetadataManager::GetActiveFiles(const string &separator) {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general, we should avoid splitting up queries, as these will increase round-trips to the DBMS catalog, which can decrease performance.

Copy link
Copy Markdown
Contributor Author

@qsliu2017 qsliu2017 Feb 11, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pdet

read_blob({DATA_PATH} || '**') is a duckdb specific function.

actually I think this split will eventually reduce round-trips:

  • original query: contains read_blob, must be executed in duckdb. duckdb reads each related catalog table by one select * from table. # of round-trips == # of catalog table in query
  • new query: can be executed as one query in catalog. # of round-trips == 1

@qsliu2017 qsliu2017 requested a review from pdet February 22, 2026 05:54
@qsliu2017
Copy link
Copy Markdown
Contributor Author

Hi @pdet , just checking in on this PR when you have a moment. Let me know if there's anything else you need from my side. Thanks!

@pdet
Copy link
Copy Markdown
Member

pdet commented Feb 26, 2026

Hi @pdet , just checking in on this PR when you have a moment. Let me know if there's anything else you need from my side. Thanks!

Hi, I'll have a look this afternoon, but I'm holding off on merges for a bit for the release!

@qsliu2017 qsliu2017 changed the title split GetOrphanFilesForCleanup Move metadata queries to DuckLakeMetadataManager::Query/Execute Mar 24, 2026
@qsliu2017 qsliu2017 changed the title Move metadata queries to DuckLakeMetadataManager::Query/Execute split GetOrphanFilesForCleanup Mar 24, 2026
Extract the metadata query (data files, delete files, scheduled-for-deletion
files) into a separate GetActiveFiles method that returns an unordered_set of
known file paths. GetOrphanFilesForCleanup now queries the filesystem
independently and filters against the active file set. This separation allows
the metadata query to run on the metadata connection while the filesystem
scan runs on the transaction connection.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants