split GetOrphanFilesForCleanup#755
Conversation
pdet
left a comment
There was a problem hiding this comment.
Thanks for the PR. I have one comment!
| FROM read_blob({DATA_PATH} || '**') | ||
| WHERE filename NOT IN ( | ||
|
|
||
| unordered_set<string> DuckLakeMetadataManager::GetActiveFiles(const string &separator) { |
There was a problem hiding this comment.
In general, we should avoid splitting up queries, as these will increase round-trips to the DBMS catalog, which can decrease performance.
There was a problem hiding this comment.
read_blob({DATA_PATH} || '**') is a duckdb specific function.
actually I think this split will eventually reduce round-trips:
- original query: contains read_blob, must be executed in duckdb. duckdb reads each related catalog table by one
select * from table. # of round-trips == # of catalog table in query - new query: can be executed as one query in catalog. # of round-trips == 1
|
Hi @pdet , just checking in on this PR when you have a moment. Let me know if there's anything else you need from my side. Thanks! |
Hi, I'll have a look this afternoon, but I'm holding off on merges for a bit for the release! |
1d6396a to
5b976c0
Compare
5e1bd18 to
5b976c0
Compare
Extract the metadata query (data files, delete files, scheduled-for-deletion files) into a separate GetActiveFiles method that returns an unordered_set of known file paths. GetOrphanFilesForCleanup now queries the filesystem independently and filters against the active file set. This separation allows the metadata query to run on the metadata connection while the filesystem scan runs on the transaction connection.
5b976c0 to
e2b9b2b
Compare
part of #731
split
GetOrphanFilesForCleanupintoGetActiveFiles(on metadata db) and get all files by DuckDB'sread_blob.