Skip to content

Add ducklake_static_backup#429

Draft
carlopi wants to merge 1 commit into
duckdb:mainfrom
carlopi:ducklake_static_backup
Draft

Add ducklake_static_backup#429
carlopi wants to merge 1 commit into
duckdb:mainfrom
carlopi:ducklake_static_backup

Conversation

@carlopi
Copy link
Copy Markdown
Member

@carlopi carlopi commented Sep 9, 2025

Example, you can attach a given ducklake in read/write mode like:

ATTACH 'ducklake:postgres:<CONNECTION_STRING>' AS my_ducklake (DATA_PATH 's3://some_bucket', STATIC_BACKUP 's3://some_bucket/backup.ducklake');
--- some operations
CALL ducklake_static_backup('my_ducklake');
--- this will copy metadata to a duckdb file at s3://some_bucket/backup.ducklake
--- some more operations
CALL ducklake_static_backup('my_ducklake');

And at the same time, a fully static read-only backup will be accessible like:

ATTACH 'ducklake:s3://some_bucket/backup.ducklake' (READ_ONLY); ---- DATA_PATH will be the same as the main ducklake instance

(note that backup will be up to date to the most recent ducklake_static_backup call that happened BEFORE attaching it)

Example, you can attach a given ducklake in read/write mode like:
```sql
ATTACH 'ducklake:postgres:<CONNECTION_STRING>' AS my_ducklake (DATA_PATH 's3://some_bucket', STATIC_BACKUP 's3://some_bucket/backup.ducklake');
--- some operations
CALL ducklake_static_backup('my_ducklake');
--- this will copy metadata to s3://some_bucket/backup.ducklake
```
And at the same time, a fully static backup will be accessible like:
```sql
ATTACH 'ducklake:s3://some_bucket/backup.ducklake' (READ_ONLY);
```
(note that backup will be up to date to the most recent `ducklake_static_backup` call that happened BEFORE attaching it)
@carlopi carlopi force-pushed the ducklake_static_backup branch from 9d456af to 524bff5 Compare September 9, 2025 07:24
@carlopi carlopi requested a review from pdet September 9, 2025 07:53

namespace duckdb {

struct BackupBindData : public TableFunctionData {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please wrap these in an anonymous namespace to avoid name collisions

namespace {

...

} // namespace


if (fs.FileExists(tmp_uuid) || fs.FileExists(tmp_uuid + ".wal")) {
throw BinderException(
"Temporary file \"%s\" is already in use, please cleanup files in the form \"ducklake_backup_file.*\"",
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is generated, we can just regenerate if this is the case, no?

string backup_location = ducklake_catalog.GetStaticBackup();

if (backup_location.empty()) {
throw InvalidInputException("static_backup not specified as attach option");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this strictly necessary? Is it possible to just accept a second parameter to the function and use that if it's not defined on the catalog?

}

auto result = transaction.Query(
string("") + "ATTACH IF NOT EXISTS '" + tmp_uuid +
Copy link
Copy Markdown
Member

@Tishj Tishj Feb 19, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we use:

StringUtil::Format(R"(
ATTACH IF NOT EXISTS '%s' AS {METADATA_CATALOG_NAME_IDENTIFIER_BACKUP} (STORAGE_VERSION 'v1.4.0');
COPY FROM DATABASE {METADATA_CATALOG_NAME_IDENTIFIER} TO {METADATA_CATALOG_NAME_IDENTIFIER_BACKUP};
DETACH {METADATA_CATALOG_NAME_IDENTIFIER_BACKUP};
COPY (SELECT content FROM read_blob('%s')) TO '%s' (FORMAT BLOB);
COPY (SELECT content FROM read_blob('%s.wal')) TO '%s.wal' (FORMAT BLOB);
)", ...);

DuckLakeBackupData() : offset(0), executed(false) {
}

idx_t offset;
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unused

@Tishj
Copy link
Copy Markdown
Member

Tishj commented Feb 19, 2026

This looks like it's missing the second half, a method to restore from the backup?

@carlopi
Copy link
Copy Markdown
Member Author

carlopi commented Feb 19, 2026

This looks like it's missing the second half, a method to restore from the backup?

Thanks for the review!

This is on purpose, since the semantic it's not super clear, like what happens to stuff you have inserted or removed in the meantime?
That can always be performed out-of-band, but I think offering a functionally for that allows more misuse.

The backup it's intentended to simplify cases where the metadata catalog might for various reasons be non available to end-users, while the backup can be simply a couple of files on remote storage.
For example, a postgres metadata catalog it's great for parallel read/write, but a read-only duckdb file it's more portable and more accessible (for example, the DuckDB-Wasm client), so if one accept to be somewhat out-of-sync, the tradeoff might be worth it.

This is also meant to simplify quasi-frozen ducklake architectures, where one might add data with some cadence, but most workloads are read heavy.

@Tishj
Copy link
Copy Markdown
Member

Tishj commented Feb 23, 2026

I think if it's called backup, I would expect it to be a proper backup.
This sounds like it's only a metadata backup, and like you said, performing deletes could cause the backup to get entirely broken, if the deleted data gets garbage collected.

I think this would require proper branching support first, to be able to freeze a state of the table, preventing garbage collection of the data referenced by the backup

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants