Add ducklake_static_backup#429
Conversation
Example, you can attach a given ducklake in read/write mode like:
```sql
ATTACH 'ducklake:postgres:<CONNECTION_STRING>' AS my_ducklake (DATA_PATH 's3://some_bucket', STATIC_BACKUP 's3://some_bucket/backup.ducklake');
--- some operations
CALL ducklake_static_backup('my_ducklake');
--- this will copy metadata to s3://some_bucket/backup.ducklake
```
And at the same time, a fully static backup will be accessible like:
```sql
ATTACH 'ducklake:s3://some_bucket/backup.ducklake' (READ_ONLY);
```
(note that backup will be up to date to the most recent `ducklake_static_backup` call that happened BEFORE attaching it)
9d456af to
524bff5
Compare
|
|
||
| namespace duckdb { | ||
|
|
||
| struct BackupBindData : public TableFunctionData { |
There was a problem hiding this comment.
please wrap these in an anonymous namespace to avoid name collisions
namespace {
...
} // namespace
|
|
||
| if (fs.FileExists(tmp_uuid) || fs.FileExists(tmp_uuid + ".wal")) { | ||
| throw BinderException( | ||
| "Temporary file \"%s\" is already in use, please cleanup files in the form \"ducklake_backup_file.*\"", |
There was a problem hiding this comment.
This is generated, we can just regenerate if this is the case, no?
| string backup_location = ducklake_catalog.GetStaticBackup(); | ||
|
|
||
| if (backup_location.empty()) { | ||
| throw InvalidInputException("static_backup not specified as attach option"); |
There was a problem hiding this comment.
Is this strictly necessary? Is it possible to just accept a second parameter to the function and use that if it's not defined on the catalog?
| } | ||
|
|
||
| auto result = transaction.Query( | ||
| string("") + "ATTACH IF NOT EXISTS '" + tmp_uuid + |
There was a problem hiding this comment.
Can we use:
StringUtil::Format(R"(
ATTACH IF NOT EXISTS '%s' AS {METADATA_CATALOG_NAME_IDENTIFIER_BACKUP} (STORAGE_VERSION 'v1.4.0');
COPY FROM DATABASE {METADATA_CATALOG_NAME_IDENTIFIER} TO {METADATA_CATALOG_NAME_IDENTIFIER_BACKUP};
DETACH {METADATA_CATALOG_NAME_IDENTIFIER_BACKUP};
COPY (SELECT content FROM read_blob('%s')) TO '%s' (FORMAT BLOB);
COPY (SELECT content FROM read_blob('%s.wal')) TO '%s.wal' (FORMAT BLOB);
)", ...);| DuckLakeBackupData() : offset(0), executed(false) { | ||
| } | ||
|
|
||
| idx_t offset; |
|
This looks like it's missing the second half, a method to restore from the backup? |
Thanks for the review! This is on purpose, since the semantic it's not super clear, like what happens to stuff you have inserted or removed in the meantime? The backup it's intentended to simplify cases where the metadata catalog might for various reasons be non available to end-users, while the backup can be simply a couple of files on remote storage. This is also meant to simplify quasi-frozen ducklake architectures, where one might add data with some cadence, but most workloads are read heavy. |
|
I think if it's called I think this would require proper branching support first, to be able to freeze a state of the table, preventing garbage collection of the data referenced by the backup |
Example, you can attach a given ducklake in read/write mode like:
And at the same time, a fully static read-only backup will be accessible like:
(note that backup will be up to date to the most recent
ducklake_static_backupcall that happened BEFORE attaching it)