CREATE and CTAS WITH clause for table options#1125
Open
Alex-Monahan wants to merge 9 commits into
Open
Conversation
GetNewPartitionKey was mutating partition_data->partition_id from the local TRANSACTION_LOCAL_ID_START sentinel to the freshly-allocated committed id at commit time. On a metadata-write conflict (concurrent CTAS), CommitChanges retries with a fresh commit_state. The retry would read the already-mutated committed id as if it were the local id and register the wrong key in committed_partition_ids — RemapPartitionId would then leak the unmodified local sentinel (2^63) into the data-file SQL. Postgres bigint rejects 2^63 (signed int64 max is 2^63 - 1); SQLite silently truncates, which masked the bug on the SQLite metadata backend. The mutation never had any consumer: every reader of partition_data->partition_id runs at plan time of a current or future statement, and future statements load fresh state from metadata via the persisted partition_key.id. Removing the mutation makes GetNewPartitionKey idempotent across retries. Apply the same removal to GetNewSortKey for symmetry — sort_id is currently harmless because no RemapSortId step exists, but the mutation pattern is the same latent foot-gun.
Allow CREATE TABLE and CREATE TABLE AS SELECT statements to set per-table DuckLake config options through the existing WITH (key=value, ...) clause. Each key/value pair is validated using the same path as ducklake_set_option and persisted into ducklake_metadata at commit time, scoped to the new table. Options also take effect immediately for any data written within the same transaction (e.g. CTAS row group sizing) and survive a rename within that transaction.
…ed expressions
Naming pass — every WITH-clause-related identifier now uses options_in_create_with /
OptionsInCreateWith, replacing pending_table_options, GetPendingTableOptions,
SetPendingTableOptions, with_options, new_table_options, WriteNewTableConfigOptions,
ValidateCreateTableOptions, and create_with_options.
Architectural cleanup:
- Push WITH-clause validation+population into the schema-flavored DuckLakeCopyInput
constructor; PlanCreateTableAs no longer mutates copy_input post-construction.
- Replace the inlined get_option lambda in PlanCopyForInsert with a method on
DuckLakeCopyInput::GetEffectiveOption (the only inlined-helper lambda in DuckLake's
src/ tree before this change).
- Mirror the partition_data / sort_data copy guard pattern on the alter constructors:
only copy options_in_create_with when the parent has any.
Functionality:
- Accept foldable expressions (constants, getvariable(...), upper('zstd'), etc.) as
WITH (...) values, mirroring upstream CREATE SECRET / ATTACH binding. Use
ConstantBinder + ExpressionExecutor::EvaluateScalar; reject unbindable expressions
(column refs, subqueries) and prepared-statement parameters.
- Lift the raw-options-map validation loop into a single shared free function
ValidateOptionsInCreateWith; both call sites collapse to one line each.
Tests:
- Add cases for getvariable() WITH values, function-call folding (upper('zstd')),
unbound column-ref rejection, and the explicit CREATE TABLE WITH + same-txn INSERT
scenario.
…le_with_options.test Two new test cases — one plain CREATE TABLE, one CTAS — exercise all three clauses in the same statement. Verifies: - Parser accepts the three-clause combination (no syntax conflicts). - WITH options (parquet_compression, parquet_row_group_size) persist to ducklake_metadata under the new table_id. - PARTITIONED BY produces hive 'i=N/' directories in the data path; for 20 distinct values we see 20 partition directories. - WITH compression overrides the default (gzip on the plain path, zstd on the CTAS path) and is observed in parquet_metadata(). SORT correctness is exercised here only by not-erroring; full SORT verification lives in test/sql/sorted_table/* and test/sql/create_table_inline_partition_sort/*_sort_verification.test.
…tests
Both the plain-INSERT and CTAS combined-clause tests now hardcode per-rowgroup
stats_min/max for the sort key, mirroring the technique in
create_table_inline_ctas_sort_verification.test.
Layout: 6144 rows, p = i // 3072 (two partitions), b = 6143 - i (input is b
DESCENDING). With parquet_row_group_size=2048 from WITH and
preserve_insertion_order=true, each partition file has exactly 2 rowgroups.
After SORTED BY (b), per-rowgroup b stats are disjoint and ascending:
Partition p=0 (b ∈ [3072, 6143]):
rowgroup 0: b ∈ [3072, 5119]
rowgroup 1: b ∈ [5120, 6143]
Partition p=1 (b ∈ [0, 3071]):
rowgroup 0: b ∈ [0, 2047]
rowgroup 1: b ∈ [2048, 3071]
Without sort, rowgroup 0 of each file would carry the full input range of b
(descending), so the disjoint-ascending shape is a tight regression test for
'sort actually reordered rows on disk' on both code paths.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Howdy folks!
This PR adds support in DuckLake for table-scoped settings to be set in the
WITHclause withinCREATEand CTAS statements. The purpose of this feature is to dramatically simplify how easy it is to use DuckLake from within dbt and other tools that rely heavily on CTAS syntax.This comes from the DuckDB functionality added here: duckdb/duckdb#20431, duckdb/duckdb#20728.
For example:
It works by setting table level options within the
ducklake_metadatatable, so these options are set in a persistent fashion. It also supports any expression that can be simplified down to a constant, so for examplegetvariablesyntax or string manipulation functions are supported for option values. Option names must be plain text, however.This PR builds on top of #1107 (it contains all of the changes in 1107) and targets v1.5 also.
This was built with Claude's help, but I have done substantial refactoring as a part of writing the PR and I have read through the tests and added new ones to make sure there is adequate coverage.
Open to feedback of course!
CC @nicku33