Skip to content

fix: concurrent-commit correctness on the postgres metadata backend#1113

Open
hello-world-bfree wants to merge 1 commit into
duckdb:mainfrom
hello-world-bfree:fix/incrementing-id-race
Open

fix: concurrent-commit correctness on the postgres metadata backend#1113
hello-world-bfree wants to merge 1 commit into
duckdb:mainfrom
hello-world-bfree:fix/incrementing-id-race

Conversation

@hello-world-bfree
Copy link
Copy Markdown
Contributor

Note

This is follow-up to #1044 , addresses issue #1094. Original PR closed by @pdet citing concerns that conflict resolution was being bypassed. That feedback surfaced a window where a concurrent committer could land changes between the check and the lock. This was due to running CheckForConflicts before AcquireCommitLock. This PR resolves that as well as adds definitive conflict-detection tests and better frames/explains the issue. It's not actually performance related (that'll require the stored proc implementation that @pdet mentioned) but correctness related. Given the level of effort and cost for the stored proc implementation (as I understand it at least) this c++ implementation would stop the bleeding and allow for incremental progress towards the stored proc implementation with minimal cost.

Problem

*_id allocation uses max(id) + 1 in C++ across two postgres round trips. Concurrent writers race, hit
ducklake_snapshot_pkey violations, retry succeeds but logs fill at >1 MB/h per cluster, and correctness bugs are introduced.

Multiple silent-correctness failure modes on the postgres metadata backend under concurrent writers. PK-violation log spam (>1 MB/h per cluster) is the visible symptom; the underlying defects are worse:

  • Cross-action conflicts not detected. First-attempt commits skip CheckForConflicts and rely on ducklake_snapshot_pkey collision as the sole conflict signal. INSERT × ALTER, INSERT × DROP, ALTER × ALTER, COMPACT × ALTER, ALTER-view × DROP-view all silently both-commit. Catalog ends up referencing dropped tables, columns added to non-existent tables, etc.
  • No active-uniqueness invariants. Two concurrent CREATE TABLE (same name), two concurrent ADD COLUMN (same name), or two concurrent DELETEs on the same data file all both-commit. ducklake_table / ducklake_column / ducklake_delete_file end up with duplicate active rows; subsequent resolution is nondeterministic.
  • Stats overwrites silently. Concurrent UPDATE on ducklake_table_stats is last-writer-wins with no version tracking. Planner reads stale cardinalities.
  • Schema cache mis-keyed. GetSchemaForSnapshot was keyed on schema_version, but two snapshots can share schema_version and see different tables (filtered by begin_snapshot). AT(VERSION => N) can return wrong column sets. - Default isolation hides committers. postgres_scanner attaches at REPEATABLE_READ. Once a DuckLake tx issues any metadata read, its PG snapshot pins for the rest of the tx, so CheckForConflicts sees empty changes_made even when conflicts exist.
  • PK-collision retry storms. Every snapshot-id collision is a hard PG error producing the observed log noise plus retry backoff under load.

Change

Allocate ids with postgres sequences (nextval, cache 1). This is race-free so there are no PK violations to log and create unnecessary bloat. Sequences are bootstrapped on attach and setval()'d from current max so existing catalogs migrate without issue. Decoupling allocation from commit order requires:

  • pg_advisory_xact_lock (30 s timeout) around snapshot allocation + commit. SetCommittedSnapshotId becomes monotonic-max. - Partial unique indexes on active rows of ducklake_schema(schema_name), ducklake_table(schema_id, table_name), ducklake_view(schema_id, view_name), ducklake_delete_file(data_file_id). WHERE end_snapshot IS NULL. Replaces PK-violation as the conflict signal.
  • CheckForConflicts runs after the lock on every attempt. Authoritative under the commit critical section. can_retry gated so logical conflicts don't retry; transient lock errors do.
  • Stats CAS via new stats_version column on ducklake_table_stats plus post-commit SELECT to catch silent zero-row UPDATEs.
  • Schema cache keys on snapshot_id instead of schema_version.
  • isolation_level 'read committed' for PG metadata (overridable). Applies for both metadata_type == "postgres" and "postgres_scanner" — the original change missed the latter, which is what DBPathAndType::ExtractExtensionPrefix actually returns for the postgres: prefix.
  • Conflict-check matrix completion. Five ConflictCheck calls added covering altered_tables ↔ tables_merge_adjacent, altered_tables ↔ tables_rewrite_delete, altered_views ↔ dropped_views (both directions). Pre-existing matrix omissions, surfaced while building the test suite.

Tests

test/sql/concurrent/concurrent_pg_conflict_detection.test (new); 17 deterministic scenarios + iso-level regression guard + smoke. Each scenario uses named connections + explicit BEGIN/COMMIT and asserts on the regex-matched error from the labelled ConflictCheck line, so passing means the path actually fired.

Mechanism Scenarios
Partial unique index A (CREATE × CREATE), D (DELETE × DELETE same file)
CheckForConflicts post-lock B (INSERT × ALTER), C (INSERT × DROP), F (ALTER × ALTER), G (DROP × ALTER), H (DROP × DROP), I (ALTER × DELETE), J (DROP × DELETE), L (ALTER view × ALTER view), M (DROP view × DROP view), N (COMPACT × DELETE), O (DELETE × COMPACT), P (COMPACT × COMPACT), Q (COMPACT × ALTER), R (ALTER × COMPACT), S (ALTER view × DROP view), T (DROP view × ALTER view)
Negative control K (INSERT × INSERT — both commit, row count = 2)
Iso-level regression guard reads current_setting('transaction_isolation') from internal metadata catalog, asserts read committed (catches reintroduction of the postgres_scanner branch typo class)
Smoke 8-way concurrentloop CREATE same name converges to one active row

Compact × INSERT remains by-design no-conflict, asserted by the existing test/sql/compaction/compaction_delete_conflict.test:119-138.

[concurrent] suite: 470 assertions / 8 files, all green. [compaction] suite green on modified paths (compaction_delete_conflict.test, 50 assertions).

Warning

The ci does have failures but they're upstream in duckdb:main

  1. DataChunk::Verify - size mismatch: vector 0 (BIGINT) has size 0 but chunk has size 1
  2. SIGSEGV in test/sql/alter/struct_evolution_list_alter.test. It's flipped back and forth between skipped and unskipped recently

…t-order serialization. schemas keyed on snapshot_id, not schema_version. partial unique indexes to replace retry signal. optimistic concurrency stats. move CheckForConflicts post-lock. add conflict detection tests. handle all combinations of conlict, add test
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant