feat(disttae,tae/logtail): per-account catalog memory isolation for CN nodes (#23777)#24262
feat(disttae,tae/logtail): per-account catalog memory isolation for CN nodes (#23777)#24262aptend wants to merge 11 commits intomatrixorigin:mainfrom
Conversation
- show_account.go: Change INNER JOIN to LEFT JOIN from mo_account to db_tbl_counts so freshly-created but not-yet-activated accounts still appear in 'show accounts'. TN-side lazy catalog publish filtering strips mo_tables/mo_database entries for non-activated accounts, so the INNER JOIN was hiding them. COALESCE handles NULL counts. - snapshot.go: Add activateAccountCatalogIfNeeded() before the first internal SQL in restore flows (doRestoreSnapshot, restoreAccountUsingClusterSnapshotToNew) so background SQL sessions that bypass AuthenticateUser/ActivateTenantCatalog can see catalog entries for the target account. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When restoring a dropped account (via cluster snapshot or account-level snapshot), the source account may have never been activated on this CN (e.g., no tenant session was created before the drop). Without activation, the source account's catalog entries are absent from both the catalog cache and the mo_tables partition state (filtered by TN publish), causing table resolution for mo_foreign_keys to fail with 'table does not exist'. Fix: activate the source (from) account's catalog before FK table resolution when from != to, in both the cluster restore path (restoreAccountUsingClusterSnapshotToNew) and the account-level restore path (doRestoreSnapshot). BVT: snapshot/cluster 3772/3772 (100%), snapshot/ 17597/17598 (99.99%, the 1 remaining failure is a pre-existing stale expected-result in snapshotRead.sql unrelated to this change) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Regenerated proto-derived files after rebasing onto origin/main. Also includes gofmt formatting fixes from make pb. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Previously visitObjMeta used the global HasDropCommitted() flag to decide whether to skip an appendable object's in-memory data scan. This races with concurrent compaction: if the drop commits AFTER the scan-end of the current logtail window, the aobj was still alive at scan-end and its rows must be included in the response, but the global flag causes us to skip the scan. The newly created merged nobj has CreateNode > scan-end, so both objects fall outside the window and the rows are silently lost. This was the root cause of flaky failures in the high-concurrent CREATE/ DROP ACCOUNT tests: mo_columns aobj drops committed a few ms after the activation-pull scan-end, causing CN to receive an empty-column TableDef for information_schema.character_sets, which later surfaced as 'no such table information_schema.character_sets' during drop-account FK cleanup. CI evidence: mo-auto-test run 24869137344 job 72813171794 step 7 shows logtail.pull.skip.just.dropped.aobj firing on mo_columns ~1.5s before the client-visible 'no such table' error. Fix: skip the data scan only if DeleteNode is committed AND its End <= scan-end, so the decision is ts-relative to the scan window, matching the semantics already used by ForeachMVCCNodeInRange for metadata. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
When TN's HandleSyncLogTailReq hits a tail larger than Size90M it calls FlushTable and retries the same (from, end) range. The flush upgrades the appendable object's mnode to a pnode and writes a DeleteNode at flushTS > end. On retry visitObjMeta still emits the aobj meta (its CreateNode falls in range, its DeleteNode does not), but baseObject.ScanInMemory returned nil for any persisted node, dropping all of the aobj's row data from the retry tail. For lazy catalog (mo_database / mo_tables / mo_columns) this silently produced 0-row activation replays on CN, which later manifested as 'delete table ... failed' / 'no such table' panics during concurrent account creation. Fix: add persistedNode.ScanDataInRange, modeled after CollectObjectTombstoneInRange, that reads per-row data + commit_ts from the flushed aobj's block and filters by [start, end]. Wire it into ScanInMemory for appendable, non-tombstone persisted objects. Non-appendable persisted objects stay meta-only (unchanged). Lock the behavior down with TestCollectInsertAfterFlush: append three transactions into an aobj, flush it, then verify RangeScanInMemoryByObject still returns the same 6 / 9 / 3 / 6 row windows for [0,p1] / [0,p2] / (p1,p2] / (p1,p3] that TestCollectInsert checks before the flush. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ve DIAG- noise - Rename the diagnostic FIND_TABLE prefix used across loadTableFromStorage, replayCatalog* and a few error-injection sites to the more descriptive catalog-load tag, so the logs read as production observability rather than ad-hoc debug markers. - Drop the noisy DIAG-consumeEntry per-account DCA delayed Warn: the path fires on every per-account catch-up entry by design and added no signal beyond what the activation lifecycle logs already cover. - Replace the panic in loadDatabaseFromStorage's row-count guard with a proper InternalError return; the caller already handles errors and the panic was added for diagnosis, not as a real invariant. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
ⓘ You've reached your Qodo monthly free-tier limit. Reviews pause until next month — upgrade your plan to continue now, or link your paid account if you already have one. |
There was a problem hiding this comment.
Pull request overview
Implements tenant-aware (“lazy”) catalog loading so CN nodes only materialize mo_database/mo_tables/mo_columns for sys + activated accounts, reducing per-CN catalog memory and replay work in high-tenant deployments.
Changes:
- Extends logtail protocol to support lazy-catalog subscription metadata and per-account activation request/response.
- Adds TN-side row filtering for the three catalog tables and CN-side per-account activation/replay + serve gating.
- Adds frontend hooks (auth/restore/clone/subscription) and debug/status surfaces for catalog activation/cache/partition introspection.
Reviewed changes
Copilot reviewed 70 out of 70 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| test/distributed/cases/git4data/branch/metadata/branch_metadata_tenant.sql | Activates tenant catalog on CN for cross-account catalog queries in test case. |
| test/distributed/cases/git4data/branch/metadata/branch_metadata_tenant.result | Updates expected output for added activation steps. |
| proto/logtail.proto | Adds lazy-catalog subscribe fields + activation request/response messages. |
| proto/api.proto | Adds presence-safe account summary fields on api.Entry for lazy-catalog filtering. |
| pkg/vm/engine/types.go | Introduces TenantCatalogActivator optional engine interface. |
| pkg/vm/engine/tae/tables/pnode.go | Adds persisted appendable-object range scan to support lazy-catalog tail collection. |
| pkg/vm/engine/tae/tables/base.go | Routes flushed appendable objects to persisted range scan when needed. |
| pkg/vm/engine/tae/logtail/txn_handle_test.go | Tests entry account-summary annotation uses row account, not current account. |
| pkg/vm/engine/tae/logtail/txn_handle.go | Annotates lazy-catalog entries with account summary when single-account. |
| pkg/vm/engine/tae/logtail/service/session_test.go | Adds coverage for activation response send path. |
| pkg/vm/engine/tae/logtail/service/session.go | Adds activation response builder and clarifies filter behavior separation. |
| pkg/vm/engine/tae/logtail/service/server_test.go | Adds tests for activation queueing and early row filtering behavior. |
| pkg/vm/engine/tae/logtail/service/lazy_catalog_session_test.go | Tests lazy-catalog per-session filter/activation state bookkeeping. |
| pkg/vm/engine/tae/logtail/service/lazy_catalog_session.go | Implements per-session lazy-catalog active/activating account tracking + publish rewrite hook. |
| pkg/vm/engine/tae/logtail/service/client_test.go | Adds tests for new logtail client subscribe/activate APIs. |
| pkg/vm/engine/tae/logtail/service/client.go | Adds SubscribeCatalogTable and ActivateAccountForCatalog APIs + request cloning helpers. |
| pkg/vm/engine/tae/logtail/service/catalog_filter_test.go | Adds focused unit tests for TN-side catalog row filtering. |
| pkg/vm/engine/tae/logtail/service/catalog_filter.go | Implements TN-side lazy-catalog row-level filtering for subscribe/activation/publish paths. |
| pkg/vm/engine/tae/logtail/handle.go | Fixes appendable-object window logic to avoid missing rows across compaction timing. |
| pkg/vm/engine/tae/db/test/db_test.go | Adds regression test verifying insert collection still works after aobj flush. |
| pkg/vm/engine/entire_engine_test.go | Tests EntireEngine forwards ActivateTenantCatalog. |
| pkg/vm/engine/entire_engine.go | Forwards ActivateTenantCatalog when wrapped engine supports it. |
| pkg/vm/engine/disttae/txn_table.go | Minor refactor for in-memory block collection condition readability. |
| pkg/vm/engine/disttae/txn_database.go | Adds serve-gating by account, removes panics, improves logging, and avoids returning column-less table defs. |
| pkg/vm/engine/disttae/stats.go | Gates stats refresh/update on per-account catalog readiness. |
| pkg/vm/engine/disttae/mo_table_stats.go | Gates cache-based stats reads on per-account readiness and reuses cached handles. |
| pkg/vm/engine/disttae/logtailreplay/partition_state_test.go | Adds regression tests for duplicate DELETE (push vs activation) GC behavior. |
| pkg/vm/engine/disttae/logtailreplay/partition_state.go | Adjusts dead-object protection to still allow GC on repeated DELETEs. |
| pkg/vm/engine/disttae/logtailreplay/partition.go | Exposes checkpoint-consumed state accessor. |
| pkg/vm/engine/disttae/logtailreplay/debug.go | Adds partition-state debug summary helpers. |
| pkg/vm/engine/disttae/logtail_consumer_test.go | Adds test ensuring DCA drains nested delayed closures before replay. |
| pkg/vm/engine/disttae/logtail.go | Adds skipCatalogCache flag and per-account delayed catalog-cache apply routing. |
| pkg/vm/engine/disttae/lazy_catalog_cn_test.go | Adds unit tests for CN lazy-catalog state machine and helpers. |
| pkg/vm/engine/disttae/lazy_catalog_cn.go | Implements CN per-account activation state, DCA buffering, and response delivery plumbing. |
| pkg/vm/engine/disttae/engine.go | Adds per-account serve gating for DB/relation resolution and implements ActivateTenantCatalog. |
| pkg/vm/engine/disttae/debug_state_test.go | Tests debug-state rendering for lazy-catalog activation/cache readiness. |
| pkg/vm/engine/disttae/debug_state.go | Adds debug structs and endpoints backing for catalog/cache/activation/partitions introspection. |
| pkg/vm/engine/disttae/cache/debug_test.go | Tests catalog-cache debug account-contents summaries. |
| pkg/vm/engine/disttae/cache/debug.go | Adds catalog-cache debug summary + per-account contents rendering. |
| pkg/vm/engine/disttae/cache/catalog.go | Fixes cpkey-index delete correctness, adds duplicate-column diagnostics, and makes InsertColumns copy-on-write. |
| pkg/util/status/server_test.go | Adds tests for new debug/status HTTP endpoints. |
| pkg/util/status/server.go | Extends status server routing with catalog/cache/activation/partitions debug endpoints and CN selection. |
| pkg/tests/issues/lazy_catalog_reconnect_test.go | Adds embed-cluster reconnect integration tests for ready-account restoration. |
| pkg/sql/plan/make.go | Changes hidden-column lookup to return nil instead of panic. |
| pkg/sql/plan/function/ctl/types.go | Registers ActivateTenantCatalog ctl method. |
| pkg/sql/plan/function/ctl/cmd_activate_tenant_catalog.go | Implements sys-only CN ctl command to activate tenant catalog on current CN. |
| pkg/sql/colexec/aggexec/sumavg2.go | Fixes YEAR type specialization to use types.MoYear. |
| pkg/frontend/snapshot_restore_with_ts.go | Treats catalog-resolution errors as “no FK deps” under lazy-catalog restore edge cases. |
| pkg/frontend/snapshot.go | Activates relevant tenant catalogs for restore flows and softens FK dependency resolution failures. |
| pkg/frontend/show_account.go | Makes SHOW ACCOUNTS resilient to missing counts via LEFT JOIN + activates target account catalog for special-table stats. |
| pkg/frontend/session.go | Activates tenant catalog during authentication before tenant-context SQL. |
| pkg/frontend/predefined.go | Adjusts predefined schema (level varchar(10)). |
| pkg/frontend/output.go | Fixes YEAR extraction/slicing to use types.MoYear backing storage. |
| pkg/frontend/compiler_context.go | Activates publisher/target tenant catalogs when resolving subscription metadata or snapshots. |
| pkg/frontend/clone.go | Activates target account catalog for clone flows that run internal SQL under tenant context. |
| pkg/frontend/back_exec.go | Guards against nil parent context when creating back sessions. |
| pkg/frontend/authenticate.go | Activates target account catalog before ALTER/DROP ACCOUNT internal SQL runs under tenant context. |
| pkg/container/types/types.go | Extends Ints type set to include MoYear. |
| pkg/cnservice/distributed_tae.go | Registers disttae engine with status server for new debug endpoints. |
| pkg/catalog/types_test.go | Adds unit tests for new account-scoped catalog query builders. |
| pkg/catalog/types.go | Adds helpers to build account-scoped batch queries for the three catalog tables. |
| pkg/catalog/lazy_catalog.go | Adds lazy-catalog helpers for identifying catalog tables and extracting account IDs from entries/batches. |
| etc/docker-multi-cn-local-disk/start.sh | Exports repo-root for docker multi-CN local-disk harness. |
| etc/docker-multi-cn-local-disk/docker-compose.yml | Mounts test/ into containers to support BVT test data paths. |
| docs/design/lazy_catalog_load_for_cn.plan.md | Adds main design doc for lazy catalog load on CN. |
| docs/design/lazy_catalog_load_for_cn.flow.md | Adds flow doc describing startup/activation/reconnect semantics and invariants. |
| } | ||
| } | ||
| panic("Unable to find target column from predefined table columns") | ||
| return nil |
There was a problem hiding this comment.
GetColDefFromTable now returns nil when the column isn't found, but there are existing callers that use the returned pointer without a nil check (e.g. building composite PK/cluster-by defs). This change can turn a previously-fast-fail panic into a later nil-deref or invalid TableDef. Either keep the panic, or change all call sites to handle nil and propagate a structured error.
…eCBs golangci-lint flagged two issues introduced by the lazy-catalog branch: - pkg/vm/engine/disttae/logtail_consumer.go: forcedSubscribeTable is unused (its single caller was replaced by forcedSubscribeCatalogTable during the lazy-catalog refactor); remove it. - pkg/vm/engine/tae/logtail/service/server.go: pre-allocate the allCloseCBs slice with the known table count to satisfy prealloc. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Add unit tests for the server param-parsing helpers in pkg/util/status (optional/required uint, optional timestamp, intParam, selectCNInstance) plus the full handleActivateTenantCatalog branch matrix. These exercise the previously-uncovered error paths that drag down PR coverage. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
aunjgr
left a comment
There was a problem hiding this comment.
Review: Per-Account Catalog Memory Isolation (#23777)
Well-designed feature. The architecture is clean: TN filters rows by account before sending, CN tracks per-account readiness via a state machine, and the activation protocol ensures exactly-once catalog materialization per tenant. All CI green.
Architecture Assessment
Correctness of the core protocol:
- TN-side filtering (
catalog_filter.go) correctly handles insert/update (byaccount_idcolumn) and delete (by decoding cpkey[0]) paths, with proper all-pass zero-copy fast path whenselectedRows == bat.RowCount(). - CN-side state machine (
lazy_catalog_cn.go) has clean state transitions:inactive → catching_up → ready_draining → ready. ThebeginAccountReadyTransition+finishAccountReadysplit correctly ensures DCA closures are flushed undercatalogCacheMubefore publishing readyTS. - Inflight dedup via
sync.Map.LoadOrStoreprevents duplicate activations; waiters block oninf.donechannel. - Reconnect correctly collects
wantedAccountsbefore reset, passes them asinitial_active_accountson re-subscribe, and replays all at once at one snapshot TS.
TN sender serialization:
- Phase-1 pull happens in the bounded
activationReqChanworker pool. - Phase-2 completion and response send happen in the single
logtailSendergoroutine, preventing the activation response from racing with steady-state push. The account is only promoted to active AFTER the response enters the session FIFO — correct ordering guarantee.
Panic removal:
loadDatabaseFromStorageandloadTableFromStoragenow return errors instead of panicking — good defensive change for multi-tenant environments where dropped accounts may have compacted catalog.
Catalog cache race fix in getTableItem:
- The nil-Defs fallthrough to storage handles the InsertTable/InsertColumns split-apply window. This is a latent race fix independent of lazy loading — nice catch.
Questions / Minor Items (non-blocking)
-
delayedCacheApplyphase → atomic + mutex double-check: ThedcaTryDelaydoesphase.Load()outside the lock as a fast path, then re-checks inside. This is correct but thedcaConfirmAndApplyloop re-acquires the lock on each iteration. With many concurrent push entries during draining, this could cause brief contention. Not a real concern for 3 catalog tables with low push frequency. -
lazyCatalogSubscribeFilterMPis a package-level mpool: It's used for row copying during subscribe filtering on the TN side. Since filtering runs in the logtailSender (single goroutine for subscribe) and the activation worker pool (bounded), concurrent allocations from this pool should be fine as mpool is thread-safe. Just noting it's shared. -
Activation retry backoff (500ms → 4s, max 4 retries): Total worst-case latency is ~7.5s. For auth-path activation this means a login could block up to 7.5s during rapid reconnect storms. Acceptable given reconnects are rare and this is better than failing the login outright.
-
buildAccountScopedCatalogBatchQuerystring manipulation: Thestrings.Index(query, " order by ")to split and re-append ORDER BY is fragile if the base query ever uses uppercase or has subqueries with ORDER BY. Currently safe because the base queries are hard-coded constants. A comment noting this assumption would help future readers. -
Design docs committed (
docs/design/lazy_catalog_load_for_cn.{plan,flow}.md): Good to have the protocol documented alongside the code.
Summary
Solid implementation of a significant scalability improvement. The protocol correctly handles all lifecycle events (startup, activation, push, reconnect, restore). State machine transitions are well-guarded, and the test coverage (unit + integration reconnect test) is adequate. CI all green.
What type of PR is this?
Which issue(s) this PR fixes:
issue #23777
What this PR does / why we need it:
This PR introduces per-account (tenant-aware) catalog memory isolation for CN
nodes, addressing #23777. Previously every CN replayed and cached
mo_database/
mo_tables/mo_columnsrows for all tenants, which scales poorly withtenant count. With this change, each CN only materialises the catalog of the
sys account plus the tenants it actually serves.
Highlights
per-account subscribe / activation request / response messages dedicated to
the three system catalog tables (
mo_database,mo_tables,mo_columns).Normal-table subscribe / unsubscribe / publish flows are untouched.
and filters subscribe / activation / publish payloads by
account_id(ordecoded
cpkeyfor deletes) before sending. Row-level filtering uses aselection-list +
Batch.Union(sels)and lazily allocates so the all-passcase stays zero-copy.
(seq, ready, readyTS, wantedAccounts, accountDCA)and replay each newly activated tenantat a single
replayTSsnapshot. A delayed-cache-apply (DCA) buffer peraccount ensures live push-stream entries don't escape catalog cache before
replay rebuilds the baseline.
txnDatabase/stats consumers now check bothCatalogCache.CanServe(ts)andpClient.CanServeAccount(account, ts).Bare/unit-test engines that never enable lazy catalog keep their pre-existing
ungated behavior.
catalog tables with
wantedAccounts ∪ {0}and replays them at one snapshotTS, so previously-active tenants come back ready without per-account
re-activation. Newly cold tenants stay not-ready until first access.
immediately after
SetTenantID.RESTORE CLUSTERand single-account restorealso activate the impacted accounts so restored databases are immediately
visible.
GetSubscriptionMetaactivates the publisher's account sosubscription paths see publisher metadata even on cold CNs.
/debug/status/catalog,/debug/status/catalog-cache,/debug/status/catalog-activation,/debug/status/partitionsexposeper-account readiness, catalog-cache contents, recent activation lifecycle
events, and partition state for diagnosis.
Validation
go test -tags matrixone_test ./pkg/catalog ./pkg/vm/engine/disttae/... ./pkg/vm/engine/tae/logtail/...all green.(
pkg/tests/issues/lazy_catalog_reconnect_test.go) exercises realdisconnect / reconnect / per-account activation isolation.
docs/design/lazy_catalog_load_for_cn.{plan,flow}.mddescribes the design) validated independent activation, cross-CN cold
activation, CN-local restart rebuild, TN reconnect with divergent ready sets,
and checkpoint/in-memory mixed-state recovery.
Notes
not modified.
EntireEnginenow forwardsengine.TenantCatalogActivatorso frontend cantrigger activation through the wrapper.
consistent
catalog-loadlog prefix; transient debugDIAG-Warn and apanicinloadDatabaseFromStoragewere removed in favor of anInternalErrorreturn.Special notes for your reviewer:
pkg/vm/engine/tae/logtail/service/catalog_filter.gowith focused tests incatalog_filter_test.go.pkg/vm/engine/disttae/lazy_catalog_cn.go/logtail_consumer.go(functions taggedActivation/replayCatalogTableCache).proto/api.protoandproto/logtail.proto; theregenerated pb files are included.