CEP-46 witnesses #4378

aweisberg · 2025-09-15T19:05:14Z

This is some initial cleanup removing a lot of references to transient replication from a lot of the sstable plumbing. I didn't remove the field from the format metadata yet although we should do that as part of the next format change.

It also has the few tweaks needed to get single partition reads and writes working.

Known bugs:
Needs fixing now

ReplicaPlans.contactForRead doesn’t ensure any full replicas will be read from at ONE
Fix witness newly introduced bug where repair doesn’t correctly terminate on transient nodes and mark their sstables as repaired preventing nodetool cleanup from cleaning transient data

Could be fixed later

Fix bug where partition range read command can send data requests to transient replicas because the Replica doesn’t match what is actually being queried by the read command

aratno

There are still a lot of references to transient replication - are you planning to remove those later on?

Could include a comment in NEWS that this breaks any existing users of the experimental transient replication features. Those users shouldn't have any expectation of feature stability, but a note in NEWS would be a kind gesture.

aratno · 2025-09-20T00:25:15Z

src/java/org/apache/cassandra/cql3/statements/schema/CreateTableStatement.java

+        // TODO (review): This can be removed right? ReadRepair is effectively not done anymore so the setting doesn't matter
+//        if (keyspace.replicationStrategy.hasTransientReplicas()
+//            && table.params.readRepair != ReadRepairStrategy.NONE)
+//        {
+//            throw ire("read_repair must be set to 'NONE' for transiently replicated keyspaces");
+//        }


I think we should require BLOCKING for tracked keyspaces for now. Users could previously switch to NONE if they preferred write atomicity over read monotonicity and wanted slightly better performance, but with mutation tracking there wouldn't be a semantics tradeoff, and the performance of read reconciliation should be even better than read repair anyway.

The reason I think we should defer support for NONE is that we may decide to do an equivalent of full repair with range reads, and there isn't a strong justification to support both modes yet.

I'm going to punt on this for now. Requiring things makes it blow up on existing schema when maybe it doesn't need to? Makes it harder to enable.

Makes sense, I support moving forward with that option essentially ignored for now. We could also drop a client warning if a user has mutation tracking enabled for a keyspace with read repair NONE.

aratno · 2025-09-20T00:31:14Z

src/java/org/apache/cassandra/db/AbstractReadCommandVerbHandler.java

@@ -107,6 +108,7 @@ else if (localComparisonEpoch.isAfter(readCommand.serializedAtEpoch()))

    private ClusterMetadata checkTokenOwnership(ClusterMetadata metadata, Message<T> message)
    {
+        boolean acceptsTransient = message.verb() != Verb.READ_REQ;


Confirming this will include RANGE_REQ later on

Nothing reads from transient replicas via this path and verb handler now since Blake reworked summary requests to be their own verb. I'll update this.

Actually the summary requests go through here so I will use summary request verb to check.

aratno · 2025-09-20T00:39:32Z

src/java/org/apache/cassandra/db/ReadCommand.java

@@ -1430,7 +1393,7 @@ private void serializeHeader(ReadCommand command, DataOutputPlus out, int versio
            out.writeByte(
            digestFlag(command.isDigestQuery())
            | indexFlag(null != command.indexQueryPlan())
-            | acceptsTransientFlag(command.acceptsTransient())
+            // | acceptsTransientFlag(false) Deprecated flag, could be reused?
            | needsReconciliationFlag(command.rowFilter().needsReconciliation())


needsReconciliationFlag was added for non-strict filtering, should probably rename for clarity now that read reconciliation is its own concept

Is this in scope for this change set?

No, just a thought while I was reviewing

aratno · 2025-09-20T00:42:03Z

src/java/org/apache/cassandra/db/compaction/CompactionStrategyManager.java

@@ -1388,7 +1371,7 @@ public CleanupSummary releaseRepairData(Collection<TimeUUID> sessions)
        readLock.lock();
        try
        {
-            for (PendingRepairManager prm : Iterables.concat(pendingRepairs.getManagers(), transientRepairs.getManagers()))
+            for (PendingRepairManager prm : Iterables.concat(pendingRepairs.getManagers()))


No need to concat anymore

aratno · 2025-09-20T00:55:00Z

src/java/org/apache/cassandra/tools/SSTableMetadataViewer.java

+            // TODO (review): This should be removed right? Nothing depends on it?
+//            if (stats != null)
+//                field("IsTransient", stats.isTransient);


I support removal 👍

The problem is poorly written tools can see the field missing and start erroring out. I'll put it back in for now.

aratno · 2025-09-20T01:05:58Z

src/java/org/apache/cassandra/locator/ReplicaPlans.java

+    // TODO (review): Cheap quorums are not a goal for now, would really require speculation for writes
+    // and would trigger reconciliation work later, potentially quite a lot so I don't think it makes sense.
+    // So just remove selector?


I support removing support for cheap quorums. I could imagine a world where we deprioritize delivery of new writes to witnesses if reconciliation is behind, but even a change like that would require more justification.

I'm going to leave this in for now to avoid disruption.

aratno · 2025-09-20T01:28:58Z

src/java/org/apache/cassandra/db/Keyspace.java

+                        // TODO (review): Races and this being an out of range mutation, it's silently accepting the mutation to the log
+                        // but not applying to the CF. This was already checked, but in a race what happens? Double check ack handling
+                        // and whether the voting group is checked to match safely
+                        for (Range<Token> r : localRanges.onlyFull().ranges())


If we're losing the range (coordinator is behind and believes replica owns the range), we may write something to an SSTable that's not read, and will be purged on next cleanup.

If we're gaining the range (coordinator is ahead and believes replica owns the range), TCM should try to catch up before we process the message and get here.

…it is no longer possible to have transient sstables.

…stBase

…thMutationTrackingTest

aweisberg · 2025-09-26T18:58:15Z

I have a separate PR for the rename #4174 I just haven't had a chance to really push that through and I was also wondering if we should merge the rename back to 4.0 to ease merge pain in the long run.

aweisberg requested a review from aratno September 15, 2025 19:05

aweisberg force-pushed the cep-46-witnesses-test branch from 65dcc37 to 8faaca0 Compare September 15, 2025 19:05

aratno approved these changes Sep 20, 2025

View reviewed changes

aweisberg added 6 commits September 22, 2025 13:52

Remove transient from sstable metadata, repair, and compaction since …

4429520

…it is no longer possible to have transient sstables.

Read and write path support

e12f1e4

Bring in some tests stuff from future changes

cec15ff

Fix failure to select a full replica when reading at ONE

5729b5f

Fix MutationVerbHandlerOutOfRangeTest and add missing SharedClusterTe…

0b46ccc

…stBase

Add MultiNodeTableWalkWithWitnessesTest based on MultiNodeTableWalkWi…

326fbf1

…thMutationTrackingTest

aweisberg force-pushed the cep-46-witnesses-test branch from c32e7e0 to 326fbf1 Compare September 22, 2025 20:52

aweisberg added 2 commits September 26, 2025 12:26

Review feedback

a1ada5a

More cleanup

618f716

CEP-46 witnesses #4378

Are you sure you want to change the base?

CEP-46 witnesses #4378

Uh oh!

Conversation

aweisberg commented Sep 15, 2025

Uh oh!

aratno left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

aweisberg commented Sep 26, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants