Skip to content

Conversation

aweisberg
Copy link
Contributor

This is some initial cleanup removing a lot of references to transient replication from a lot of the sstable plumbing. I didn't remove the field from the format metadata yet although we should do that as part of the next format change.

It also has the few tweaks needed to get single partition reads and writes working.

Known bugs:
Needs fixing now

  • ReplicaPlans.contactForRead doesn’t ensure any full replicas will be read from at ONE
  • Fix witness newly introduced bug where repair doesn’t correctly terminate on transient nodes and mark their sstables as repaired preventing nodetool cleanup from cleaning transient data

Could be fixed later

  • Fix bug where partition range read command can send data requests to transient replicas because the Replica doesn’t match what is actually being queried by the read command

@aweisberg aweisberg requested a review from aratno September 15, 2025 19:05
@aweisberg aweisberg force-pushed the cep-46-witnesses-test branch from 65dcc37 to 8faaca0 Compare September 15, 2025 19:05
Copy link
Contributor

@aratno aratno left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are still a lot of references to transient replication - are you planning to remove those later on?

Could include a comment in NEWS that this breaks any existing users of the experimental transient replication features. Those users shouldn't have any expectation of feature stability, but a note in NEWS would be a kind gesture.

Comment on lines +169 to +174
// TODO (review): This can be removed right? ReadRepair is effectively not done anymore so the setting doesn't matter
// if (keyspace.replicationStrategy.hasTransientReplicas()
// && table.params.readRepair != ReadRepairStrategy.NONE)
// {
// throw ire("read_repair must be set to 'NONE' for transiently replicated keyspaces");
// }
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should require BLOCKING for tracked keyspaces for now. Users could previously switch to NONE if they preferred write atomicity over read monotonicity and wanted slightly better performance, but with mutation tracking there wouldn't be a semantics tradeoff, and the performance of read reconciliation should be even better than read repair anyway.

The reason I think we should defer support for NONE is that we may decide to do an equivalent of full repair with range reads, and there isn't a strong justification to support both modes yet.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to punt on this for now. Requiring things makes it blow up on existing schema when maybe it doesn't need to? Makes it harder to enable.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, I support moving forward with that option essentially ignored for now. We could also drop a client warning if a user has mutation tracking enabled for a keyspace with read repair NONE.

@@ -107,6 +108,7 @@ else if (localComparisonEpoch.isAfter(readCommand.serializedAtEpoch()))

private ClusterMetadata checkTokenOwnership(ClusterMetadata metadata, Message<T> message)
{
boolean acceptsTransient = message.verb() != Verb.READ_REQ;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Confirming this will include RANGE_REQ later on

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nothing reads from transient replicas via this path and verb handler now since Blake reworked summary requests to be their own verb. I'll update this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually the summary requests go through here so I will use summary request verb to check.

@@ -1430,7 +1393,7 @@ private void serializeHeader(ReadCommand command, DataOutputPlus out, int versio
out.writeByte(
digestFlag(command.isDigestQuery())
| indexFlag(null != command.indexQueryPlan())
| acceptsTransientFlag(command.acceptsTransient())
// | acceptsTransientFlag(false) Deprecated flag, could be reused?
| needsReconciliationFlag(command.rowFilter().needsReconciliation())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needsReconciliationFlag was added for non-strict filtering, should probably rename for clarity now that read reconciliation is its own concept

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this in scope for this change set?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, just a thought while I was reviewing

@@ -1388,7 +1371,7 @@ public CleanupSummary releaseRepairData(Collection<TimeUUID> sessions)
readLock.lock();
try
{
for (PendingRepairManager prm : Iterables.concat(pendingRepairs.getManagers(), transientRepairs.getManagers()))
for (PendingRepairManager prm : Iterables.concat(pendingRepairs.getManagers()))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No need to concat anymore

Comment on lines 425 to 427
// TODO (review): This should be removed right? Nothing depends on it?
// if (stats != null)
// field("IsTransient", stats.isTransient);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I support removal 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is poorly written tools can see the field missing and start erroring out. I'll put it back in for now.

Comment on lines 635 to 637
// TODO (review): Cheap quorums are not a goal for now, would really require speculation for writes
// and would trigger reconciliation work later, potentially quite a lot so I don't think it makes sense.
// So just remove selector?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I support removing support for cheap quorums. I could imagine a world where we deprioritize delivery of new writes to witnesses if reconciliation is behind, but even a change like that would require more justification.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to leave this in for now to avoid disruption.

Comment on lines 646 to 649
// TODO (review): Races and this being an out of range mutation, it's silently accepting the mutation to the log
// but not applying to the CF. This was already checked, but in a race what happens? Double check ack handling
// and whether the voting group is checked to match safely
for (Range<Token> r : localRanges.onlyFull().ranges())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're losing the range (coordinator is behind and believes replica owns the range), we may write something to an SSTable that's not read, and will be purged on next cleanup.

If we're gaining the range (coordinator is ahead and believes replica owns the range), TCM should try to catch up before we process the message and get here.

@aweisberg aweisberg force-pushed the cep-46-witnesses-test branch from c32e7e0 to 326fbf1 Compare September 22, 2025 20:52
@aweisberg
Copy link
Contributor Author

I have a separate PR for the rename #4174 I just haven't had a chance to really push that through and I was also wondering if we should merge the rename back to 4.0 to ease merge pain in the long run.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants