Skip to content

fix: slot redirection after takeover corrupted#7263

Open
kostasrim wants to merge 1 commit intomainfrom
reconcile
Open

fix: slot redirection after takeover corrupted#7263
kostasrim wants to merge 1 commit intomainfrom
reconcile

Conversation

@kostasrim
Copy link
Copy Markdown
Contributor

Fixes #7262

Signed-off-by: Kostas Kyrimis <kostas@dragonflydb.io>
Comment thread src/server/replica.cc
// since we are budy with the replication
RETURN_ON_ERR(SendCommandAndReadResponse(StrCat("REPLCONF CLIENT-ID ", id_)));
RETURN_ON_ERR(
SendCommandAndReadResponse(StrCat("REPLCONF CLIENT-ID ", service_.cluster_family().MyID())));
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MyId() falls back to replica id if the cluster_id flag is not set.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a cluster takeover edge case where slot redirection reconciliation can fail when --cluster_node_id is set by ensuring the replica reports the cluster node ID (instead of the replication run-id) during the Dragonfly replication handshake.

Changes:

  • Send REPLCONF CLIENT-ID using service_.cluster_family().MyID() rather than the replica’s id_ (run-id), aligning replica identity with --cluster_node_id when provided.

Comment thread src/server/replica.cc
Comment on lines +422 to +423
RETURN_ON_ERR(
SendCommandAndReadResponse(StrCat("REPLCONF CLIENT-ID ", service_.cluster_family().MyID())));
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. If cloud tests find this issue - we can cover this as well @kostasrim

@augmentcode
Copy link
Copy Markdown

augmentcode Bot commented May 5, 2026

🤖 Augment PR Summary

Summary: Updates the replication handshake to send the node’s cluster ID in REPLCONF CLIENT-ID instead of the server replication run-id.
Why: Keeps replica IDs consistent with cluster topology, preventing corrupted slot redirection after a takeover (fixes #7262).

🤖 Was this summary useful? React with 👍 or 👎

Copy link
Copy Markdown

@augmentcode augmentcode Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review completed. 1 suggestion posted.

Fix All in Augment

Comment augment review to trigger a new review at any time.

Comment thread src/server/replica.cc
Copy link
Copy Markdown
Collaborator

@romange romange left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

slot redirection after takeover failed

3 participants