Skip to content

Conversation

paulomach
Copy link
Contributor

@paulomach paulomach commented Aug 14, 2025

Issue

  • Test for quorum loss is not really provoking a quorum loss.
  • Corrected test expose other issues with the implementation of the promote action and rejoin procedures.
  • charm automated GR instance recovery goes boom when units try to rejoin at the same time.

Solution

  • Fix test to ensure two (out of three) instances are killed at the same time (instead of one after the other), which will cause quorum loss.
  • Fix call to shell api command that recovers from quorum loss.
  • Serialize the recovery of the GR instances to avoid GR hang.
  • Tweaked status setting to better display quorum loss.

@paulomach paulomach added the bug Something isn't working as expected label Aug 14, 2025
@paulomach paulomach changed the title fix: quorum loss recovery DPE-7404 fix: quorum loss recovery Aug 15, 2025
@paulomach paulomach changed the title DPE-7404 fix: quorum loss recovery fix: [DPE-7404] quorum loss recovery and test fixes Aug 15, 2025
@paulomach paulomach merged commit d6efbab into main Aug 15, 2025
228 of 230 checks passed
@paulomach paulomach deleted the fix/promote-to-primary-call-and-test branch August 15, 2025 15:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working as expected Libraries: Out of sync
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants