Skip to content

Conversation

atris
Copy link
Contributor

@atris atris commented Sep 22, 2025

Add RestoreSnapshotRequest:
alias_write_index_policy: preserve (default), strip_write_index, custom_suffix
alias_suffix for custom_suffix
Apply policy in RestoreService during alias restore (post-rename):
strip_write_index: force is_write_index=false on follower aliases
custom_suffix: append suffix and set is_write_index=false

This prevents multi-write alias conflicts on followers, unlocking bidirectional CCR with write aliases.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

 Add RestoreSnapshotRequest:
   alias_write_index_policy: preserve (default), strip_write_index, custom_suffix
   alias_suffix for custom_suffix
 Apply policy in RestoreService during alias restore (post-rename):
   strip_write_index: force is_write_index=false on follower aliases
   custom_suffix: append suffix and set is_write_index=false

This prevents multi-write alias conflicts on followers, unlocking bidirectional CCR with write aliases.

Signed-off-by: Atri Sharma <[email protected]>
@atris atris requested a review from a team as a code owner September 22, 2025 18:58
Signed-off-by: Atri Sharma <[email protected]>
Copy link
Contributor

✅ Gradle check result for ac856a1: SUCCESS

Copy link

codecov bot commented Sep 22, 2025

Codecov Report

❌ Patch coverage is 93.02326% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 72.96%. Comparing base (7abef8c) to head (e5b6100).
⚠️ Report is 9 commits behind head on main.

Files with missing lines Patch % Lines
...ster/snapshots/restore/RestoreSnapshotRequest.java 84.21% 0 Missing and 3 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #19368      +/-   ##
============================================
+ Coverage     72.93%   72.96%   +0.03%     
- Complexity    69947    69952       +5     
============================================
  Files          5676     5676              
  Lines        321121   321158      +37     
  Branches      46427    46432       +5     
============================================
+ Hits         234195   234323     +128     
+ Misses        68032    67904     -128     
- Partials      18894    18931      +37     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@andrross
Copy link
Member

@ankitkala Can you take a look and see if this matches what you had in mind?

@Mishail
Copy link
Contributor

Mishail commented Sep 24, 2025

I was thinking about trying to fix this on the CCR plugin side.
Not an expert here, but we already have renameAliasPattern/renameAliasReplacement in RestoreSnapshotRequest (see #16292), so we could introduce a plugin setting (for the case of bidirectional CCR) to use those for the initial snapshot restore. And then use the same setting to decide if we need to rename aliases (or even strip writeIndex from them) during subsequent updates.

@atris
Copy link
Contributor Author

atris commented Sep 24, 2025

I was thinking about trying to fix this on the CCR plugin side. Not an expert here, but we already have renameAliasPattern/renameAliasReplacement in RestoreSnapshotRequest (see #16292), so we could introduce a plugin setting (for the case of bidirectional CCR) to use those for the initial snapshot restore. And then use the same setting to decide if we need to rename aliases (or even strip writeIndex from them) during subsequent updates.

The rename approach doesn’t fix the writeIndex conflict — it only changes alias names. renameAliasPattern can’t modify alias properties. With bidirectional CCR, both clusters end up with writeIndex=true on the same alias. Renaming “products”. -> “products_dc2” still restores writeIndex=true and you hit the conflict.
Our change transforms the alias metadata during restore (writeIndex=true → false on followers), which prevents the conflict and keeps the same alias name — what most apps expect. Rename is a fallback for older clusters, but it solves a different problem.

@mch2
Copy link
Member

mch2 commented Sep 25, 2025

Can you help me understand the use case for custom_suffix on the alias? From my understanding of the issue strip_write_index would solve the dual write problem? When would we want to restore a follower under a modified alias/suffix and strip the write flag?

Also from your test testApplyAliasPolicyWithRenameAndStrip it seems we can achieve the same thing with AliasWriteIndexPolicy.STRIP_WRITE_INDEX and renameAlias?

@atris
Copy link
Contributor Author

atris commented Sep 25, 2025

Thanks for the review. I’ve simplified the change per your feedback:

  1. RestoreSnapshotRequest now only supports alias_write_index_policy: preserve (default) or strip_write_index. I removed custom_suffix.
  2. Any alias renaming is handled via the existing renameAliasPattern/renameAliasReplacement.
  3. RestoreService applies the policy when restoring aliases (after rename), so followers never get is_write_index=true from the leader.
  4. Wire/read/write is gated at 3.3 for compatibility.

This keeps restore focused on one job (toggling the write flag) and uses the existing rename fields when user wants suffixed aliases.

Copy link
Contributor

❌ Gradle check result for e5b6100: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for e5b6100: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@atris atris closed this Sep 26, 2025
@atris atris reopened this Sep 26, 2025
Copy link
Contributor

❌ Gradle check result for e5b6100:

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@atris atris closed this Sep 26, 2025
@atris atris reopened this Sep 26, 2025
Copy link
Contributor

❌ Gradle check result for e5b6100: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Member

@mch2 mch2 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This lgtm, once this is merged and pushed to maven thinking we should also add an IT here in ccr plugin to ensure this bug is fixed?

@ankitkala Wondering if you could pls make a pass here as well

Copy link
Contributor

❌ Gradle check result for e5b6100: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@atris atris closed this Sep 27, 2025
@atris atris reopened this Sep 27, 2025
Copy link
Contributor

✅ Gradle check result for e5b6100: SUCCESS

@atris
Copy link
Contributor Author

atris commented Sep 27, 2025

Thanks @mch2 . @ankitkala Please let me know if this looks ok. I am hoping to get it in 3.3.0

@ankitkala
Copy link
Member

Thanks. Took a brief look at it. overall approach looks good to me.

Are we planning to also expose this as a parameter when starting ccr for an index? Like @mch2 called out, let's add some CCR IT as well.

@atris atris merged commit 28689b7 into opensearch-project:main Sep 28, 2025
146 of 156 checks passed
@atris
Copy link
Contributor Author

atris commented Sep 28, 2025

Thanks. Took a brief look at it. overall approach looks good to me.

Are we planning to also expose this as a parameter when starting ccr for an index? Like @mch2 called out, let's add some CCR IT as well.

Ack. I will add it in a follow up PR.

karenyrx pushed a commit to karenyrx/OpenSearch that referenced this pull request Sep 29, 2025
…-project#19368)

Restore: enable safe bidirectional CCR via alias policy on restore
 Add RestoreSnapshotRequest:
   alias_write_index_policy: preserve (default), strip_write_index, custom_suffix
   alias_suffix for custom_suffix
 Apply policy in RestoreService during alias restore (post-rename):
   strip_write_index: force is_write_index=false on follower aliases
   custom_suffix: append suffix and set is_write_index=false

This prevents multi-write alias conflicts on followers, unlocking bidirectional CCR with write aliases.

Signed-off-by: Atri Sharma <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants