Search Replica Allocation and Recovery #17457

vinaykpud · 2025-02-25T21:37:39Z

Description

Modified the search replica allocation based on node attribute
In this PR we restrict Search replica to assign to only nodes with searchonly:true attribute
Also in this PR, we have made changes to treat search and regular replicas differently so unable to allocate one do not block the other: [RW Separation] Treat Regular and Search Replicas Separately to Prevent Allocation Blocking #17421
I also fixed the recovery of search replica when there is a node left scenario: [RW Separation] Search replica recovery flow breaks when search shard allocated to new node after node drop #17334

Related Issues

Resolves #17422
Resolves #17421
Resolves #17334
Related to #15306

Check List

Functionality includes testing.
API changes companion pull request created, if applicable.
Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Signed-off-by: Vinay Krishna Pudyodu <[email protected]>

github-actions · 2025-02-25T22:37:03Z

❌ Gradle check result for f9c54c5: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Vinay Krishna Pudyodu <[email protected]>

github-actions · 2025-02-26T06:09:53Z

❕ Gradle check result for 2d5b977: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

codecov · 2025-02-26T06:10:18Z

Codecov Report

Attention: Patch coverage is 88.88889% with 4 lines in your changes missing coverage. Please review.

Project coverage is 72.40%. Comparing base (0ffed5e) to head (6e3be41).
Report is 22 commits behind head on main.

Files with missing lines	Patch %	Lines
...java/org/opensearch/index/shard/StoreRecovery.java	84.00%	1 Missing and 3 partials ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main   #17457      +/-   ##
============================================
- Coverage     72.42%   72.40%   -0.02%     
- Complexity    65611    65660      +49     
============================================
  Files          5304     5306       +2     
  Lines        304743   304605     -138     
  Branches      44189    44169      -20     
============================================
- Hits         220701   220547     -154     
- Misses        65888    65997     +109     
+ Partials      18154    18061      -93

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

github-actions · 2025-03-06T00:37:53Z

❌ Gradle check result for b1a2c8c: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2025-03-06T01:37:24Z

❌ Gradle check result for b982b40: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Bukhtawar · 2025-03-06T06:35:29Z

I am trying to understand what other options we have if you are not aligned on allocation filter rule approach, One other option we have is to have node role to deal with search replicas. But as this feature was already added as part of the #15445 we can discus this separately. We can discus as part of this thread: #17422 (comment)

The challenge is anytime there is an update the node ids have to be added explicitly. This espl gets tricky if there are concurrent modifications lets say there was a node removal and a new node addition concurrently. In such a case if the addition doesn't read the deleted state and overrides the node list with a new node, we might run into issues.

I think the filter logic should be used in cases where we actually need to add an exception(exclusion) or override the allocation logic on top of a generic node roles.

github-actions · 2025-03-06T20:48:37Z

❌ Gradle check result for 69f13dd: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Vinay Krishna Pudyodu <[email protected]>

github-actions · 2025-03-06T22:26:33Z

✅ Gradle check result for 6e3be41: SUCCESS

vinaykpud · 2025-03-06T23:58:44Z

s have to be added explicitly. This espl gets tricky if there are concurrent modifications lets say there was a node removal and a new node addition concurrently. In such a case if the addition doesn't read the deleted state and overrides the node list with a new node, we might run into issues.

Thanks @Bukhtawar

From your response, it seems the concern is about keeping the filter updated while handling concurrent node additions and removals.

However, in this implementation, users can define a set of nodes—let’s say a "Searcher fleet"—by assigning a custom attribute in the YAML configuration. This means that as long as there are nodes matching this attribute, search replicas will be assigned accordingly. Essentially, we are designating nodes dynamically as part of the "Searcher fleet."

For example, if I have 5 nodes and configure 3 of them with a custom attribute:

node.attr.rackid: "rack2"

I can then make them part of the "Searcher fleet" using the following API:

curl -X PUT "http://localhost:9200/_cluster/settings" \
-H "Content-Type: application/json" \
-d '{
  "persistent": {
    "cluster.routing.allocation.search.replica.dedicated.include.rackid": "rack2"
  }
}'

With this setup, any node with rackid: rack2 will be considered a search-dedicated node. This means that regardless of how many nodes are added or removed from the cluster, as long as they have the matching attribute, they will be automatically recognized as part of the "Reader fleet."

This approach provides flexibility for users to use any attribute-based selection for search nodes.

That said, I’d like to understand if there are any additional concerns, such as potential performance implications, with this approach.

I think the filter logic should be used in cases where we actually need to add an exception(exclusion) or override the allocation logic on top of a generic node roles.

Do you mean we are incorrectly utilizing the inclusion filter here, or is this approach misaligned with its intended purpose? Do you see any limitations or unintended behavior in using it this way?

Also please find my comment on the approach here : #17422 (comment)

Signed-off-by: Vinay Krishna Pudyodu <[email protected]>

github-actions · 2025-03-07T22:44:01Z

❌ Gradle check result for 858f984: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Vinay Krishna Pudyodu <[email protected]>

github-actions · 2025-03-08T02:01:22Z

❌ Gradle check result for 8f94e31: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Vinay Krishna Pudyodu <[email protected]>

github-actions · 2025-03-08T04:23:00Z

❌ Gradle check result for 66085c5: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2025-03-10T17:38:03Z

❌ Gradle check result for 66085c5: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Vinay Krishna Pudyodu <[email protected]>

github-actions · 2025-03-10T20:02:44Z

❌ Gradle check result for 5836b20: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

vinaykpud added 2 commits February 25, 2025 13:19

Restrict Search Replicas to Allocate only to Search dedicated node

7ca48ac

Signed-off-by: Vinay Krishna Pudyodu <[email protected]>

fixed the javadoc

4c503c4

Signed-off-by: Vinay Krishna Pudyodu <[email protected]>

github-actions bot added bug Something isn't working Search:Performance labels Feb 25, 2025

added changelog

f9c54c5

Signed-off-by: Vinay Krishna Pudyodu <[email protected]>

fixed tests

2d5b977

Signed-off-by: Vinay Krishna Pudyodu <[email protected]>

opensearch-ci-bot mentioned this pull request Feb 26, 2025

[AUTOCUT] Gradle Check Flaky Test Report for MinimumClusterManagerNodesIT #14289

Open

vinaykpud marked this pull request as ready for review February 26, 2025 07:53

vinaykpud requested review from anasalkouz, andrross, ashking94, Bukhtawar, CEHENKLE, cwperks, dblock, dbwiddis, gbbafna, jainankitk, kotwanikunal, linuxpi, mch2, msfroh, nknize, owaiskazi19, reta, Rishikesh1159 and sachinpkale as code owners February 26, 2025 07:53

vinaykpud changed the title ~~Search Replicas to Allocation~~ Search Replica Allocation and Recovery Mar 6, 2025

vinaykpud force-pushed the rw/search-replica-alloc-filter branch from 7ba8b33 to b1a2c8c Compare March 6, 2025 00:00

vinaykpud force-pushed the rw/search-replica-alloc-filter branch from b982b40 to 69f13dd Compare March 6, 2025 19:48

Fixed nits after self review

6e3be41

Signed-off-by: Vinay Krishna Pudyodu <[email protected]>

vinaykpud force-pushed the rw/search-replica-alloc-filter branch from 69f13dd to 6e3be41 Compare March 6, 2025 21:22

vinaykpud marked this pull request as ready for review March 7, 2025 00:00

vinaykpud mentioned this pull request Mar 7, 2025

[RW Separation] Restrict Search Replica Allocation to Search-Dedicated Nodes #17422

Open

Modified the search replica allocation based on node attribute

858f984

Signed-off-by: Vinay Krishna Pudyodu <[email protected]>

Merge branch 'main' into rw/search-replica-alloc-filter

8f94e31

Signed-off-by: Vinay Krishna Pudyodu <[email protected]>

removed unwanted settings added during conflict resolve

66085c5

Signed-off-by: Vinay Krishna Pudyodu <[email protected]>

vinaykpud closed this Mar 10, 2025

vinaykpud reopened this Mar 10, 2025

Merge branch 'main' into rw/search-replica-alloc-filter

5836b20

Signed-off-by: Vinay Krishna Pudyodu <[email protected]>

This was referenced Mar 7, 2025

[AUTOCUT] Gradle Check Flaky Test Report for SecureReactorNetty4HttpServerTransportTests #17486

Open

[AUTOCUT] Gradle Check Flaky Test Report for S3BlobStoreRepositoryTests #14299

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search Replica Allocation and Recovery #17457

Search Replica Allocation and Recovery #17457

vinaykpud commented Feb 25, 2025 •

edited

Loading

github-actions bot commented Feb 25, 2025

github-actions bot commented Feb 26, 2025

codecov bot commented Feb 26, 2025 •

edited

Loading

github-actions bot commented Mar 6, 2025

github-actions bot commented Mar 6, 2025

Bukhtawar commented Mar 6, 2025

github-actions bot commented Mar 6, 2025

github-actions bot commented Mar 6, 2025

vinaykpud commented Mar 6, 2025 •

edited

Loading

github-actions bot commented Mar 7, 2025

github-actions bot commented Mar 8, 2025

github-actions bot commented Mar 8, 2025

github-actions bot commented Mar 10, 2025

github-actions bot commented Mar 10, 2025

Search Replica Allocation and Recovery #17457

Are you sure you want to change the base?

Search Replica Allocation and Recovery #17457

Conversation

vinaykpud commented Feb 25, 2025 • edited Loading

Description

Related Issues

Check List

github-actions bot commented Feb 25, 2025

github-actions bot commented Feb 26, 2025

codecov bot commented Feb 26, 2025 • edited Loading

Codecov Report

github-actions bot commented Mar 6, 2025

github-actions bot commented Mar 6, 2025

Bukhtawar commented Mar 6, 2025

github-actions bot commented Mar 6, 2025

github-actions bot commented Mar 6, 2025

vinaykpud commented Mar 6, 2025 • edited Loading

github-actions bot commented Mar 7, 2025

github-actions bot commented Mar 8, 2025

github-actions bot commented Mar 8, 2025

github-actions bot commented Mar 10, 2025

github-actions bot commented Mar 10, 2025

vinaykpud commented Feb 25, 2025 •

edited

Loading

codecov bot commented Feb 26, 2025 •

edited

Loading

vinaykpud commented Mar 6, 2025 •

edited

Loading