Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search Replica Allocation and Recovery #17457

Open
wants to merge 14 commits into
base: main
Choose a base branch
from

Conversation

vinaykpud
Copy link
Contributor

@vinaykpud vinaykpud commented Feb 25, 2025

Description

Related Issues

Resolves #17422
Resolves #17421
Resolves #17334
Related to #15306

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions github-actions bot added bug Something isn't working Search:Performance labels Feb 25, 2025
Signed-off-by: Vinay Krishna Pudyodu <[email protected]>
Copy link
Contributor

❌ Gradle check result for f9c54c5: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Vinay Krishna Pudyodu <[email protected]>
Copy link
Contributor

❕ Gradle check result for 2d5b977: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Copy link

codecov bot commented Feb 26, 2025

Codecov Report

Attention: Patch coverage is 88.88889% with 4 lines in your changes missing coverage. Please review.

Project coverage is 72.40%. Comparing base (0ffed5e) to head (6e3be41).
Report is 22 commits behind head on main.

Files with missing lines Patch % Lines
...java/org/opensearch/index/shard/StoreRecovery.java 84.00% 1 Missing and 3 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main   #17457      +/-   ##
============================================
- Coverage     72.42%   72.40%   -0.02%     
- Complexity    65611    65660      +49     
============================================
  Files          5304     5306       +2     
  Lines        304743   304605     -138     
  Branches      44189    44169      -20     
============================================
- Hits         220701   220547     -154     
- Misses        65888    65997     +109     
+ Partials      18154    18061      -93     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@vinaykpud vinaykpud changed the title Search Replicas to Allocation Search Replica Allocation and Recovery Mar 6, 2025
@vinaykpud vinaykpud force-pushed the rw/search-replica-alloc-filter branch from 7ba8b33 to b1a2c8c Compare March 6, 2025 00:00
Copy link
Contributor

github-actions bot commented Mar 6, 2025

❌ Gradle check result for b1a2c8c: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Mar 6, 2025

❌ Gradle check result for b982b40: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@Bukhtawar
Copy link
Collaborator

I am trying to understand what other options we have if you are not aligned on allocation filter rule approach, One other option we have is to have node role to deal with search replicas. But as this feature was already added as part of the #15445 we can discus this separately. We can discus as part of this thread: #17422 (comment)

The challenge is anytime there is an update the node ids have to be added explicitly. This espl gets tricky if there are concurrent modifications lets say there was a node removal and a new node addition concurrently. In such a case if the addition doesn't read the deleted state and overrides the node list with a new node, we might run into issues.

I think the filter logic should be used in cases where we actually need to add an exception(exclusion) or override the allocation logic on top of a generic node roles.

@vinaykpud vinaykpud force-pushed the rw/search-replica-alloc-filter branch from b982b40 to 69f13dd Compare March 6, 2025 19:48
Copy link
Contributor

github-actions bot commented Mar 6, 2025

❌ Gradle check result for 69f13dd: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Vinay Krishna Pudyodu <[email protected]>
@vinaykpud vinaykpud force-pushed the rw/search-replica-alloc-filter branch from 69f13dd to 6e3be41 Compare March 6, 2025 21:22
Copy link
Contributor

github-actions bot commented Mar 6, 2025

✅ Gradle check result for 6e3be41: SUCCESS

@vinaykpud
Copy link
Contributor Author

vinaykpud commented Mar 6, 2025

s have to be added explicitly. This espl gets tricky if there are concurrent modifications lets say there was a node removal and a new node addition concurrently. In such a case if the addition doesn't read the deleted state and overrides the node list with a new node, we might run into issues.

Thanks @Bukhtawar

From your response, it seems the concern is about keeping the filter updated while handling concurrent node additions and removals.

However, in this implementation, users can define a set of nodes—let’s say a "Searcher fleet"—by assigning a custom attribute in the YAML configuration. This means that as long as there are nodes matching this attribute, search replicas will be assigned accordingly. Essentially, we are designating nodes dynamically as part of the "Searcher fleet."

For example, if I have 5 nodes and configure 3 of them with a custom attribute:

node.attr.rackid: "rack2"

I can then make them part of the "Searcher fleet" using the following API:

curl -X PUT "http://localhost:9200/_cluster/settings" \
-H "Content-Type: application/json" \
-d '{
  "persistent": {
    "cluster.routing.allocation.search.replica.dedicated.include.rackid": "rack2"
  }
}'

With this setup, any node with rackid: rack2 will be considered a search-dedicated node. This means that regardless of how many nodes are added or removed from the cluster, as long as they have the matching attribute, they will be automatically recognized as part of the "Reader fleet."

This approach provides flexibility for users to use any attribute-based selection for search nodes.

That said, I’d like to understand if there are any additional concerns, such as potential performance implications, with this approach.

I think the filter logic should be used in cases where we actually need to add an exception(exclusion) or override the allocation logic on top of a generic node roles.

Do you mean we are incorrectly utilizing the inclusion filter here, or is this approach misaligned with its intended purpose? Do you see any limitations or unintended behavior in using it this way?

Also please find my comment on the approach here : #17422 (comment)

Copy link
Contributor

github-actions bot commented Mar 7, 2025

❌ Gradle check result for 858f984: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Mar 8, 2025

❌ Gradle check result for 8f94e31: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

github-actions bot commented Mar 8, 2025

❌ Gradle check result for 66085c5: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@vinaykpud vinaykpud closed this Mar 10, 2025
@vinaykpud vinaykpud reopened this Mar 10, 2025
Copy link
Contributor

❌ Gradle check result for 66085c5: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 5836b20: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working enhancement Enhancement or improvement to existing feature or request Search:Performance
Projects
None yet
3 participants