-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Search Replica Allocation and Recovery #17457
base: main
Are you sure you want to change the base?
Search Replica Allocation and Recovery #17457
Conversation
Signed-off-by: Vinay Krishna Pudyodu <[email protected]>
Signed-off-by: Vinay Krishna Pudyodu <[email protected]>
Signed-off-by: Vinay Krishna Pudyodu <[email protected]>
❌ Gradle check result for f9c54c5: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Vinay Krishna Pudyodu <[email protected]>
❕ Gradle check result for 2d5b977: UNSTABLE Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure. |
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #17457 +/- ##
============================================
- Coverage 72.42% 72.40% -0.02%
- Complexity 65611 65660 +49
============================================
Files 5304 5306 +2
Lines 304743 304605 -138
Branches 44189 44169 -20
============================================
- Hits 220701 220547 -154
- Misses 65888 65997 +109
+ Partials 18154 18061 -93 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
7ba8b33
to
b1a2c8c
Compare
❌ Gradle check result for b1a2c8c: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
❌ Gradle check result for b982b40: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
The challenge is anytime there is an update the node ids have to be added explicitly. This espl gets tricky if there are concurrent modifications lets say there was a node removal and a new node addition concurrently. In such a case if the addition doesn't read the deleted state and overrides the node list with a new node, we might run into issues. I think the filter logic should be used in cases where we actually need to add an exception(exclusion) or override the allocation logic on top of a generic node roles. |
b982b40
to
69f13dd
Compare
❌ Gradle check result for 69f13dd: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Vinay Krishna Pudyodu <[email protected]>
69f13dd
to
6e3be41
Compare
Thanks @Bukhtawar From your response, it seems the concern is about keeping the filter updated while handling concurrent node additions and removals. However, in this implementation, users can define a set of nodes—let’s say a "Searcher fleet"—by assigning a custom attribute in the YAML configuration. This means that as long as there are nodes matching this attribute, search replicas will be assigned accordingly. Essentially, we are designating nodes dynamically as part of the "Searcher fleet." For example, if I have 5 nodes and configure 3 of them with a custom attribute: node.attr.rackid: "rack2" I can then make them part of the "Searcher fleet" using the following API: curl -X PUT "http://localhost:9200/_cluster/settings" \
-H "Content-Type: application/json" \
-d '{
"persistent": {
"cluster.routing.allocation.search.replica.dedicated.include.rackid": "rack2"
}
}' With this setup, any node with This approach provides flexibility for users to use any attribute-based selection for search nodes. That said, I’d like to understand if there are any additional concerns, such as potential performance implications, with this approach.
Do you mean we are incorrectly utilizing the inclusion filter here, or is this approach misaligned with its intended purpose? Do you see any limitations or unintended behavior in using it this way? Also please find my comment on the approach here : #17422 (comment) |
Signed-off-by: Vinay Krishna Pudyodu <[email protected]>
❌ Gradle check result for 858f984: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Vinay Krishna Pudyodu <[email protected]>
❌ Gradle check result for 8f94e31: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Vinay Krishna Pudyodu <[email protected]>
❌ Gradle check result for 66085c5: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
❌ Gradle check result for 66085c5: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Vinay Krishna Pudyodu <[email protected]>
❌ Gradle check result for 5836b20: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Description
Modified the search replica allocation based on node attribute
In this PR we restrict Search replica to assign to only nodes with
searchonly:true
attributeAlso in this PR, we have made changes to treat search and regular replicas differently so unable to allocate one do not block the other: [RW Separation] Treat Regular and Search Replicas Separately to Prevent Allocation Blocking #17421
I also fixed the recovery of search replica when there is a node left scenario: [RW Separation] Search replica recovery flow breaks when search shard allocated to new node after node drop #17334
Related Issues
Resolves #17422
Resolves #17421
Resolves #17334
Related to #15306
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.