Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make reroute iteration time-bound for large shard allocations #14848

Merged
merged 35 commits into from
Jul 24, 2024

Conversation

Bukhtawar
Copy link
Collaborator

@Bukhtawar Bukhtawar commented Jul 21, 2024

Description

The PR achieves to time-box the reroute duration to finish within a specific timeout so that it allows for URGENT priority tasks that would otherwise be queued up. This is important since the cluster becomes unmanageable and any admin API calls including including index creation would fail.

Without changes

  1. Pending task waiting for too long
curl localhost:9200/_cat/pending_tasks?v
insertOrder timeInQueue priority source
1732617 12.8m HIGH cluster_reroute(async_shard_batch_fetch)
1733407 5.6m HIGH cluster_reroute(async_shard_fetch)
  1. Single iteration of reroute taking 11.5m
[2024-05-18T19:28:06,950][WARN ][o.o.c.s.MasterService ] [b869f183befc74cff9f3b5572821ec21] took [11.5m], which is over [10s], to compute cluster state update for [cluster_reroute(post-join reroute)]
  1. Cluster setting update timeouts
[2024-05-18T18:44:03,287][DEBUG][o.o.a.a.c.s.TransportClusterUpdateSettingsAction] [b869f183befc74cff9f3b5572821ec21] #[org.opensearch.cluster.metadata.ProcessClusterEventTimeoutException]#failed to perform [cluster_update_settings]
ProcessClusterEventTimeoutException[failed to process cluster event (cluster_update_settings) within 1m]
        at org.opensearch.cluster.service.MasterService$Batcher.lambda$onTimeout$0(MasterService.java:200)
        at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
        at org.opensearch.cluster.service.MasterService$Batcher.lambda$onTimeout$1(MasterService.java:199)
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:863)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
        at java.base/java.lang.Thread.run(Thread.java:840)

With changes

  1. No API timeouts
  2. Faster reroute loops
[2024-07-21T12:01:04,946][WARN ][o.o.c.s.MasterService    ] [c5a433c6dfc0b2d95c57cc3d9ba7bf57] took [44.4s], which is over [10s], to compute cluster state update for [cluster_reroute(async_shard_batch_fetch)]
[2024-07-21T12:02:44,792][WARN ][o.o.c.s.MasterService    ] [c5a433c6dfc0b2d95c57cc3d9ba7bf57] took [44.2s], which is over [10s], to compute cluster state update for [cluster_reroute(async_shard_batch_fetch)]
[2024-07-21T12:04:39,199][WARN ][o.o.c.s.MasterService    ] [c5a433c6dfc0b2d95c57cc3d9ba7bf57] took [41.5s], which is over [10s], to compute cluster state update for [cluster_reroute(async_shard_batch_fetch)]
[2024-07-21T12:06:10,229][WARN ][o.o.c.s.MasterService    ] [c5a433c6dfc0b2d95c57cc3d9ba7bf57] took [30.9s], which is over [10s], to compute cluster state update for [cluster_reroute(async_shard_batch_fetch)]
[2024-07-21T12:07:13,683][WARN ][o.o.c.s.MasterService    ] [c5a433c6dfc0b2d95c57cc3d9ba7bf57] took [23.9s], which is over [10s], to compute cluster state update for [cluster_reroute(async_shard_batch_fetch)]
[2024-07-21T12:08:14,126][WARN ][o.o.c.s.MasterService    ] [c5a433c6dfc0b2d95c57cc3d9ba7bf57] took [26.5s], which is over [10s], to compute cluster state update for [cluster_reroute(async_shard_batch_fetch)]
[2024-07-21T12:09:16,605][WARN ][o.o.c.s.MasterService    ] [c5a433c6dfc0b2d95c57cc3d9ba7bf57] took [25s], which is over [10s], to compute cluster state update for [cluster_reroute(async_shard_batch_fetch)]
[2024-07-21T12:10:22,437][WARN ][o.o.c.s.MasterService    ] [c5a433c6dfc0b2d95c57cc3d9ba7bf57] took [19.1s], which is over [10s], to compute cluster state update for [cluster_reroute(async_shard_batch_fetch)]

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Contributor

❌ Gradle check result for 9b256fb: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for f9fdb3d: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for ef8232a: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Bukhtawar Khan <[email protected]>
Copy link
Contributor

❌ Gradle check result for bf9cf16: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Rishab Nahata <[email protected]>
Copy link
Contributor

✅ Gradle check result for 21c4c1c: SUCCESS

Copy link
Contributor

✅ Gradle check result for 25ca951: SUCCESS

Copy link
Contributor

❌ Gradle check result for 07befa2: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Rishab Nahata <[email protected]>
Copy link
Contributor

✅ Gradle check result for 6ddb155: SUCCESS

@Bukhtawar Bukhtawar merged commit 2a14c27 into opensearch-project:main Jul 24, 2024
32 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Jul 24, 2024
* Make reroute iteration time-bound for large shard allocations

Signed-off-by: Bukhtawar Khan <[email protected]>
Co-authored-by: Rishab Nahata <[email protected]>
(cherry picked from commit 2a14c27)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
opensearch-trigger-bot bot pushed a commit that referenced this pull request Jul 24, 2024
* Make reroute iteration time-bound for large shard allocations

Signed-off-by: Bukhtawar Khan <[email protected]>
Co-authored-by: Rishab Nahata <[email protected]>
(cherry picked from commit 2a14c27)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Bukhtawar pushed a commit that referenced this pull request Jul 24, 2024
#14953)

* Make reroute iteration time-bound for large shard allocations

Signed-off-by: Bukhtawar Khan <[email protected]>
Co-authored-by: Rishab Nahata <[email protected]>
Bukhtawar pushed a commit that referenced this pull request Jul 24, 2024
#14954)

* Make reroute iteration time-bound for large shard allocations


Signed-off-by: Bukhtawar Khan <[email protected]>
Co-authored-by: Rishab Nahata <[email protected]>
harshavamsi pushed a commit to harshavamsi/OpenSearch that referenced this pull request Aug 20, 2024
…arch-project#14848)

* Make reroute iteration time-bound for large shard allocations

Signed-off-by: Bukhtawar Khan <[email protected]>
Co-authored-by: Rishab Nahata <[email protected]>
wdongyu pushed a commit to wdongyu/OpenSearch that referenced this pull request Aug 22, 2024
…arch-project#14848)

* Make reroute iteration time-bound for large shard allocations

Signed-off-by: Bukhtawar Khan <[email protected]>
Co-authored-by: Rishab Nahata <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants