Skip to content

Conversation

@pyek-bot
Copy link
Collaborator

@pyek-bot pyek-bot commented Oct 29, 2025

Description

During a B/G for a very large cluster, the metrics job can be indexed multiple times. We only want to index this job once on startup.

In this case, due to the presence of ClusterState object, it can cause a lot of heap to be used up due to multiple threads waiting on the index action to complete.

Therefore, adding checks to validate whether the index is already created or not and adding local checks within the same node. Additionally, removed the condition for Offline batch polling task job, as it gets created on a batch predict.

Check List

  • New functionality includes testing.
  • New functionality has been documented.
  • API changes companion pull request created.
  • Commits are signed per the DCO using --signoff.
  • Public documentation issue/PR created.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Collaborator

@Zhangxunmt Zhangxunmt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apply spotless?

@pyek-bot
Copy link
Collaborator Author

Thanks @Zhangxunmt ! Addressed the comments!

Signed-off-by: Pavan Yekbote <[email protected]>
@pyek-bot pyek-bot force-pushed the fix_metrics_collection branch from aebfc8d to 4cb83b8 Compare October 29, 2025 06:34
Zhangxunmt
Zhangxunmt previously approved these changes Oct 29, 2025
Signed-off-by: Pavan Yekbote <[email protected]>
@pyek-bot pyek-bot had a problem deploying to ml-commons-cicd-env October 29, 2025 19:23 — with GitHub Actions Failure
@pyek-bot pyek-bot temporarily deployed to ml-commons-cicd-env October 29, 2025 19:23 — with GitHub Actions Inactive
@codecov
Copy link

codecov bot commented Oct 29, 2025

Codecov Report

❌ Patch coverage is 54.54545% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 80.12%. Comparing base (5964268) to head (392ac58).
⚠️ Report is 5 commits behind head on main.

Files with missing lines Patch % Lines
...ain/java/org/opensearch/ml/task/MLTaskManager.java 57.14% 2 Missing and 1 partial ⚠️
...arch/ml/cluster/MLCommonsClusterEventListener.java 50.00% 0 Missing and 2 partials ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #4362      +/-   ##
============================================
+ Coverage     80.09%   80.12%   +0.03%     
- Complexity    10199    10212      +13     
============================================
  Files           855      855              
  Lines         44374    44413      +39     
  Branches       5135     5139       +4     
============================================
+ Hits          35540    35585      +45     
+ Misses         6670     6666       -4     
+ Partials       2164     2162       -2     
Flag Coverage Δ
ml-commons 80.12% <54.54%> (+0.03%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@pyek-bot pyek-bot temporarily deployed to ml-commons-cicd-env October 29, 2025 20:31 — with GitHub Actions Inactive
@pyek-bot pyek-bot temporarily deployed to ml-commons-cicd-env October 29, 2025 20:31 — with GitHub Actions Inactive
@brianf-aws
Copy link
Contributor

Can we please merge this PR? Its very important for Large clusters

@pyek-bot pyek-bot temporarily deployed to ml-commons-cicd-env October 29, 2025 23:50 — with GitHub Actions Inactive
@pyek-bot
Copy link
Collaborator Author

Yes, waiting on the required CI to pass! It is failing due to flakiness and throttling

@pyek-bot pyek-bot merged commit 5ac2984 into opensearch-project:main Oct 30, 2025
18 of 22 checks passed
@opensearch-trigger-bot
Copy link
Contributor

The backport to 3.1 failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-3.1 3.1
# Navigate to the new working tree
cd .worktrees/backport-3.1
# Create a new branch
git switch --create backport/backport-4362-to-3.1
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 5ac2984bf59a2a0ede147262a8ab477d41859463
# Push it to GitHub
git push --set-upstream origin backport/backport-4362-to-3.1
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-3.1

Then, create a pull request where the base branch is 3.1 and the compare/head branch is backport/backport-4362-to-3.1.

opensearch-trigger-bot bot pushed a commit that referenced this pull request Oct 30, 2025
…collector (#4362)

* fix: add additional checks for initializing the stats job collector to minimize jvm usage

Signed-off-by: Pavan Yekbote <[email protected]>

* fix: spotless apply and add debug logs

Signed-off-by: Pavan Yekbote <[email protected]>

* fix: tests

Signed-off-by: Pavan Yekbote <[email protected]>

* fix: spotless

Signed-off-by: Pavan Yekbote <[email protected]>

---------

Signed-off-by: Pavan Yekbote <[email protected]>
(cherry picked from commit 5ac2984)
opensearch-trigger-bot bot pushed a commit that referenced this pull request Oct 30, 2025
…collector (#4362)

* fix: add additional checks for initializing the stats job collector to minimize jvm usage

Signed-off-by: Pavan Yekbote <[email protected]>

* fix: spotless apply and add debug logs

Signed-off-by: Pavan Yekbote <[email protected]>

* fix: tests

Signed-off-by: Pavan Yekbote <[email protected]>

* fix: spotless

Signed-off-by: Pavan Yekbote <[email protected]>

---------

Signed-off-by: Pavan Yekbote <[email protected]>
(cherry picked from commit 5ac2984)
pyek-bot added a commit that referenced this pull request Oct 30, 2025
…collector (#4362) (#4378)

* fix: add additional checks for initializing the stats job collector to minimize jvm usage



* fix: spotless apply and add debug logs



* fix: tests



* fix: spotless



---------


(cherry picked from commit 5ac2984)

Signed-off-by: Pavan Yekbote <[email protected]>
Co-authored-by: Pavan Yekbote <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants