Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ACTION NEEDED] Fix flaky integration tests at distribution level #667

Closed
Tracked by #4588
gaiksaya opened this issue Apr 3, 2024 · 4 comments
Closed
Tracked by #4588

[ACTION NEEDED] Fix flaky integration tests at distribution level #667

gaiksaya opened this issue Apr 3, 2024 · 4 comments
Labels
bug Something isn't working v2.14.0

Comments

@gaiksaya
Copy link
Member

gaiksaya commented Apr 3, 2024

What is the bug?
It was observed in 2.13.0 and previous other releases that this component manually signed off on the release for failing integration tests. See opensearch-project/opensearch-build#4433 (comment)
The flakiness of the test runs take a lot of time from the release team to collect go/no-go decision and significantly lower the confidence in the release bundles.

How can one reproduce the bug?
Steps to reproduce the behavior:

  1. Run integration testing for altering and see the failures.
  2. Issues can be reproduced using the steps declared in AUTOCUT issues for failed integration testing

What is the expected behavior?
Tests should be consistently passing.

Do you have any additional context?
Please note that this is a hard blocker for 2.14.0 release as per the discussion here

@bbarani
Copy link
Member

bbarani commented Apr 23, 2024

@martin-gaievski @vibrantvarun Can you please provide your inputs?

@martin-gaievski
Copy link
Member

As per our deep dive inconsistency in integ tests is caused by the implementation of memory circuit breaker on ml-commons side. In scope of our tests we're deploying/undeploying local models using ml-commons API and after multiple such calls memory CB became open. I've opened an issue in ml-commons for this matter opensearch-project/ml-commons#2308.
As a short-term mitigation for 2.14 we have optimized our tests (issues #683 and #689) to minimize number of times local model got redeployed. As of now integ tests are constantly pass in our local copy of https://github.com/opensearch-project/opensearch-build/

@gaiksaya
Copy link
Member Author

Adding 2.14.0 release manager @rishabh6788

@martin-gaievski
Copy link
Member

Seems that implemented approach is stable, there are no failing tests in 2.14 release pipeline. Resolving this issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working v2.14.0
Projects
None yet
Development

No branches or pull requests

3 participants