Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizing integ tests for less model upload calls #683

Conversation

martin-gaievski
Copy link
Member

@martin-gaievski martin-gaievski commented Apr 10, 2024

Description

Refactored integration tests to minimize number of local model uploads/deployments. This is a first step in attempt to solve test failures in distribution build pipeline. With multiple model redeployments at some point we're reaching the state when memory circuit breaker on ml-commons side opens and next model upload call is failing.

There is a separate issue opened for ml-commons for possible long term fix on their side opensearch-project/ml-commons#2308. As of now there are no plans to work on it, as per their suggestions I'm trying to optimize tests on neural-search side.

In this PR I'm doing following:

  • replacing knn vector search queries with other types of queries, mainly bm25 text searchers
  • merging simple tests into a single method with multiple search queries. This should help as such single test method uses one model, while for multiple smaller methods there will be many model uploads. Now number of integ test methods lowered 64 -> 57.
  • increasing threshold for amount of memory that can be allocated for JVM heap in ml-commons. Default is 85%, I'm setting it to 95%. I tested it with 100%, it gives worse results, I assume that's because memory cannot be allocated to some code paths with native code.

Tested in copy of infra environment, tests are passing:

2024-04-10 22:12:51 INFO     ===============================================
2024-04-10 22:12:51 INFO     Running integration tests for neural-search without-security
2024-04-10 22:12:51 INFO     ===============================================
2024-04-10 22:12:51 INFO     Executing "bash /local/home/dev/opensearch/opensearch-build/scripts/default/integtest.sh -b localhost -p 9200 -s false -v 3.0.0" in /tmp/tmpvo11uccc/neural-search
2024-04-10 22:16:48 INFO     Recording component test results for neural-search at /local/home/dev/opensearch/opensearch-build/test-results/32ef973a0cb743f5bf2f8c29df105e83/integ-test/neural-search/without-security
2024-04-10 22:16:48 INFO     Stderr reported for component: neural-search
2024-04-10 22:16:48 INFO     Note: /tmp/tmpvo11uccc/neural-search/src/main/java/org/opensearch/neuralsearch/processor/combination/ScoreCombinationUtil.java uses or overrides a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
Note: Some input files use unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.
Note: Some input files use or override a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
Note: Some input files use unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.
Note: Some input files use or override a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
Note: Some input files use unchecked or unsafe operations.
Note: Recompile with -Xlint:unchecked for details.
Apr 10, 2024 10:13:26 PM sun.util.locale.provider.LocaleProviderAdapter <clinit>
WARNING: COMPAT locale provider will be removed in a future release

2024-04-10 22:16:48 INFO     Sending SIGKILL to PID 2755
2024-04-10 22:16:48 INFO     Process killed with exit code None
2024-04-10 22:16:48 INFO     Recording local cluster logs for neural-search with test configuration as without-security at /local/home/dev/opensearch/opensearch-build/test-results/32ef973a0cb743f5bf2f8c29df105e83/integ-test/neural-search/without-security/local-cluster-logs/id-0
2024-04-10 22:16:48 INFO     Cleanup /tmp/tmpvo11uccc/1/local-test-cluster/opensearch-3.0.0 content after the test
2024-04-10 22:16:48 INFO     Walking tree from /local/home/dev/opensearch/opensearch-build/test-results/32ef973a0cb743f5bf2f8c29df105e83/integ-test/neural-search/without-security
2024-04-10 22:16:48 INFO     Removing /tmp/tmpvo11uccc
2024-04-10 22:16:48 INFO     | neural-search        | without-security     | PASS  |

Issues Resolved

#667

Check List

  • All tests pass
  • Commits are signed as per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@martin-gaievski martin-gaievski added Maintenance Add support for new versions of OpenSearch/Dashboards from upstream skip-changelog v2.14.0 labels Apr 10, 2024
@martin-gaievski martin-gaievski force-pushed the fix_model_upload_in_integ_tests branch 2 times, most recently from 5e25a0b to 072c50c Compare April 10, 2024 21:26
@martin-gaievski
Copy link
Member Author

martin-gaievski commented Apr 10, 2024

Rolling tests BWC are failing for Linux due to some issue in Lucene/core that is unrelated to this change. Currently same issue exists for main branch, created new issue to track those failures: #688

 [2024-04-10T23:36:50,728][ERROR][o.o.b.OpenSearchUncaughtExceptionHandler] [neuralSearchBwcCluster-rolling-0] uncaught exception in thread [main]
  org.opensearch.bootstrap.StartupException: OpenSearchException[failed to bind service]; nested: IndexFormatTooNewException[Format version is not supported (resource BufferedChecksumIndexInput(NIOFSIndexInput(path="/local/home/gaievski/dev/opensearch/        neural-search/qa/rolling-upgrade/build/testclusters/neuralSearchBwcCluster-rolling-0/data/nodes/0/_state/_2o.cfs") [slice=_2o.fnm])): 1 (needs to be between 0 and 0)];
          at org.opensearch.bootstrap.OpenSearch.init(OpenSearch.java:185) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
          at org.opensearch.bootstrap.OpenSearch.execute(OpenSearch.java:172) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
          at org.opensearch.cli.EnvironmentAwareCommand.execute(EnvironmentAwareCommand.java:104) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
          at org.opensearch.cli.Command.mainWithoutErrorHandling(Command.java:138) ~[opensearch-cli-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
          at org.opensearch.cli.Command.main(Command.java:101) ~[opensearch-cli-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
          at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:138) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
          at org.opensearch.bootstrap.OpenSearch.main(OpenSearch.java:104) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
  Caused by: org.opensearch.OpenSearchException: failed to bind service
          at org.opensearch.node.Node.<init>(Node.java:1331) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
          at org.opensearch.node.Node.<init>(Node.java:413) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
          at org.opensearch.bootstrap.Bootstrap$5.<init>(Bootstrap.java:242) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
          at org.opensearch.bootstrap.Bootstrap.setup(Bootstrap.java:242) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
          at org.opensearch.bootstrap.Bootstrap.init(Bootstrap.java:404) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
          at org.opensearch.bootstrap.OpenSearch.init(OpenSearch.java:181) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
          ... 6 more
  Caused by: org.apache.lucene.index.IndexFormatTooNewException: Format version is not supported (resource BufferedChecksumIndexInput(NIOFSIndexInput(path="/local/home/gaievski/dev/opensearch/neural-search/qa/rolling-upgrade/build/testclusters/                neuralSearchBwcCluster-rolling-0/data/nodes/0/_state/_2o.cfs") [slice=_2o.fnm])): 1 (needs to be between 0 and 0)
          at org.apache.lucene.codecs.CodecUtil.checkHeaderNoMagic(CodecUtil.java:214) ~[lucene-core-9.9.2.jar:9.9.2 a2939784c4ca60bc28bf488b5479c02fc2e5e22c - 2024-01-25 09:51:09]
          at org.apache.lucene.codecs.CodecUtil.checkHeader(CodecUtil.java:194) ~[lucene-core-9.9.2.jar:9.9.2 a2939784c4ca60bc28bf488b5479c02fc2e5e22c - 2024-01-25 09:51:09]
          at org.apache.lucene.codecs.CodecUtil.checkIndexHeader(CodecUtil.java:254) ~[lucene-core-9.9.2.jar:9.9.2 a2939784c4ca60bc28bf488b5479c02fc2e5e22c - 2024-01-25 09:51:09]
          at org.apache.lucene.codecs.lucene94.Lucene94FieldInfosFormat.read(Lucene94FieldInfosFormat.java:134) ~[lucene-core-9.9.2.jar:9.9.2 a2939784c4ca60bc28bf488b5479c02fc2e5e22c - 2024-01-25 09:51:09]
          at org.apache.lucene.index.SegmentCoreReaders.<init>(SegmentCoreReaders.java:112) ~[lucene-core-9.9.2.jar:9.9.2 a2939784c4ca60bc28bf488b5479c02fc2e5e22c - 2024-01-25 09:51:09]
          at org.apache.lucene.index.SegmentReader.<init>(SegmentReader.java:96) ~[lucene-core-9.9.2.jar:9.9.2 a2939784c4ca60bc28bf488b5479c02fc2e5e22c - 2024-01-25 09:51:09]
          at org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:94) ~[lucene-core-9.9.2.jar:9.9.2 a2939784c4ca60bc28bf488b5479c02fc2e5e22c - 2024-01-25 09:51:09]
          at org.apache.lucene.index.StandardDirectoryReader$1.doBody(StandardDirectoryReader.java:77) ~[lucene-core-9.9.2.jar:9.9.2 a2939784c4ca60bc28bf488b5479c02fc2e5e22c - 2024-01-25 09:51:09]
          at org.apache.lucene.index.SegmentInfos$FindSegmentsFile.run(SegmentInfos.java:820) ~[lucene-core-9.9.2.jar:9.9.2 a2939784c4ca60bc28bf488b5479c02fc2e5e22c - 2024-01-25 09:51:09]
          at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:109) ~[lucene-core-9.9.2.jar:9.9.2 a2939784c4ca60bc28bf488b5479c02fc2e5e22c - 2024-01-25 09:51:09]
          at org.apache.lucene.index.StandardDirectoryReader.open(StandardDirectoryReader.java:67) ~[lucene-core-9.9.2.jar:9.9.2 a2939784c4ca60bc28bf488b5479c02fc2e5e22c - 2024-01-25 09:51:09]
          at org.apache.lucene.index.DirectoryReader.open(DirectoryReader.java:60) ~[lucene-core-9.9.2.jar:9.9.2 a2939784c4ca60bc28bf488b5479c02fc2e5e22c - 2024-01-25 09:51:09]
          at org.opensearch.gateway.PersistedClusterStateService.nodeMetadata(PersistedClusterStateService.java:309) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
          at org.opensearch.env.NodeEnvironment.loadNodeMetadata(NodeEnvironment.java:479) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
          at org.opensearch.env.NodeEnvironment.<init>(NodeEnvironment.java:394) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
          at org.opensearch.env.NodeEnvironment.<init>(NodeEnvironment.java:301) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]

@vibrantvarun
Copy link
Member

First of all, Great work @martin-gaievski. Awesome job on optimization.

@vibrantvarun
Copy link
Member

vibrantvarun commented Apr 11, 2024

Overall looks good to me.

  1. I think we should add changelog of this as in future until we get permanent fix we have to be careful of uploading models in integ tests.

  2. Can you tests with security plugin as well?

cc: @martin-gaievski

@martin-gaievski
Copy link
Member Author

Overall looks good to me.

  1. I think we should add changelog of this as in future until we get permanent fix we have to be careful of uploading models in integ tests.
  2. Can you tests with security plugin as well?

cc: @martin-gaievski

For 1 - it makes sense, just I think readme file will work better than changelog. Will push commit for this.
Reg 2 - that setup I'm using - it doesn't support with security tests, cluster is not accessible. I see the risk as very low as this change is not related to security of the system or plugin

@martin-gaievski
Copy link
Member Author

Created new issue for a long term fix (switching from model per test case to a shared models strategy) #689

Signed-off-by: Martin Gaievski <[email protected]>
Copy link
Member

@ryanbogan ryanbogan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! This eliminates a major release blocker for release owners. For my understanding, the long term fix is so that we don't accidentally trigger the circuit breaker by adding more tests in the future right?

@martin-gaievski
Copy link
Member Author

LGTM! This eliminates a major release blocker for release owners. For my understanding, the long term fix is so that we don't accidentally trigger the circuit breaker by adding more tests in the future right?

For long term we're still deciding what should be the fix, idea is to switch from "one model per test case" to global models concept, so once deployed one type of model can be used for all tests that needs that model type. It should be possible as we only have 3 model types and models are immutable, so no need in sync of the state between tests

@martin-gaievski martin-gaievski added the backport 2.x Label will add auto workflow to backport PR to 2.x branch label Apr 12, 2024
@martin-gaievski martin-gaievski merged commit 7106d20 into opensearch-project:main Apr 12, 2024
62 of 68 checks passed
@opensearch-trigger-bot
Copy link
Contributor

The backport to 2.x failed:

The process '/usr/bin/git' failed with exit code 1

To backport manually, run these commands in your terminal:

# Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-2.x 2.x
# Navigate to the new working tree
cd .worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-683-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 7106d205a05404ab134beba595bb9da90592ba7b
# Push it to GitHub
git push --set-upstream origin backport/backport-683-to-2.x
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-2.x

Then, create a pull request where the base branch is 2.x and the compare/head branch is backport/backport-683-to-2.x.

martin-gaievski added a commit to martin-gaievski/neural-search that referenced this pull request Apr 12, 2024
…t#683)

* Optimizing tests for less model upload calls

Signed-off-by: Martin Gaievski <[email protected]>
(cherry picked from commit 7106d20)
martin-gaievski added a commit to martin-gaievski/neural-search that referenced this pull request Apr 12, 2024
…t#683)

* Optimizing tests for less model upload calls

Signed-off-by: Martin Gaievski <[email protected]>
(cherry picked from commit 7106d20)
Signed-off-by: Martin Gaievski <[email protected]>
martin-gaievski added a commit that referenced this pull request Apr 12, 2024
* Optimizing tests for less model upload calls


(cherry picked from commit 7106d20)

Signed-off-by: Martin Gaievski <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Label will add auto workflow to backport PR to 2.x branch Maintenance Add support for new versions of OpenSearch/Dashboards from upstream skip-changelog v2.14.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants