Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Multiple calls of model deploy API causes exception from Memory Circuit Breaker #2308

Closed
martin-gaievski opened this issue Apr 10, 2024 · 7 comments
Assignees
Labels
bug Something isn't working

Comments

@martin-gaievski
Copy link
Member

What is the bug?
When uploading model with _upload API, system return following response:

Error response for model upload: Memory Circuit Breaker is open, please check your resources!

How can one reproduce the bug?
Steps to reproduce the behavior:

  1. Run ./gradlew integTest on neural search on java 21
  2. Wait for test to complete.

To increase chance of error change amount of max heap for JVM to 1Gb here. This settings is same that infra build team uses for distribution pipeline run.

Exact tests that are failing are random, but error happens for every execution of run tests command, its always 2 to 6 failing tests

What is the expected behavior?
No CB error

What is your host/environment?

  • JDK: 21, for lower versions everything works
  • Version 2.14 (2.x) and main

Do you have any additional context?
We upload model from ml-commons repo using following request payload: https://github.com/opensearch-project/neural-search/blob/main/src/test/resources/processor/UploadModelRequestBody.json

We use following sequence for model upload:

Following sequence of calls to delete resources:

To increase chance of error change amount of max heap for JVM to 1Gb here. This settings is same that infra build team uses for distribution pipeline run.
It's somehow related to #1896, but at that time we lower the chance of test failures by increasing max heap size to 4Gb. For 2.14 it's not an option as per this global issue opensearch-project/neural-search#667

@martin-gaievski martin-gaievski added bug Something isn't working untriaged labels Apr 10, 2024
@dhrubo-os
Copy link
Collaborator

I'm not 100% sure if this is ml-commons bug. Seems like in the cluster memory usage is still very high.
May be you can try setting up this setting to 100?

@martin-gaievski
Copy link
Member Author

we do have this set to 100 for neural-search https://github.com/opensearch-project/neural-search/blob/main/src/testFixtures/java/org/opensearch/neuralsearch/BaseNeuralSearchIT.java#L116. Let me try with different values for other setting for jvm heap plugins.ml_commons.jvm_heap_memory_threshold.

@martin-gaievski
Copy link
Member Author

I found that value of 95% for setting plugins.ml_commons.jvm_heap_memory_threshold is most effective, but it doesn't prevent tests from failing, instead of 4-6 failing tests now with some other optimizations it's 1-2 tests. Overall with this setting I think we're not solving the problem but delaying its manifestation.

@Zhangxunmt
Copy link
Collaborator

Zhangxunmt commented Apr 10, 2024

I believe the problem is related to the fact that after several load/unload, the un-released memory was hold by pytorch runtime library, which was used as a blackbox in DJL.

The most common use case of using pytorch is hosing a model server and performance is NO.1 priority, so it's designed to consume large memory even the model is unloaded. Our use case is special so that's why we don't recommend using pre-trained or local models in the production environment.

For this integration tests problem, can you reduce the number of load/unload in your tests? In other words, is it possible to finish all the necessary tests in only one model lifecycle? Also, can you try using a smaller model in the IT?

@martin-gaievski
Copy link
Member Author

I think we already using the small model from ml-commons repo.
I'll be pushing PR for tests refactoring, will check our tests and remove unnecessary model uploads plus merge few small test methods into large ones to reuse single model.

@martin-gaievski
Copy link
Member Author

martin-gaievski commented Apr 30, 2024

@Zhangxunmt My team has an assumption that Memory CB does not calculate used memory properly, in particular mmaped files are also counted. That causes leak kind of behavior when with time after multiple undeployments amount of memory that is counted goes beyond the memory that is actually used.

I've verified this by following experiment:

  • in neural-search plugin we set memory CB threshold as 100%, run tests they have failed
  • disable memory CB. For that build opensearch min distribution, use ml-commons 2.13 branch with this commit
  • run same workload.

Step 1 confirms the issue. Step 3 shows that even with 100% threshold CB doesn't count memory usage correctly.

For repro the issue I setup https://github.com/opensearch-project/opensearch-build/ locally and point it to my custom branch of ml-commons.

I suggest that either ml-commons should add an option or setting to disable memory CB completely, or ignore CB check if threshold is set to >= 100%

@Zhangxunmt
Copy link
Collaborator

memory CB is disabled with heap threshold == 100. Resolving this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Development

No branches or pull requests

3 participants