[BUG] Multiple calls of model `deploy` API causes exception from Memory Circuit Breaker #2308

martin-gaievski · 2024-04-10T01:13:35Z

What is the bug?
When uploading model with _upload API, system return following response:

Error response for model upload: Memory Circuit Breaker is open, please check your resources!

How can one reproduce the bug?
Steps to reproduce the behavior:

Run ./gradlew integTest on neural search on java 21
Wait for test to complete.

To increase chance of error change amount of max heap for JVM to 1Gb here. This settings is same that infra build team uses for distribution pipeline run.

Exact tests that are failing are random, but error happens for every execution of run tests command, its always 2 to 6 failing tests

What is the expected behavior?
No CB error

What is your host/environment?

JDK: 21, for lower versions everything works
Version 2.14 (2.x) and main

Do you have any additional context?
We upload model from ml-commons repo using following request payload: https://github.com/opensearch-project/neural-search/blob/main/src/test/resources/processor/UploadModelRequestBody.json

We use following sequence for model upload:

create model group
upload model, wait for task to complete, got model id
Code ref: https://github.com/opensearch-project/neural-search/blob/main/src/testFixtures/java/org/opensearch/neuralsearch/BaseNeuralSearchIT.java#L146
deploy model by model id
Code ref: https://github.com/opensearch-project/neural-search/blob/main/src/testFixtures/java/org/opensearch/neuralsearch/BaseNeuralSearchIT.java#L175

Following sequence of calls to delete resources:

undeploy, poll for trerminal state
delete model
Code ref: https://github.com/opensearch-project/neural-search/blob/main/src/testFixtures/java/org/opensearch/neuralsearch/BaseNeuralSearchIT.java#L916
there isn't call for model group delete as it has to be deleted when last associated model is deleted

To increase chance of error change amount of max heap for JVM to 1Gb here. This settings is same that infra build team uses for distribution pipeline run.
It's somehow related to #1896, but at that time we lower the chance of test failures by increasing max heap size to 4Gb. For 2.14 it's not an option as per this global issue opensearch-project/neural-search#667

The text was updated successfully, but these errors were encountered:

dhrubo-os · 2024-04-10T16:46:53Z

I'm not 100% sure if this is ml-commons bug. Seems like in the cluster memory usage is still very high.
May be you can try setting up this setting to 100?

martin-gaievski · 2024-04-10T19:21:17Z

we do have this set to 100 for neural-search https://github.com/opensearch-project/neural-search/blob/main/src/testFixtures/java/org/opensearch/neuralsearch/BaseNeuralSearchIT.java#L116. Let me try with different values for other setting for jvm heap plugins.ml_commons.jvm_heap_memory_threshold.

martin-gaievski · 2024-04-10T20:53:28Z

I found that value of 95% for setting plugins.ml_commons.jvm_heap_memory_threshold is most effective, but it doesn't prevent tests from failing, instead of 4-6 failing tests now with some other optimizations it's 1-2 tests. Overall with this setting I think we're not solving the problem but delaying its manifestation.

Zhangxunmt · 2024-04-10T21:05:10Z

I believe the problem is related to the fact that after several load/unload, the un-released memory was hold by pytorch runtime library, which was used as a blackbox in DJL.

The most common use case of using pytorch is hosing a model server and performance is NO.1 priority, so it's designed to consume large memory even the model is unloaded. Our use case is special so that's why we don't recommend using pre-trained or local models in the production environment.

For this integration tests problem, can you reduce the number of load/unload in your tests? In other words, is it possible to finish all the necessary tests in only one model lifecycle? Also, can you try using a smaller model in the IT?

martin-gaievski · 2024-04-11T17:15:12Z

I think we already using the small model from ml-commons repo.
I'll be pushing PR for tests refactoring, will check our tests and remove unnecessary model uploads plus merge few small test methods into large ones to reuse single model.

martin-gaievski · 2024-04-30T00:55:13Z

@Zhangxunmt My team has an assumption that Memory CB does not calculate used memory properly, in particular mmaped files are also counted. That causes leak kind of behavior when with time after multiple undeployments amount of memory that is counted goes beyond the memory that is actually used.

I've verified this by following experiment:

in neural-search plugin we set memory CB threshold as 100%, run tests they have failed
disable memory CB. For that build opensearch min distribution, use ml-commons 2.13 branch with this commit
run same workload.

Step 1 confirms the issue. Step 3 shows that even with 100% threshold CB doesn't count memory usage correctly.

For repro the issue I setup https://github.com/opensearch-project/opensearch-build/ locally and point it to my custom branch of ml-commons.

I suggest that either ml-commons should add an option or setting to disable memory CB completely, or ignore CB check if threshold is set to >= 100%

Zhangxunmt · 2024-05-23T22:47:22Z

memory CB is disabled with heap threshold == 100. Resolving this issue.

martin-gaievski added bug Something isn't working untriaged labels Apr 10, 2024

martin-gaievski mentioned this issue Apr 11, 2024

Optimizing integ tests for less model upload calls opensearch-project/neural-search#683

Merged

2 tasks

martin-gaievski mentioned this issue Apr 11, 2024

Optimize the way of deploying local models for integration tests opensearch-project/neural-search#689

Closed

martin-gaievski mentioned this issue Apr 30, 2024

[ACTION NEEDED] Fix flaky integration tests at distribution level opensearch-project/neural-search#667

Closed

dhrubo-os assigned Zhangxunmt May 7, 2024

dhrubo-os removed the untriaged label May 7, 2024

Zhangxunmt mentioned this issue May 22, 2024

fix memory CB bugs and upgrade UTs to compatible with core changes #2469

Merged

5 tasks

Zhangxunmt closed this as completed May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Multiple calls of model `deploy` API causes exception from Memory Circuit Breaker #2308

[BUG] Multiple calls of model `deploy` API causes exception from Memory Circuit Breaker #2308

martin-gaievski commented Apr 10, 2024

dhrubo-os commented Apr 10, 2024

martin-gaievski commented Apr 10, 2024

martin-gaievski commented Apr 10, 2024

Zhangxunmt commented Apr 10, 2024 •

edited

Loading

martin-gaievski commented Apr 11, 2024

martin-gaievski commented Apr 30, 2024 •

edited

Loading

Zhangxunmt commented May 23, 2024

[BUG] Multiple calls of model deploy API causes exception from Memory Circuit Breaker #2308

[BUG] Multiple calls of model deploy API causes exception from Memory Circuit Breaker #2308

Comments

martin-gaievski commented Apr 10, 2024

dhrubo-os commented Apr 10, 2024

martin-gaievski commented Apr 10, 2024

martin-gaievski commented Apr 10, 2024

Zhangxunmt commented Apr 10, 2024 • edited Loading

martin-gaievski commented Apr 11, 2024

martin-gaievski commented Apr 30, 2024 • edited Loading

Zhangxunmt commented May 23, 2024

[BUG] Multiple calls of model `deploy` API causes exception from Memory Circuit Breaker #2308

[BUG] Multiple calls of model `deploy` API causes exception from Memory Circuit Breaker #2308

Zhangxunmt commented Apr 10, 2024 •

edited

Loading

martin-gaievski commented Apr 30, 2024 •

edited

Loading