Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: After a standalone rolling upgrade, memory usage surged to twice its original amount, resulting in an out-of-memory (OOM) error. #34148

Open
1 task done
zhuwenxing opened this issue Jun 25, 2024 · 2 comments
Assignees
Labels
kind/bug Issues or changes related a bug priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Milestone

Comments

@zhuwenxing
Copy link
Contributor

Is there an existing issue for this?

  • I have searched the existing issues

Environment

- Milvus version:v2.4.1 --> master-20240624-fd922d92-amd64
- Deployment mode(standalone or cluster):standalone
- MQ type(rocksmq, pulsar or kafka):kafka    
- SDK version(e.g. pymilvus v2.0.0rc2):
- OS(Ubuntu or CentOS): 
- CPU/Memory: 
- GPU: 
- Others:

Current Behavior

[2024-06-24T17:47:38.647Z] + kubectl get pods -o wide

[2024-06-24T17:47:38.648Z] + grep kafka-standalone-3862

[2024-06-24T17:47:38.905Z] kafka-standalone-3862-etcd-0                                     1/1     Running            0                 45m     10.104.23.13    4am-node27   <none>           <none>

[2024-06-24T17:47:38.905Z] kafka-standalone-3862-etcd-1                                     1/1     Running            0                 45m     10.104.17.75    4am-node23   <none>           <none>

[2024-06-24T17:47:38.905Z] kafka-standalone-3862-etcd-2                                     1/1     Running            0                 45m     10.104.25.114   4am-node30   <none>           <none>

[2024-06-24T17:47:38.905Z] kafka-standalone-3862-kafka-0                                    2/2     Running            0                 45m     10.104.23.17    4am-node27   <none>           <none>

[2024-06-24T17:47:38.905Z] kafka-standalone-3862-kafka-1                                    2/2     Running            0                 45m     10.104.17.81    4am-node23   <none>           <none>

[2024-06-24T17:47:38.905Z] kafka-standalone-3862-kafka-2                                    2/2     Running            0                 45m     10.104.25.118   4am-node30   <none>           <none>

[2024-06-24T17:47:38.905Z] kafka-standalone-3862-kafka-exporter-5c584cb654-26n2g            1/1     Running            4 (45m ago)       45m     10.104.9.13     4am-node14   <none>           <none>

[2024-06-24T17:47:38.905Z] kafka-standalone-3862-kafka-zookeeper-0                          1/1     Running            0                 45m     10.104.23.15    4am-node27   <none>           <none>

[2024-06-24T17:47:38.905Z] kafka-standalone-3862-kafka-zookeeper-1                          1/1     Running            0                 45m     10.104.25.117   4am-node30   <none>           <none>

[2024-06-24T17:47:38.905Z] kafka-standalone-3862-kafka-zookeeper-2                          1/1     Running            0                 45m     10.104.17.84    4am-node23   <none>           <none>

[2024-06-24T17:47:38.905Z] kafka-standalone-3862-milvus-standalone-7d695d7554-r8b57         0/1     CrashLoopBackOff   11 (34s ago)      36m     10.104.13.27    4am-node16   <none>           <none>

[2024-06-24T17:47:38.905Z] kafka-standalone-3862-minio-0                                    1/1     Running            0                 45m     10.104.23.14    4am-node27   <none>           <none>

[2024-06-24T17:47:38.905Z] kafka-standalone-3862-minio-1                                    1/1     Running            0                 45m     10.104.17.77    4am-node23   <none>           <none>

[2024-06-24T17:47:38.905Z] kafka-standalone-3862-minio-2                                    1/1     Running            0                 45m     10.104.25.115   4am-node30   <none>           <none>

[2024-06-24T17:47:38.905Z] kafka-standalone-3862-minio-3                                    1/1     Running            0                 45m     10.104.16.79    4am-node21   <none>           <none>

image

Containers:
  standalone:
    Container ID:  containerd://53b912e3903d275e9f7cea276c4b3bd80d4cd315e6b2462392c49b217e397419
    Image:         harbor.milvus.io/milvus/milvus:master-20240624-fd922d92-amd64
    Image ID:      harbor.milvus.io/milvus/milvus@sha256:972731ad453f7a8427b9bdfcea6f6664ecb7fb91cb1ad51ad119ea01fe1af2a7
    Ports:         19530/TCP, 9091/TCP
    Host Ports:    0/TCP, 0/TCP
    Args:
      /milvus/tools/run.sh
      milvus
      run
      standalone
    State:          Waiting
      Reason:       CrashLoopBackOff
    Last State:     Terminated
      Reason:       OOMKilled
      Exit Code:    137
      Started:      Mon, 24 Jun 2024 17:46:56 +0000
      Finished:     Mon, 24 Jun 2024 17:47:04 +0000
    Ready:          False
    Restart Count:  11
    Limits:
      cpu:     4
      memory:  8Gi
    Requests:
      cpu:      100m
      memory:   128Mi
    Liveness:   http-get http://:9091/healthz delay=0s timeout=10s period=15s #success=1 #failure=3
    Readiness:  http-get http://:9091/healthz delay=0s timeout=3s period=15s #success=1 #failure=2
    Startup:    http-get http://:9091/healthz delay=0s timeout=3s period=10s #success=1 #failure=18
    Environment:
      CACHE_SIZE:        8 (limits.memory)
      MINIO_ACCESS_KEY:  <set to the key 'accesskey' in secret 'kafka-standalone-3862-minio'>  Optional: false
      MINIO_SECRET_KEY:  <set to the key 'secretkey' in secret 'kafka-standalone-3862-minio'>  Optional: false

Expected Behavior

No response

Steps To Reproduce

No response

Milvus Log

failed job: https://qa-jenkins.milvus.io/blue/organizations/jenkins/rolling_update_for_operator_test_simple/detail/rolling_update_for_operator_test_simple/3862/pipeline

log:
artifacts-kafka-standalone-3862-server-logs.tar.gz

Anything else?

No response

@zhuwenxing zhuwenxing added kind/bug Issues or changes related a bug needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 25, 2024
@zhuwenxing zhuwenxing added this to the 2.5.0 milestone Jun 25, 2024
@zhuwenxing zhuwenxing added priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. labels Jun 25, 2024
@yanliang567 yanliang567 added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 25, 2024
@yanliang567
Copy link
Contributor

/assign @weiliu1031
/unassign

@zhuwenxing
Copy link
Contributor Author

image

this is a memory usage change when upgrading from v2.4.1 --> 2.4-20240624-63c9e6e0-amd64. it looks fine and does not cause oom

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Issues or changes related a bug priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. severity/critical Critical, lead to crash, data missing, wrong result, function totally doesn't work. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

3 participants