Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Model is getting stuck in deploying state #2970

Open
gaurav7830 opened this issue Sep 18, 2024 · 5 comments
Open

[BUG] Model is getting stuck in deploying state #2970

gaurav7830 opened this issue Sep 18, 2024 · 5 comments
Assignees
Labels
bug Something isn't working

Comments

@gaurav7830
Copy link

gaurav7830 commented Sep 18, 2024

What is the bug?
Model is getting stuck in deploying state while registering it on the cluster. We have seen cases where the model is not found on the few nodes.

Scenario

  1. Model stuck in DEPLOYING state.
  2. Call model undeploy api on the cluster returning the following response.
    "NodeId": {
        "stats": {
            "ModelId": "not_found"
        }
    },
    "NodeId": {
        "stats": {
            "ModelId": "not_found"
        }
    },
    "NodeId": {
        "stats": {
            "ModelId": "undeployed"
        }
    },
    "NodeId": {
        "stats": {
            "ModelId": "not_found"
        }
    },
    "NodeId": {
        "stats": {
            "ModelId": "undeployed"
        }
    },
    "NodeId": {
        "stats": {
            "ModelId": "not_found"
        }
    },
    "NodeId": {
        "stats": {
            "ModelId": "not_found"
        }
    },
    "NodeId": {
        "stats": {
            "ModelId": "undeployed"
        }
    },
    "NodeId": {
        "stats": {
            "ModelId": "not_found"
        }
    },
    "NodeId": {
        "stats": {
            "ModelId": "not_found"
        }
    }
}
  1. Called GetModel api and it returning model state as DEPLOYING.

What is the expected behavior?
Model should be undeployed.

@gaurav7830 gaurav7830 added bug Something isn't working untriaged labels Sep 18, 2024
@ylwu-amzn
Copy link
Collaborator

@Zhangxunmt I know you have some suggestion to enhance this part. Please help take a look.

@mingshl
Copy link
Collaborator

mingshl commented Sep 24, 2024

#2976

@zane-neo
Copy link
Collaborator

#2976

This PR is to remove the remote model auto redeploy during cluster change, it doesn't mean this issue is caused by model auto redeploy, in fact, the root cause of why the model stuck in deploying status is still unknown since it's very difficult to reproduce. The real solution for this issue is to support model undeploy when model status is deploying which will be implemented very soon, user can use this feature to undeploy the model and redeploy again to mitigate the pain.

@zane-neo
Copy link
Collaborator

zane-neo commented Oct 8, 2024

The root cause is when deploying the model, manager node sends out the deploy request to all eligible nodes in the cluster, but a node can crash at any moment, if it crashed right after the getEligibleNodes method ran, that node won’t send deploy response to manager node. The worker node won’t be count down to 0, so the model status won’t be updated and keeps at deploying status.

To reproduce this issue, you need a small cluster with at least 3 nodes, one is manager node and others are data nodes. Start the manager node and one data node first, create a model and deploy, then start another data node, add debug breakpoint to deploy transport action on manager node(after getting all eligible node), when the debug triggered, shut down the first data node and continue the debug. Then you’ll see the model keeps at deploying status.

@zane-neo
Copy link
Collaborator

zane-neo commented Oct 8, 2024

@rbhavna Can you update the solution details that will be used to fix this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
Status: In Progress
Development

No branches or pull requests

6 participants