Skip to content

Commit

Permalink
fix(hetzner): insufficient nodes when boot fails
Browse files Browse the repository at this point in the history
The Hetzner Cloud API returns "Actions" for anything asynchronous that
happens inside the backend. When creating a new server multiple actions
are returned: `create_server`, `start_server`, `attach_to_network` (if set).

Our current code waits for the `create_server` and if it fails, it makes
sure to delete the server so cluster-autoscaler can create a new one
immediately to provide the required capacity. If one of the "follow up"
actions fails though, we do not handle this. This causes issues when the
server for whatever reason did not start properly on the first try, as
then the customer has a shutdown server, is paying for it, but does not
receive the additional capacity for their Kubernetes cluster.

This commit fixes the bug, by awaiting all actions returned by the
create server API call, and deleting the server if any of them fail.
  • Loading branch information
apricote committed Dec 11, 2023
1 parent fe7e07d commit 19c4942
Showing 1 changed file with 11 additions and 5 deletions.
16 changes: 11 additions & 5 deletions cluster-autoscaler/cloudprovider/hetzner/hetzner_node_group.go
Original file line number Diff line number Diff line change
Expand Up @@ -424,12 +424,18 @@ func createServer(n *hetznerNodeGroup) error {
return fmt.Errorf("could not create server type %s in region %s: %v", n.instanceType, n.region, err)
}

action := serverCreateResult.Action
server := serverCreateResult.Server
err = waitForServerAction(n.manager, server.Name, action)
if err != nil {
_ = n.manager.deleteServer(server)
return fmt.Errorf("failed to start server %s error: %v", server.Name, err)

actions := []*hcloud.Action{serverCreateResult.Action}
actions = append(actions, serverCreateResult.NextActions...)

// Delete the server if any action (most importantly create_server & start_server) fails
for _, action := range actions {
err = waitForServerAction(n.manager, server.Name, action)
if err != nil {
_ = n.manager.deleteServer(server)
return fmt.Errorf("failed to start server %s error: %v", server.Name, err)
}
}

return nil
Expand Down

0 comments on commit 19c4942

Please sign in to comment.