-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Leader unresponsive, unavailable #29033
Comments
In addition to details already shared, the follower's logs have no messages indicating any communication failures to the leader node. Also while trying to login to vault from the follower node, it returned 500. Below is the error that was returned:
|
Hi, thanks for the detailed report! As recently discussed in #28846, Vault doesn't have a mechanism to react to a full disk, the Raft heartbeats continue to work while in this situation so a leader election isn't triggered. You're right that in this situation the step-down commands would get forwarded to the unresponsive leader and since it cannot write to the audit device it won't even attempt the operation. For situations like this forcing a leader election by shutting down the node in some way is a good approach. Another slightly different option you had was to increase the volume size, which might require figuring out the SSH problem in order to increase the partition size at the OS level, or if using an AMI that supports partition resizing during startup you can just restart the node. If none of that was possible you could also use the peers.json approach to restart the healthy nodes and make them forget about the unresponsive node. Generally, infrastructure monitoring should be used to alert about such OS-level status, giving you time to increase the disk size before the cluster goes offline. |
Thank you for the response. Seems like the design decision invalidates the reasons to deploy redundant or highly available, if one node failure (for any reason) can take the whole cluster down. The peers.json recovery approach is useful, thank you for the info. The AMI and EC2 instance in this case expanding disk isn't possible, and some change (AMI?) took place where a Linux command from logrotate no longer worked. Either way this is fragile. We are considering moving logging to an external drive to work around the Vault failure mode, but this will take awhile to get to production. Monitoring is in place, but we need automated recovery, since it can not be predicted anyone is looking (until the world stops of course). We think root cause is known now (logrotate script stopped working), but what else can kill the leader and make the whole thing stop working - are there other scenarios? A few questions:
|
As for infrastructure monitoring the exhaustion of other resources like CPU and RAM will result in leader elections, it's just really the disk issue that can cause this useless leader scenario. That's why it's important to monitor things like disk utilization and volume health (like it AWS marks it as degraded).
Yes, you can disable audit logs, do note the limitations section though. Also, by enabling multiple audit devices you can make it so that Vault will process the request as long it has logged the request to any of the configured audit devices, so a failure of any single one won't make Vault unavailable. As for disabling this "blocked audit device" behavior when there's a single audit device configured, that's not possible, Vault's design considers recording requests and responses in audit devices to be critical from a security perspective.
I raised this discussion internally and there were mixed opinions. On one hand it seems clear that expanding Vault's responsibility to monitor the environment where it's running as a replacement for infra monitoring is not something we'd like to prioritize since we won't ever be able to do enough in that space to fully replace monitoring. Also, assuming all nodes are deployed the same, any improvements here would only delay the time until the cluster goes offline since one node would fill up after the next and eventually they'd all have full disks in the absence of monitoring and operator intervention. On the other hand, if there's a simple way which we could use to consistently detect this condition and yield leadership in order to buy some extra time, we should probably investigate it. I'll try to get that prioritized.
Well for configuration changes like removing a peer the Vault protocol describes how it should be done and it involves having the leader start the operation. As I mentioned earlier from a Raft perspective the cluster is fine, for all the followers the leader is completely fine. As for the step-down command that must be executed in the active/leader node because it must voluntarily yield leadership. Something I just noticed though is that the |
Thank you for the detail response and investigation.
Interesting. Seems odd that disk is excluded; but it is what it is.
Thank you. We are likely to move the audit logs mount to EFS and remove the bricking behavior/defect exposure in a future release, but we could just toss audit logs entirely to avoid this.
Understood and thanks. Having the health go bad early enough, even if it walked through all the cluster members, seems like it would not prevent an operator from jumping into nodes and fixing it node by node, instead of the current "bricking". I hope folks will find some improvement. In general our position is self-healing in addition to monitoring, because the human aspects of monitoring alone.
In general, because of the leader/bricking choice, it feels like admin operators need a way to bring a cluster back from death... (short of what I did). Would be great if a class of commands worked on the cluster from ANY working node, FWIW. Thank you again for the engagement and discussion. |
Environment:
1.17.4
Linux
Vault Config File:
Startup Log Output:
Expected Behavior:
Ability to change the leader, when the leader is unresponsive. It appears the operator step-down command is redirected from a healthy node to the leader node, and is thus unresponsive as well.
Actual Behavior:
Steps to Reproduce:
Unfortunately I do have the means to reproduce, or logs off the killed node.
Important Factoids:
References:
The text was updated successfully, but these errors were encountered: