Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Leader unresponsive, unavailable #29033

Open
ellisroll-b opened this issue Nov 26, 2024 · 5 comments
Open

Leader unresponsive, unavailable #29033

ellisroll-b opened this issue Nov 26, 2024 · 5 comments
Labels
core/ha specific to high-availability question

Comments

@ellisroll-b
Copy link

Environment:

  • Vault Version:
    1.17.4
  • Operating System/Architecture:
    Linux

Vault Config File:

# Paste your Vault config here.
# Be sure to scrub any sensitive values
ui                                  = false
disable_mlock                       = true                            #integrated storage - mlock should NOT be enabled
cluster_addr                        = "http://TEMPLATE_HOST_IP:8201"  #must include protocol, replaced by auto_join
api_addr                            = "http://TEMPLATE_HOST_IP:8200"  #must include protocol, replaced by auto_join
enable_response_header_hostname     = "true"
enable_response_header_raft_node_id = "true"

# logging at info level, to files in the listed directory with names of /var/log/vault/vault-epoch.log, where epoch is a timestamp
# max log file of 5M, and max number of older than current at 2
log_level            = "info"
log_file             = "/var/log/vault/"
log_format           = "json"
log_rotate_bytes     = 5242880
log_rotate_max_files = 2

storage "raft" {
  path      = "/mnt/data/vault"
  node_id   = "TEMPLATE_EC2_INST_ID"
  retry_join {
    auto_join        = "provider=aws region=TEMPLATE_AWS_REGION tag_key=VAJC tag_value=TEMPLATE_ACCOUNT_ID-TEMPLATE_DC_NAME-vault-cluster addr_type=private_v4"
    auto_join_scheme = "http"
  }
}

# HTTP listener
listener "tcp" {
  address = "0.0.0.0:8200"
  tls_disable = 1
}

# HTTPS listener
#listener "tcp" {
#  address       = "0.0.0.0:8200"
#  tls_cert_file = "/opt/vault/tls/tls.crt"
#  tls_key_file  = "/opt/vault/tls/tls.key"
#  #tls_client_ca_file = "/opt/vault/tls/vault-ca.pem"
#}

# AWS KMS auto unseal
seal "awskms" {
  region      = "TEMPLATE_AWS_REGION"
  kms_key_id  = "TEMPLATE_KMS_KEY_ID"
}

Startup Log Output:

# Paste your log output here

Expected Behavior:

Ability to change the leader, when the leader is unresponsive. It appears the operator step-down command is redirected from a healthy node to the leader node, and is thus unresponsive as well.

Actual Behavior:

  1. Vault logins starting failing, from our other services (approle).
  2. Even though we have log rotation/truncation, a developer looked at the exported logs (we export to an S3 bucket) of a 1 (of 5) node cluster and found an error saying it was out of disk space. This happened to be the leader.
  3. We attempted to jump into the instance (a docker running in an AWS EC2 instance) and could not connect at all to the EC2 instance. AWS reported the instance was healthy and fine.
  4. We jumped into one of the other instances (of 5) and attempted local login. This is redirected to an unresponsive leader, and timed out.
  5. We needed to get back to health, so we killed the leader EC2 instance. Vault began working (election happened), and a new EC2 instance spun up and the missing node joined the cluster.
  6. however - this operational response, lost the EC2 instance and the ability to debug what was happening in the node or docker container. My own theory is networking on the EC2 instance was messed up and it had not reached the point where AWS reported the instance as unhealthy.
  7. questions:
  • what does one do if the leader is fully unresponsive? - can't login, and according to the operator step-down command could not issue the command even if we were logged in, since it would be redirected to the unresponsive leader?
  • how does vault determine the leader is no longer functional? I apologize if this is documented somewhere, but I have not found it.
  • are there any other hammers short of what we did, to resolve such an issue?

Steps to Reproduce:

Unfortunately I do have the means to reproduce, or logs off the killed node.

Important Factoids:

References:

@maheshpoojaryneu
Copy link

maheshpoojaryneu commented Nov 26, 2024

In addition to details already shared, the follower's logs have no messages indicating any communication failures to the leader node. Also while trying to login to vault from the follower node, it returned 500. Below is the error that was returned:

Password (will be hidden):
Error authenticating: Error making API request.

URL: PUT http://127.0.0.1:8200/v1/auth/userpass/login/admin
Code: 500. Errors:

* internal error

@bosouza
Copy link
Contributor

bosouza commented Nov 29, 2024

Hi, thanks for the detailed report! As recently discussed in #28846, Vault doesn't have a mechanism to react to a full disk, the Raft heartbeats continue to work while in this situation so a leader election isn't triggered. You're right that in this situation the step-down commands would get forwarded to the unresponsive leader and since it cannot write to the audit device it won't even attempt the operation.

For situations like this forcing a leader election by shutting down the node in some way is a good approach. Another slightly different option you had was to increase the volume size, which might require figuring out the SSH problem in order to increase the partition size at the OS level, or if using an AMI that supports partition resizing during startup you can just restart the node. If none of that was possible you could also use the peers.json approach to restart the healthy nodes and make them forget about the unresponsive node.

Generally, infrastructure monitoring should be used to alert about such OS-level status, giving you time to increase the disk size before the cluster goes offline.

@bosouza bosouza added question core/ha specific to high-availability labels Nov 29, 2024
@ellisroll-b
Copy link
Author

Thank you for the response. Seems like the design decision invalidates the reasons to deploy redundant or highly available, if one node failure (for any reason) can take the whole cluster down.

The peers.json recovery approach is useful, thank you for the info. The AMI and EC2 instance in this case expanding disk isn't possible, and some change (AMI?) took place where a Linux command from logrotate no longer worked. Either way this is fragile. We are considering moving logging to an external drive to work around the Vault failure mode, but this will take awhile to get to production.

Monitoring is in place, but we need automated recovery, since it can not be predicted anyone is looking (until the world stops of course). We think root cause is known now (logrotate script stopped working), but what else can kill the leader and make the whole thing stop working - are there other scenarios?

A few questions:

  1. Can I turn this behavior off (stop working because of audit file path errors of any kind)? (If so how). Staying operational is the first priority always in our system. 5 nodes becoming dysfunctional because of one, is not worth any logging model decision.

  2. Design: Why wouldn't this set the leader node to unhealthy, and force election of a new leader? I believe you stated that raft works in this condition, so it feels like the state of a dead leader should be known in the cluster.

  3. Design: For operator actions, why is this forwarded to the dead leader? Why would operator actions not be possible on any node, at any time?

@bosouza
Copy link
Contributor

bosouza commented Dec 4, 2024

We think root cause is known now (logrotate script stopped working), but what else can kill the leader and make the whole thing stop working - are there other scenarios?

As for infrastructure monitoring the exhaustion of other resources like CPU and RAM will result in leader elections, it's just really the disk issue that can cause this useless leader scenario. That's why it's important to monitor things like disk utilization and volume health (like it AWS marks it as degraded).

  1. Can I turn this behavior off (stop working because of audit file path errors of any kind)? (If so how). Staying operational is the first priority always in our system. 5 nodes becoming dysfunctional because of one, is not worth any logging model decision.

Yes, you can disable audit logs, do note the limitations section though. Also, by enabling multiple audit devices you can make it so that Vault will process the request as long it has logged the request to any of the configured audit devices, so a failure of any single one won't make Vault unavailable. As for disabling this "blocked audit device" behavior when there's a single audit device configured, that's not possible, Vault's design considers recording requests and responses in audit devices to be critical from a security perspective.

  1. Design: Why wouldn't this set the leader node to unhealthy, and force election of a new leader? I believe you stated that raft works in this condition, so it feels like the state of a dead leader should be known in the cluster.

I raised this discussion internally and there were mixed opinions. On one hand it seems clear that expanding Vault's responsibility to monitor the environment where it's running as a replacement for infra monitoring is not something we'd like to prioritize since we won't ever be able to do enough in that space to fully replace monitoring. Also, assuming all nodes are deployed the same, any improvements here would only delay the time until the cluster goes offline since one node would fill up after the next and eventually they'd all have full disks in the absence of monitoring and operator intervention. On the other hand, if there's a simple way which we could use to consistently detect this condition and yield leadership in order to buy some extra time, we should probably investigate it. I'll try to get that prioritized.

  1. Design: For operator actions, why is this forwarded to the dead leader? Why would operator actions not be possible on any node, at any time?

Well for configuration changes like removing a peer the Vault protocol describes how it should be done and it involves having the leader start the operation. As I mentioned earlier from a Raft perspective the cluster is fine, for all the followers the leader is completely fine. As for the step-down command that must be executed in the active/leader node because it must voluntarily yield leadership. Something I just noticed though is that the step-down endpoint is documented as non-audited so it wouldn't make sense for it to get stuck on the blocked audit device, I'll check if that info is correct.

@ellisroll-b
Copy link
Author

ellisroll-b commented Dec 4, 2024

Thank you for the detail response and investigation.

...it's just really the disk issue that can cause this useless leader scenario.

Interesting. Seems odd that disk is excluded; but it is what it is.

Yes, you can disable audit logs...

Thank you. We are likely to move the audit logs mount to EFS and remove the bricking behavior/defect exposure in a future release, but we could just toss audit logs entirely to avoid this.

I raised this discussion internally and there were mixed opinions....

Understood and thanks. Having the health go bad early enough, even if it walked through all the cluster members, seems like it would not prevent an operator from jumping into nodes and fixing it node by node, instead of the current "bricking". I hope folks will find some improvement. In general our position is self-healing in addition to monitoring, because the human aspects of monitoring alone.

...step-down endpoint is ...

In general, because of the leader/bricking choice, it feels like admin operators need a way to bring a cluster back from death... (short of what I did). Would be great if a class of commands worked on the cluster from ANY working node, FWIW.

Thank you again for the engagement and discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core/ha specific to high-availability question
Projects
None yet
Development

No branches or pull requests

3 participants