Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vault disk full failure modes #3725

Open
kenbreeman opened this issue Dec 20, 2017 · 10 comments
Open

Vault disk full failure modes #3725

kenbreeman opened this issue Dec 20, 2017 · 10 comments

Comments

@kenbreeman
Copy link
Contributor

When Vault is configured with a file-based audit backend and the disk fills up, the Vault leader becomes unhealthy but fails to step-down despite having standby instances available.

Environment:

  • Vault Version: 0.8.2
  • Operating System/Architecture: CentOS x64

Vault Config File:
N/A

Startup Log Output:
N/A

Expected Behavior:
When a Vault leader is no longer able to serve requests it should step-down and allow a standby to serve requests.

Actual Behavior:
Disk-full prevented Vault from writing to the audit log so it didn't automatically step down, issuing a manual vault step-down also failed due to being unable to write to the audit log. Killing the Vault leader process was the only way to recover.

Steps to Reproduce:

  • Create a Vault HA cluster
  • Enable the file audit backend
  • Fill the disk (e.g. cat /dev/urandom > /path/to/audit/log/volume/temp.bin)
  • Observe Vault leader failing to serve requests (e.g. vault read secret/foo)
  • Observe unable to step down cleanly (e.g. vault step-down)

Important Factoids:
N/A

References:
N/A

@jefferai jefferai added this to the 0.9.2 milestone Dec 26, 2017
@jefferai jefferai modified the milestones: 0.9.2, 0.9.3 Jan 17, 2018
@jefferai jefferai modified the milestones: 0.9.3, 0.9.4 Jan 28, 2018
@jefferai jefferai modified the milestones: 0.9.4, 0.10 Feb 14, 2018
@jefferai jefferai modified the milestones: 0.10, 0.10.1, 0.11 Apr 10, 2018
@chrishoffman chrishoffman modified the milestones: 0.11, near-term Aug 16, 2018
@tyrannosaurus-becks tyrannosaurus-becks self-assigned this Mar 18, 2019
@tyrannosaurus-becks tyrannosaurus-becks removed their assignment Mar 25, 2019
@catsby catsby added bug Used to indicate a potential bug core/audit version/0.8.x labels Nov 7, 2019
@pbernal pbernal modified the milestones: near-term, triaged May 28, 2020
@daniilyar-incountry
Copy link

Affects us with Vault version 1.5.5 as well

@vishalnayak
Copy link
Contributor

Issues that are not reproducible and/or not had any interaction for a long time are stale issues. Sometimes even the valid issues remain stale lacking traction either by the maintainers or the community. In order to provide faster responses and better engagement with the community, we strive to keep the issue tracker clean and the issue count low. In this regard, our current policy is to close stale issues after 30 days. Closed issues will still be indexed and available for future viewers. If users feel that the issue is still relevant but is wrongly closed, we encourage reopening them.

Please refer to our contributing guidelines for details on issue lifecycle.

@nvx
Copy link
Contributor

nvx commented Aug 19, 2021

Just ran into this on Vault 1.7.2, definitely still an issue.

@JohnTrevorBurke
Copy link

I ran into this on 1.7.2 as well. Not sure if the issue has been fixed in later versions.

@aphorise
Copy link
Contributor

aphorise commented Sep 2, 2022

Hey aren't these typically OS / platform level concerns? - So if there are ELK beats or Prom exporters specific to Host / OS environment of the service (Vault) then any system excesses are monitored and altered separately.

There's also other non-file based outputs (socket) and stdout logging as in the case of containers where disk full would be unlikely.

Other preventative can be performing log rotations more frequent especially as that would be shorter (CPU time) when performed more frequently than once per day for the entire days worth of logs.

I think reporting this consistently may a challenge particularly in those cases where it's not possible to get accurate reports of sizes or if increases in reported fluctuations are not caused by Audits or even Vault. If it was possible - perhaps a one way of doing it could be to combine all avaible, free data, used (audit + other vault data) in some rough measure wherever some storage details are avaible.

@nvx
Copy link
Contributor

nvx commented Sep 5, 2022

Hey aren't these typically OS / platform level concerns?

Yes, but the specific behaviour I'm expecting here is when Vault attempts to write to the audit log, encounters an error, and thus refuses to service the request because audit logging failed is for Vault to then trigger a leader election.

I'm not expecting Vault to be monitoring disk space/etc, just to step down as a leader if it encounters a fatal audit log error.

In my particular instance, regular monitoring and logrotate settings/etc were fine for normal use, the issue only came up when some Vault clients had some bugs (rate limit configuration is something that would also help a lot here) that ended up making lots of requests in a very short amount of time, for example #12566 is one such issue I found with Vault Agent.

In the above example, if the server had stepped down as leader it would have at least allowed a longer duration before a total service outage occurred by which time the bad clients could be addressed / rate limits implemented.

Another reason this would be good to have is in some virtualised environments, it's not unheard of for a failing VM host or other similar event to cause a VMs disks to flick to read only (after a few errors Linux tends to give up and remount the filesystem as read only). I've thankfully not run into this on a Vault server before, but I have on a number of other VMs and I'd expect from Vault's perspective that the log filesystem becoming read-only would be more or less the same as running out of disk space. Having to wait for an operator to manually kill a bad Vault VM rather than having it auto-fail over is less than ideal.

Note that in my instance, I'm using Consul as the Vault storage backend, so other storage considerations were not an issue. I'm not sure what the current failure mode of being unable to write to the raft folder is, but I imagine users would expect failures of the raft storage to act similarly (if it doesn't already). I'd like to imagine anyone running Vault with raft storage would have separate filesystems for logs vs raft data to avoid logs filling up from impacting on raft (indeed in my setup this is also the case for logs vs rest of filesystem), but the other considerations still apply.

@kenbreeman
Copy link
Contributor Author

I think the important piece of this issue is that the Audit Log feature is both critical and blocking. Vault should not serve any requests if it is unable to accurately maintain the audit log. My opinion is this SHOULD include attempted requests because those are still relevant even if Vault was unable to fulfill them.

If Vault can't write to the audit log (disk full, disk failure, config error, etc) then it should no longer be trusted to continue running as the leader node and should halt (e.g. os.Exit(74) // EX_IOERR ). Graceful step-down would be nicer, but that includes writing the step-down event to the audit log... Since we can't rely on being able to write to the logs with a clear error message I think the best option is to exit with a well-defined error code to aid in debugging.

The longer Vault is running unsealed in a bad state without being able to log audit events: the bigger the risk becomes. I think we should fail closed here. This has the added advantage of better uptime for HA clusters as well.

@mpalmi mpalmi added enhancement and removed bug Used to indicate a potential bug labels Dec 13, 2022
@mpalmi
Copy link
Contributor

mpalmi commented Dec 13, 2022

Greetings! I appreciate all of the valuable input in this thread and will do my best to close the loop here. I have gone ahead and removed the bug label in favor of enhancement, as this appears to be a request for an improvement to the way Vault currently (intentionally) handles a failure scenario.

Before going into much detail about the rationale, I would like to point out the [currently subtle] recommendation to enable multiple audit devices, which should prevent this situation from being an issue.

I have a docs PR in progress (yet to be opened), which should improve the clarity around enabling audit devices and support the recommendation of having more than one audit device in production.

The assertion that Vault should fail closed when Vault fails to audit is completely justified and valid. The current behavior is attempting to achieve that, though it may fall short in some areas. There has been some internal discussion about the ways to solve some of the deficiencies with the current approach. Further design/discussion/prioritization needs to take place before any solution can be made into a reality.

Since this issue can immediately be resolved by increasing the redundancy of audit solutions, I am inclined to reclassify this as a UX "enhancement." Please stay tuned for the docs PR and any other future updates!

@robertdebock
Copy link
Contributor

Still an issue with version 1.12.1+ent.

When configuring an audit device (a file on the local disk) and making it unavailable, Vault stops working, but the health-check is okay:

$ vault kv get kv/my-1
Error making API request.

URL: GET http://127.0.0.1:8200/v1/sys/internal/ui/mounts/kv/my-1
Code: 500. Errors:

* local node not active but active cluster node not found*

The health-check returns healthy:

$ curl -v "$VAULT_ADDR/v1/sys/health?standbycode=200"
*   Trying 127.0.0.1...
* TCP_NODELAY set
* Connected to 127.0.0.1 (127.0.0.1) port 8200 (#0)
> GET /v1/sys/health?standbycode=200 HTTP/1.1
> Host: 127.0.0.1:8200
> User-Agent: curl/7.61.1
> Accept: */*
> 
< HTTP/1.1 200 OK
< Cache-Control: no-store
< Content-Type: application/json
< Strict-Transport-Security: max-age=31536000; includeSubDomains
< Date: Thu, 05 Jan 2023 09:03:51 GMT
< Content-Length: 378
< 
{"initialized":true,"sealed":false,"standby":true,"performance_standby":false,"replication_performance_mode":"unknown","replication_dr_mode":"unknown","server_time_utc":1672909431,"version":"1.12.1+ent","cluster_name":"vault_one_nodes","cluster_id":"4b9002db-a28b-982f-6413-3648d557cbd6","license":{"state":"autoloaded","expiry_time":"2023-01-28T07:04:53Z","terminated":false}}
* Connection #0 to host 127.0.0.1 left intact

I would like the health-check to report a non-200 code, maybe 502, since that's not used.

@aphorise
Copy link
Contributor

  1. Doing a discouraged feature / flag like: VAULT_AUDIT_ERRORS_IGNORE and or an equivalent HCL parameter to the same effect may be an option here (use at your own risk).

  2. On the http response (200) of /v1/sys/health maybe an equivalent query-string like ?auditerrorcode=503 that's a higher precedent could be provided (more pressing that ?standycode=200).

Maybe PR's can be drafted for these enhancements (ignoring audit failure & response on health check)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests