-
Notifications
You must be signed in to change notification settings - Fork 62
[inventory] Add health check information for each sled #9434
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[inventory] Add health check information for each sled #9434
Conversation
| pub reconciler_status: ConfigReconcilerInventoryStatus, | ||
| pub last_reconciliation: Option<ConfigReconcilerInventory>, | ||
| pub zone_image_resolver: ZoneImageResolverInventory, | ||
| pub smf_services_enabled_not_running: Vec<SvcNotRunning>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up not specifically calling this smf_services_in_maintenance because of the description in https://www.illumos.org/man/1/svcs
-x Displays explanations for service states. Without arguments, the -x option explains the states of services which: o are enabled, but are not running. o are preventing another enabled service from running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we do want to stick to just services in maintenance, but not use -x (see my other comment).
| // License, v. 2.0. If a copy of the MPL was not distributed with this | ||
| // file, You can obtain one at https://mozilla.org/MPL/2.0/. | ||
|
|
||
| //! Sled-agent API types that changed from v10 to v11. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
idk what I'm doing. Someone please check that I did this versioning thing right
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't inspected it closely but it looks like the right shape and the OpenAPI changes look right so I think it's right.
|
After the sled agent version change I tested the API on a simulated omicron to make sure everything still works: All enabled services are healthyUsing v10 the ouput doesn't contain the smf_services_enabled_not_running field: $ curl -H "api-version: 10.0.0" http://[::1]:41963/inventory | jq
<...>
"mupdate_override": {
"boot_disk_path": "/fake/path/install/mupdate_override.json",
"boot_override": {
"ok": null
},
"non_boot_status": []
}
}
}Using v11 does $ curl -H "api-version: 11.0.0" http://[::1]:41963/inventory | jq
<...>
"mupdate_override": {
"boot_disk_path": "/fake/path/install/mupdate_override.json",
"boot_override": {
"ok": null
},
"non_boot_status": []
}
},
"smf_services_enabled_not_running": []
}There is an unhealthy enabled service$ svcadm enable svc:/site/fake-service:defaultUsing v10 the ouput doesn't contain services that aren't running: $ curl -H "api-version: 10.0.0" http://[::1]:41963/inventory | jq
<...>
"mupdate_override": {
"boot_disk_path": "/fake/path/install/mupdate_override.json",
"boot_override": {
"ok": null
},
"non_boot_status": []
}
}
}
Using v11 does $ curl -H "api-version: 11.0.0" http://[::1]:41963/inventory | jq
<...>
"mupdate_override": {
"boot_disk_path": "/fake/path/install/mupdate_override.json",
"boot_override": {
"ok": null
},
"non_boot_status": []
}
},
"smf_services_enabled_not_running": [
{
"fmri": "svc:/site/fake-service:default",
"zone": "global",
"state": "maintenance",
"state_since": "2025-12-03T02:27:20Z",
"reason": "Restarting too quickly.",
"impact": "This service is not running.",
"additional_info": [
"http://illumos.org/msg/SMF-8000-L5",
"/var/svc/log/site-fake-service:default.log"
]
}
]
}
|
| // License, v. 2.0. If a copy of the MPL was not distributed with this | ||
| // file, You can obtain one at https://mozilla.org/MPL/2.0/. | ||
|
|
||
| //! Utilities for reporting SMF services' status. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't know about this, but it looks like we have an smf crate:
https://github.com/oxidecomputer/smf
and it supports a Query operation that that seems to basically be svcs:
https://docs.rs/smf/latest/smf/struct.Query.html
I think one problem with that is that it doesn't seem to support -Z (query all zones). It either does the current zone or a specific zone. But it looks like it could be easy to add -Z and then you could use that here?
If for whatever reason we don't want to do that: I think better than trying to parse svcs -Zx would be using svcs -Za -H -o fmri,state. svcs -xv seems much harder to parse and less stable. (It looks like neither is guaranteed to be stable but it would be much more surprising for -H -o to change.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha! I was not aware of this smf crate either, thanks. I took a look and I think I'm going to go with the second option though. The from_str method there returns an error if it is unable to parse a line of text. My intention here was to skip over any missing fields or lines that couldn't be parsed, and fill out as much of the struct as possible.
I wonder about the additional information that svcs -x provides though. Would it not be useful to support to have the time of when the service went into maintenance, if it is in a degraded state, or the reason why the service is in such a state? Or is this information available to support elsewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ha! I was not aware of this smf crate either, thanks. I took a look and I think I'm going to go with the second option though. The from_str method there returns an error if it is unable to parse a line of text. My intention here was to skip over any missing fields or lines that couldn't be parsed, and fill out as much of the struct as possible.
Hmm, that's interesting. This makes me wonder if what we want is actually a list of errors, which would be an enum with two variants: one is like ServiceInMaintenance and the other is like SvcsFailed. Then you could still use the existing thing. We wouldn't have information in the case that svcs failed, but that's already true if the whole command fails, and I believe this really shouldn't happen in practice. (If it does, we'll be able to see that and fix the bug.) I'm not totally sold on this, but it also doesn't seem great to duplicate this code just for an error case that shouldn't ever happen.
I wonder about the additional information that svcs -x provides though. Would it not be useful to support to have the time of when the service went into maintenance, if it is in a degraded state, or the reason why the service is in such a state? Or is this information available to support elsewhere?
I assume support is going to need a support bundle to debug a service in maintenance. Ultimately, the information here just feeds a boolean to the user about whether to call support, and the value is really for the cases where there's no service in maintenance. (Right?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, that's interesting. This makes me wonder if what we want is actually a list of errors, which would be an enum with two variants: one is like ServiceInMaintenance and the other is like SvcsFailed. Then you could still use the existing thing. We wouldn't have information in the case that svcs failed, but that's already true if the whole command fails, and I believe this really shouldn't happen in practice. (If it does, we'll be able to see that and fix the bug.) I'm not totally sold on this, but it also doesn't seem great to duplicate this code just for an error case that shouldn't ever happen.
Hmmm... Perhaps? I'll play around with this idea. But yeah, duplicating code is not the best. I'm a little concerned that we may be shoehorning what already exists into an enum that may not feel as natural. Let's say the parsing fails midway a service in maintenance we want to collect, but instead of just filling the enum with as much information we have, we end up with a SvcsFailed. Would we then report the health check as failed or not? There could also be the case where parsing fails with a service that is healthy, we would also get a SvcsFailed there, but in this case it doesn't mean the health check failed.
Thinking about all of this, I'm more inclined to use svcs -Za -H -o fmri,state for less ambiguity. Which brings me to your next point
I assume support is going to need a support bundle to debug a service in maintenance. Ultimately, the information here just feeds a boolean to the user about whether to call support, and the value is really for the cases where there's no service in maintenance. (Right?)
If this is the case, then svcs -Za -H -o fmri,state should be enough no?
| pub reconciler_status: ConfigReconcilerInventoryStatus, | ||
| pub last_reconciliation: Option<ConfigReconcilerInventory>, | ||
| pub zone_image_resolver: ZoneImageResolverInventory, | ||
| pub smf_services_enabled_not_running: Vec<SvcNotRunning>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we do want to stick to just services in maintenance, but not use -x (see my other comment).
| // License, v. 2.0. If a copy of the MPL was not distributed with this | ||
| // file, You can obtain one at https://mozilla.org/MPL/2.0/. | ||
|
|
||
| //! Sled-agent API types that changed from v10 to v11. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I haven't inspected it closely but it looks like the right shape and the OpenAPI changes look right so I think it's right.
| let smf_services_enabled_not_running = | ||
| Svcs::enabled_not_running(&self.log).await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think the simulated sled agent should call out to the real svcs command to see what services are in maintenance. It should probably just fake up a response indicating no services in maintenance. For testing, we might want a way to get it to report some services in maintenance, but we may not really need that since we can create fake inventory collections instead.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ugh yeah sorry, I was using this to test on my machine without having to spin up an a4x2 environment every time I tried out the endpoint and forgot to remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I can see the appeal of testing this in the simulated environment. When I've wanted to do something like this in the past, I added an API to the simulated sled agent that would let you configure its behavior. I could see doing something like that in the limit, but I could also see it not really being worth it -- whatever you think.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll give it a go and see if it doesn't take too long to implement. This may be useful for any further health checks.
| let smf_services_enabled_not_running = | ||
| Svcs::enabled_not_running(&self.log).await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious for @jgallagher's thoughts on this but I feel like it would be better if this function remained non-async and just had this information available already. Maybe the config-reconciler could collect it? Or some other background task?
There are a few things I'm worried about here:
- a stuck
svcscall blocks our ability to collect any inventory at all - if a
svcscall gets stuck, every call to inventory will accumulate another one
If we instead put this into a separate background task, there can be only one at a time, and we can always report the latest state, even if it's old.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strongly agreed; failing to collect inventory because we're stuck in a hung SMF interaction would be quite painful (since inventory is supposed to report that kind of problem in the first place).
I'm less sure about putting it in the config reconciler vs spawning a new background task to do it. My gut reaction was to put it in a separate task, but since the config reconciler is responsible for starting up zones I could see it wanting know whether any services in that zone are in trouble.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that all makes sense.
I'm less sure about putting it in the config reconciler vs spawning a new background task to do it. My gut reaction was to put it in a separate task, but since the config reconciler is responsible for starting up zones I could see it wanting know whether any services in that zone are in trouble.
This comment makes me think having it in the config-reconciler would be more useful?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment makes me think having it in the config-reconciler would be more useful?
(Recapping a live conversation) I thought so too until we were chatting, and realized that the config-reconciler is mostly dormant. It only runs when there's been a change it needs to reconcile (new config, change in hardware, ...) or if the previous attempt to reconcile didn't complete successfully (in which case it continues retrying periodically until it does succeed, at which point it goes dormant again waiting for changes). We want health check info to be fresher than that, so a new background task seems right.
This is the first PR for #9412
This PR only implements checking SMF service health for each sled. There will be follow up PRs that deal with adding the data types to the DB.
NB: This PR is less than 1000 lines, not ~10,000. Most of the changes here are API JSON additions
Manual testing on a simulated omicron
System has two unhealthy enabled services
Disabling one of the unhealthy services should only show a single entry
Disabling the other service should not return anything