[inventory] Add health check information for each sled #9434

karencfv · 2025-11-21T05:58:58Z

This is the first PR for #9412

This PR only implements checking SMF service health for each sled. There will be follow up PRs that deal with adding the data types to the DB.

NB: This PR is less than 1000 lines, not ~10,000. Most of the changes here are API JSON additions

Manual testing on a simulated omicron

System has two unhealthy enabled services

$ pfexec svcs -Zx
svc:/site/fake-service:default (?)
  Zone: global
 State: maintenance since Mon Dec  1 20:53:40 2025
Reason: Restarting too quickly.
   See: http://illumos.org/msg/SMF-8000-L5
   See: /var/svc/log/site-fake-service:default.log
Impact: This service is not running.

svc:/system/omicron/baseline:default (Omicron brand baseline generation)
  Zone: global
 State: maintenance since Mon Dec  1 21:23:43 2025
Reason: Start method failed repeatedly, last died on Killed (9).
   See: http://illumos.org/msg/SMF-8000-KS
   See: omicron1(7)
   See: /var/svc/log/system-omicron-baseline:default.log
Impact: This service is not running.

$ curl -H "api-version: 10.0.0"  http://[::1]:32824/inventory | jq
<...>
  "smf_services_enabled_not_running": [
    {
      "fmri": "svc:/site/fake-service:default",
      "zone": "global",
      "state": "maintenance",
      "state_since": "2025-12-01T20:53:40Z",
      "reason": "Restarting too quickly.",
      "impact": "This service is not running.",
      "additional_info": [
        "http://illumos.org/msg/SMF-8000-L5",
        "/var/svc/log/site-fake-service:default.log"
      ]
    },
    {
      "fmri": "svc:/system/omicron/baseline:default",
      "zone": "global",
      "state": "maintenance",
      "state_since": "2025-12-01T21:23:43Z",
      "reason": "Start method failed repeatedly, last died on Killed (9).",
      "impact": "This service is not running.",
      "additional_info": [
        "http://illumos.org/msg/SMF-8000-KS",
        "omicron1(7)",
        "/var/svc/log/system-omicron-baseline:default.log"
      ]
    }
  ]
}

Disabling one of the unhealthy services should only show a single entry

$ svcadm disable svc:/site/fake-service:default
$ pfexec svcs -Zx
svc:/system/omicron/baseline:default (Omicron brand baseline generation)
  Zone: global
 State: maintenance since Mon Dec  1 21:23:43 2025
Reason: Start method failed repeatedly, last died on Killed (9).
   See: http://illumos.org/msg/SMF-8000-KS
   See: omicron1(7)
   See: /var/svc/log/system-omicron-baseline:default.log
Impact: This service is not running.
$ curl -H "api-version: 10.0.0"  http://[::1]:32824/inventory | jq
<...>
  "smf_services_enabled_not_running": [
    {
      "fmri": "svc:/system/omicron/baseline:default",
      "zone": "global",
      "state": "maintenance",
      "state_since": "2025-12-01T21:23:43Z",
      "reason": "Start method failed repeatedly, last died on Killed (9).",
      "impact": "This service is not running.",
      "additional_info": [
        "http://illumos.org/msg/SMF-8000-KS",
        "omicron1(7)",
        "/var/svc/log/system-omicron-baseline:default.log"
      ]
    }
  ]
}

Disabling the other service should not return anything

$ svcadm disable svc:/system/omicron/baseline:default
$ pfexec svcs -Zx
$ curl -H "api-version: 10.0.0"  http://[::1]:32824/inventory | jq
<...>
  "smf_services_enabled_not_running": []
}

karencfv · 2025-12-02T07:41:42Z

nexus/types/src/inventory.rs

    pub reconciler_status: ConfigReconcilerInventoryStatus,
    pub last_reconciliation: Option<ConfigReconcilerInventory>,
    pub zone_image_resolver: ZoneImageResolverInventory,
+    pub smf_services_enabled_not_running: Vec<SvcNotRunning>,


I ended up not specifically calling this smf_services_in_maintenance because of the description in https://www.illumos.org/man/1/svcs

-x Displays explanations for service states. Without arguments, the -x option explains the states of services which: o are enabled, but are not running. o are preventing another enabled service from running.

I think we do want to stick to just services in maintenance, but not use -x (see my other comment).

karencfv · 2025-12-02T07:43:00Z

sled-agent/api/src/v10.rs

+// License, v. 2.0. If a copy of the MPL was not distributed with this
+// file, You can obtain one at https://mozilla.org/MPL/2.0/.
+
+//! Sled-agent API types that changed from v10 to v11.


idk what I'm doing. Someone please check that I did this versioning thing right

I haven't inspected it closely but it looks like the right shape and the OpenAPI changes look right so I think it's right.

karencfv · 2025-12-03T02:31:39Z

After the sled agent version change I tested the API on a simulated omicron to make sure everything still works:

All enabled services are healthy

Using v10 the ouput doesn't contain the smf_services_enabled_not_running field:

$ curl -H "api-version: 10.0.0"  http://[::1]:41963/inventory | jq
<...>
    "mupdate_override": {
      "boot_disk_path": "/fake/path/install/mupdate_override.json",
      "boot_override": {
        "ok": null
      },
      "non_boot_status": []
    }
  }
}

Using v11 does

$ curl -H "api-version: 11.0.0"  http://[::1]:41963/inventory | jq
<...>
    "mupdate_override": {
      "boot_disk_path": "/fake/path/install/mupdate_override.json",
      "boot_override": {
        "ok": null
      },
      "non_boot_status": []
    }
  },
  "smf_services_enabled_not_running": []
}

There is an unhealthy enabled service

$ svcadm enable svc:/site/fake-service:default

Using v10 the ouput doesn't contain services that aren't running:

$ curl -H "api-version: 10.0.0"  http://[::1]:41963/inventory | jq
<...>
    "mupdate_override": {
      "boot_disk_path": "/fake/path/install/mupdate_override.json",
      "boot_override": {
        "ok": null
      },
      "non_boot_status": []
    }
  }
}

Using v11 does

$ curl -H "api-version: 11.0.0"  http://[::1]:41963/inventory | jq
<...>
    "mupdate_override": {
      "boot_disk_path": "/fake/path/install/mupdate_override.json",
      "boot_override": {
        "ok": null
      },
      "non_boot_status": []
    }
  },
  "smf_services_enabled_not_running": [
    {
      "fmri": "svc:/site/fake-service:default",
      "zone": "global",
      "state": "maintenance",
      "state_since": "2025-12-03T02:27:20Z",
      "reason": "Restarting too quickly.",
      "impact": "This service is not running.",
      "additional_info": [
        "http://illumos.org/msg/SMF-8000-L5",
        "/var/svc/log/site-fake-service:default.log"
      ]
    }
  ]
}

davepacheco · 2025-12-04T17:50:55Z

illumos-utils/src/svcs.rs

+// License, v. 2.0. If a copy of the MPL was not distributed with this
+// file, You can obtain one at https://mozilla.org/MPL/2.0/.
+
+//! Utilities for reporting SMF services' status.


I didn't know about this, but it looks like we have an smf crate:
https://github.com/oxidecomputer/smf

and it supports a Query operation that that seems to basically be svcs:
https://docs.rs/smf/latest/smf/struct.Query.html

I think one problem with that is that it doesn't seem to support -Z (query all zones). It either does the current zone or a specific zone. But it looks like it could be easy to add -Z and then you could use that here?

If for whatever reason we don't want to do that: I think better than trying to parse svcs -Zx would be using svcs -Za -H -o fmri,state. svcs -xv seems much harder to parse and less stable. (It looks like neither is guaranteed to be stable but it would be much more surprising for -H -o to change.)

Ha! I was not aware of this smf crate either, thanks. I took a look and I think I'm going to go with the second option though. The from_str method there returns an error if it is unable to parse a line of text. My intention here was to skip over any missing fields or lines that couldn't be parsed, and fill out as much of the struct as possible.

I wonder about the additional information that svcs -x provides though. Would it not be useful to support to have the time of when the service went into maintenance, if it is in a degraded state, or the reason why the service is in such a state? Or is this information available to support elsewhere?

Ha! I was not aware of this smf crate either, thanks. I took a look and I think I'm going to go with the second option though. The from_str method there returns an error if it is unable to parse a line of text. My intention here was to skip over any missing fields or lines that couldn't be parsed, and fill out as much of the struct as possible.

Hmm, that's interesting. This makes me wonder if what we want is actually a list of errors, which would be an enum with two variants: one is like ServiceInMaintenance and the other is like SvcsFailed. Then you could still use the existing thing. We wouldn't have information in the case that svcs failed, but that's already true if the whole command fails, and I believe this really shouldn't happen in practice. (If it does, we'll be able to see that and fix the bug.) I'm not totally sold on this, but it also doesn't seem great to duplicate this code just for an error case that shouldn't ever happen.

I wonder about the additional information that svcs -x provides though. Would it not be useful to support to have the time of when the service went into maintenance, if it is in a degraded state, or the reason why the service is in such a state? Or is this information available to support elsewhere?

I assume support is going to need a support bundle to debug a service in maintenance. Ultimately, the information here just feeds a boolean to the user about whether to call support, and the value is really for the cases where there's no service in maintenance. (Right?)

Hmm, that's interesting. This makes me wonder if what we want is actually a list of errors, which would be an enum with two variants: one is like ServiceInMaintenance and the other is like SvcsFailed. Then you could still use the existing thing. We wouldn't have information in the case that svcs failed, but that's already true if the whole command fails, and I believe this really shouldn't happen in practice. (If it does, we'll be able to see that and fix the bug.) I'm not totally sold on this, but it also doesn't seem great to duplicate this code just for an error case that shouldn't ever happen.

Hmmm... Perhaps? I'll play around with this idea. But yeah, duplicating code is not the best. I'm a little concerned that we may be shoehorning what already exists into an enum that may not feel as natural. Let's say the parsing fails midway a service in maintenance we want to collect, but instead of just filling the enum with as much information we have, we end up with a SvcsFailed. Would we then report the health check as failed or not? There could also be the case where parsing fails with a service that is healthy, we would also get a SvcsFailed there, but in this case it doesn't mean the health check failed.

Thinking about all of this, I'm more inclined to use svcs -Za -H -o fmri,state for less ambiguity. Which brings me to your next point

I assume support is going to need a support bundle to debug a service in maintenance. Ultimately, the information here just feeds a boolean to the user about whether to call support, and the value is really for the cases where there's no service in maintenance. (Right?)

If this is the case, then svcs -Za -H -o fmri,state should be enough no?

davepacheco · 2025-12-04T17:53:22Z

nexus/types/src/inventory.rs

    pub reconciler_status: ConfigReconcilerInventoryStatus,
    pub last_reconciliation: Option<ConfigReconcilerInventory>,
    pub zone_image_resolver: ZoneImageResolverInventory,
+    pub smf_services_enabled_not_running: Vec<SvcNotRunning>,


I think we do want to stick to just services in maintenance, but not use -x (see my other comment).

davepacheco · 2025-12-04T17:54:57Z

sled-agent/api/src/v10.rs

+// License, v. 2.0. If a copy of the MPL was not distributed with this
+// file, You can obtain one at https://mozilla.org/MPL/2.0/.
+
+//! Sled-agent API types that changed from v10 to v11.


I haven't inspected it closely but it looks like the right shape and the OpenAPI changes look right so I think it's right.

davepacheco · 2025-12-04T17:57:57Z

sled-agent/src/sim/sled_agent.rs

+        let smf_services_enabled_not_running =
+            Svcs::enabled_not_running(&self.log).await?;


I don't think the simulated sled agent should call out to the real svcs command to see what services are in maintenance. It should probably just fake up a response indicating no services in maintenance. For testing, we might want a way to get it to report some services in maintenance, but we may not really need that since we can create fake inventory collections instead.

Ugh yeah sorry, I was using this to test on my machine without having to spin up an a4x2 environment every time I tried out the endpoint and forgot to remove it.

Yeah, I can see the appeal of testing this in the simulated environment. When I've wanted to do something like this in the past, I added an API to the simulated sled agent that would let you configure its behavior. I could see doing something like that in the limit, but I could also see it not really being worth it -- whatever you think.

I'll give it a go and see if it doesn't take too long to implement. This may be useful for any further health checks.

davepacheco · 2025-12-04T17:59:33Z

sled-agent/src/sled_agent.rs

+        let smf_services_enabled_not_running =
+            Svcs::enabled_not_running(&self.log).await?;


I'm curious for @jgallagher's thoughts on this but I feel like it would be better if this function remained non-async and just had this information available already. Maybe the config-reconciler could collect it? Or some other background task?

There are a few things I'm worried about here:

a stuck svcs call blocks our ability to collect any inventory at all

if a svcs call gets stuck, every call to inventory will accumulate another one

If we instead put this into a separate background task, there can be only one at a time, and we can always report the latest state, even if it's old.

Strongly agreed; failing to collect inventory because we're stuck in a hung SMF interaction would be quite painful (since inventory is supposed to report that kind of problem in the first place).

I'm less sure about putting it in the config reconciler vs spawning a new background task to do it. My gut reaction was to put it in a separate task, but since the config reconciler is responsible for starting up zones I could see it wanting know whether any services in that zone are in trouble.

Yeah, that all makes sense.

I'm less sure about putting it in the config reconciler vs spawning a new background task to do it. My gut reaction was to put it in a separate task, but since the config reconciler is responsible for starting up zones I could see it wanting know whether any services in that zone are in trouble.

This comment makes me think having it in the config-reconciler would be more useful?

This comment makes me think having it in the config-reconciler would be more useful?

(Recapping a live conversation) I thought so too until we were chatting, and realized that the config-reconciler is mostly dormant. It only runs when there's been a change it needs to reconcile (new config, change in hardware, ...) or if the previous attempt to reconcile didn't complete successfully (in which case it continues retrying periodically until it does succeed, at which point it goes dormant again waiting for changes). We want health check info to be fresher than that, so a new background task seems right.

karencfv added 23 commits November 21, 2025 13:59

braindump

5343500

collect services in maintenance

df67f8d

some tweaks

ff61646

more tweaks

e2ed8e9

what the hell was I thinking

f795e73

merge main

cf14925

parse

c863d60

add test

26b79fe

improve parsing

c794895

parse for inventory API

27f0c8c

service states

2fb51f6

deal with state and state_since

3b7a9c4

improve time parsing

c5cd9ec

clean up

cd65777

clean up

6e2edda

clippy

d3adf50

clean up

0ab2cea

clean up

3b372bc

fmt

553edd8

do the API version thing

5dd2113

actually add the files for the second time 🤦‍♀️

560edfd

clean up

8a62e68

small fix

85fe040

karencfv marked this pull request as ready for review December 2, 2025 07:37

karencfv requested a review from davepacheco December 2, 2025 07:38

karencfv commented Dec 2, 2025

View reviewed changes

oops

8d60d8e

davepacheco reviewed Dec 4, 2025

View reviewed changes

		let smf_services_enabled_not_running =
		Svcs::enabled_not_running(&self.log).await?;

[inventory] Add health check information for each sled #9434

Are you sure you want to change the base?

[inventory] Add health check information for each sled #9434

Uh oh!

Conversation

karencfv commented Nov 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Manual testing on a simulated omicron

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

karencfv commented Dec 3, 2025

All enabled services are healthy

There is an unhealthy enabled service

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

karencfv commented Nov 21, 2025 •

edited

Loading