Skip to content

Conversation

@karencfv
Copy link
Contributor

@karencfv karencfv commented Nov 21, 2025

This is the first PR for #9412

This PR only implements checking SMF service health for each sled. There will be follow up PRs that deal with adding the data types to the DB.

NB: This PR is less than 1000 lines, not ~10,000. Most of the changes here are API JSON additions

Manual testing on a simulated omicron

System has two unhealthy enabled services

$ pfexec svcs -Zx
svc:/site/fake-service:default (?)
  Zone: global
 State: maintenance since Mon Dec  1 20:53:40 2025
Reason: Restarting too quickly.
   See: http://illumos.org/msg/SMF-8000-L5
   See: /var/svc/log/site-fake-service:default.log
Impact: This service is not running.

svc:/system/omicron/baseline:default (Omicron brand baseline generation)
  Zone: global
 State: maintenance since Mon Dec  1 21:23:43 2025
Reason: Start method failed repeatedly, last died on Killed (9).
   See: http://illumos.org/msg/SMF-8000-KS
   See: omicron1(7)
   See: /var/svc/log/system-omicron-baseline:default.log
Impact: This service is not running.

$ curl -H "api-version: 10.0.0"  http://[::1]:32824/inventory | jq
<...>
  "smf_services_enabled_not_running": [
    {
      "fmri": "svc:/site/fake-service:default",
      "zone": "global",
      "state": "maintenance",
      "state_since": "2025-12-01T20:53:40Z",
      "reason": "Restarting too quickly.",
      "impact": "This service is not running.",
      "additional_info": [
        "http://illumos.org/msg/SMF-8000-L5",
        "/var/svc/log/site-fake-service:default.log"
      ]
    },
    {
      "fmri": "svc:/system/omicron/baseline:default",
      "zone": "global",
      "state": "maintenance",
      "state_since": "2025-12-01T21:23:43Z",
      "reason": "Start method failed repeatedly, last died on Killed (9).",
      "impact": "This service is not running.",
      "additional_info": [
        "http://illumos.org/msg/SMF-8000-KS",
        "omicron1(7)",
        "/var/svc/log/system-omicron-baseline:default.log"
      ]
    }
  ]
}

Disabling one of the unhealthy services should only show a single entry

$ svcadm disable svc:/site/fake-service:default
$ pfexec svcs -Zx
svc:/system/omicron/baseline:default (Omicron brand baseline generation)
  Zone: global
 State: maintenance since Mon Dec  1 21:23:43 2025
Reason: Start method failed repeatedly, last died on Killed (9).
   See: http://illumos.org/msg/SMF-8000-KS
   See: omicron1(7)
   See: /var/svc/log/system-omicron-baseline:default.log
Impact: This service is not running.
$ curl -H "api-version: 10.0.0"  http://[::1]:32824/inventory | jq
<...>
  "smf_services_enabled_not_running": [
    {
      "fmri": "svc:/system/omicron/baseline:default",
      "zone": "global",
      "state": "maintenance",
      "state_since": "2025-12-01T21:23:43Z",
      "reason": "Start method failed repeatedly, last died on Killed (9).",
      "impact": "This service is not running.",
      "additional_info": [
        "http://illumos.org/msg/SMF-8000-KS",
        "omicron1(7)",
        "/var/svc/log/system-omicron-baseline:default.log"
      ]
    }
  ]
}

Disabling the other service should not return anything

$ svcadm disable svc:/system/omicron/baseline:default
$ pfexec svcs -Zx
$ curl -H "api-version: 10.0.0"  http://[::1]:32824/inventory | jq
<...>
  "smf_services_enabled_not_running": []
}

@karencfv karencfv marked this pull request as ready for review December 2, 2025 07:37
@karencfv karencfv requested a review from davepacheco December 2, 2025 07:38
pub reconciler_status: ConfigReconcilerInventoryStatus,
pub last_reconciliation: Option<ConfigReconcilerInventory>,
pub zone_image_resolver: ZoneImageResolverInventory,
pub smf_services_enabled_not_running: Vec<SvcNotRunning>,
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ended up not specifically calling this smf_services_in_maintenance because of the description in https://www.illumos.org/man/1/svcs

   -x
                      Displays explanations for service states.

                      Without arguments, the -x option explains the states
                      of services which:

                          o      are enabled, but are not running.

                          o      are preventing another enabled service
                                 from running.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do want to stick to just services in maintenance, but not use -x (see my other comment).

// License, v. 2.0. If a copy of the MPL was not distributed with this
// file, You can obtain one at https://mozilla.org/MPL/2.0/.

//! Sled-agent API types that changed from v10 to v11.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

idk what I'm doing. Someone please check that I did this versioning thing right

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't inspected it closely but it looks like the right shape and the OpenAPI changes look right so I think it's right.

@karencfv
Copy link
Contributor Author

karencfv commented Dec 3, 2025

After the sled agent version change I tested the API on a simulated omicron to make sure everything still works:

All enabled services are healthy

Using v10 the ouput doesn't contain the smf_services_enabled_not_running field:

$ curl -H "api-version: 10.0.0"  http://[::1]:41963/inventory | jq
<...>
    "mupdate_override": {
      "boot_disk_path": "/fake/path/install/mupdate_override.json",
      "boot_override": {
        "ok": null
      },
      "non_boot_status": []
    }
  }
}

Using v11 does

$ curl -H "api-version: 11.0.0"  http://[::1]:41963/inventory | jq
<...>
    "mupdate_override": {
      "boot_disk_path": "/fake/path/install/mupdate_override.json",
      "boot_override": {
        "ok": null
      },
      "non_boot_status": []
    }
  },
  "smf_services_enabled_not_running": []
}

There is an unhealthy enabled service

$ svcadm enable svc:/site/fake-service:default

Using v10 the ouput doesn't contain services that aren't running:

$ curl -H "api-version: 10.0.0"  http://[::1]:41963/inventory | jq
<...>
    "mupdate_override": {
      "boot_disk_path": "/fake/path/install/mupdate_override.json",
      "boot_override": {
        "ok": null
      },
      "non_boot_status": []
    }
  }
}

Using v11 does

$ curl -H "api-version: 11.0.0"  http://[::1]:41963/inventory | jq
<...>
    "mupdate_override": {
      "boot_disk_path": "/fake/path/install/mupdate_override.json",
      "boot_override": {
        "ok": null
      },
      "non_boot_status": []
    }
  },
  "smf_services_enabled_not_running": [
    {
      "fmri": "svc:/site/fake-service:default",
      "zone": "global",
      "state": "maintenance",
      "state_since": "2025-12-03T02:27:20Z",
      "reason": "Restarting too quickly.",
      "impact": "This service is not running.",
      "additional_info": [
        "http://illumos.org/msg/SMF-8000-L5",
        "/var/svc/log/site-fake-service:default.log"
      ]
    }
  ]
}

// License, v. 2.0. If a copy of the MPL was not distributed with this
// file, You can obtain one at https://mozilla.org/MPL/2.0/.

//! Utilities for reporting SMF services' status.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't know about this, but it looks like we have an smf crate:
https://github.com/oxidecomputer/smf

and it supports a Query operation that that seems to basically be svcs:
https://docs.rs/smf/latest/smf/struct.Query.html

I think one problem with that is that it doesn't seem to support -Z (query all zones). It either does the current zone or a specific zone. But it looks like it could be easy to add -Z and then you could use that here?


If for whatever reason we don't want to do that: I think better than trying to parse svcs -Zx would be using svcs -Za -H -o fmri,state. svcs -xv seems much harder to parse and less stable. (It looks like neither is guaranteed to be stable but it would be much more surprising for -H -o to change.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha! I was not aware of this smf crate either, thanks. I took a look and I think I'm going to go with the second option though. The from_str method there returns an error if it is unable to parse a line of text. My intention here was to skip over any missing fields or lines that couldn't be parsed, and fill out as much of the struct as possible.

I wonder about the additional information that svcs -x provides though. Would it not be useful to support to have the time of when the service went into maintenance, if it is in a degraded state, or the reason why the service is in such a state? Or is this information available to support elsewhere?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha! I was not aware of this smf crate either, thanks. I took a look and I think I'm going to go with the second option though. The from_str method there returns an error if it is unable to parse a line of text. My intention here was to skip over any missing fields or lines that couldn't be parsed, and fill out as much of the struct as possible.

Hmm, that's interesting. This makes me wonder if what we want is actually a list of errors, which would be an enum with two variants: one is like ServiceInMaintenance and the other is like SvcsFailed. Then you could still use the existing thing. We wouldn't have information in the case that svcs failed, but that's already true if the whole command fails, and I believe this really shouldn't happen in practice. (If it does, we'll be able to see that and fix the bug.) I'm not totally sold on this, but it also doesn't seem great to duplicate this code just for an error case that shouldn't ever happen.

I wonder about the additional information that svcs -x provides though. Would it not be useful to support to have the time of when the service went into maintenance, if it is in a degraded state, or the reason why the service is in such a state? Or is this information available to support elsewhere?

I assume support is going to need a support bundle to debug a service in maintenance. Ultimately, the information here just feeds a boolean to the user about whether to call support, and the value is really for the cases where there's no service in maintenance. (Right?)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, that's interesting. This makes me wonder if what we want is actually a list of errors, which would be an enum with two variants: one is like ServiceInMaintenance and the other is like SvcsFailed. Then you could still use the existing thing. We wouldn't have information in the case that svcs failed, but that's already true if the whole command fails, and I believe this really shouldn't happen in practice. (If it does, we'll be able to see that and fix the bug.) I'm not totally sold on this, but it also doesn't seem great to duplicate this code just for an error case that shouldn't ever happen.

Hmmm... Perhaps? I'll play around with this idea. But yeah, duplicating code is not the best. I'm a little concerned that we may be shoehorning what already exists into an enum that may not feel as natural. Let's say the parsing fails midway a service in maintenance we want to collect, but instead of just filling the enum with as much information we have, we end up with a SvcsFailed. Would we then report the health check as failed or not? There could also be the case where parsing fails with a service that is healthy, we would also get a SvcsFailed there, but in this case it doesn't mean the health check failed.

Thinking about all of this, I'm more inclined to use svcs -Za -H -o fmri,state for less ambiguity. Which brings me to your next point

I assume support is going to need a support bundle to debug a service in maintenance. Ultimately, the information here just feeds a boolean to the user about whether to call support, and the value is really for the cases where there's no service in maintenance. (Right?)

If this is the case, then svcs -Za -H -o fmri,state should be enough no?

pub reconciler_status: ConfigReconcilerInventoryStatus,
pub last_reconciliation: Option<ConfigReconcilerInventory>,
pub zone_image_resolver: ZoneImageResolverInventory,
pub smf_services_enabled_not_running: Vec<SvcNotRunning>,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we do want to stick to just services in maintenance, but not use -x (see my other comment).

// License, v. 2.0. If a copy of the MPL was not distributed with this
// file, You can obtain one at https://mozilla.org/MPL/2.0/.

//! Sled-agent API types that changed from v10 to v11.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I haven't inspected it closely but it looks like the right shape and the OpenAPI changes look right so I think it's right.

Comment on lines +807 to +808
let smf_services_enabled_not_running =
Svcs::enabled_not_running(&self.log).await?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think the simulated sled agent should call out to the real svcs command to see what services are in maintenance. It should probably just fake up a response indicating no services in maintenance. For testing, we might want a way to get it to report some services in maintenance, but we may not really need that since we can create fake inventory collections instead.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ugh yeah sorry, I was using this to test on my machine without having to spin up an a4x2 environment every time I tried out the endpoint and forgot to remove it.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I can see the appeal of testing this in the simulated environment. When I've wanted to do something like this in the past, I added an API to the simulated sled agent that would let you configure its behavior. I could see doing something like that in the limit, but I could also see it not really being worth it -- whatever you think.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll give it a go and see if it doesn't take too long to implement. This may be useful for any further health checks.

Comment on lines +1136 to +1137
let smf_services_enabled_not_running =
Svcs::enabled_not_running(&self.log).await?;
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious for @jgallagher's thoughts on this but I feel like it would be better if this function remained non-async and just had this information available already. Maybe the config-reconciler could collect it? Or some other background task?

There are a few things I'm worried about here:

  • a stuck svcs call blocks our ability to collect any inventory at all
  • if a svcs call gets stuck, every call to inventory will accumulate another one

If we instead put this into a separate background task, there can be only one at a time, and we can always report the latest state, even if it's old.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strongly agreed; failing to collect inventory because we're stuck in a hung SMF interaction would be quite painful (since inventory is supposed to report that kind of problem in the first place).

I'm less sure about putting it in the config reconciler vs spawning a new background task to do it. My gut reaction was to put it in a separate task, but since the config reconciler is responsible for starting up zones I could see it wanting know whether any services in that zone are in trouble.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, that all makes sense.

I'm less sure about putting it in the config reconciler vs spawning a new background task to do it. My gut reaction was to put it in a separate task, but since the config reconciler is responsible for starting up zones I could see it wanting know whether any services in that zone are in trouble.

This comment makes me think having it in the config-reconciler would be more useful?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment makes me think having it in the config-reconciler would be more useful?

(Recapping a live conversation) I thought so too until we were chatting, and realized that the config-reconciler is mostly dormant. It only runs when there's been a change it needs to reconcile (new config, change in hardware, ...) or if the previous attempt to reconcile didn't complete successfully (in which case it continues retrying periodically until it does succeed, at which point it goes dormant again waiting for changes). We want health check info to be fresher than that, so a new background task seems right.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants