Skip to content

Conversation

@kdelemme
Copy link
Contributor

@kdelemme kdelemme commented Nov 21, 2025

Fix #238018
Fix #243706

🍒 Summary

This PR refactors (with breaking change) the SLO Health internal API to include more information from the related transforms, and remove the previous "state" part as it was not used.

The new information returned provides more fine-grain details on potential issues:

  • is the transform missing?
  • if not missing, is it healthy or unhealthy?
  • if not missing, what is its current state and does it match the enabled flag on the SLO definition?
"health": {
  "isProblematic": false,
  "rollup": {
      "isProblematic": false,
      "missing": false,
      "status": "healthy",
      "state": "started",
      "stateMatches": true
  },
  "summary": {
      "isProblematic": false,
      "missing": false,
      "status": "healthy",
      "state": "started",
      "stateMatches": true
    }
  }

I've updated the shape of the params & response as well, removing the slo prefix to keep things cleaner.

Both the SLO Definition API and the SLO Health API use this new domain service to compute the health.

The usage on the frontend has been updated to include the conflicting state reason when on the SLO details page.
As a side effect, no health callout is shown when creating a new SLO from the UI anymore. This was a bug from #243562

Left out of this PR

  • Handle errors when calling computeHealth service from the SLO Definitions API + timeout/circuit breaker to avoid breaking the SLO management page
  • Refactor the various SLO health callout: Can't we have only one version instead of 2 callout components and 1 custom handling in the management table?

Testing

  • Create 4 SLOs (with or without instances) - keep enabled
  • Stop rollup transform SLO 1
  • Stop summary transform SLO 2
  • Delete rollup transform SLO 3
  • Delete rollup and stop summary transforms SLO 4
  • Create a 5th SLO (disable)
  • Restart rollup and/or summary transforms SLO 5

@github-actions github-actions bot added the author:actionable-obs PRs authored by the actionable obs team label Nov 21, 2025
@kdelemme kdelemme force-pushed the chore/decouple-health branch from 177e416 to 59ffc3b Compare November 24, 2025 13:03
@kdelemme kdelemme force-pushed the chore/decouple-health branch from 59ffc3b to 78e1f7a Compare November 24, 2025 13:07
@kdelemme kdelemme requested a review from Copilot November 24, 2025 21:50
Copilot finished reviewing on behalf of kdelemme November 24, 2025 21:52
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR refactors the SLO Health internal API with breaking changes to provide more detailed transform health information. The main changes include removing the deprecated "state" field, renaming API parameters (removing slo prefix), and introducing fine-grained health indicators for both rollup and summary transforms including missing status, health status, transform state, and state matching validation.

Key Changes:

  • Introduced a new computeHealth domain service that centralizes health computation logic
  • Refactored Health API request/response schemas to use cleaner field names (id, instanceId, revision, name instead of sloId, sloInstanceId, etc.)
  • Enhanced health response structure with isProblematic, missing, status, state, and stateMatches fields for both rollup and summary transforms

Reviewed changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
compute_health.ts New domain service that computes SLO health by checking transform stats against expected states
get_slo_health.ts Refactored to delegate health computation to the new domain service
find_slo_definitions.ts Updated to use the new domain service for health computation when includeHealth is requested
health.ts (schema) Updated health schema to support the new transform health structure with additional fields
health.ts (models) Updated type exports to use TransformHealth instead of HealthStatus and State
slo_health_callout.tsx Updated to use new health structure and display conflicting state information
health_callout.tsx Updated to work with the new isProblematic field and renamed response properties
slo_management_table.tsx Updated to handle the new health structure including undefined health field
Test files Comprehensively updated to test the new health computation logic and response structure

You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.

@elasticmachine
Copy link
Contributor

Pinging @elastic/obs-ux-management-team (Team:obs-ux-management)

const state = toTransformState(transformStat.state?.toLowerCase());
const status = toTransformStatus(transformStat.health?.status?.toLowerCase());
const stateMatches =
(!item.enabled && ['stopped', 'stopping', 'aborting'].includes(state)) ||
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to check if these ACCs are covered with this logic:

- if the SLO is ENABLED and either or both of its transforms are STOPPED, the SLO "needs attention"
- if the SLO is DISABLED, and either or both of its transforms are NOT STOPPED, the SLO "needs attention"
- if the SLO is DISABLED and its transforms are STOPPED, the SLO does not need attention

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

right, that's the stateMatches behaviour:

  • SLO disabled and stopped/stopping/aborting state: OK
  • SLO enabled and started/indexing state: OK
  • Other case: not ok

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested it and indeed you cover all the cases correctly according to the ACCs.

The stateMatches logic could be extracted into a well-named function for readability:

function doesTransformStateMatchSLOEnabledState(enabled: boolean, state: TransformState): boolean {
     if (!enabled) {
       return ['stopped', 'stopping', 'aborting'].includes(state);
     }
     return ['started', 'indexing'].includes(state);
   }

@baileycash-elastic
Copy link
Contributor

Currently trying to resolve response size error

@kdelemme
Copy link
Contributor Author

summary: aHealthyTransformHealth,
};

export const anUnhealthySLOHealth = {
Copy link
Contributor

@mgiota mgiota Nov 25, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This can be true as well

anUnhealthySLOHealth = {
  isProblematic: true,
  rollup: anUnhealthyTransformHealth,
  summary: anUnHealthyTransformHealth,
};

Wondering if you added on purpose an unhealthy rollup transform and a healthy summary transform or if it is a typo.

I also checked where this const is being used and looks like it is not being used anywhere.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

your example is also true. it's just a fixture, and for some reason I stopped using it. I can definitely remove it

@mgiota
Copy link
Contributor

mgiota commented Nov 25, 2025

@kdelemme The flow in the following Video is confusing. The scenario I am testing is when transforms are healthy and started and SLO is disabled.

First of all the health callout in the SLO list page states that SLO is in an unhealthy state and data maybe missing or incomplete. In this case the SLO is disabled. Maybe we should mention or SLO is disabled?

In the SLO details page, we can see SLO is in conflicting state. However when user goes to transform page to check the conflicting state everything looks fine. If user checks SLO Health Management page, they can see SLO is disabled and needs attention. According to the ACCs of this issue if the SLO is DISABLED, and either or both of its transforms are NOT STOPPED, the SLO "needs attention" this is true, but it can be confusing for the user to see that SLO is in conflicting state without explaining them what conflicting state means.

Screen.Recording.2025-11-26.at.01.00.28.mov

(!item.enabled && ['stopped', 'stopping', 'aborting'].includes(state)) ||
(item.enabled && ['started', 'indexing'].includes(state));

const isProblematic = status === 'unhealthy' || state === 'failed' || !stateMatches;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about state === 'stopped'? Isn't this considered problematic as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is taken into account with !stateMatches

isProblematic: true,
missing: true,
status: 'unavailable',
state: 'unavailable',
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I noticed stateMatches is not set in case of a missing transform. Maybe we should add undefined for clarity? I am fine keeping it as is though, since it is optional in the schema anyway

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm thinking of it as a union type
it is either missing, so always equal to

  isProblematic: true,
  missing: true,
  status: 'unavailable',
  state: 'unavailable',

or not missing and it can be:

  isProblematic: true or false,
  missing: false,
  status: 'healthy' | 'unhealthy' | 'unavailable',
  state: 'started' | ... | 'stopped' | 'unavailable',
  stateMatches: true or false

@kdelemme
Copy link
Contributor Author

@mgiota thanks for the review

First of all the health callout in the SLO list page states that SLO is in an unhealthy state and data maybe missing or incomplete. In this case the SLO is disabled. Maybe we should mention or SLO is disabled?

I want to revamp this health callout honestly - actually, I want to revamp both of them, and keep only one. It will simplify the maintenance going forward. All the copy should probably be revisited too.

On the listing page we use:
"The following {count, plural, one {SLO is} other {SLOs are}} in an unhealthy state. Data may be missing or incomplete. You can inspect {count, plural, one {it} other {each one}} here:"
Maybe we can use something more broad:
"The following {count, plural, one {SLO is} other {SLOs are}} might have some operational problems. You can inspect {count, plural, one {it} other {each one}} here:"

For the details page callout:
I can change the conflicting state translation to include the expected state: "{transformId} (conflicting state: should be {expectedState})"

@kdelemme kdelemme force-pushed the chore/decouple-health branch from 6d1fa39 to d18d854 Compare November 26, 2025 02:37
@kdelemme
Copy link
Contributor Author

@mgiota
Updated the copy of the health callout here + adding expected state when they are conflicting: f9fe38c

I think I'll create a follow-up PR to merge both callout components into a single component, copy might change again

@mgiota
Copy link
Contributor

mgiota commented Nov 26, 2025

@kdelemme Thanks let me do one more round of testing.

I think I'll create a follow-up PR to merge both callout components into a single component, copy might change again

Sounds good! When I introduced the missing state, I had one single component, but then we other changes taking place we ended up with two versions. Agree keeping only one, will simplify the maintenance going forward 👍

@mgiota
Copy link
Contributor

mgiota commented Nov 26, 2025

@kdelemme Clear separation of concerns with the domain service and efficient batching to avoid large payloads, great work!

I tested the workflow one more time with the copy changes and it makes more sense now, especially the expected state.

Screen.Recording.2025-11-26.at.09.23.49.mov

Once this PR is merged I can work on a follow up improvement where we have a call to action in the health callout (to stop for example the transform). Now user needs to do many steps in order to fix the issue SLO overview page > Click on the health callout > Go to SLO details page > Inspect in transform page > Start the transform. We can minimize the steps to SLO overview page > Click on the health callout > Go to SLO details page > Start the transform

@mgiota
Copy link
Contributor

mgiota commented Nov 26, 2025

I tried a few more scenarios and I noticed that when user disables the SLO from health management page, behind the scenes the transforms are stopped, which is good, there is no conflicting state in this case.

Scenario 1
Given user disables the SLO from health management page
And transforms are automatically stopped
When user enables the SLO in the SLO health management page
Then transforms are automatically started
And there is no conflicting state (expected)

Scenario 2
Given user disables the SLO from health management page
And transforms are automatically stopped
When user starts the transform in Transform page
User gets the health callout with a message to stop the transforms.

I would argue that instead of suggesting that the user stop the transform, we should prompt them to enable the SLO. From my testing, any action initiated from the SLO pages—whether enabling or disabling —keeps the SLO state in sync with the corresponding transform state.

However, when the user performs an action directly from the Transform page, the SLO state can become out of sync. In that scenario, the correct guidance should be to enable the SLO (assuming it was disabled). Otherwise, the user might end up going in circles if they intentionally started the transforms and we keep suggesting that they stop them.

Is this the expected behavior?

@elasticmachine
Copy link
Contributor

💛 Build succeeded, but was flaky

Failed CI Steps

Test Failures

  • [job] [logs] FTR Configs #52 / saved objects management apis find with kibana index - relationships hasReference and hasReferenceOperator parameters search for a reference

Metrics [docs]

Public APIs missing comments

Total count of every public API that lacks a comment. Target amount is 0. Run node scripts/build_api_docs --plugin [yourplugin] --stats comments for more detailed information.

id before after diff
@kbn/slo-schema 214 209 -5

Async chunks

Total size of all lazy-loaded chunks that will be downloaded as the user navigates the app

id before after diff
observability 1.7MB 1.7MB -13.0B
slo 988.4KB 988.5KB +163.0B
synthetics 1.0MB 1.0MB -16.0B
total +134.0B
Unknown metric groups

API count

id before after diff
@kbn/slo-schema 214 210 -4

History

cc @kdelemme

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

author:actionable-obs PRs authored by the actionable obs team backport:skip This PR does not require backporting ci:beta-faster-pr-build Uses an alternative PR build pipeline with speed optimizations Feature:SLO release_note:skip Skip the PR/issue when compiling release notes Team:actionable-obs Formerly "obs-ux-management", responsible for SLO, o11y alerting, significant events, & synthetics. Team:obs-ux-management v9.3.0

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[SLO] Decouple definition service response from health service response [SLO] Reflect underlying transform stopped status on SLO pages

4 participants