-
Notifications
You must be signed in to change notification settings - Fork 8.5k
refactor(slo)!: enhance Health API with detailed transform insights #243858
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
177e416 to
59ffc3b
Compare
59ffc3b to
78e1f7a
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR refactors the SLO Health internal API with breaking changes to provide more detailed transform health information. The main changes include removing the deprecated "state" field, renaming API parameters (removing slo prefix), and introducing fine-grained health indicators for both rollup and summary transforms including missing status, health status, transform state, and state matching validation.
Key Changes:
- Introduced a new
computeHealthdomain service that centralizes health computation logic - Refactored Health API request/response schemas to use cleaner field names (
id,instanceId,revision,nameinstead ofsloId,sloInstanceId, etc.) - Enhanced health response structure with
isProblematic,missing,status,state, andstateMatchesfields for both rollup and summary transforms
Reviewed changes
Copilot reviewed 23 out of 23 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
compute_health.ts |
New domain service that computes SLO health by checking transform stats against expected states |
get_slo_health.ts |
Refactored to delegate health computation to the new domain service |
find_slo_definitions.ts |
Updated to use the new domain service for health computation when includeHealth is requested |
health.ts (schema) |
Updated health schema to support the new transform health structure with additional fields |
health.ts (models) |
Updated type exports to use TransformHealth instead of HealthStatus and State |
slo_health_callout.tsx |
Updated to use new health structure and display conflicting state information |
health_callout.tsx |
Updated to work with the new isProblematic field and renamed response properties |
slo_management_table.tsx |
Updated to handle the new health structure including undefined health field |
| Test files | Comprehensively updated to test the new health computation logic and response structure |
You can also share your feedback on Copilot code review for a chance to win a $100 gift card. Take the survey.
...lutions/observability/plugins/slo/public/pages/slo_details/components/slo_health_callout.tsx
Outdated
Show resolved
Hide resolved
x-pack/solutions/observability/plugins/slo/server/domain/services/compute_health.ts
Outdated
Show resolved
Hide resolved
|
Pinging @elastic/obs-ux-management-team (Team:obs-ux-management) |
| const state = toTransformState(transformStat.state?.toLowerCase()); | ||
| const status = toTransformStatus(transformStat.health?.status?.toLowerCase()); | ||
| const stateMatches = | ||
| (!item.enabled && ['stopped', 'stopping', 'aborting'].includes(state)) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to check if these ACCs are covered with this logic:
- if the SLO is ENABLED and either or both of its transforms are STOPPED, the SLO "needs attention"
- if the SLO is DISABLED, and either or both of its transforms are NOT STOPPED, the SLO "needs attention"
- if the SLO is DISABLED and its transforms are STOPPED, the SLO does not need attention
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
right, that's the stateMatches behaviour:
- SLO disabled and stopped/stopping/aborting state: OK
- SLO enabled and started/indexing state: OK
- Other case: not ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested it and indeed you cover all the cases correctly according to the ACCs.
The stateMatches logic could be extracted into a well-named function for readability:
function doesTransformStateMatchSLOEnabledState(enabled: boolean, state: TransformState): boolean {
if (!enabled) {
return ['stopped', 'stopping', 'aborting'].includes(state);
}
return ['started', 'indexing'].includes(state);
}
...lutions/observability/plugins/slo/public/pages/slo_details/components/slo_health_callout.tsx
Show resolved
Hide resolved
|
Currently trying to resolve response size error |
| summary: aHealthyTransformHealth, | ||
| }; | ||
|
|
||
| export const anUnhealthySLOHealth = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can be true as well
anUnhealthySLOHealth = {
isProblematic: true,
rollup: anUnhealthyTransformHealth,
summary: anUnHealthyTransformHealth,
};
Wondering if you added on purpose an unhealthy rollup transform and a healthy summary transform or if it is a typo.
I also checked where this const is being used and looks like it is not being used anywhere.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
your example is also true. it's just a fixture, and for some reason I stopped using it. I can definitely remove it
|
@kdelemme The flow in the following Video is confusing. The scenario I am testing is when transforms are healthy and started and SLO is disabled. First of all the health callout in the SLO list page states that SLO is in an unhealthy state and data maybe missing or incomplete. In this case the SLO is disabled. Maybe we should mention In the SLO details page, we can see SLO is in conflicting state. However when user goes to transform page to check the conflicting state everything looks fine. If user checks SLO Health Management page, they can see SLO is disabled and needs attention. According to the ACCs of this issue Screen.Recording.2025-11-26.at.01.00.28.mov |
| (!item.enabled && ['stopped', 'stopping', 'aborting'].includes(state)) || | ||
| (item.enabled && ['started', 'indexing'].includes(state)); | ||
|
|
||
| const isProblematic = status === 'unhealthy' || state === 'failed' || !stateMatches; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about state === 'stopped'? Isn't this considered problematic as well?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is taken into account with !stateMatches
| isProblematic: true, | ||
| missing: true, | ||
| status: 'unavailable', | ||
| state: 'unavailable', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed stateMatches is not set in case of a missing transform. Maybe we should add undefined for clarity? I am fine keeping it as is though, since it is optional in the schema anyway
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm thinking of it as a union type
it is either missing, so always equal to
isProblematic: true,
missing: true,
status: 'unavailable',
state: 'unavailable',
or not missing and it can be:
isProblematic: true or false,
missing: false,
status: 'healthy' | 'unhealthy' | 'unavailable',
state: 'started' | ... | 'stopped' | 'unavailable',
stateMatches: true or false
|
@mgiota thanks for the review
I want to revamp this health callout honestly - actually, I want to revamp both of them, and keep only one. It will simplify the maintenance going forward. All the copy should probably be revisited too. On the listing page we use: For the details page callout: |
6d1fa39 to
d18d854
Compare
|
@kdelemme Thanks let me do one more round of testing.
Sounds good! When I introduced the missing state, I had one single component, but then we other changes taking place we ended up with two versions. Agree keeping only one, will simplify the maintenance going forward 👍 |
|
@kdelemme Clear separation of concerns with the domain service and efficient batching to avoid large payloads, great work! I tested the workflow one more time with the copy changes and it makes more sense now, especially the expected state. Screen.Recording.2025-11-26.at.09.23.49.movOnce this PR is merged I can work on a follow up improvement where we have a call to action in the health callout (to stop for example the transform). Now user needs to do many steps in order to fix the issue |
|
I tried a few more scenarios and I noticed that when user disables the SLO from health management page, behind the scenes the transforms are stopped, which is good, there is no conflicting state in this case. Scenario 1 Scenario 2 I would argue that instead of suggesting that the user stop the transform, we should prompt them to enable the SLO. From my testing, any action initiated from the SLO pages—whether enabling or disabling —keeps the SLO state in sync with the corresponding transform state. However, when the user performs an action directly from the Transform page, the SLO state can become out of sync. In that scenario, the correct guidance should be to enable the SLO (assuming it was disabled). Otherwise, the user might end up going in circles if they intentionally started the transforms and we keep suggesting that they stop them. Is this the expected behavior? |
💛 Build succeeded, but was flaky
Failed CI StepsTest Failures
Metrics [docs]Public APIs missing comments
Async chunks
History
cc @kdelemme |
Fix #238018
Fix #243706
🍒 Summary
This PR refactors (with breaking change) the SLO Health internal API to include more information from the related transforms, and remove the previous "state" part as it was not used.
The new information returned provides more fine-grain details on potential issues:
enabledflag on the SLO definition?I've updated the shape of the params & response as well, removing the
sloprefix to keep things cleaner.Both the SLO Definition API and the SLO Health API use this new domain service to compute the health.
The usage on the frontend has been updated to include the conflicting state reason when on the SLO details page.
As a side effect, no health callout is shown when creating a new SLO from the UI anymore. This was a bug from #243562
Left out of this PR
computeHealthservice from the SLO Definitions API + timeout/circuit breaker to avoid breaking the SLO management pageTesting