Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Healthcheck: correctly label nodes removed from metrics #4178

Merged
merged 1 commit into from
Dec 18, 2024

Conversation

Michal-Leszczynski
Copy link
Collaborator

@Michal-Leszczynski Michal-Leszczynski commented Dec 18, 2024

Commits a75df89 and 3cdd20d added and started setting new labels (dc/rack) in healthcheck metrics.
Unfortunately, those labels weren't taken into consideration when removing metrics of removed clusters/nodes:

hl := prometheus.Labels{
clusterKey: clusterID.String(),
hostKey: host,
}
r.metrics.status.Delete(hl)
r.metrics.rtt.Delete(hl)

This resulted in not removing those metrics until SM restart.

This commit fixes this issue and brings a small refactor to the way in which healthcheck labels are applied, so that it's more difficult to make such mistake in the future.

Fixes #4017

Commits a75df89 and 3cdd20d added and started setting
new labels (dc/rack) to the healthcheck metrics.
Unfortunately, those labels weren't taken into consideration
when removing labels of removed clusters/nodes.
This resulted in not removing those metrics until SM restart.

This commit fixes this issue and brings a small refactor
to the way in which healthcheck labels are applied,
so that it's more difficult to make such mistake in the future.

Fixes #4017
@Michal-Leszczynski
Copy link
Collaborator Author

The fix was tested manually by decommissioning a node from a fresh testing cluster.

Before the fix:
image
After the fix:
image

Note that the metrics for the removed node will keep on showing -1 until it's removed from the config cache svc, which can take up to 5 minutes, but I guess that's not really a problem. Before the fix, no matter how long I waited, the metrics weren't removed at all. After the fix it's clearly visible that after some time the metrics were removed.

@Michal-Leszczynski
Copy link
Collaborator Author

@karol-kokoszka @VAveryanov8 not sure about automated integration testing.
It would probably require some bash scripts using docker commands - I don't think that we have the infrastructure for that in place and that it would be too much hassle to properly implement it.

@Michal-Leszczynski Michal-Leszczynski marked this pull request as ready for review December 18, 2024 10:17
Copy link
Collaborator

@karol-kokoszka karol-kokoszka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there is a test for that in integration-tests, it would catch this bug before releasing scylla-manager with the bug.
But I agree it's too much effort to add it and to cover that scenario. Especially that we are not really testing metrics in e2e scenarios.

@Michal-Leszczynski Michal-Leszczynski merged commit a16e511 into master Dec 18, 2024
51 checks passed
@Michal-Leszczynski Michal-Leszczynski deleted the ml/fix-4017 branch December 18, 2024 11:38
Michal-Leszczynski added a commit that referenced this pull request Dec 18, 2024
Commits a75df89 and 3cdd20d added and started setting
new labels (dc/rack) to the healthcheck metrics.
Unfortunately, those labels weren't taken into consideration
when removing labels of removed clusters/nodes.
This resulted in not removing those metrics until SM restart.

This commit fixes this issue and brings a small refactor
to the way in which healthcheck labels are applied,
so that it's more difficult to make such mistake in the future.

Fixes #4017

(cherry picked from commit a16e511)
Michal-Leszczynski added a commit that referenced this pull request Dec 18, 2024
Commits a75df89 and 3cdd20d added and started setting
new labels (dc/rack) to the healthcheck metrics.
Unfortunately, those labels weren't taken into consideration
when removing labels of removed clusters/nodes.
This resulted in not removing those metrics until SM restart.

This commit fixes this issue and brings a small refactor
to the way in which healthcheck labels are applied,
so that it's more difficult to make such mistake in the future.

Fixes #4017

(cherry picked from commit a16e511)
Michal-Leszczynski added a commit that referenced this pull request Dec 18, 2024
Commits a75df89 and 3cdd20d added and started setting
new labels (dc/rack) to the healthcheck metrics.
Unfortunately, those labels weren't taken into consideration
when removing labels of removed clusters/nodes.
This resulted in not removing those metrics until SM restart.

This commit fixes this issue and brings a small refactor
to the way in which healthcheck labels are applied,
so that it's more difficult to make such mistake in the future.

Fixes #4017

(cherry picked from commit a16e511)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Metrics from changed ip address nodes
3 participants