Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repeating Alerts for a task failing on Scylla Manager #2307

Open
noellymedina opened this issue Jun 4, 2024 · 8 comments
Open

Repeating Alerts for a task failing on Scylla Manager #2307

noellymedina opened this issue Jun 4, 2024 · 8 comments
Labels

Comments

@noellymedina
Copy link

When a backup or repair task on Scylla Manager fails, it automatically retries right after, and usually it succeeds.
The thing is that for every failed run, one alert will be triggered. Is that possible to configure the alert to be sent only once when the task fails definitely and there are no more attempts?

@amnonh

@noellymedina noellymedina added the bug Something isn't working right label Jun 4, 2024
@mykaul mykaul removed the bug Something isn't working right label Jun 4, 2024
@mykaul
Copy link
Contributor

mykaul commented Jun 4, 2024

Does it make sense though? Do you really do not want to be aware of a failed repair task? (perhaps it should be an option).

@noellymedina
Copy link
Author

the point is, for every retry it will spam alerts, was thinking of an alternative to bring a single alert instead of several ones for the same task.

@noellymedina
Copy link
Author

as an example:
image
every attempt of these would spam an alert while only one for this task would be enough to let us know that it requires investigation

@mykaul
Copy link
Contributor

mykaul commented Jun 5, 2024

as an example: image every attempt of these would spam an alert while only one for this task would be enough to let us know that it requires investigation

We should fix the root cause. I don't think the alert can know it's due to the same issue or a different issue.

@amnonh
Copy link
Collaborator

amnonh commented Jun 5, 2024

Let's get the manager team involved, I think it's part of the bigger issue of what the manager reports, they have too many metrics on one hand, and it's hard to get significant information from them on the other

@amnonh
Copy link
Collaborator

amnonh commented Jun 24, 2024

@mykaul, can we assign someone from the manager team to get their input?

@mykaul
Copy link
Contributor

mykaul commented Jun 24, 2024

@karol-kokoszka - can you look into this?

@karol-kokoszka
Copy link

Outside of the scope of alerts in this issue... The error that @noellymedina attached comes from the ScyllaDB back-end used by Scylla Manager. A simple INSERT operation failed. Something is wrong with this Scylla instance. Maybe there is not enough memory? It's worth checking the logs from the Scylla instance running on the same machine as the manager.

Can I see the full error message (not cut) ? We have known issue in SM scylladb/scylla-manager#3884 , but I'm not sure if it's connected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants