Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics to show the backup status other than "Completed" #7239

Closed
Shuanglu opened this issue Dec 20, 2023 · 6 comments
Closed

Metrics to show the backup status other than "Completed" #7239

Shuanglu opened this issue Dec 20, 2023 · 6 comments
Assignees
Labels
Metrics Related to prometheus metrics target/1.13.1
Milestone

Comments

@Shuanglu
Copy link

Shuanglu commented Dec 20, 2023

What steps did you take and what happened:

  • We have Velero deployed with helm chart 4.1.3. Recently it started to encounter OOM when it attempt to backup the cluster and sometimes it's evicted due to autoscaler/node pressure. After these incidnet, the backup status became phase: Failed and failureReason: get a backup with status "InProgress" during the server starting, mark it as "Failed" other than the Completed and no backup files uploaded to our remote storage.
    It looks like Velero native metrics backupFailureTotal and backupPartialFailureTotal are always 0 in this case. Is this expected?

  • Previously this was deployed with helm2 and recently we migrated to helm3 and it started to OOM very frequently. Any parameter might be helpful to avoid this?

What did you expect to happen:
Metrics to show the backup status other than Completed
The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help

If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)

  • kubectl logs deployment/velero -n velero
  • velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml
  • velero backup logs <backupname>
  • velero restore describe <restorename> or kubectl get restore/<restorename> -n velero -o yaml
  • velero restore logs <restorename>

Anything else you would like to add:

Environment:

  • Velero version (use velero version): 1.10.1
  • Velero features (use velero client config get features):
  • Kubernetes version (use kubectl version): 1.27
  • Kubernetes installer & version:
  • Cloud provider or hardware configuration:
  • OS (e.g. from /etc/os-release):

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@Lyndon-Li Lyndon-Li added the Metrics Related to prometheus metrics label Dec 21, 2023
@allenxu404
Copy link
Contributor

allenxu404 commented Jan 3, 2024

It looks like Velero native metrics backupFailureTotal and backupPartialFailureTotal are always 0 in this case. Is this expected?

No, it's not expected. Thanks for pointing it out, we will fix it in next version.

@allenxu404
Copy link
Contributor

Hi @Shuanglu, upon further consideration, I think it may not be necessary to update the Velero metrics in this scenario.

In the case you describe, the unfinished backups are marked as Failed when the sever restarts. All the Velero metrics will be reset along with the server's restart, so updating the metrics at this point would have little value. it's best to leave the current monitoring behaviour as-is for now.

@allenxu404
Copy link
Contributor

Close it as we plan to leave the current behaviour as-is.

Feel free to reopen it if you have any doubt.

@Shuanglu
Copy link
Author

Shuanglu commented Mar 6, 2024

Hi @Shuanglu, upon further consideration, I think it may not be necessary to update the Velero metrics in this scenario.

In the case you describe, the unfinished backups are marked as Failed when the sever restarts. All the Velero metrics will be reset along with the server's restart, so updating the metrics at this point would have little value. it's best to leave the current monitoring behaviour as-is for now.

Thanks. I'll check if any other metrics can be used to monitor this. It's a little bit long time and I don't remember the behavior detail.

@CheckmkOps
Copy link

@Shuanglu did you find the better way to catch the same failed scenario with metrics? The easiest thing which comes to mind is to check in logs and generate alerts based on error messages. But using metrics in this case is a better approach.

@Shuanglu
Copy link
Author

Shuanglu commented Oct 11, 2024

@Shuanglu did you find the better way to catch the same failed scenario with metrics? The easiest thing which comes to mind is to check in logs and generate alerts based on error messages. But using metrics in this case is a better approach.

Nope. We actually change the monitor to watch the successful backup within the period and if it's a mismatch, it will alert us

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Metrics Related to prometheus metrics target/1.13.1
Projects
None yet
Development

No branches or pull requests

7 participants