Metrics to show the backup status other than "Completed" #7239

Shuanglu · 2023-12-20T23:46:52Z

What steps did you take and what happened:

We have Velero deployed with helm chart 4.1.3. Recently it started to encounter OOM when it attempt to backup the cluster and sometimes it's evicted due to autoscaler/node pressure. After these incidnet, the backup status became phase: Failed and failureReason: get a backup with status "InProgress" during the server starting, mark it as "Failed" other than the Completed and no backup files uploaded to our remote storage.
It looks like Velero native metrics backupFailureTotal and backupPartialFailureTotal are always 0 in this case. Is this expected?
Previously this was deployed with helm2 and recently we migrated to helm3 and it started to OOM very frequently. Any parameter might be helpful to avoid this?

What did you expect to happen:
Metrics to show the backup status other than Completed
The following information will help us better understand what's going on:

If you are using velero v1.7.0+:
Please use velero debug --backup <backupname> --restore <restorename> to generate the support bundle, and attach to this issue, more options please refer to velero debug --help

If you are using earlier versions:
Please provide the output of the following commands (Pasting long output into a GitHub gist or other pastebin is fine.)

kubectl logs deployment/velero -n velero
velero backup describe <backupname> or kubectl get backup/<backupname> -n velero -o yaml
velero backup logs <backupname>
velero restore describe <restorename> or kubectl get restore/<restorename> -n velero -o yaml
velero restore logs <restorename>

Anything else you would like to add:

Environment:

Velero version (use velero version): 1.10.1
Velero features (use velero client config get features):
Kubernetes version (use kubectl version): 1.27
Kubernetes installer & version:
Cloud provider or hardware configuration:
OS (e.g. from /etc/os-release):

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

👍 for "I would like to see this bug fixed as soon as possible"
👎 for "There are more important bugs to focus on right now"

The text was updated successfully, but these errors were encountered:

allenxu404 · 2024-01-03T02:41:35Z

It looks like Velero native metrics backupFailureTotal and backupPartialFailureTotal are always 0 in this case. Is this expected?

No, it's not expected. Thanks for pointing it out, we will fix it in next version.

allenxu404 · 2024-02-21T10:10:34Z

Hi @Shuanglu, upon further consideration, I think it may not be necessary to update the Velero metrics in this scenario.

In the case you describe, the unfinished backups are marked as Failed when the sever restarts. All the Velero metrics will be reset along with the server's restart, so updating the metrics at this point would have little value. it's best to leave the current monitoring behaviour as-is for now.

allenxu404 · 2024-03-05T06:58:06Z

Close it as we plan to leave the current behaviour as-is.

Feel free to reopen it if you have any doubt.

Shuanglu · 2024-03-06T15:43:32Z

Hi @Shuanglu, upon further consideration, I think it may not be necessary to update the Velero metrics in this scenario.

In the case you describe, the unfinished backups are marked as Failed when the sever restarts. All the Velero metrics will be reset along with the server's restart, so updating the metrics at this point would have little value. it's best to leave the current monitoring behaviour as-is for now.

Thanks. I'll check if any other metrics can be used to monitor this. It's a little bit long time and I don't remember the behavior detail.

CheckmkOps · 2024-10-09T14:42:57Z

@Shuanglu did you find the better way to catch the same failed scenario with metrics? The easiest thing which comes to mind is to check in logs and generate alerts based on error messages. But using metrics in this case is a better approach.

Shuanglu · 2024-10-11T02:19:04Z

@Shuanglu did you find the better way to catch the same failed scenario with metrics? The easiest thing which comes to mind is to check in logs and generate alerts based on error messages. But using metrics in this case is a better approach.

Nope. We actually change the monitor to watch the successful backup within the period and if it's a mismatch, it will alert us

Lyndon-Li added the Metrics Related to prometheus metrics label Dec 21, 2023

ywk253100 assigned allenxu404 Jan 2, 2024

pradeepkchaturvedi added the 1.14-candidate label Jan 24, 2024

reasonerjt removed the 1.14-candidate label Feb 6, 2024

reasonerjt added this to the v1.14 milestone Feb 6, 2024

reasonerjt added the target/1.13.1 label Feb 6, 2024

ywk253100 closed this as completed Mar 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics to show the backup status other than "Completed" #7239

Metrics to show the backup status other than "Completed" #7239

Shuanglu commented Dec 20, 2023 •

edited

Loading

allenxu404 commented Jan 3, 2024 •

edited

Loading

allenxu404 commented Feb 21, 2024

allenxu404 commented Mar 5, 2024

Shuanglu commented Mar 6, 2024

CheckmkOps commented Oct 9, 2024

Shuanglu commented Oct 11, 2024 •

edited

Loading

Metrics to show the backup status other than "Completed" #7239

Metrics to show the backup status other than "Completed" #7239

Comments

Shuanglu commented Dec 20, 2023 • edited Loading

allenxu404 commented Jan 3, 2024 • edited Loading

allenxu404 commented Feb 21, 2024

allenxu404 commented Mar 5, 2024

Shuanglu commented Mar 6, 2024

CheckmkOps commented Oct 9, 2024

Shuanglu commented Oct 11, 2024 • edited Loading

Shuanglu commented Dec 20, 2023 •

edited

Loading

allenxu404 commented Jan 3, 2024 •

edited

Loading

Shuanglu commented Oct 11, 2024 •

edited

Loading