Journalized activity recorder for backup and restore #6606

Lyndon-Li · 2023-08-04T09:50:56Z

At present, for a backup or restore, users need to collect information from multiple places, i.e., from various CRs, from various logs, etc., to tell what has exactly done. In the other words, critical information are not listed centrally in a journal style for Velero backups and restores.
Moreover, the information in the logs are getting increasingly complicated.

One possible solution is to use the Event mechanism:

Create Event recorder for Backup and Restore CR
Record the critical steps and info as Events along the running of backup/restore
In the same cluster, Velero doesn't need to do anything more, users just need to do kubectl describe
To support backup sync, Velero needs to backup the Event objects as part of the backup, just like backup logs

shawn-hurley · 2023-08-04T15:14:13Z

I think this would also be a great change for third-party data movers to have a common way to give information to the user during the backup and restore.

I also love the UX of this personally as a user of k8s. I am so used to getting this information with kubectl describe

Lyndon-Li · 2023-08-04T16:08:18Z

To cover 3rd-party data movers, one possible way is that we provide this journalized event mechanism as a generic mechanism of Velero backup/restore workflow, so that these events go together with Velero backup/restore no matter which module generates them.

Lyndon-Li · 2023-08-15T09:01:58Z

One thing that may be a hinder of the proposal to use Kubernetes Event mechanism is:

Kubernetes' event resources have associated TTLs which cannot be disabled
The default TTL value is 1 hour
The TTL value is nearly impossible to reconfigured (it is an api-server parameter to the entire etcd storage)

As a result, if we store the backup & restore events based on the Events, the events will be cleared after 1 hour, for the long running backups, this is not enough.

This means:

Even in the same cluster, Velero needs to back up these events for each backup timely
Even in the same cluster, the entire events for each backup need to be retrieved from the backup tarball, so kubectl get backup -n velero should not be used

Then we will need to compare whether this is simpler enough than the solution to create a dedicate event mechanism from Velero.

shawn-hurley · 2023-08-15T17:01:29Z

I think there are two concerrns here:

A user having a k8s native way, to determine how the backup is currently acting
A complete recording of every action a user could look over after the fact

For the first, events should be used because it will alert the user to what is happening.

For 2, the events are stored in the audit log. IIRC would be a place to point users for 2, or we can create a new log file that just TEE records the events but saves them in the backup repository.

IDK, does that make sense?

Lyndon-Li · 2023-08-16T13:19:32Z

Personally, if the events only last 1 hour, I think even for 1, it will lose lots of values --- users will not timely check the events during the backup, especially for schedule backups.
Think about what the schedule backups are being used, users usually schedule the backups in a window of time when the environment is not heavily used.

shawn-hurley · 2023-08-16T17:02:34Z

Hm sounds like a different use case to me, TBH.

I think that when I create a backup, you can tell me <we have done X, we are doing Y> and keep this info coming (you can see the "got event eight times over the last 5 min". This helps you to know that things are being worked.

It sounds like you are focused on the case of me coming in on Monday morning, and my backup which is supposed to run on Sunday at 8 pm or something, has failed. Here I agree having a journaled log in the backup (like the TEE approach I talked about) would be useful.

Sounds like you just disagree that the first use case is relevant or needed?

Lyndon-Li · 2023-08-17T08:54:14Z

I think it is less valuable if it can only support the first case as you mentioned, because:

It is not a common practice for a backup user to create a backup and then timely watch the backup. Especially for schedule backups, which are usually backend tasks at non-working time.
People will be annoying if the can see something (events within the latest 1 hours), but they cannot see all

Let me discuss this within the team and address:

What everyone things the value based on the current situation
Whether we want to do something for it in 1.13

shawn-hurley · 2023-08-17T14:49:13Z

I disagree with

People will be annoying if the can see something (events within the latest 1 hours), but they cannot see all

This is how kube events work. This is known and works for long-running pods, PVs, PVCs, Jobs, etc.

Please consider making it easier for users to use normal k8s tooling to debug rather than using something special. I agree on something special for the second case as there is no other option. And as stated, just adding a call to EventRecorder when you add a journal log is minimal complexity.

I also can't entirely agree that the only way someone uses this is from schedule backups. We have many use cases where users watch the backups, and this would be very helpful.

shawn-hurley · 2023-08-17T14:49:46Z

I also think that we should have a conversation on this in the open, can we add it to the next community meeting instead?

Lyndon-Li · 2023-08-18T10:17:54Z

Sure, let's try to reach more people and hear more voices.

A conclusion of my personal opinions, if the solution could meet both 1 and 2, I will fully vote it. If it only meets 1, I will not be confident in its values.
And I also believe even for Kubernetes itself, this event implementation is not perfect --- it is actually a compromise to etcd's low performance in handling the loads in this scenario.

My understanding may be wrong. So let's see more comments later from others.

weshayutin · 2024-01-26T15:38:45Z

++ love the idea

Lyndon-Li added kind/requirement 1.13-candidate issue/pr that should be considered to target v1.13 minor release labels Aug 4, 2023

Lyndon-Li mentioned this issue Aug 4, 2023

Mark the DataUpload/DataDownload CRs with the node name for easy debugging #6563

Closed

reasonerjt assigned Lyndon-Li Aug 16, 2023

reasonerjt added backlog Reviewed Q3 2023 labels Aug 16, 2023

reasonerjt added the Needs Design label Sep 6, 2023

reasonerjt removed the 1.13-candidate issue/pr that should be considered to target v1.13 minor release label Sep 20, 2023

Lyndon-Li mentioned this issue Nov 20, 2023

Expired Backup objects from BackupStorageLocations with accessMode=ReadOnly don't get cleaned up #7106

Closed

Lyndon-Li mentioned this issue Jan 26, 2024

Generate an Event on Backup/Restore Failure #7358

Closed

reasonerjt added the Needs Product Blocked needing input or feedback from Product label Mar 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Journalized activity recorder for backup and restore #6606

Journalized activity recorder for backup and restore #6606

Lyndon-Li commented Aug 4, 2023

shawn-hurley commented Aug 4, 2023

Lyndon-Li commented Aug 4, 2023

Lyndon-Li commented Aug 15, 2023 •

edited

Loading

shawn-hurley commented Aug 15, 2023

Lyndon-Li commented Aug 16, 2023

shawn-hurley commented Aug 16, 2023

Lyndon-Li commented Aug 17, 2023 •

edited

Loading

shawn-hurley commented Aug 17, 2023

shawn-hurley commented Aug 17, 2023

Lyndon-Li commented Aug 18, 2023 •

edited

Loading

weshayutin commented Jan 26, 2024

Journalized activity recorder for backup and restore #6606

Journalized activity recorder for backup and restore #6606

Comments

Lyndon-Li commented Aug 4, 2023

shawn-hurley commented Aug 4, 2023

Lyndon-Li commented Aug 4, 2023

Lyndon-Li commented Aug 15, 2023 • edited Loading

shawn-hurley commented Aug 15, 2023

Lyndon-Li commented Aug 16, 2023

shawn-hurley commented Aug 16, 2023

Lyndon-Li commented Aug 17, 2023 • edited Loading

shawn-hurley commented Aug 17, 2023

shawn-hurley commented Aug 17, 2023

Lyndon-Li commented Aug 18, 2023 • edited Loading

weshayutin commented Jan 26, 2024

Lyndon-Li commented Aug 15, 2023 •

edited

Loading

Lyndon-Li commented Aug 17, 2023 •

edited

Loading

Lyndon-Li commented Aug 18, 2023 •

edited

Loading