Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Journalized activity recorder for backup and restore #6606

Open
Lyndon-Li opened this issue Aug 4, 2023 · 11 comments
Open

Journalized activity recorder for backup and restore #6606

Lyndon-Li opened this issue Aug 4, 2023 · 11 comments
Assignees
Labels

Comments

@Lyndon-Li
Copy link
Contributor

At present, for a backup or restore, users need to collect information from multiple places, i.e., from various CRs, from various logs, etc., to tell what has exactly done. In the other words, critical information are not listed centrally in a journal style for Velero backups and restores.
Moreover, the information in the logs are getting increasingly complicated.

One possible solution is to use the Event mechanism:

  • Create Event recorder for Backup and Restore CR
  • Record the critical steps and info as Events along the running of backup/restore
  • In the same cluster, Velero doesn't need to do anything more, users just need to do kubectl describe
  • To support backup sync, Velero needs to backup the Event objects as part of the backup, just like backup logs
@Lyndon-Li Lyndon-Li added kind/requirement 1.13-candidate issue/pr that should be considered to target v1.13 minor release labels Aug 4, 2023
@shawn-hurley
Copy link
Contributor

I think this would also be a great change for third-party data movers to have a common way to give information to the user during the backup and restore.

I also love the UX of this personally as a user of k8s. I am so used to getting this information with kubectl describe

@Lyndon-Li
Copy link
Contributor Author

To cover 3rd-party data movers, one possible way is that we provide this journalized event mechanism as a generic mechanism of Velero backup/restore workflow, so that these events go together with Velero backup/restore no matter which module generates them.

@Lyndon-Li
Copy link
Contributor Author

Lyndon-Li commented Aug 15, 2023

One thing that may be a hinder of the proposal to use Kubernetes Event mechanism is:

  • Kubernetes' event resources have associated TTLs which cannot be disabled
  • The default TTL value is 1 hour
  • The TTL value is nearly impossible to reconfigured (it is an api-server parameter to the entire etcd storage)

As a result, if we store the backup & restore events based on the Events, the events will be cleared after 1 hour, for the long running backups, this is not enough.

This means:

  • Even in the same cluster, Velero needs to back up these events for each backup timely
  • Even in the same cluster, the entire events for each backup need to be retrieved from the backup tarball, so kubectl get backup -n velero should not be used

Then we will need to compare whether this is simpler enough than the solution to create a dedicate event mechanism from Velero.

@shawn-hurley
Copy link
Contributor

I think there are two concerrns here:

  1. A user having a k8s native way, to determine how the backup is currently acting
  2. A complete recording of every action a user could look over after the fact

For the first, events should be used because it will alert the user to what is happening.

For 2, the events are stored in the audit log. IIRC would be a place to point users for 2, or we can create a new log file that just TEE records the events but saves them in the backup repository.

IDK, does that make sense?

@Lyndon-Li
Copy link
Contributor Author

Personally, if the events only last 1 hour, I think even for 1, it will lose lots of values --- users will not timely check the events during the backup, especially for schedule backups.
Think about what the schedule backups are being used, users usually schedule the backups in a window of time when the environment is not heavily used.

@shawn-hurley
Copy link
Contributor

Hm sounds like a different use case to me, TBH.

I think that when I create a backup, you can tell me <we have done X, we are doing Y> and keep this info coming (you can see the "got event eight times over the last 5 min". This helps you to know that things are being worked.

It sounds like you are focused on the case of me coming in on Monday morning, and my backup which is supposed to run on Sunday at 8 pm or something, has failed. Here I agree having a journaled log in the backup (like the TEE approach I talked about) would be useful.

Sounds like you just disagree that the first use case is relevant or needed?

@Lyndon-Li
Copy link
Contributor Author

Lyndon-Li commented Aug 17, 2023

I think it is less valuable if it can only support the first case as you mentioned, because:

  1. It is not a common practice for a backup user to create a backup and then timely watch the backup. Especially for schedule backups, which are usually backend tasks at non-working time.
  2. People will be annoying if the can see something (events within the latest 1 hours), but they cannot see all

Let me discuss this within the team and address:

  1. What everyone things the value based on the current situation
  2. Whether we want to do something for it in 1.13

@shawn-hurley
Copy link
Contributor

I disagree with

People will be annoying if the can see something (events within the latest 1 hours), but they cannot see all

This is how kube events work. This is known and works for long-running pods, PVs, PVCs, Jobs, etc.

Please consider making it easier for users to use normal k8s tooling to debug rather than using something special. I agree on something special for the second case as there is no other option. And as stated, just adding a call to EventRecorder when you add a journal log is minimal complexity.

I also can't entirely agree that the only way someone uses this is from schedule backups. We have many use cases where users watch the backups, and this would be very helpful.

@shawn-hurley
Copy link
Contributor

I also think that we should have a conversation on this in the open, can we add it to the next community meeting instead?

@Lyndon-Li
Copy link
Contributor Author

Lyndon-Li commented Aug 18, 2023

Sure, let's try to reach more people and hear more voices.

A conclusion of my personal opinions, if the solution could meet both 1 and 2, I will fully vote it. If it only meets 1, I will not be confident in its values.
And I also believe even for Kubernetes itself, this event implementation is not perfect --- it is actually a compromise to etcd's low performance in handling the loads in this scenario.

My understanding may be wrong. So let's see more comments later from others.

@weshayutin
Copy link
Contributor

++ love the idea

@reasonerjt reasonerjt added the Needs Product Blocked needing input or feedback from Product label Mar 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants