Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Mark the DataUpload/DataDownload CRs with the node name for easy debugging #6563

Closed
ywk253100 opened this issue Jul 31, 2023 · 13 comments
Closed

Comments

@ywk253100
Copy link
Contributor

Is it possible that mark the DataUpload and DataDownload CRs with the node name so that we can know which node-agent handles it. It is easier for debugging.

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

  • 👍 for "I would like to see this bug fixed as soon as possible"
  • 👎 for "There are more important bugs to focus on right now"
@Lyndon-Li
Copy link
Contributor

Yes, the information is already in the two CRs. And if you run kubectl get dataupload/datadownload -n velero -w, you will also see the node name is coming right before the data transfer starts.

@ywk253100
Copy link
Contributor Author

Seems that the node info isn't set for the failed dataupload/datadownload:

kubectl -n velero get dataupload -o wide
NAME          STATUS      STARTED   BYTES DONE   TOTAL BYTES   STORAGE LOCATION   AGE     NODE
dm-03-7f5gr   Failed      20h                                  default            20h
dm-04-mbxlw   Failed      20h                                  default            20h
dm-05-svkrl   Failed      20h                                  default            20h
dm-06-f6q28   Completed   5h56m     128020480    128020480     default            5h56m   wl-antrea-md-2-khgcv-79b7c9b86fxl9pkq-6fktz
dm-07-7swwz   Completed   5h53m     128020480    128020480     default            5h53m   wl-antrea-md-2-khgcv-79b7c9b86fxl9pkq-6fktz
dm-08-4vdsk   Completed   5h47m     128020480    128020480     default            5h47m   wl-antrea-md-2-khgcv-79b7c9b86fxl9pkq-6fktz
dm-09-mzs76   Completed   5h46m     128020480    128020480     default            5h46m   wl-antrea-md-2-khgcv-79b7c9b86fxl9pkq-6fktz
dm-10-2gbdf   Completed   9m46s     128020480    128020480     default            9m46s   wl-antrea-md-2-khgcv-79b7c9b86fxl9pkq-6fktz

@Lyndon-Li
Copy link
Contributor

A dataupload/datadownload is assigned to one node only after the Prepare stage, if the failure happens before it, no node will be filled to the CR.

Otherwise, if the CR fails during InProgress phase but the node name is not there, this is a bug.
@ywk253100 Could you help to confirm this?

@qiuming-best qiuming-best self-assigned this Aug 2, 2023
@qiuming-best qiuming-best added this to the v1.12 milestone Aug 2, 2023
@Lyndon-Li
Copy link
Contributor

As the discussion, let's add a label to the DUCR/DDCR velero.io/accepted-by : <node-name> to indicate that the CR is accepted by a node. This label will not be changed once the CR is accepted. cc @qiuming-best

@shawn-hurley
Copy link
Contributor

Why is there a label if there is a field?

@Lyndon-Li
Copy link
Contributor

@shawn-hurley
The label is to indicate controller on which node a DUCR/DDCR is accepted and then prepared, this is only for troubleshooting purpose.
After a DUCR/DDCR is prepared, it is finally designated to a node where the controller will do the data transfer. This node may be different from the one which prepared it. This not is set to the field of the CRs. The field is not only for troubleshooting but has logical usages.

@shawn-hurley
Copy link
Contributor

I wonder if an event may be better for this? Would there be a need to query all DataUpload/DataDownload that are prepared by a node?

@Lyndon-Li
Copy link
Contributor

The current implementation is already event driven -- the controller doesn't need to query all the CRs and it doesn't need to react to all the events of a CR either, just watches the events it is interested in.

@shawn-hurley
Copy link
Contributor

Sorry if I was not clear, I am suggesting creating events that can be attached to the DataUpload/DataDownload CR to give this information. Creating Events is what I was referring to, not that the reconciliation needs to watch for informer-based changes. I hope this helps.

qiuming-best added a commit to qiuming-best/velero that referenced this issue Aug 4, 2023
@Lyndon-Li
Copy link
Contributor

@shawn-hurley
Thanks for the sharing!

The current requirement is to record the node in which the controller prepared the CR. This is a static information and once set, it is not changed.
Per my understanding, the Events mentioned above is a much more powerful mechanism that it could record multiple pieces information with arbitrary content. On the other hand, this mechanism also involves more code complexities in the controller than a simple label.

Therefore, if the label could meet the requirement, I think we will not need to use the Events mechanisms.

@Lyndon-Li
Copy link
Contributor

Lyndon-Li commented Aug 4, 2023

On the other hand, this Events mechanism reminds me another requirement:

  • At present, for a backup or restore, users need to collect information from multiple places, i.e., from various CRs, from various logs, etc., to tell what has exactly done. In the other words, critical information are not listed centrally in a journal style for Velero backups and restores.

I feel the Events mechanism is suitable to fulfill this requirement:

  • Create Event recorder for Backup and Restore CR
  • Record the critical steps and info as Events along the running of backup/restore
  • In the same cluster, Velero doesn't need to do anything more, users just need to do kubectl describe <backup CR>
  • To support backup sync, Velero needs to backup the Event objects as part of the backup, just like backup logs

I've opened a separate issue #6606 for any further discussion about this

@shawn-hurley
Copy link
Contributor

shawn-hurley commented Aug 4, 2023

The current requirement is to record the node in which the controller prepared the CR. This is a static information and once set, it is not changed.

I am pretty sure this is the exact use case for when pods are scheduled, and that is an event.

What I am worried about is that in the future, you want to remove the label, but folks become used to using it to determine the location rather than the field. I think you can see an example of this with the default snapshot class labels in CSI. I think we even recently had a bug open about it that @sseago responded to.

All that said, I think that adding events so that many things (node agent, controller, third party DM) can give some high level of what the user is seeing is worth looking into for 1.13. I understand if we want to do a simple thing for 1.12 and revert it in 1.13 but just something to consider.

@Lyndon-Li
Copy link
Contributor

@shawn-hurley
I got your points, you are suggesting to deliver wider variation of information, not just this prepare-node info, in a more sustainable way. I agree it is useful, as same as #6606.

Then my proposal is as below:

  • Let's keep using the label for this prepare-node info in 1.12 since it is simple and safe.
  • In 1.13, together with issue Journalized activity recorder for backup and restore #6606, let's see how we can provide the journalized event mechanism as a generic mechanism of Velero, so that various functionalities, i.e., resource backup/restore, data mover, fs backup/restore, snapshot backup/restore, etc. could benefit from this mechanism

qiuming-best added a commit to qiuming-best/velero that referenced this issue Aug 7, 2023
qiuming-best added a commit to qiuming-best/velero that referenced this issue Aug 7, 2023
qiuming-best added a commit to qiuming-best/velero that referenced this issue Aug 7, 2023
qiuming-best added a commit to qiuming-best/velero that referenced this issue Aug 7, 2023
qiuming-best added a commit to qiuming-best/velero that referenced this issue Aug 9, 2023
qiuming-best added a commit to qiuming-best/velero that referenced this issue Aug 9, 2023
qiuming-best added a commit to qiuming-best/velero that referenced this issue Aug 9, 2023
qiuming-best added a commit to qiuming-best/velero that referenced this issue Aug 9, 2023
qiuming-best added a commit to qiuming-best/velero that referenced this issue Aug 9, 2023
qiuming-best added a commit to qiuming-best/velero that referenced this issue Aug 9, 2023
qiuming-best added a commit to qiuming-best/velero that referenced this issue Aug 10, 2023
qiuming-best added a commit to qiuming-best/velero that referenced this issue Aug 13, 2023
qiuming-best added a commit to qiuming-best/velero that referenced this issue Aug 13, 2023
qiuming-best added a commit to qiuming-best/velero that referenced this issue Aug 13, 2023
qiuming-best added a commit to qiuming-best/velero that referenced this issue Aug 14, 2023
qiuming-best added a commit to qiuming-best/velero that referenced this issue Aug 14, 2023
qiuming-best added a commit to qiuming-best/velero that referenced this issue Aug 14, 2023
qiuming-best added a commit to qiuming-best/velero that referenced this issue Aug 14, 2023
qiuming-best added a commit to qiuming-best/velero that referenced this issue Aug 14, 2023
qiuming-best added a commit to qiuming-best/velero that referenced this issue Aug 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants