Mark the DataUpload/DataDownload CRs with the node name for easy debugging #6563

ywk253100 · 2023-07-31T09:49:48Z

Is it possible that mark the DataUpload and DataDownload CRs with the node name so that we can know which node-agent handles it. It is easier for debugging.

Vote on this issue!

This is an invitation to the Velero community to vote on issues, you can see the project's top voted issues listed here.
Use the "reaction smiley face" up to the right of this comment to vote.

👍 for "I would like to see this bug fixed as soon as possible"
👎 for "There are more important bugs to focus on right now"

The text was updated successfully, but these errors were encountered:

Lyndon-Li · 2023-07-31T13:37:55Z

Yes, the information is already in the two CRs. And if you run kubectl get dataupload/datadownload -n velero -w, you will also see the node name is coming right before the data transfer starts.

ywk253100 · 2023-08-01T06:17:27Z

Seems that the node info isn't set for the failed dataupload/datadownload:

kubectl -n velero get dataupload -o wide
NAME          STATUS      STARTED   BYTES DONE   TOTAL BYTES   STORAGE LOCATION   AGE     NODE
dm-03-7f5gr   Failed      20h                                  default            20h
dm-04-mbxlw   Failed      20h                                  default            20h
dm-05-svkrl   Failed      20h                                  default            20h
dm-06-f6q28   Completed   5h56m     128020480    128020480     default            5h56m   wl-antrea-md-2-khgcv-79b7c9b86fxl9pkq-6fktz
dm-07-7swwz   Completed   5h53m     128020480    128020480     default            5h53m   wl-antrea-md-2-khgcv-79b7c9b86fxl9pkq-6fktz
dm-08-4vdsk   Completed   5h47m     128020480    128020480     default            5h47m   wl-antrea-md-2-khgcv-79b7c9b86fxl9pkq-6fktz
dm-09-mzs76   Completed   5h46m     128020480    128020480     default            5h46m   wl-antrea-md-2-khgcv-79b7c9b86fxl9pkq-6fktz
dm-10-2gbdf   Completed   9m46s     128020480    128020480     default            9m46s   wl-antrea-md-2-khgcv-79b7c9b86fxl9pkq-6fktz

Lyndon-Li · 2023-08-01T07:56:02Z

A dataupload/datadownload is assigned to one node only after the Prepare stage, if the failure happens before it, no node will be filled to the CR.

Otherwise, if the CR fails during InProgress phase but the node name is not there, this is a bug.
@ywk253100 Could you help to confirm this?

Lyndon-Li · 2023-08-02T10:26:51Z

As the discussion, let's add a label to the DUCR/DDCR velero.io/accepted-by : <node-name> to indicate that the CR is accepted by a node. This label will not be changed once the CR is accepted. cc @qiuming-best

shawn-hurley · 2023-08-02T14:25:14Z

Why is there a label if there is a field?

Lyndon-Li · 2023-08-03T02:20:14Z

@shawn-hurley
The label is to indicate controller on which node a DUCR/DDCR is accepted and then prepared, this is only for troubleshooting purpose.
After a DUCR/DDCR is prepared, it is finally designated to a node where the controller will do the data transfer. This node may be different from the one which prepared it. This not is set to the field of the CRs. The field is not only for troubleshooting but has logical usages.

shawn-hurley · 2023-08-03T12:00:33Z

I wonder if an event may be better for this? Would there be a need to query all DataUpload/DataDownload that are prepared by a node?

Lyndon-Li · 2023-08-03T13:00:34Z

The current implementation is already event driven -- the controller doesn't need to query all the CRs and it doesn't need to react to all the events of a CR either, just watches the events it is interested in.

shawn-hurley · 2023-08-03T14:17:09Z

Sorry if I was not clear, I am suggesting creating events that can be attached to the DataUpload/DataDownload CR to give this information. Creating Events is what I was referring to, not that the reconciliation needs to watch for informer-based changes. I hope this helps.

Signed-off-by: Ming Qiu <[email protected]>

Lyndon-Li · 2023-08-04T09:03:25Z

@shawn-hurley
Thanks for the sharing!

The current requirement is to record the node in which the controller prepared the CR. This is a static information and once set, it is not changed.
Per my understanding, the Events mentioned above is a much more powerful mechanism that it could record multiple pieces information with arbitrary content. On the other hand, this mechanism also involves more code complexities in the controller than a simple label.

Therefore, if the label could meet the requirement, I think we will not need to use the Events mechanisms.

Lyndon-Li · 2023-08-04T09:46:39Z

On the other hand, this Events mechanism reminds me another requirement:

At present, for a backup or restore, users need to collect information from multiple places, i.e., from various CRs, from various logs, etc., to tell what has exactly done. In the other words, critical information are not listed centrally in a journal style for Velero backups and restores.

I feel the Events mechanism is suitable to fulfill this requirement:

Create Event recorder for Backup and Restore CR
Record the critical steps and info as Events along the running of backup/restore
In the same cluster, Velero doesn't need to do anything more, users just need to do kubectl describe <backup CR>
To support backup sync, Velero needs to backup the Event objects as part of the backup, just like backup logs

I've opened a separate issue #6606 for any further discussion about this

shawn-hurley · 2023-08-04T15:10:59Z

The current requirement is to record the node in which the controller prepared the CR. This is a static information and once set, it is not changed.

I am pretty sure this is the exact use case for when pods are scheduled, and that is an event.

What I am worried about is that in the future, you want to remove the label, but folks become used to using it to determine the location rather than the field. I think you can see an example of this with the default snapshot class labels in CSI. I think we even recently had a bug open about it that @sseago responded to.

All that said, I think that adding events so that many things (node agent, controller, third party DM) can give some high level of what the user is seeing is worth looking into for 1.13. I understand if we want to do a simple thing for 1.12 and revert it in 1.13 but just something to consider.

Lyndon-Li · 2023-08-04T16:01:36Z

@shawn-hurley
I got your points, you are suggesting to deliver wider variation of information, not just this prepare-node info, in a more sustainable way. I agree it is useful, as same as #6606.

Then my proposal is as below:

Let's keep using the label for this prepare-node info in 1.12 since it is simple and safe.
In 1.13, together with issue Journalized activity recorder for backup and restore #6606, let's see how we can provide the journalized event mechanism as a generic mechanism of Velero, so that various functionalities, i.e., resource backup/restore, data mover, fs backup/restore, snapshot backup/restore, etc. could benefit from this mechanism

…6600 Signed-off-by: Ming Qiu <[email protected]>

ywk253100 added the area/datamover label Jul 31, 2023

qiuming-best self-assigned this Aug 2, 2023

qiuming-best added this to the v1.12 milestone Aug 2, 2023

qiuming-best added a commit to qiuming-best/velero that referenced this issue Aug 4, 2023

Fix vmware-tanzu#6562 vmware-tanzu#6563 data mover bug

142c66b

Signed-off-by: Ming Qiu <[email protected]>

qiuming-best mentioned this issue Aug 4, 2023

Fix #6562 #6563 data mover bugs #6603

Closed

3 tasks

qiuming-best added a commit to qiuming-best/velero that referenced this issue Aug 7, 2023

Fix data mover bugs vmware-tanzu#6550 vmware-tanzu#6563 vmware-tanzu#…

ec76cf4

…6600 Signed-off-by: Ming Qiu <[email protected]>

qiuming-best mentioned this issue Aug 7, 2023

Fix data mover controller bugs #6616

Merged

3 tasks

qiuming-best added a commit to qiuming-best/velero that referenced this issue Aug 7, 2023

Fix data mover bugs vmware-tanzu#6550 vmware-tanzu#6563 vmware-tanzu#…

10c9f30

…6600 Signed-off-by: Ming Qiu <[email protected]>

qiuming-best added a commit to qiuming-best/velero that referenced this issue Aug 7, 2023

Fix data mover bugs vmware-tanzu#6550 vmware-tanzu#6563 vmware-tanzu#…

80abc03

…6600 Signed-off-by: Ming Qiu <[email protected]>

qiuming-best added a commit to qiuming-best/velero that referenced this issue Aug 7, 2023

Fix data mover bugs vmware-tanzu#6550 vmware-tanzu#6563 vmware-tanzu#…

7005dcc

…6600 Signed-off-by: Ming Qiu <[email protected]>

qiuming-best added a commit to qiuming-best/velero that referenced this issue Aug 9, 2023

Fix data mover bugs vmware-tanzu#6550 vmware-tanzu#6563 vmware-tanzu#…

6276d02

…6600 Signed-off-by: Ming Qiu <[email protected]>

qiuming-best added a commit to qiuming-best/velero that referenced this issue Aug 9, 2023

Fix data mover bugs vmware-tanzu#6550 vmware-tanzu#6563 vmware-tanzu#…

5600f1d

…6600 Signed-off-by: Ming Qiu <[email protected]>

qiuming-best added a commit to qiuming-best/velero that referenced this issue Aug 9, 2023

Fix data mover bugs vmware-tanzu#6550 vmware-tanzu#6563 vmware-tanzu#…

de65cd7

…6600 Signed-off-by: Ming Qiu <[email protected]>

qiuming-best added a commit to qiuming-best/velero that referenced this issue Aug 9, 2023

Fix data mover bugs vmware-tanzu#6550 vmware-tanzu#6563 vmware-tanzu#…

054d35a

…6600 Signed-off-by: Ming Qiu <[email protected]>

reasonerjt added the target/1.12-rc.1 label Aug 9, 2023

qiuming-best added a commit to qiuming-best/velero that referenced this issue Aug 9, 2023

Fix data mover bugs vmware-tanzu#6550 vmware-tanzu#6563 vmware-tanzu#…

b50ba81

…6600 Signed-off-by: Ming Qiu <[email protected]>

qiuming-best added a commit to qiuming-best/velero that referenced this issue Aug 9, 2023

Fix data mover bugs vmware-tanzu#6550 vmware-tanzu#6563 vmware-tanzu#…

8c7c199

…6600 Signed-off-by: Ming Qiu <[email protected]>

qiuming-best added a commit to qiuming-best/velero that referenced this issue Aug 10, 2023

Fix data mover bugs vmware-tanzu#6550 vmware-tanzu#6563 vmware-tanzu#…

9279e19

…6600 Signed-off-by: Ming Qiu <[email protected]>

qiuming-best added a commit to qiuming-best/velero that referenced this issue Aug 13, 2023

Fix data mover bugs vmware-tanzu#6550 vmware-tanzu#6563 vmware-tanzu#…

163737e

…6600 Signed-off-by: Ming Qiu <[email protected]>

qiuming-best added a commit to qiuming-best/velero that referenced this issue Aug 13, 2023

Fix data mover bugs vmware-tanzu#6550 vmware-tanzu#6563 vmware-tanzu#…

446108c

…6600 Signed-off-by: Ming Qiu <[email protected]>

qiuming-best added a commit to qiuming-best/velero that referenced this issue Aug 13, 2023

Fix data mover bugs vmware-tanzu#6550 vmware-tanzu#6563 vmware-tanzu#…

f679902

…6600 Signed-off-by: Ming Qiu <[email protected]>

qiuming-best added a commit to qiuming-best/velero that referenced this issue Aug 14, 2023

Fix data mover bugs vmware-tanzu#6550 vmware-tanzu#6563 vmware-tanzu#…

1e23ac3

…6600 Signed-off-by: Ming Qiu <[email protected]>

qiuming-best added a commit to qiuming-best/velero that referenced this issue Aug 14, 2023

Fix data mover bugs vmware-tanzu#6550 vmware-tanzu#6563 vmware-tanzu#…

3099a0e

…6600 Signed-off-by: Ming Qiu <[email protected]>

qiuming-best added a commit to qiuming-best/velero that referenced this issue Aug 14, 2023

Fix data mover bugs vmware-tanzu#6550 vmware-tanzu#6563 vmware-tanzu#…

04b8946

…6600 Signed-off-by: Ming Qiu <[email protected]>

qiuming-best added a commit to qiuming-best/velero that referenced this issue Aug 14, 2023

Fix data mover bugs vmware-tanzu#6550 vmware-tanzu#6563 vmware-tanzu#…

b409b69

…6600 Signed-off-by: Ming Qiu <[email protected]>

qiuming-best added a commit to qiuming-best/velero that referenced this issue Aug 14, 2023

Fix data mover bugs vmware-tanzu#6550 vmware-tanzu#6563 vmware-tanzu#…

bbbdbcd

…6600 Signed-off-by: Ming Qiu <[email protected]>

qiuming-best added a commit to qiuming-best/velero that referenced this issue Aug 14, 2023

Fix data mover bugs vmware-tanzu#6550 vmware-tanzu#6563 vmware-tanzu#…

5485616

…6600 Signed-off-by: Ming Qiu <[email protected]>

qiuming-best closed this as completed Aug 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Mark the DataUpload/DataDownload CRs with the node name for easy debugging #6563

Mark the DataUpload/DataDownload CRs with the node name for easy debugging #6563

ywk253100 commented Jul 31, 2023

Lyndon-Li commented Jul 31, 2023

ywk253100 commented Aug 1, 2023

Lyndon-Li commented Aug 1, 2023

Lyndon-Li commented Aug 2, 2023

shawn-hurley commented Aug 2, 2023

Lyndon-Li commented Aug 3, 2023

shawn-hurley commented Aug 3, 2023

Lyndon-Li commented Aug 3, 2023

shawn-hurley commented Aug 3, 2023

Lyndon-Li commented Aug 4, 2023

Lyndon-Li commented Aug 4, 2023 •

edited

Loading

shawn-hurley commented Aug 4, 2023 •

edited

Loading

Lyndon-Li commented Aug 4, 2023

Mark the DataUpload/DataDownload CRs with the node name for easy debugging #6563

Mark the DataUpload/DataDownload CRs with the node name for easy debugging #6563

Comments

ywk253100 commented Jul 31, 2023

Lyndon-Li commented Jul 31, 2023

ywk253100 commented Aug 1, 2023

Lyndon-Li commented Aug 1, 2023

Lyndon-Li commented Aug 2, 2023

shawn-hurley commented Aug 2, 2023

Lyndon-Li commented Aug 3, 2023

shawn-hurley commented Aug 3, 2023

Lyndon-Li commented Aug 3, 2023

shawn-hurley commented Aug 3, 2023

Lyndon-Li commented Aug 4, 2023

Lyndon-Li commented Aug 4, 2023 • edited Loading

shawn-hurley commented Aug 4, 2023 • edited Loading

Lyndon-Li commented Aug 4, 2023

Lyndon-Li commented Aug 4, 2023 •

edited

Loading

shawn-hurley commented Aug 4, 2023 •

edited

Loading