handle lost CBT case #258

dlunev · 2025-03-10T09:11:01Z

It is legal that CBT inside QEMU could be lost f.e. once the power on the node would be lost or QEMU will be crashed. Dirty bitmaps are stored inside QCOW2 as auto-clear feature and are removed on every start when the image is unclean. This is done as CBTs are kept in the memory and are written only during normal QEMU shutdown process.

This situation should be handled by the backup software. Backup chains should not be broken and the only normal way to handle this would be to make full backup and store it inside incremental chain. Letting resolving this by the end-user would be not friendly to the end-user.

The condition is rare and we should just create full backup in this case. Handle this respectively in the code.

Note: this requires change in the extent handling. As CBT is not available, BLOCK_STATUS requests offset should be moved over libvirt.CONTEXT_BASE_ALLOCATION namespace processing.

It is legal that CBT inside QEMU could be lost f.e. once the power on the node would be lost or QEMU will be crashed. Dirty bitmaps are stored inside QCOW2 as auto-clear feature and are removed on every start when the image is unclean. This is done as CBTs are kept in the memory and are written only during normal QEMU shutdown process. This situation should be handled by the backup software. Backup chains should not be broken and the only normal way to handle this would be to make full backup and store it inside incremental chain. Letting resolving this by the end-user would be not friendly to the end-user. The condition is rare and we should just create full backup in this case. Handle this respectively in the code. Note: this requires change in the extent handling. As CBT is not available, BLOCK_STATUS requests offset should be moved over libvirt.CONTEXT_BASE_ALLOCATION namespace processing. Signed-off-by: Denis V. Lunev <[email protected]>

abbbi · 2025-03-10T09:51:34Z

hi,

might make sense to extend tests for this situation, how can i produce it? Thanks :)

dlunev · 2025-03-10T09:54:59Z

simply kill -9 QEMU over the image which has CBT (i.e. full backup has been made some time ago) and start it again.

abbbi · 2025-03-10T10:25:30Z

gotcha, current version at least bails out if inconsistency is noticed:

Failed to start backup: [checkpoint inconsistent: missing or broken bitmap 'virtnbdbackup.0' for disk 'sda']

abbbi · 2025-03-10T12:23:21Z

It needs some more considerations. While im a fan of this idea, the situation at current is:

if the bitmap is missing, an incremental backup currently bails out, and the user has to clean up things:

Failed to start backup: [checkpoint inconsistent: missing or broken bitmap 'virtnbdbackup.0' for disk 'sda']

From there the user must cleanup his checkpoints (remove them with the --metadata option using virsh)
and start clean over.

with the applied patch, the backup continues with retry and more or less creates a full backup, which is fine.
But, this will cause problems during next full backup, because currently the utility leaves the checkpoints
behind and attempts to remove all of them during full backup, which fails the same way:

[2025-03-10 13:18:05] ERROR root checkpoint - delete [main]:  Error during checkpoint removal: [internal error: bitmap 'virtnbdbackup.0' not found in backing chain of 'sda']
[2025-03-10 13:18:05] ERROR root virtnbdbackup - main [main]:  Failed to remove checkpoint.

The fallback to the full backup happens after all the checkpoint handling, and messes up things a little.

virtnbdbackup -d fstrim -l full -o mydata_retry
kill -9 qemu-pid
virsh start testvm
virtnbdbackup -d fstrim -l inc -o mydata_retry # full backup happens
virtnbdbackup -d fstrim -l inc -o mydata_retry # retry ..
virtnbdbackup -d fstrim -l full -o new_full #fails

dlunev · 2025-03-10T13:22:04Z

I will take a look at evening, but seems reasonable

abbbi · 2025-03-12T07:45:39Z

i wonder if there is an libvirt API endpoint that could be used to detect the libvirt.VIR_ERR_CHECKPOINT_INCONSISTENT error berfore actually starting the backup operation. That way a function could be introduced that checks for the consistency before any backup operatoin starts and its easier to do fallback to incremental with base allocation data.

dlunev · 2025-03-12T09:43:07Z

QEMU API do exists, will try to check with libvirt. Anyway, good point.

FYI: seems this code is doing the trick.

def validate_checkpoint(domain, c):
    try:
        xml = c.getXMLDesc(0)

        flags = (libvirt.VIR_DOMAIN_CHECKPOINT_CREATE_REDEFINE |
                 libvirt.VIR_DOMAIN_CHECKPOINT_CREATE_REDEFINE_VALIDATE)

        # Redefine the checkpoint using the provided XML description
        checkpoint = domain.checkpointCreateXML(xml, flags)
        if not checkpoint:
            return False
        return True

    except libvirt.libvirtError as e:
        return False

def main():
    try:
        # Connect to local QEMU hypervisor
        conn = libvirt.open('qemu:///system')
        if conn is None:
            print('Failed to open connection to qemu:///system')
            sys.exit(1)

        domain = conn.lookupByName("ubuntu")
        if not domain:
            print(f"Domain '{domain_name}' not found")
            return

        # Get the list of checkpoints for the domain
        checkpoints = domain.listAllCheckpoints()
        if not checkpoints:
            print(f"No checkpoints found for domain '{domain_name}'")
            return

        for c in checkpoints:
            valid = validate_checkpoint(domain, c)
            print(f"Checkpoint {c.getName()} is {valid}")

    except libvirt.libvirtError as e:
        print(f"Libvirt error: {str(e)}")
        sys.exit(1)

    except Exception as e:
        print(f"Unexpected error: {str(e)}")
        sys.exit(1)

    finally:
        if 'conn' in locals():
            conn.close()

if __name__ == '__main__':
    main()

abbbi · 2025-03-13T14:04:59Z

yes, redefining the checkpoint with the ibvirt.VIR_DOMAIN_CHECKPOINT_CREATE_REDEFINE_VALIDATE option is something that will shield this error.

There is already code for redefining the checkpoints (for the transient vm feature, where vm's are migrated between hosts) in:

https://github.com/abbbi/virtnbdbackup/blob/master/libvirtnbdbackup/virt/checkpoint.py#L187

Function could be adjusted or code reused that during incremental/differential backup it checks the consistency for all existing checkpoints and does fallback accordingly.

From what it looks like, there might also be versions of libvirt that support getting the checkpoint xml with the --size option. If you request checkpoint-dumpxml --size and the bitmap is inconsistent, no size will be returned.
But thats not as good as attempting to redefine and get a proper error.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

handle lost CBT case #258

handle lost CBT case #258

dlunev commented Mar 10, 2025

abbbi commented Mar 10, 2025

dlunev commented Mar 10, 2025

abbbi commented Mar 10, 2025

abbbi commented Mar 10, 2025 •

edited

Loading

dlunev commented Mar 10, 2025

abbbi commented Mar 12, 2025

dlunev commented Mar 12, 2025 •

edited

Loading

abbbi commented Mar 13, 2025 •

edited

Loading

handle lost CBT case #258

Are you sure you want to change the base?

handle lost CBT case #258

Conversation

dlunev commented Mar 10, 2025

abbbi commented Mar 10, 2025

dlunev commented Mar 10, 2025

abbbi commented Mar 10, 2025

abbbi commented Mar 10, 2025 • edited Loading

dlunev commented Mar 10, 2025

abbbi commented Mar 12, 2025

dlunev commented Mar 12, 2025 • edited Loading

abbbi commented Mar 13, 2025 • edited Loading

abbbi commented Mar 10, 2025 •

edited

Loading

dlunev commented Mar 12, 2025 •

edited

Loading

abbbi commented Mar 13, 2025 •

edited

Loading