Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handle lost CBT case #258

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

handle lost CBT case #258

wants to merge 1 commit into from

Conversation

dlunev
Copy link

@dlunev dlunev commented Mar 10, 2025

It is legal that CBT inside QEMU could be lost f.e. once the power on the node would be lost or QEMU will be crashed. Dirty bitmaps are stored inside QCOW2 as auto-clear feature and are removed on every start when the image is unclean. This is done as CBTs are kept in the memory and are written only during normal QEMU shutdown process.

This situation should be handled by the backup software. Backup chains should not be broken and the only normal way to handle this would be to make full backup and store it inside incremental chain. Letting resolving this by the end-user would be not friendly to the end-user.

The condition is rare and we should just create full backup in this case. Handle this respectively in the code.

Note: this requires change in the extent handling. As CBT is not available, BLOCK_STATUS requests offset should be moved over libvirt.CONTEXT_BASE_ALLOCATION namespace processing.

It is legal that CBT inside QEMU could be lost f.e. once the power on
the node would be lost or QEMU will be crashed. Dirty bitmaps are
stored inside QCOW2 as auto-clear feature and are removed on every
start when the image is unclean. This is done as CBTs are kept in the
memory and are written only during normal QEMU shutdown process.

This situation should be handled by the backup software. Backup chains
should not be broken and the only normal way to handle this would be
to make full backup and store it inside incremental chain. Letting
resolving this by the end-user would be not friendly to the end-user.

The condition is rare and we should just create full backup in this case.
Handle this respectively in the code.

Note: this requires change in the extent handling. As CBT is not
available, BLOCK_STATUS requests offset should be moved over
libvirt.CONTEXT_BASE_ALLOCATION namespace processing.

Signed-off-by: Denis V. Lunev <[email protected]>
@abbbi
Copy link
Owner

abbbi commented Mar 10, 2025

hi,

might make sense to extend tests for this situation, how can i produce it? Thanks :)

@dlunev
Copy link
Author

dlunev commented Mar 10, 2025

simply kill -9 QEMU over the image which has CBT (i.e. full backup has been made some time ago) and start it again.

@abbbi
Copy link
Owner

abbbi commented Mar 10, 2025

gotcha, current version at least bails out if inconsistency is noticed:

Failed to start backup: [checkpoint inconsistent: missing or broken bitmap 'virtnbdbackup.0' for disk 'sda']

@abbbi
Copy link
Owner

abbbi commented Mar 10, 2025

It needs some more considerations. While im a fan of this idea, the situation at current is:

  1. if the bitmap is missing, an incremental backup currently bails out, and the user has to clean up things:

Failed to start backup: [checkpoint inconsistent: missing or broken bitmap 'virtnbdbackup.0' for disk 'sda']

From there the user must cleanup his checkpoints (remove them with the --metadata option using virsh)
and start clean over.

  1. with the applied patch, the backup continues with retry and more or less creates a full backup, which is fine.
    But, this will cause problems during next full backup, because currently the utility leaves the checkpoints
    behind and attempts to remove all of them during full backup, which fails the same way:
[2025-03-10 13:18:05] ERROR root checkpoint - delete [main]:  Error during checkpoint removal: [internal error: bitmap 'virtnbdbackup.0' not found in backing chain of 'sda']
[2025-03-10 13:18:05] ERROR root virtnbdbackup - main [main]:  Failed to remove checkpoint.

The fallback to the full backup happens after all the checkpoint handling, and messes up things a little.

virtnbdbackup -d fstrim -l full -o mydata_retry
kill -9 qemu-pid
virsh start testvm
virtnbdbackup -d fstrim -l inc -o mydata_retry # full backup happens
virtnbdbackup -d fstrim -l inc -o mydata_retry # retry ..
virtnbdbackup -d fstrim -l full -o new_full #fails

@dlunev
Copy link
Author

dlunev commented Mar 10, 2025

I will take a look at evening, but seems reasonable

@abbbi
Copy link
Owner

abbbi commented Mar 12, 2025

i wonder if there is an libvirt API endpoint that could be used to detect the libvirt.VIR_ERR_CHECKPOINT_INCONSISTENT error berfore actually starting the backup operation. That way a function could be introduced that checks for the consistency before any backup operatoin starts and its easier to do fallback to incremental with base allocation data.

@dlunev
Copy link
Author

dlunev commented Mar 12, 2025

QEMU API do exists, will try to check with libvirt. Anyway, good point.

FYI: seems this code is doing the trick.

def validate_checkpoint(domain, c):
    try:
        xml = c.getXMLDesc(0)

        flags = (libvirt.VIR_DOMAIN_CHECKPOINT_CREATE_REDEFINE |
                 libvirt.VIR_DOMAIN_CHECKPOINT_CREATE_REDEFINE_VALIDATE)

        # Redefine the checkpoint using the provided XML description
        checkpoint = domain.checkpointCreateXML(xml, flags)
        if not checkpoint:
            return False
        return True

    except libvirt.libvirtError as e:
        return False

def main():
    try:
        # Connect to local QEMU hypervisor
        conn = libvirt.open('qemu:///system')
        if conn is None:
            print('Failed to open connection to qemu:///system')
            sys.exit(1)

        domain = conn.lookupByName("ubuntu")
        if not domain:
            print(f"Domain '{domain_name}' not found")
            return

        # Get the list of checkpoints for the domain
        checkpoints = domain.listAllCheckpoints()
        if not checkpoints:
            print(f"No checkpoints found for domain '{domain_name}'")
            return

        for c in checkpoints:
            valid = validate_checkpoint(domain, c)
            print(f"Checkpoint {c.getName()} is {valid}")

    except libvirt.libvirtError as e:
        print(f"Libvirt error: {str(e)}")
        sys.exit(1)

    except Exception as e:
        print(f"Unexpected error: {str(e)}")
        sys.exit(1)

    finally:
        if 'conn' in locals():
            conn.close()

if __name__ == '__main__':
    main()

@abbbi
Copy link
Owner

abbbi commented Mar 13, 2025

yes, redefining the checkpoint with the ibvirt.VIR_DOMAIN_CHECKPOINT_CREATE_REDEFINE_VALIDATE option is something that will shield this error.

There is already code for redefining the checkpoints (for the transient vm feature, where vm's are migrated between hosts) in:

https://github.com/abbbi/virtnbdbackup/blob/master/libvirtnbdbackup/virt/checkpoint.py#L187

Function could be adjusted or code reused that during incremental/differential backup it checks the consistency for all existing checkpoints and does fallback accordingly.

From what it looks like, there might also be versions of libvirt that support getting the checkpoint xml with the --size option. If you request checkpoint-dumpxml --size and the bitmap is inconsistent, no size will be returned.
But thats not as good as attempting to redefine and get a proper error.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants