Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shutting down domain takes a long time #5426

Open
heinrich-ulbricht opened this issue Oct 27, 2019 · 95 comments
Open

Shutting down domain takes a long time #5426

heinrich-ulbricht opened this issue Oct 27, 2019 · 95 comments
Labels
affects-4.1 This issue affects Qubes OS 4.1. affects-4.2 This issue affects Qubes OS 4.2. C: other diagnosed Technical diagnosis has been performed (see issue comments). P: major Priority: major. Between "default" and "critical" in severity.

Comments

@heinrich-ulbricht
Copy link

heinrich-ulbricht commented Oct 27, 2019

Qubes OS version
R4.0

Affected component(s) or functionality
Shutting down a domain.

Brief summary
Shutting down one of my domains takes ~40 seconds. Other domains only take ~6 seconds.

To Reproduce
Don't know how to reproduce. Works (not) on my machine:tm:.

Expected behavior
Domain shuts down fast.

Actual behavior
Domain takes ages.

Additional context
My domain has 600 GB private storage. I recently increased the size by ~60 GB, ignoring the advice to increase in steps of 10 GB. I just added 60 GB. Now I don't know if the slow shutdown is since then or if it is since I switched the template from Fedora 28 to Fedora 30 or not related at all. It might also be normal for such a size. Somebody familiar with the inner workings might know if some of those things could matter.

Solutions you've tried
Don't know where to start, which logs to consult or which tools to use.

Any help to get started diagnosing this is appreciated.

@heinrich-ulbricht heinrich-ulbricht added P: default Priority: default. Default priority for new issues, to be replaced given sufficient information. T: bug labels Oct 27, 2019
@heinrich-ulbricht heinrich-ulbricht changed the title Stopping domain takes a long time Shutting down domain takes a long time Oct 27, 2019
@marmarek
Copy link
Member

Take a looks at /var/log/xen/console/guest-VMNAME.log (also available in qubes manager right click menu) for that domain. It should contain individual service stops during domain shutdown, including timestamps - it will give a hint what takes this long.

@heinrich-ulbricht
Copy link
Author

heinrich-ulbricht commented Oct 28, 2019

@marmarek Looking at the most recent log file for the domain I can see that the log messages report successful shutdown (so it seems) almost immediately after clicking the "Shutdown" button in the Qubes menu. When I click Shutdown and look at the log I immediately see this:

[.[0;32m  OK  .[0m] Reached target .[0;1;39mShutdown.[0m.
[.[0;32m  OK  .[0m] Reached target .[0;1;39mFinal Step.[0m.
[.[0;32m  OK  .[0m] Started .[0;1;39mPower-Off.[0m.
[.[0;32m  OK  .[0m] Reached target .[0;1;39mPower-Off.[0m.
[   35.852900] systemd-shutdown[1]: Syncing filesystems and block devices.
[   35.915401] systemd-shutdown[1]: Sending SIGTERM to remaining processes...
[   35.922197] systemd-journald[210]: Received SIGTERM from PID 1 (systemd-shutdow).
[   35.968467] systemd-shutdown[1]: Sending SIGKILL to remaining processes...
[   35.976329] systemd-shutdown[1]: Unmounting file systems.
[   35.980770] [1619]: Remounting '/usr/lib/modules' read-only in with options 'lowerdir=/tmp/modules,upperdir=/sysroot/lib/modules,workdir=/sysroot/lib/.modules_work'.
[   35.985098] [1620]: Unmounting '/usr/lib/modules'.
[   35.995188] [1621]: Remounting '/' read-only in with options 'discard'.
[   36.017008] EXT4-fs (xvda3): re-mounted. Opts: discard
[   36.027040] systemd-shutdown[1]: All filesystems unmounted.
[   36.027075] systemd-shutdown[1]: Deactivating swaps.
[   36.027192] systemd-shutdown[1]: All swaps deactivated.
[   36.027219] systemd-shutdown[1]: Detaching loop devices.
[   36.027512] systemd-shutdown[1]: All loop devices detached.
[   36.027542] systemd-shutdown[1]: Detaching DM devices.
[   36.172314] reboot: System halted

And then it takes ~40 seconds. No further log entries are added, so the log does not really help. Is this ok? Where would I look further?

@brendanhoar
Copy link

  1. What physical storage are you using, SSD or HDD?
  2. Are you configured with the standard LVM-thin, btrfs or standard file-based containers?
  3. What was the initial version of Qubes installed?
  4. Are you running standard or testing repos?

@heinrich-ulbricht
Copy link
Author

@brendanhoar

  1. classic HDD
  2. standard LVM-thin in a separate pool
  3. Qubes R4.0
    • did a migration lately, but from 4.0 to 4.0 to increase the size of my main-pool SSD; re-created the large pool on another HDD that SHOULD be the same model as the first one
    • I upgraded the underlying Fedora 28 TemplateVM to Fedora 30 lately
  4. standard Qubes and Fedora repos

Your questions lead me to do some more testing:

  • I switched the TemplateVM to Fedora 30 and Fedora 28 Minimal - no change, the long shutdown time persists

Shutdown time currently is ~50 seconds in all cases.

The first time I noticed the long shutdown is after increasing the private storage by a large amount. But it might be that I did not notice the long shutdown time before during regular shutdown. But I'm certain it used to be a lot faster. If I could pinpoint it to a slow HDD or something this would be fine. I'd just like to see some log where the delay can be seen. Maybe it's a timeout or something.

@heinrich-ulbricht
Copy link
Author

Just had look into the oldest xen log file I have, which is from Oct 06. It looks different for shutdown:

         Starting Power-Off...
[25187.655862] systemd-shutdow: 25 output lines suppressed due to ratelimiting
[25187.759906] systemd-shutdown[1]: Syncing filesystems and block devices.
[25187.839439] systemd-shutdown[1]: Sending SIGTERM to remaining processes...
[25187.842755] systemd-journald[203]: Received SIGTERM from PID 1 (systemd-shutdow).
[25187.874552] systemd-shutdown[1]: Sending SIGKILL to remaining processes...
[25187.877111] systemd-shutdown[1]: Unmounting file systems.
[25187.879350] [2493]: Remounting '/usr/lib/modules' read-only in with options 'lowerdir=/tmp/modules,upperdir=/sysroot/lib/modules,workdir=/sysroot/lib/.modules_work'.
[25187.887041] [2494]: Unmounting '/usr/lib/modules'.
[25187.896060] [2495]: Remounting '/' read-only in with options 'discard'.
[25187.923818] EXT4-fs (xvda3): re-mounted. Opts: discard
[25187.927815] systemd-shutdown[1]: All filesystems unmounted.
[25187.927830] systemd-shutdown[1]: Deactivating swaps.
[25187.927869] systemd-shutdown[1]: All swaps deactivated.
[25187.927880] systemd-shutdown[1]: Detaching loop devices.
[25187.928048] systemd-shutdown[1]: All loop devices detached.
[25187.928060] systemd-shutdown[1]: Detaching DM devices.
[25187.954533] xenbus: xenbus_dev_shutdown: device/vbd/51840: Closed != Connected, skipping
[25187.993300] reboot: System halted

The xenbus line is gone in the current log. But then again those changes might be to kernel or xen updates. And the delay is not covered in the log.

@heinrich-ulbricht
Copy link
Author

heinrich-ulbricht commented Oct 28, 2019

guid.domain.log differs from other domains. It shows:

Icon size: 128x128
libvchan_is_eof
Icon size: 128x128
domain dead
Failed to connect to gui-agent

These log entries seem to be unique to this domain:

domain dead
Failed to connect to gui-agent

Maybe it's just a reporting issue? Somebody not noticing the shutdown already happened?

@marmarek
Copy link
Member

Those particular messages are probably unrelated, but I have another idea: during that long shutdown observe output of xl list, specifically state column. Is it -b---- or something else when that long delay happens?
If something else, check what version of qubes-libvchan-xen do you have, specifically if 4.0.7 or older - a related change was included there (that could improve this).

@heinrich-ulbricht
Copy link
Author

heinrich-ulbricht commented Oct 28, 2019

@marmarek
During shutdown I can observe the state in xl list change like this over time (domain being the name of my domain):

  • domain: r----- (time: 14.1)
  • domain: -b---- (time: 20.2)
  • domain: r----- (time: 21.0)
  • (null): --ps-- (time: 21.0)
  • (null): --ps-d (time: 21.0)

Note how the domain name at some point is replaced by (null) and the time stops.

It stays like (null): --ps-d (21.0) for about 8 seconds, then the entry disappears while the domain is still shutting down for many more seconds according to the Qubes Domains list.

In my VM I have version 4.0.7 of qubes-libvchan-xen.

[user@domain ~]$ dnf list installed | grep qubes-libvchan-xen
qubes-libvchan-xen.x86_64                   4.0.7-1.fc30                          @qubes-vm-r4.0-current

Does this shed some light?

@brendanhoar
Copy link

brendanhoar commented Oct 28, 2019

  1. What model of HDD is this? Seagate or Western Digital? Does it report it supports ... trim or unmap? SHR-based HDDs now sometimes do, and it's possible they take more time to process a blkdiscard than SSDs do (which may be occurring when removing volatile/snapshot volumes during VM shutdown).
  2. By any chance, did you change the qubes-wide default pool from the primary (SSD) pool to the secondary (HDD) pool?
  3. Where is the template for this VM located, SSD or HDD?
  4. When you grep the output of sudo lvs for the name of your running VM and the name of the template it is based on...in which pools do the standard, volatile and snapshot volumes reside?

@heinrich-ulbricht
Copy link
Author

@brendanhoar

  1. It's a Western Digital WD10EAVS from years ago. So SHR, trim etc. shouldn't be things to consider (right?)
  2. Yes, I temporarily changed the qubes-wide default pool from the primary (SSD) pool to the secondary pool to restore the large domain to the HDD pool. Then I changed it back to the SSD pool.
  3. The template is located on the SSD.
  4. Find the grep of the running VM below; pool00 is on the SSD, vg_contoso/lv_contoso on the HDD
[heinrich@dom0 console]$ sudo lvs | grep contoso
  vm-contoso-root-snap                               qubes_dom0 Vwi-aotz--  21.72g pool00    vm-fedora-30                         56.84                                  
  lv_contoso                                         vg_contoso  twi-aotz-- 931.27g                                                             61.82  39.53                           
  vm-contoso-private                                 vg_contoso  Vwi-a-tz-- 585.94g lv_contoso vm-contoso-private-1572293911-back                 98.26                                  
  vm-contoso-private-1572268964-back                 vg_contoso  Vwi-a-tz-- 585.94g lv_contoso                                                   98.26                                  
  vm-contoso-private-1572293911-back                 vg_contoso  Vwi-a-tz-- 585.94g lv_contoso vm-contoso-private-1572268964-back                 98.26                                  
  vm-contoso-private-snap                            vg_contoso  Vwi-aotz-- 585.94g lv_contoso vm-contoso-private                                 98.26                                  
  vm-contoso-volatile                                vg_contoso  Vwi-aotz--  10.00g lv_contoso 

Do you see anything in there that looks odd?

@brendanhoar
Copy link

brendanhoar commented Oct 29, 2019

Thanks.

I had a (now seemingly incorrect) hunch the issue was that blkdiscard was issuing discards* against thin-volumes hosted on the HDD...but that WD drive is around 10 years old, so it definitely does not support TRIM, so that's not the issue.

-Brendan

* Qubes issues blkdiscard before lvremove post-shutdown for volatile/snapshot volumes and/or post-VM-delete for deleted VMs. It's good practice, for several reasons, upper layers to notify lower layers that storage is being de-allocated, and lvm does not currently explicitly do so when lvremove is executed. If all layers have pass-down enabled (by default, Qubes doesn't, but many of us enable it), this leads to TRIM requests being issued to the physical storage layer.

@heinrich-ulbricht
Copy link
Author

@brendanhoar By explicitly stating the age of this drive you make me wanna do backups more often...
Thanks for the theory and information. The mystery continues.

@heinrich-ulbricht
Copy link
Author

heinrich-ulbricht commented Oct 31, 2019

I made another observation today while trying to reduce the number of revisions_to_keep for the pool to see if this changes something. (Not sure if this works, I'm still trying to make sense of #3256)

Calling sudo lvs | grep contoso repeatedly while the large domain is shutting down shows that one value in the Data% column decreases steadily until it reaches zero. The "domain has shut down" message pops up in this exact moment when the value reaches zero.

Here is the log of calling sudo lvs | grep contoso three times while the domain is doing its ~50 sec shutdown (Data% being 37.36, then 15.23, then 0.01). By the way, there is a noticeable delay of about ~3 seconds until each of those calls returns.

[heinrich@dom0 ~]$ sudo lvs | grep contoso
  lv_contoso                                         vg_contoso  twi-aotz-- 931.27g                                                             61.82  55.72                           
  vm-contoso-private                                 vg_contoso  Vwi-a-tz-- 585.94g lv_contoso vm-contoso-private-1572532134-back                 98.26                                  
  vm-contoso-private-1572522261-back                 vg_contoso  Vwi-aotz-- 585.94g lv_contoso                                                   37.36                                  
  vm-contoso-private-1572532035-back                 vg_contoso  Vwi-a-tz-- 585.94g lv_contoso vm-contoso-private-1572522261-back                 98.26                                  
  vm-contoso-private-1572532134-back                 vg_contoso  Vwi-a-tz-- 585.94g lv_contoso vm-contoso-private-1572532035-back                 98.26                                  
[heinrich@dom0 ~]$ sudo lvs | grep contoso
  lv_contoso                                         vg_contoso  twi-aotz-- 931.27g                                                             61.82  62.31                           
  vm-contoso-private                                 vg_contoso  Vwi-a-tz-- 585.94g lv_contoso vm-contoso-private-1572532134-back                 98.26                                  
  vm-contoso-private-1572522261-back                 vg_contoso  Vwi-aotz-- 585.94g lv_contoso                                                   15.23                                  
  vm-contoso-private-1572532035-back                 vg_contoso  Vwi-a-tz-- 585.94g lv_contoso vm-contoso-private-1572522261-back                 98.26                                  
  vm-contoso-private-1572532134-back                 vg_contoso  Vwi-a-tz-- 585.94g lv_contoso vm-contoso-private-1572532035-back                 98.26                                  
[heinrich@dom0 ~]$ sudo lvs | grep contoso
  lv_contoso                                         vg_contoso  twi-aotz-- 931.27g                                                             61.82  64.77                           
  vm-contoso-private                                 vg_contoso  Vwi-a-tz-- 585.94g lv_contoso vm-contoso-private-1572532134-back                 98.26                                  
  vm-contoso-private-1572522261-back                 vg_contoso  Vwi-aotz-- 585.94g lv_contoso                                                   0.01                                   
  vm-contoso-private-1572532035-back                 vg_contoso  Vwi-a-tz-- 585.94g lv_contoso vm-contoso-private-1572522261-back                 98.26                                  
  vm-contoso-private-1572532134-back                 vg_contoso  Vwi-a-tz-- 585.94g lv_contoso vm-contoso-private-1572532035-back                 98.26                                  

@marmarek @brendanhoar How would you interpret this? What exactly is happening there? Some kind of diff being calculated? Which maybe just takes time due to the large amount of data on a slow HDD?

@heinrich-ulbricht
Copy link
Author

heinrich-ulbricht commented Oct 31, 2019

It looks like discarding the oldest snapshot takes its time.

Additionally, I just set the revisions_to_keep to 0 and when shutting down the domain 3 snapshots were discarded (two from the old setting revisions_to_keep=2, one current) - each of them taking ages. I could monitor the progress by calling qvm-volume info contoso:private repeatedly and see the snapshots disappear one after another. Now with 0 revisions it still takes time since one snapshot is still being discarded on shutdown (leaving 0).

Increasing the revisions_to_keep again to a value of 2 gives me a nice fast shutdown since the snapshot does not need to be discarded. Of course only until there are 3 of them again, one of them then being discarded on shutdown.

@marmarek Does this sound right? Is this supposed to take such a long time? Is there a way to speed this up?

@marmarek
Copy link
Member

That makes sense. Issuing DISCARD for the whole 580GB volume indeed can take time. LVM internally handle it too, even if underlying block doesn't support it. Specifically - release blocks from the volume and update internal metadata about those blocks (reference counters etc).
I see three options how to proceed here:

  1. Don't do anything and accept long shutdown - I think the only things that are slowed down during this time are about the very some qube (for example if you want to start it again). Other qubes should be unaffected.
  2. Don't issue DISCARD if underlying device doesn't support it. Issuing DISCARD here is mostly useful for the underlying SSD disk, not so much for HDD. On the other hand, detecting the support may be tricky (there are multiple layers to consider, including LUKS, LVM itself etc)
  3. Make it more asynchronous, in the background - don't hold further operations of the qube while it's discarding old volume. That volume wouldn't be used in any subsequent operations anyway, so it should be doable, but introducing more concurrency into the process may be risky - for example we'd need to ensure the operation is completed when shutting down the system, or removing the qube.

Do you have a specific use case against assumptions in the first point?

@brendanhoar
Copy link

brendanhoar commented Oct 31, 2019

[Edit - I will amend the below after catching up on the previous several posts.]

Some interplay of HDD, /etc/lvm/lvm.conf (and cryptsetup options, if being used) may make it possible that the blkdiscard issued before lvremove might be zeroing out the container files (instead of issuing discards, since the backing store is not trim-capable).

  1. What is the setting of issue_discards= in /etc/lvm/lvm.conf?
  2. What is the setting of thin_pool_discards in /etc/lvm/lvm.conf?
  3. If cryptsetup is being used in an intervening layer between the HDD and the PV/VG/thin volume, is the --allow-discards option being used with the cryptsetup?

Lastly, I wonder if blkdiscard automatically falls back to using zeroout if the volume (real or virtual) does not indicate trim support. That would not be optimal behavior on a thin volume. I may have to go dig out some spinning rust from my closet to do some tests.

Brendan

PS - my one-ish liner on monitoring discards as they happen:
#!/bin/bash
watch -n 1 -d
"if [ -d /sys/block/sda ] ; then pat1=sd pat2=loop; else pat1=xvd ; fi ; sync;echo --DISCARD TOTALS--;cat /sys/block/$pat1*/stat /sys/block/$pat2*/stat|awk 'BEGIN {print "DRIVE IOS QMERGES SECTORS MSWAITS"} {printf "%5i %-8s %s %15s %11s\n",NR,$12,$13,$14,$15}'"

In dom0, Drive 1 will be /dev/sda, etc., it'll also show discards on loops devices as well.
In domU, it uses xvd* instead.

@heinrich-ulbricht
Copy link
Author

heinrich-ulbricht commented Oct 31, 2019

@brendanhoar

  1. issue_discards in /etc/lvm/lvm.conf is 0
  2. thin_pool_discards in /etc/lvm/lvm.conf is commented out, thus no explicit value set
  3. Where can I check this? I set up the second pool including cryptsetup by following these instructions: https://www.qubes-os.org/doc/secondary-storage/ - no mentioning of discard options there, did configure nothing else

(Just as a side note, and this could be irrelevant: before I can start the (long-shutdown) AppVM in the second pool I have to start another VM (from the SSD pool) first, OR I have to call sudo service qubesd restart. Otherwise I get an error that the pool could not be found (have to reboot to get the exact error message). I think I commented somewhere else a long time ago about this but since then ignored it as this workaround was ok for me. Just wanted to mention this as we are taking deeper looks into the specifics of this pool.)

Thanks to both of you for helping me out here.

@heinrich-ulbricht
Copy link
Author

heinrich-ulbricht commented Oct 31, 2019

@marmarek
First and foremost I want to know if something is broken. If it is broken I want to repair it, if possible.
If it is not broken then I want to know the options and the effort it takes (and for whom) to implement those. If it's too much hassle I can live with it.
I'm currently on a dual-boot(*) machine and unfortunately have to switch between operating systems quite often. So every time I shut down Qubes I have this minute-long wait :-/ It's just annoying.

(*) with a physical SATA switch, so no "realy" dual-boot that could be of relevance to the issue

@marmarek
Copy link
Member

First and foremost I want to know if something is broken.

I wouldn't call it broken. Rather sub-optimal performance.

So every time I shut down Qubes I have this minute-long wait :-/

This rules out my third idea, as you'd still need to wait during shutdown.

What is the setting of issue_discards= in /etc/lvm/lvm.conf?

I believe this setting applies to non-thin volumes only.

What is the setting of thin_pool_discards in /etc/lvm/lvm.conf?

This indeed is relevant for thin volumes, possible values: ignore, nopassdown, passdown. If lower layer doesn't support discard, nopassdown and passdown are effectively equivalent. Default is passdown.

3\. If cryptsetup is being used in an intervening layer between the HDD and the PV/VG/thin volume, is the --allow-discards option being used with the cryptsetup?

See https://www.qubes-os.org/doc/disk-trim/#luks

@johnnyboy-3
Copy link

johnnyboy-3 commented Aug 23, 2020

I have the very same problem discussed here https://groups.google.com/forum/m/#!topic/qubes-users/-ZnZM6rS4hs
Problem exists even when setting revisions_to_keep to 0.
Workaround for me is to use btrfs

Btrfs also works great. I noticed that dom0 really hogs up the cpu after shutdown of huge appvms. For some reasons qubes still creates a <timestamp>-back revision just to delete it afterwards again, though revisions_to_keep is set to 0. Using btrfs i don't even need to set that config to 0, it shutdowns in like 1-2 seconds.

@iamahuman
Copy link

While shutting down a VM, I observe freezing in other VMs on the same pool as well. During the freeze, A kworker/**-dm-thin kernel thread takes 100% of CPU. I suppose this is the same issue?

@noskb
Copy link

noskb commented Dec 3, 2020

I'm using an SSD but I had the same problem and it was only occurred on SanDisk SDSSDXPS480G and not on Samsung SSD 860 EVO.
Of course my domain has a large size of private storage.

In conclusion, it was solved by changing the discard mode of LVM thin pool from passdown to nopassdown.

sudo lvchange --discards nopassdown qubes_dom0/pool00 in dom0 terminal

This will restore the physical space to the thin pool as in the past, but no TRIM command will be issued to the physical device.

from the lvmthin manpage

Discard

   The discard behavior of a thin pool LV determines how discard
   requests are handled.  Enabling discard under a file system may
   adversely affect the file system performance (see the section on
   fstrim for an alternative.)  Possible discard behaviors:

   ignore: Ignore any discards that are received.

   nopassdown: Process any discards in the thin pool itself and allow
   the no longer needed extents to be overwritten by new data.

   passdown: Process discards in the thin pool (as with nopassdown), and
   pass the discards down the the underlying device.  This is the
   default mode.

The default "passdown" may cause performance degradation in some cases.

@marmarek
Copy link
Member

marmarek commented Dec 3, 2020

This sounds like your SSD driver has very poor implementation of "TRIM" operation. "nopassdown" indeed may improve performance in this case, but depending on SSD model, may reduce the disk life time. It is also possible that this disk firmware is stupid enough it won't matter at all besides performance. Hard to tell without doing some extensive tests or analyzing disk firmware.

@brendanhoar
Copy link

At a minimum, Qubes should document possible performance issues under LVM on VM shutdown (during volatile volume/snapshot removals) with both SMR HDs generally and SSDs with poor firmware. Docs should recommend “nopassdown “ in lvm.conf as the workaround, noting the caveats for SSDs above.

I was going to recommend exposing this option via the qubes global settings but thinking it through, with all the custom storage layouts out there it might get messy, so that’s probably not a great idea.

[I say this all as the instigator (or at least one of them) of adding the blkdiscard invocation to lvm volume removals. I still think the benefits outweigh the other issues.]

B

@andrewdavidwong
Copy link
Member

At a minimum, Qubes should document possible performance issues under LVM on VM shutdown (during volatile volume/snapshot removals) with both SMR HDs generally and SSDs with poor firmware. Docs should recommend “nopassdown “ in lvm.conf as the workaround, noting the caveats for SSDs above.

In case you (or any others reading this) are not already aware, the documentation is a community effort, and everyone is welcome to contribute. (That's how things like this get updated!) So, if you'd like to get involved with the project, this is a great way to do it. You can read more about how to submit documentation changes here:

https://www.qubes-os.org/doc/doc-guidelines/

@DemiMarie
Copy link

Not a particularly efficient or ideal approach, but I'm using an extension I wrote myself that 1. splits discard requests and 2. do them in the background so that it doesn't block qubesd from performing other tasks.

Do you clone the volume, then after removal of the original, pass the clone to a thread that does throttled discards (and an lvremove of the clone after they complete)? Or do you just rename the volume, hand the renamed volume to a thread that does the throttled blkdiscard and lvremove, and allow the qubes storage drive to think it was removed?

It's the latter. To be accurate, the Qubes storage driver already forgets the volume by the time it is renamed (as if it was deleted, which means it uses lvrename just like lvremove), and the knowledge of the volume's existence is handed over exclusively to the background thread. As for how it is renamed, qubesd would prepend a RECLAIM- prefix to the volumes put into the removal queue, so that qubesd could scan for those volumes next time it restarts. This way qubesd can resume the "throttled discard" operation if it gets interrupted due to e.g. power failure.

Naming VM volumes anything that does not start with vm- is a bad idea unless the relevant udev rules are changed: they could get scanned by udev.

@iamahuman
Copy link

Not a particularly efficient or ideal approach, but I'm using an extension I wrote myself that 1. splits discard requests and 2. do them in the background so that it doesn't block qubesd from performing other tasks.

Do you clone the volume, then after removal of the original, pass the clone to a thread that does throttled discards (and an lvremove of the clone after they complete)? Or do you just rename the volume, hand the renamed volume to a thread that does the throttled blkdiscard and lvremove, and allow the qubes storage drive to think it was removed?

It's the latter. To be accurate, the Qubes storage driver already forgets the volume by the time it is renamed (as if it was deleted, which means it uses lvrename just like lvremove), and the knowledge of the volume's existence is handed over exclusively to the background thread. As for how it is renamed, qubesd would prepend a RECLAIM- prefix to the volumes put into the removal queue, so that qubesd could scan for those volumes next time it restarts. This way qubesd can resume the "throttled discard" operation if it gets interrupted due to e.g. power failure.

Naming VM volumes anything that does not start with vm- is a bad idea unless the relevant udev rules are changed: they could get scanned by udev.

Thanks! Forgot to mention it...

@faerbersteve
Copy link

Has someone tried using a newer version of libvirt?
The newer version 7.1 of libvirt contains a fix for a shutdown race which is mentioned in the release notes.
Maybe this solve the issue or part of it.

@marmarek
Copy link
Member

marmarek commented Jun 3, 2021 via email

@DemiMarie
Copy link

Not a particularly efficient or ideal approach, but I'm using an extension I wrote myself that 1. splits discard requests and 2. do them in the background so that it doesn't block qubesd from performing other tasks.

Would you mind submitting a PR for this? This seems like the best approach until a proper kernel patch can be written. An email to dm-devel would also be great.

@iamahuman
Copy link

iamahuman commented Jul 10, 2021

Not a particularly efficient or ideal approach, but I'm using an extension I wrote myself that 1. splits discard requests and 2. do them in the background so that it doesn't block qubesd from performing other tasks.

Would you mind submitting a PR for this? This seems like the best approach until a proper kernel patch can be written. An email to dm-devel would also be great.

The code is written for R4.0, so perhaps I'll be setting up a new R4.1 testing environment in the coming week. Might take a while...

@iamahuman
Copy link

iamahuman commented Aug 1, 2021

It appears like I might not be able to procure enough time anytime soon, so here's the plugin code I had been using at least:

https://gist.github.com/iamahuman/a784efbbcadaec508235a10b52cc8baa

Sub-tasks for morphing it into something actually merge-able:

  • Transform the extension into an integrable patch
  • Get rid of linux_fs.py, switch to blkdiscard -p
  • Replace the worker pool with a purely asynchronous version
  • s/yield from/await/
  • Figure out appropriate lifecycle management scheme for the worker pool (← help needed)
    • Starting: __init__ or something more appropriate
    • Stopping: the operation itself is durable, so we don't need to handle abrupt shutdown cases; we merely have to clean up pending blkdiscard processes on systemctl stop qubesd.
  • Perhaps just make the queue simply local to ThinPool

@andrewdavidwong andrewdavidwong added diagnosed Technical diagnosis has been performed (see issue comments). and removed pr submitted A pull request has been submitted for this issue. labels Aug 1, 2021
@DemiMarie
Copy link

  • Get rid of linux_fs.py, switch to blkdiscard -p

Is this actually necessary? Using an ioctl will always be more efficient than spawning a process.

  • s/yield from/await/

I would start with this, as it has no dependencies on any of the other changes.

@iamahuman
Copy link

  • Get rid of linux_fs.py, switch to blkdiscard -p

Is this actually necessary? Using an ioctl will always be more efficient than spawning a process.

Mainly to reduce complexity. However if we keep it, I suppose we shall keep the worker pool as well.

  • s/yield from/await/

I would start with this, as it has no dependencies on any of the other changes.

@DemiMarie
Copy link

  • Get rid of linux_fs.py, switch to blkdiscard -p

Is this actually necessary? Using an ioctl will always be more efficient than spawning a process.

Mainly to reduce complexity. However if we keep it, I suppose we shall keep the worker pool as well.

I will let @marmarek decide if the complexity is worthwhile. On bleeding-edge kernels, we might be able to get rid of the pool with io_uring, but there is no way whatsoever that that is worth the effort.

@q4747
Copy link

q4747 commented Sep 25, 2021

I can confirm that re-installing QubesOS 4.0.4 with BTRFS worked for me to solve this issue.

Previously I had QubesOS 4.0.4 with the default LVM-thin. This led to rather slow shutdowns and sometimes temporary GUI freezes when shutting down VMs with data of ~300GB and larger. It typically took 30 sec. up to 2 min, depending on the size.
Just like @heinrich-ulbricht, I was able to observe the progress of the slow deletion of the backup file with sudo lvs | grep [vm-name].

I'm using the SSD Samsung 870 EVO with 4TB, which works fine now with BTRFS.

@DemiMarie
Copy link

I think using thin_trim during boot is definitely a good idea. In the long term this needs to be solved at the kernel level.

@MonkeyBars3k
Copy link

Confirming that fresh install of 4.1 with BTRFS enabled solves this issue.

The instructions on how to select this option during the install setup are detailed here.

@rustybird
Copy link

rustybird commented Jul 10, 2023

Even if LVM currently doesn't have a dedicated defrag command, wouldn't it still be possible to defragment a volume in a generic way:

  1. Ensure that there's plenty of free space in the pool
  2. Wait for the problematic VM to cleanly shut down, however long it takes
  3. Delete the fragmented save_on_stop volume's revisions (except the latest one!)
  4. Create a new thin volume of the same size
  5. cp --sparse=always from the old volume's to the new volume's /dev node
  6. Delete the old volume
  7. Rename the new volume to the old name
  8. Restart qubesd.service

This is totally untested and maybe I'm missing something due to lack of familiarity with LVM2 and the lvm_thin driver.

@andrewdavidwong andrewdavidwong added the affects-4.1 This issue affects Qubes OS 4.1. label Aug 8, 2023
@andrewdavidwong andrewdavidwong removed this from the Release 4.1 updates milestone Aug 13, 2023
@johnnyboy-3
Copy link

Hello,

with R4.2 things have changed.
With default partition scheme this problem no longer exists.
Interestingly this problem now occurs with BTRFS (both tested with upgrade and fresh installation).
I do like the compress feature that BTRFS supports, is there a way to enable it with lvm/ext4?
Or is there a fix in the near future, that solves this issue with BTRFS?

See also small discussion @ https://forum.qubes-os.org/t/very-slow-shutdown-of-large-vms-in-4-2-even-with-btrfs/23156

@andrewdavidwong
Copy link
Member

with R4.2 things have changed.
With default partition scheme this problem no longer exists.
Interestingly this problem now occurs with BTRFS (both tested with upgrade and fresh installation).

I guess this means it should still get the affects-4.2 label, since the problem occurs with BTRFS, even though it's not the default.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
affects-4.1 This issue affects Qubes OS 4.1. affects-4.2 This issue affects Qubes OS 4.2. C: other diagnosed Technical diagnosis has been performed (see issue comments). P: major Priority: major. Between "default" and "critical" in severity.
Projects
None yet
Development

No branches or pull requests