Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vng silently hangs if boot fails #217

Closed
bjackman opened this issue Jan 10, 2025 · 6 comments · Fixed by #227
Closed

vng silently hangs if boot fails #217

bjackman opened this issue Jan 10, 2025 · 6 comments · Fixed by #227

Comments

@bjackman
Copy link

I have a system failing like [ 2.073575] Kernel panic - not syncing: Attempted to kill init! exitcode=0x00007f00.

I can see this if I pass --verbose, but if I just run vng, I get no error messages.

My initial feeling was that the kernel console should be included in the default output regardless of --verbose but I suspect there's a good reason why it isn't.

Another idea I had would be to combine the kernel's panic=-1 with QEMU's -no-reboot. But then that breaks the usecase of people who genuinely want to reboot (is that a valid usecase? I guess so?).

Basically, I think the ideal behaviour is that if the kernel panics before we hit virtme-init , we spit out the console and exit with an error.

So... any ideas?

@SPYFF
Copy link
Contributor

SPYFF commented Jan 29, 2025

Version?

@matttbe
Copy link
Collaborator

matttbe commented Jan 29, 2025

Hello,

@bjackman : which mode are you using? Interactive or script mode?
In script mode, the VM should stop and not panic, see #70. (with panic=-1 and -no-reboot).

In interactive mode, I agree it is unclear what can be done when it crashes before the init script. Maybe a timeout for the init script? In case of timeout, a message could be printed to use the verbose mode to see what's wrong?

@bjackman
Copy link
Author

bjackman commented Jan 29, 2025

Version?

❯❯  vng --version
virtme-ng 1.31+43.g67cac60

@bjackman : which mode are you using? Interactive or script mode?

I'm guessing script mode means when you pass a command after --? I'm not doing that, so I guess it's interactive mode.

Some examples of the commands in my history

► unshare -r vng --root ~/src/limmat-kernel/mkosi-rootfs/image/ --verbose --user root --debug --append=spectre_bhi=vmexit                                                     
► unshare -r vng --root ~/src/limmat-kernel/mkosi-rootfs/image/ --verbose --user root --gdb                                                                                   
► unshare -r vng --root ~/src/limmat-kernel/mkosi-rootfs/image/ --verbose --user root --debug                                                                                 
► unshare -r vng --root ~/src/limmat-kernel/mkosi-rootfs/image/ --verbose --user root --debug --append=spectre_bhi=off                                                        
► unshare -r vng --root ~/src/limmat-kernel/mkosi-rootfs/image/ --verbose --user root                                                                                         
► unshare -r vng --root ~/src/limmat-kernel/mkosi-rootfs/image/ --verbose --user root --disable-microvm --append=asi=on --rwdir /mnt/kvmtool=$HOME/src/kvmtool    

Maybe a timeout for the init script? In case of timeout, a message could be printed to use the verbose mode to see what's wrong?

Yeah - although I think before adding that complexity it would be good to confirm my assumption that setting -no-reboot would break a usecase that we care about.

I guess the ideal would be if there was a way to start QEMU with the -no-reboot behaviour and then have the init script communicate via the vsock or whatever it is that QGA uses, to tell QEMU "OK, now you're allowed to reboot instead of exit". But that's probably more complicated than just having the timeout + print message.

@matttbe
Copy link
Collaborator

matttbe commented Jan 29, 2025

Version?

❯❯  vng --version
virtme-ng 1.31+43.g67cac60

@bjackman : which mode are you using? Interactive or script mode?

I'm guessing script mode means when you pass a command after --? I'm not doing that, so I guess it's interactive mode.

Yes sorry, everything passed after -- or using --exec

Did you use vng to build the kernel as well? I suspect your original issue comes from there, no? Maybe some missing kconfig?
But not because you use your own kernel (vng -r) or one from the cloud (vng -r v6.6.17)?

Maybe a timeout for the init script? In case of timeout, a message could be printed to use the verbose mode to see what's wrong?

Yeah - although I think before adding that complexity it would be good to confirm my assumption that setting -no-reboot would break a usecase that we care about.

It is hard to assume people might not explicitly want to reboot the VM :)
I guess it should not be common. But I guess that's why @arighi only restricted this to "if you use vng in script mode".

I guess the ideal would be if there was a way to start QEMU with the -no-reboot behaviour and then have the init script communicate via the vsock or whatever it is that QGA uses, to tell QEMU "OK, now you're allowed to reboot instead of exit". But that's probably more complicated than just having the timeout + print message.

Or by default using the panic=-1 and -no-reboot in interactive mode as well except if a new option is set? (--allow-reboot and/or --panic <value>)

@bjackman
Copy link
Author

bjackman commented Jan 29, 2025

Did you use vng to build the kernel as well? I suspect your original issue comes from there, no? Maybe some missing kconfig?

Oh yeah don't worry I don't need help with the original problem, I can't remember what it was (I think probably I had disabled paravirt in my guest kernel so I needed --disable-microvm). Just raising this to improve UX for vng itself, I think being helpful when the guest kernel is broken is close to its fundamental value proposition (maybe biased as someone who breaks kernels a lot!).

But I guess that's why @arighi only restricted this to "if you use vng in script mode".

Yeah that's a good point - definitely sound like rebooting is a usecase that matters to someone.

if a new option is set

Yeah I guess that would be an option. But, if there are people relying on the current behaviour, it would cause pain for them. So I think your timeout idea probably is the best solution here.

@arighi
Copy link
Owner

arighi commented Jan 30, 2025

The combo panic=-1 and -no-reboot is used only in non-interactive mode (when you pass a command to vng), if a panic happens vng will report the special error code 255, see commit 838cca0.

We don't do this in interactive mode, since we want to simulate what a machine would do normally on panic (just hang) unless some special boot options are passed via --append ....

However, it'd be nice to always print kernel oops / panic to stderr when running in interactive mode. I'll do some testing and prepare a patch.

arighi added a commit that referenced this issue Jan 30, 2025
As mentioned in #217 it would be nice to always print the output of
critical kernel errors (oops / panic), instead of suppressing all the
kernel logs completely by default.

Therefore, keep suppressing the boot kernel log, but always dump
panic/oops to stderr by default when running in interactive mode.

Example with this change applied:

arighi@virtme-ng~/s/linux (master)> vng
          _      _
   __   _(_)_ __| |_ _ __ ___   ___       _ __   __ _
   \ \ / / |  __| __|  _   _ \ / _ \_____|  _ \ / _  |
    \ V /| | |  | |_| | | | | |  __/_____| | | | (_| |
     \_/ |_|_|   \__|_| |_| |_|\___|     |_| |_|\__  |
                                                |___/
   kernel version: 6.13.0-virtme x86_64
   (CTRL+d to exit)

arighi@virtme-ng~/s/linux (master)> echo c | sudo tee /proc/sysrq-trigger
[    8.923672] sysrq: Trigger a crash
[    8.923980] Kernel panic - not syncing: sysrq triggered crash
[    8.924183] CPU: 0 UID: 0 PID: 198 Comm: tee Not tainted 6.13.0-virtme #2
[    8.924632] Call Trace:
[    8.924704]  <TASK>
[    8.924783]  panic+0x349/0x3b0
[    8.925055]  sysrq_handle_crash+0x36/0x80
[    8.925181]  __handle_sysrq+0xed/0x270
[    8.925274]  write_sysrq_trigger+0x6a/0x90
[    8.925380]  proc_reg_write+0x56/0xa0
[    8.925489]  vfs_write+0x105/0x590
[    8.925600]  ksys_write+0x74/0xf0
[    8.925682]  do_syscall_64+0xbb/0x1d0
[    8.925767]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[    8.925891] RIP: 0033:0x7fdfed54ba84

Signed-off-by: Andrea Righi <[email protected]>
arighi added a commit that referenced this issue Jan 30, 2025
As mentioned in #217 it would be nice to always print the output of
critical kernel errors (oops / panic), instead of suppressing all the
kernel logs completely by default.

Therefore, keep suppressing the boot kernel log, but always dump
panic/oops to stderr by default when running in interactive mode.

Example with this change applied:

arighi@virtme-ng~/s/linux (master)> vng
          _      _
   __   _(_)_ __| |_ _ __ ___   ___       _ __   __ _
   \ \ / / |  __| __|  _   _ \ / _ \_____|  _ \ / _  |
    \ V /| | |  | |_| | | | | |  __/_____| | | | (_| |
     \_/ |_|_|   \__|_| |_| |_|\___|     |_| |_|\__  |
                                                |___/
   kernel version: 6.13.0-virtme x86_64
   (CTRL+d to exit)

arighi@virtme-ng~/s/linux (master)> echo c | sudo tee /proc/sysrq-trigger
[    8.923672] sysrq: Trigger a crash
[    8.923980] Kernel panic - not syncing: sysrq triggered crash
[    8.924183] CPU: 0 UID: 0 PID: 198 Comm: tee Not tainted 6.13.0-virtme #2
[    8.924632] Call Trace:
[    8.924704]  <TASK>
[    8.924783]  panic+0x349/0x3b0
[    8.925055]  sysrq_handle_crash+0x36/0x80
[    8.925181]  __handle_sysrq+0xed/0x270
[    8.925274]  write_sysrq_trigger+0x6a/0x90
[    8.925380]  proc_reg_write+0x56/0xa0
[    8.925489]  vfs_write+0x105/0x590
[    8.925600]  ksys_write+0x74/0xf0
[    8.925682]  do_syscall_64+0xbb/0x1d0
[    8.925767]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[    8.925891] RIP: 0033:0x7fdfed54ba84

Signed-off-by: Andrea Righi <[email protected]>
arighi added a commit that referenced this issue Jan 30, 2025
As mentioned in #217 it would be nice to always print the output of
critical kernel errors (oops / panic), instead of suppressing all the
kernel logs completely by default.

Therefore, keep suppressing the boot kernel log, but always dump
panic/oops to stderr by default when running in interactive mode.

Example with this change applied:

arighi@virtme-ng~/s/linux (master)> vng
          _      _
   __   _(_)_ __| |_ _ __ ___   ___       _ __   __ _
   \ \ / / |  __| __|  _   _ \ / _ \_____|  _ \ / _  |
    \ V /| | |  | |_| | | | | |  __/_____| | | | (_| |
     \_/ |_|_|   \__|_| |_| |_|\___|     |_| |_|\__  |
                                                |___/
   kernel version: 6.13.0-virtme x86_64
   (CTRL+d to exit)

arighi@virtme-ng~/s/linux (master)> echo c | sudo tee /proc/sysrq-trigger
[    8.923672] sysrq: Trigger a crash
[    8.923980] Kernel panic - not syncing: sysrq triggered crash
[    8.924183] CPU: 0 UID: 0 PID: 198 Comm: tee Not tainted 6.13.0-virtme #2
[    8.924632] Call Trace:
[    8.924704]  <TASK>
[    8.924783]  panic+0x349/0x3b0
[    8.925055]  sysrq_handle_crash+0x36/0x80
[    8.925181]  __handle_sysrq+0xed/0x270
[    8.925274]  write_sysrq_trigger+0x6a/0x90
[    8.925380]  proc_reg_write+0x56/0xa0
[    8.925489]  vfs_write+0x105/0x590
[    8.925600]  ksys_write+0x74/0xf0
[    8.925682]  do_syscall_64+0xbb/0x1d0
[    8.925767]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[    8.925891] RIP: 0033:0x7fdfed54ba84

Signed-off-by: Andrea Righi <[email protected]>
arighi added a commit that referenced this issue Jan 30, 2025
As mentioned in #217 it would be nice to always print the output of
critical kernel errors (oops / panic), instead of suppressing all the
kernel logs completely by default.

Therefore, keep suppressing the boot kernel log, but always dump
panic/oops to stderr by default when running in interactive mode.

Example with this change applied:

arighi@virtme-ng~/s/linux (master)> vng
          _      _
   __   _(_)_ __| |_ _ __ ___   ___       _ __   __ _
   \ \ / / |  __| __|  _   _ \ / _ \_____|  _ \ / _  |
    \ V /| | |  | |_| | | | | |  __/_____| | | | (_| |
     \_/ |_|_|   \__|_| |_| |_|\___|     |_| |_|\__  |
                                                |___/
   kernel version: 6.13.0-virtme x86_64
   (CTRL+d to exit)

arighi@virtme-ng~/s/linux (master)> echo c | sudo tee /proc/sysrq-trigger
[    8.923672] sysrq: Trigger a crash
[    8.923980] Kernel panic - not syncing: sysrq triggered crash
[    8.924183] CPU: 0 UID: 0 PID: 198 Comm: tee Not tainted 6.13.0-virtme #2
[    8.924632] Call Trace:
[    8.924704]  <TASK>
[    8.924783]  panic+0x349/0x3b0
[    8.925055]  sysrq_handle_crash+0x36/0x80
[    8.925181]  __handle_sysrq+0xed/0x270
[    8.925274]  write_sysrq_trigger+0x6a/0x90
[    8.925380]  proc_reg_write+0x56/0xa0
[    8.925489]  vfs_write+0x105/0x590
[    8.925600]  ksys_write+0x74/0xf0
[    8.925682]  do_syscall_64+0xbb/0x1d0
[    8.925767]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[    8.925891] RIP: 0033:0x7fdfed54ba84

Signed-off-by: Andrea Righi <[email protected]>
arighi added a commit that referenced this issue Jan 30, 2025
As mentioned in #217 it would be nice to always print the output of
critical kernel errors (oops / panic), instead of suppressing all the
kernel logs completely by default.

Therefore, keep suppressing the boot kernel log, but always dump
panic/oops to stderr by default when running in interactive mode.

Example with this change applied:

arighi@virtme-ng~/s/linux (master)> vng
          _      _
   __   _(_)_ __| |_ _ __ ___   ___       _ __   __ _
   \ \ / / |  __| __|  _   _ \ / _ \_____|  _ \ / _  |
    \ V /| | |  | |_| | | | | |  __/_____| | | | (_| |
     \_/ |_|_|   \__|_| |_| |_|\___|     |_| |_|\__  |
                                                |___/
   kernel version: 6.13.0-virtme x86_64
   (CTRL+d to exit)

arighi@virtme-ng~/s/linux (master)> echo c | sudo tee /proc/sysrq-trigger
[    8.923672] sysrq: Trigger a crash
[    8.923980] Kernel panic - not syncing: sysrq triggered crash
[    8.924183] CPU: 0 UID: 0 PID: 198 Comm: tee Not tainted 6.13.0-virtme #2
[    8.924632] Call Trace:
[    8.924704]  <TASK>
[    8.924783]  panic+0x349/0x3b0
[    8.925055]  sysrq_handle_crash+0x36/0x80
[    8.925181]  __handle_sysrq+0xed/0x270
[    8.925274]  write_sysrq_trigger+0x6a/0x90
[    8.925380]  proc_reg_write+0x56/0xa0
[    8.925489]  vfs_write+0x105/0x590
[    8.925600]  ksys_write+0x74/0xf0
[    8.925682]  do_syscall_64+0xbb/0x1d0
[    8.925767]  entry_SYSCALL_64_after_hwframe+0x77/0x7f
[    8.925891] RIP: 0033:0x7fdfed54ba84

Signed-off-by: Andrea Righi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants