Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AMD Radeon 780M #6

Open
forceclosed opened this issue Oct 1, 2024 · 13 comments
Open

AMD Radeon 780M #6

forceclosed opened this issue Oct 1, 2024 · 13 comments

Comments

@forceclosed
Copy link

Hello inga_lovinde,

Issue:
I followed the instructions to this and startup works great. The issue begins after a reboot/shutdown/sleep of the VM. The VM is not able to start again until the host machine is rebooted.

Device:
Beelink SER7 with AMD Radeon 780M Graphics
Windows 11 Pro

Startup and Diagnose Logs:
radeonfix_20240930_215717.log
radeonfix_20240930_215905.log
radeonfix_20240930_220617.log

Any advice is greatly appreciated, Thanks!

@oznakn
Copy link

oznakn commented Oct 16, 2024

Hello,

I’ve the very same problem with my system. Have you find any solution for this?

Thanks in advance.

@inga-lovinde
Copy link
Owner

Sorry for taking so long.
It seems that in the recent AMD GPUs (or in the recent drivers) they introduced many more subdevices which should be disabled on shutdown but aren't (because this tool is not aware that they are related to AMD GPU), so the GPU itself is not shut down gracefully.
I'll release an updated version, with improved AMD device detection logic, soon (as soon as I have access to a PC with VS.NET).

@oznakn
Copy link

oznakn commented Oct 21, 2024

Hello,

I semi-solved the problem by adding a third device, specifically 0000:c5:00.6 to the PCI passthrough config. Now, I can successfully reboot the host machine without any problem (this was not possible before). So no rebooting in the windows VM, and issue is semi-solved.

@inga-lovinde
Copy link
Owner

@oznakn that's interesting, what was the device called? I vaguely remember that on my host, (1) GPU device node was duplicated, and maybe (2) there was a separate HDMI audio device node, and I had to passthrough all of them in order to give guest full control over the actual GPU device, instead of splitting it between host and guest.

If this is the same for you, then I'll update the readme, this at least won't require PC with VS.NET from me :)

@oznakn
Copy link

oznakn commented Oct 21, 2024

Let me get back to you about this. However, if I passthrough all GPU devices, the host goes into a bootloop. I tried many combinations, and only one that works is passthrough 3 devices (main gpu, audio, and one more).

And also I want to state that even with RadeonResetBugFix rebooting vm does not work. So it's like:

  • Passthrough 2 devices + No RadeonResetBugFix: No reboot on both host and vm
  • Passthrough 2 devices + With RadeonResetBugFix: No reboot on both host and vm
  • Passthrough 3 devices + No RadeonResetBugFix: No reboot on vm, but can reboot on host
  • Passthrough 3 devices + With RadeonResetBugFix: No reboot on vm, but can reboot on host

@inga-lovinde
Copy link
Owner

However, if I passthrough all GPU devices, the host goes into a bootloop.

Not sure if I understand you correctly. Of course you should not pass through all GPU devices that are there (if you have multiple physical GPUs); you should pass through all device nodes that come from the physical AMD GPU you're trying to passthrough.

I can easily imagine that the host doesn't boot if it doesn't have any available GPUs. You'd need to somehow configure your host to boot in headless mode, and I'm not sure that Windows hosts even support this.
So if your host is Windows-based, you'll need at least two physical GPUs (one to be used by the host, and another to be passed through to VM). And you only should pass through device nodes related to the second GPU, not to the first GPU.

Passthrough 3 devices + With RadeonResetBugFix: No reboot on vm, but can reboot on host

What do you mean by "no reboot on VM, but can reboot on host"?

@oznakn
Copy link

oznakn commented Oct 21, 2024

oh, let me provide more information.

I'm using a debian as the host OS. It does not have any graphics output, I just use proxmox using a browser. With this, I don't need any extra GPU. (Btw I'm using the same system as the issue author uses.)

When I passthrough only 2 devices I cannot reboot the host computer, with or without RadeonResetBugFix. But, if I passthrough 3 devices I can reboot the host computer. However, It's still problematic to reboot the vm.

So what's happening is since the host computer does not use GPU at all, when we try to reboot the Windows VM, GPU got stuck due to the reset bug. However, if I directly reboot the host computer, then it's working (only if I passthrough 3 devices).

What I was thinking that is probably like you said GPU itself is not shut down gracefully. And my gut says if RadeonResetBugFix can also reset the third device I passthroughed, probably it will be possible to reboot the windows VM.

@inga-lovinde
Copy link
Owner

@oznakn so what do you mean by

However, if I passthrough all GPU devices, the host goes into a bootloop.

?

This conflicts with your words that the host can be rebooted without any problems if you passthrough three devices. I'm just trying to understand what's going on here.

And also, this is a weird behavior you're seeing with passing through two devices, because IIRC the essence of the bug is that trying to initialize the GPU twice in the same host session (unless the GPU was shut down gracefully) causes the host to lock up. But if you just start the VM once and then try to reboot the host, this shouldn't be a problem (as long as it is a full reboot and not just reinit/reroot, not sure what's the right term for Linux).

So with "No reboot on both host and vm", what exactly is the problem? When does it hang up? On shutdown, or on the next startup?

The only possibility I can think about of why would it hang on shutdown is if for some reason the host tries to use / initialize the GPU after the guest released it. Then it would also hang when you shut down the guest without trying to reboot the host, or when you ungracefully "power off" the guest. Does this match your experience?

@xiaomujiayou
Copy link

Hello,

I semi-solved the problem by adding a third device, specifically 0000:c5:00.6 to the PCI passthrough config. Now, I can successfully reboot the host machine without any problem (this was not possible before). So no rebooting in the windows VM, and issue is semi-solved.

根据你的提示,我在8845HS上测试,win10、win11可以正常开关机、重启,thanks

@smarticz
Copy link

Sorry for taking so long. It seems that in the recent AMD GPUs (or in the recent drivers) they introduced many more subdevices which should be disabled on shutdown but aren't (because this tool is not aware that they are related to AMD GPU), so the GPU itself is not shut down gracefully. I'll release an updated version, with improved AMD device detection logic, soon (as soon as I have access to a PC with VS.NET).

Yes, we are begging🙏

@smarticz
Copy link

smarticz commented Dec 25, 2024

"PCI\\VEN_1002&DEV_1640&SUBSYS_16401002", // Radeon High Definition Audio Controller
"PCI\\VEN_1002&DEV_1900&SUBSYS_01241002", // AMD Radeon Phoenix3 GPU
"HDAUDIO\\FUNC_01&VEN_1002&DEV_AA01&SUBSYS_00AA0100", // AMD Audio Controller

That wildcards its be ok?

"PCI\\VEN_1002&DEV_*&SUBSYS_*", // Subdevices for AMD Phoenix
"HDAUDIO\\FUNC_01&VEN_1002&DEV_*&SUBSYS_*", // AMD Audio Controller with AMD Phoenix

@inga-lovinde
Copy link
Owner

inga-lovinde commented Dec 26, 2024

@smarticz , I remember that I considered this idea when originally implementing this project, and decided against it; but I don't remember why exactly. Maybe because I didn't want to process any devices besides those that are absolutely necessary for working around the AMD bug.

One of the problems with wildcard approach I can see is that "PCI\\VEN_1002&DEV_*&SUBSYS_*" will just catch all AMD devices on PCI bus, not just GPU-related; it will probably mean your entire chipset. Some of these devices might be critically important for correct functioning of the system, and some of them might not actually be disableable.

It might be a good idea to disable all AMD audio though, regardless of its DEV and SUBSYS identifiers.


Unfortunately, being unemployed and searching for a job leaves me with much less free time than an employment would. So I don't know when I'll be able to actually work on this.

@smarticz
Copy link

smarticz commented Jan 2, 2025

It might actually be the case since there’s an issue with the iGPU, which is integrated into the processor, and that processor would be necessary for the machine to function properly. I think this approach would be more suitable for a solo GPU setup. It seems like a good idea to separate the logic from the settings by moving the settings to a file that could be easily edited without recompiling the program. It’s clear that the guys experimented earlier and came to new conclusions, which personally turned out to be useful for me.

I tried to implement this, rewrote the code to use wildcards, and it started nicely catching devices without crashing. However, I’m currently stuck on the logic for disabling, as it appears to be some sort of procedure based on your hardware experience, which is why it works. I’m just wondering about callbacks, as you’re creating mechanisms to disable something, but you’re not actually verifying it; for example, you just initialize it again. I assume it’s because optimizing this wasn’t worth the time, as it’s a matter of losing seconds during disable/enable operations, or maybe it’s not possible to check the response, or it’s just tricky know-how because there were attempts, and this is the only way it works.

I hope you manage to find your dream job quickly in the new year, and I sincerely wish you all the best!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants