-
Notifications
You must be signed in to change notification settings - Fork 102
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AMD Radeon Instinct MI50 in a KVM/QEMU fails with "atombios stuck in loop", Fatal error during GPU init #157
Comments
I believe that we saw this internally as well, and it was resolved with a different VBIOS. It was a quirk with MI50 specifically, since other VG20 SKUs were fine. I'm trying to find out if/where we published the solution (setting the GPU as a native PCIe endpoint in the VBIOS) |
Hi @kentrussell, Thanks for looking into this. Any luck finding the documentation or solution for this problem? I'm keen to use ROCm for learning MLOps, at least until AMD drops support for the MI50 😢 🤔 |
I hadn't heard back from the guys yet, so I'll try to ping them again. It seems like a bit of a surprise that we'd identify something and not release a fix for it. But maybe the internal issue didn't document the whole process. I'll hopefully have something within the week |
Thank you @kentrussell, appreciate your efforts! |
So I got in touch with a virtualization expert. He was asking about why vfio-pci was enabled when it's already on passthrough. If you disable vfio-pci and run the KVM without it, does it throw the same error? |
Sorry, I don't understand the question. From what I've read, vfio enables pci-passthrough on the host to allow a VM to use the hardware. On my setup the vfio-pci is enabled only on the host side, not the VM/client side. Just to clarify the message about the vfio-pci kernel driver is from the host.
If I don't setup and enable vfio on the host, then I could not add the GPU (pci 44:00.0 device) in the VM hardware setup. It would complain the device had not been setup for pci-passthrough. Are you able clarify with your colleague the meaning of "... why vfio-pci was enabled when it's already on passthrough."? Is there another way to get passthrough? |
Hi @kentrussell Just wondering if you've had any luck from your colleagues or other teams? |
Sorry for the delays on it. I keep poking around and am getting passed around to different contacts, but haven't found anything useful yet. Currently there isn't a new VBIOS to fix it. I have been trying to find out if there's a way to manually force the change through other tools to no avail. I'm still hopeful though. |
Hi @kentrussell, I hope you have been well. Any progress with this issue? Any hacks 🔨 around the problem? . ... or could you convince the AMD gods to release all GPU VBIOS's as free and open source? 😃 It would really help the compute ecosystem and help users all over the world. 🚀 |
@nartmada I haven't had any luck trying through all of my unofficial channels and contacts. Think you can make a proper JIRA and assign it to the MI50 program to see if we can get this addressed? I'll ping you with the old JIRA so we can link against it. I just don't know where else we can get a workaround/fix |
Thanks @kentrussell for your efforts! |
Hi @nartmada, @kentrussell, Any luck by creating the JIRA ticket? |
@JustGitting. Apologies for my slow progress. JIRA ticket has been created and I am following up with the MI50 folks. |
Great, thank you @nartmada! 🚀 |
Just pinging you both, hoping to hear some good news 😄 I'm happy to test any procedures, hacks or ideas that have come up so far in solving this problem. 🪛 |
Any progress with the JIRA ticket? |
Sorry, at last check it was still bouncing around. Trying to find who owns it is harder, since some teams work on the latest HW and transition support for specific HW to other teams. Adam might have an update if I've missed it though |
Thank you for chasing this up @kentrussell Getting support/fixes for old(er) hardware has always been hard (or non-existent) in the computer industry. Getting customers to upgrade all the time is more profitable than supporting products long term. Lack of support for products is a form of "designed obsolescence". /rant Hence, I appreciate you're efforts. I've started trying to debug this again. If I find anything I'll share here. |
Minor update I've tried qemu xml options for the VM and/or changed host kernel options without success. I thought my only option was to try to pass the vbios to the VM, as suggested by https://wiki.archlinux.org/title/PCI_passthrough_via_OVMF, among others. <hostdev>
...
<rom file='/path/to/your/gpu/bios.bin'/>
...
</hostdev>
I've searched techpowerup.com for the MI50 vbios but nothing turns up. Hence, I've tried to manually dump the GPU vbios, but have also been unsuccessful. On the host running Alpine, I've disabled all vfio and blacklisted the amdgpu so the card should not be initialised. Kernel options: For good measure:
I've rebuilt the initramfs and rebooted. Login as root No amdgpu module is loaded. No drivers are associated with the GPU.
The ROM file is 128KB, indicating it's a UEFI vbios.
Dump vbios.
However, the amd_mi50_vbios.bin is only Q. How do I dump the vbios from the AMD GPU? Or it's been disable? Or can AMD please release the latest vbios for their enterprise GPU's?
|
Hi @kentrussell, @nartmada, Any suggestions for how to dump the vbios on the MI50? |
@JustGitting, sorry for the slow response. I will get the answers to your 2 questions: |
Thank you @nartmada. |
Thankfully dumping the VBIOS is easy on the terminal: |
I've booted to a live Ubuntu which found and initialised the AMD gpu without error. I did the following as suggested:
However, the Q1. What size should the vbios.rom file be? Q2. How do I check if the Q3. Any other approaches? |
Great news, I've found out how to initialise the GPU in the VM. 1. Dumping Video bios ROM (not needed for VM initialisation, but documenting the method for others)I've found how to dump the AMD GPU firmware ROM. It turns out that the ROM is not directly accessible via standard methods as detailed previously. Downloaded the GNU/Linux version amdvbflash_linux_4.71.zip from https://www.techpowerup.com/download/ati-atiflash/. I installed the GNU/Linux amdvbflash tool and executed the following commands:
Check what AMD cards are available:
Dump vbios ROM from MI50 using proprietary amdvbflash tool by running:
Check information in ROM.
It's a big bios rom file at 1MB:
2. Fixing "Atombios stuck in loop"Well, it turns out dumping the ROM file and passing it to the VM is not necessary. The card needs to be reset (re-initialised) by the vendor-reset module (https://github.com/gnif/vendor-reset). This is because the card cannot be reset by the standard kernel methods (in pci_quirks?) before the VM accesses it. I had found someone with the same "atombios stuck in loop" problem in the discussion www.reddit.com/r/VFIO/comments/oxsku7/vfio_amd_vega20_gpu_passthrough_issues/. Eventually I tried vendor-reset out of desperation and it solved the "atombios stuck in loop" problem. Check kernel has supported features (all should be 'y') for the module:
Build module:
Add the vendor module name to modules config so it is loaded early in the boot process.
I've also added the module name to /etc/modules-load.d/modules.conf (unnecessarily?)
Update initramfs:
I'd assume the module would be included in the initramfs, but it is not. I guess updating initramfs is unnecessary (?).
Reboot machine. UPDATED: No need to use "workaround". Just need to copy udev rules from vendor-reset to correct udev directory location (gnif/vendor-reset#46 (comment)).
Need real root access for this:
Original reset method setting:
Change reset method:
Because the GPU has a large amount of RAM (32GB), the PCI BAR (Base Address Register) (https://wiki.osdev.org/PCI#Base_Address_Registers) size needs to be increased in the VM (www.reddit.com/r/VFIO/comments/oxsku7/vfio_amd_vega20_gpu_passthrough_issues/). Otherwise you get errors in the VM about insufficient space for PCI BAR allocation. <domain xmlns:qemu="http://libvirt.org/schemas/domain/qemu/1.0" type="kvm">
...
<qemu:commandline>
<qemu:arg value="-cpu"/>
<qemu:arg value="host,host-phys-bits=on"/>
<qemu:arg value="-fw_cfg"/>
<qemu:arg value="opt/ovmf/X-PciMmio64Mb,string=65536"/>
</qemu:commandline>
</domain> Start the VM and check that the GPU has initialised:
All looks great. I've been able to install rocm successfully, at least the installation seems to have worked. Yet to use it for actual ML. 3. Suggestions.This is not targeting AMD staff who are clearly doing their best, its a problem with leadership's lack of vision and strategy. After wasting so much time to fix this problem myself, I hope AMD can do better by users and the community. 3.1. FOSS-alise amdvbflash tool.It's clear GNU/Linux is not a priority as the necessary AMDvbflash tool for advanced BIOS tasks has not been updated since 20th March 2020. I also cannot find the official release from AMD either. Please push for the amdvbflash to be made Free and Open source software, then the community could learn and update the tool for GNU/Linux users. We would no longer be dependant on AMD for some support. 3.2. Fix reset problem.It is crazy that AMD's later GPU's are unusable for VM's unless a 3rd party tool is used (vendor-reset module). It is a major problem for us poor users that AMD doesn't document the problem, fixes this problem or even provide support for the ONE developer (@gnif) who created the workaround. Please fix this problem in a more permanent / sustainable manner. Or provide support to the SINGLE developer who is fixing AMD's problems and who is a single point of failure. We know this is a major risk as shown by the xz/liblxma ssh attempted backdoor (oss-security - backdoor in upstream xz/liblzma leading to ssh server compromise https://www.openwall.com/lists/oss-security/2024/03/29/4) |
That issue is old and should have been closed. It's not a "work around", the kernel was enhanced to allow exactly what vendor-reset does without as much jank as it had prior. I have updated the issue and posted the proper solution. |
Thanks @gnif! 👍 |
@JustGitting Please advise if we can go ahead and close this ticket. Thanks! |
Hi @ppanchad-amd, This issue has not been fixed as far as I'm aware. There has been no response from @kentrussell, @nartmada or anyone else from AMD regarding the points I made in the previous post. Is there any plans from AMD to fix this issue? Was there a fix? Any updated documentation for a workaround to the problem? |
@JustGitting We have an internal ticket to investigate this issue. I will follow up to get an update. Thanks! |
@ppanchad-amd Any luck with an update? |
Hi @ppanchad-amd, any news regarding this issue? Thanks for chasing this up. |
@JustGitting I have followed up with the internal team and they are looking at how to resolve this issue. I will keep you posted. Thanks! |
Hi everyone, happy new year everyone!
I'm stumped trying to get an AMD MI50 GPU to work in a KVM/QEMU virtual environment.
Problem
The AMD Radeon MI50 is detected by both Ubuntu 22.04 and Alpine 3.19 if the proprietary linux-firmware is installed, running on bare-metal. However, the MI50 has a fatal error in a KVM/QEMU instance.
Hardware
Dell R720 server
Running latest firmware: 2.9.0
CPU 1 and 2: E5-2670 v2
2 x 1100W PSU's
GPU installed in x16 PCI Riser 2.
I tried to disable the integrated graphics option in the R720's bios, but it is greyed out.
Presumably because Dell only officially supported a small list of GPUs on the R720 .
Supported GPU cards listed on Pages 36-37 of Technical Guide https://downloads.dell.com/manuals/all-products/esuprt_ser_stor_net/esuprt_poweredge/poweredge-r720_reference-guide_en-us.pdf
Below is the results with Ubuntu and Alpine running on bare-metal.
Ubuntu 22.04 (Live OS)
Kernel: 6.2.0-26-generic
linux-firmware: 20220329
Default settings used.
Alpine 3.19 (installed)
Kernel: 6.6.9
linux-firmware: 20231111
The correct driver is assigned according to lspci -nnv
KVM/QEMU
I setup Alpine 3.19 as host with PCI passthrough using VFIO per the instructions at https://wiki.alpinelinux.org/wiki/KVM
After rebooting the host, the MI50 has the vfio-pci driver attached.
Alpine dmesg:
Using virt-manager on the host, I setup Ubuntu 22.04 as the guest OS with a Q35 chipet and UEFI bios.
I also add the MI50 PCI card during the hardware setup stage so the quest OS can access it.
After installing the OS and rebooting the Ubuntu VM, I get the following error about initialising the card.
Ubuntu has the necessary linux-firmware installed.
I've searched extensively for this problem, but mostly found dead-ends or problems that were not the same.
There are few people with the problem, but no definitive answers.
Help with RadeonVII error "atombios stuck in a loop" (not a ROCm issue)
ROCm/ROCm#1320
What does this mean? (MI60)
https://community.amd.com/t5/graphics-cards/what-does-this-mean/td-p/599894
I would appreciate any help solving this problem so I can actually use rocm (https://github.com/ROCm/ROCm) as I'm stumped with no leads.
I'm not sure if it's a Dell Server issue or the MI50 doesn't like being in a VM (...like Nvidia that charge extra fees for the privilege)...or I've not setup the KVM/QEMU correctly.
The text was updated successfully, but these errors were encountered: