Skip to content

AMD Radeon Instinct MI50 in a KVM/QEMU fails with "atombios stuck in loop", Fatal error during GPU init #157

@JustGitting

Description

@JustGitting

Hi everyone, happy new year everyone!

I'm stumped trying to get an AMD MI50 GPU to work in a KVM/QEMU virtual environment.

Problem

The AMD Radeon MI50 is detected by both Ubuntu 22.04 and Alpine 3.19 if the proprietary linux-firmware is installed, running on bare-metal. However, the MI50 has a fatal error in a KVM/QEMU instance.

Hardware

Dell R720 server
Running latest firmware: 2.9.0
CPU 1 and 2: E5-2670 v2
2 x 1100W PSU's
GPU installed in x16 PCI Riser 2.

I tried to disable the integrated graphics option in the R720's bios, but it is greyed out.
Presumably because Dell only officially supported a small list of GPUs on the R720 .

Supported GPU cards listed on Pages 36-37 of Technical Guide https://downloads.dell.com/manuals/all-products/esuprt_ser_stor_net/esuprt_poweredge/poweredge-r720_reference-guide_en-us.pdf

Below is the results with Ubuntu and Alpine running on bare-metal.

Ubuntu 22.04 (Live OS)
Kernel: 6.2.0-26-generic
linux-firmware: 20220329

Default settings used.

 ACPI: bus type drm_connector registered
 [drm] Initialized mgag200 1.0.0 20110418 for 0000:0c:00.0 on minor 0
 fbcon: mgag200drmfb (fb0) is primary device
 mgag200 0000:0c:00.0: [drm] fb0: mgag200drmfb frame buffer device
 [drm] amdgpu kernel modesetting enabled.
 amdgpu: CRAT table not found
 amdgpu: Virtual CRAT table created for CPU
 amdgpu: Topology: Add CPU node
 [drm] initializing kernel modesetting (VEGA20 0x1002:0x66A1 0x1002:0x0834 0x02).
 [drm] register mmio base: 0xD4F80000
 [drm] register mmio size: 524288
 [drm] add ip block number 0 <soc15_common>
 [drm] add ip block number 1 <gmc_v9_0>
 [drm] add ip block number 2 <vega20_ih>
 [drm] add ip block number 3 <psp>
 [drm] add ip block number 4 <powerplay>
 [drm] add ip block number 5 <dm>
 [drm] add ip block number 6 <gfx_v9_0>
 [drm] add ip block number 7 <sdma_v4_0>
 [drm] add ip block number 8 <uvd_v7_0>
 [drm] add ip block number 9 <vce_v4_0>
 amdgpu 0000:44:00.0: amdgpu: Fetched VBIOS from ROM BAR
 amdgpu: ATOM BIOS: 113-D1631700-111
 [drm] UVD(0) is enabled in VM mode
 [drm] UVD(1) is enabled in VM mode
 [drm] UVD(0) ENC is enabled in VM mode
 [drm] UVD(1) ENC is enabled in VM mode
 [drm] VCE enabled in VM mode
 amdgpu 0000:44:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
 [drm] GPU posting now...
 amdgpu 0000:44:00.0: amdgpu: MEM ECC is active.
 amdgpu 0000:44:00.0: amdgpu: SRAM ECC is active.
 amdgpu 0000:44:00.0: amdgpu: RAS INFO: ras initialized successfully, hardware ability[7fff] ras_mask[7fff]
 [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
 amdgpu 0000:44:00.0: amdgpu: VRAM: 32752M 0x0000008000000000 - 0x00000087FEFFFFFF (32752M used)
 amdgpu 0000:44:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
 amdgpu 0000:44:00.0: amdgpu: AGP: 267878400M 0x0000008800000000 - 0x0000FFFFFFFFFFFF
 [drm] Detected VRAM RAM=32752M, BAR=32768M
 [drm] RAM width 4096bits HBM
 [drm] amdgpu: 32752M of VRAM memory ready
 [drm] amdgpu: 257959M of GTT memory ready.
 [drm] GART: num cpu pages 131072, num gpu pages 131072
 [drm] PCIE GART of 512M enabled.
 [drm] PTB located at 0x00000087FEF00000
 amdgpu 0000:44:00.0: amdgpu: PSP runtime database doesn't exist
 amdgpu 0000:44:00.0: amdgpu: PSP runtime database doesn't exist
 amdgpu: hwmgr_sw_init smu backed is vega20_smu
 [drm] Found UVD firmware ENC: 1.2 DEC: .43 Family ID: 19
 [drm] PSP loading UVD firmware
 [drm] Found VCE firmware Version: 57.6 Binary ID: 4
 [drm] PSP loading VCE firmware
 [drm] reserve 0x400000 from 0x87fe000000 for PSP TMR
 amdgpu 0000:44:00.0: amdgpu: HDCP: optional hdcp ta ucode is not available
 amdgpu 0000:44:00.0: amdgpu: DTM: optional dtm ta ucode is not available
 amdgpu 0000:44:00.0: amdgpu: RAP: optional rap ta ucode is not available
 amdgpu 0000:44:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
 [drm] Display Core initialized with v3.2.215!
 [drm] kiq ring mec 2 pipe 1 q 0
 [drm] UVD and UVD ENC initialized successfully.
 [drm] VCE initialized successfully.
 kfd kfd: amdgpu: Allocated 3969056 bytes on gart
 amdgpu: sdma_bitmap: ffff
 amdgpu: HMM registered 32752MB device memory
 amdgpu: Virtual CRAT table created for GPU
 amdgpu: Topology: Add dGPU node [0x66a1:0x1002]
 kfd kfd: amdgpu: added device 1002:66a1
 amdgpu 0000:44:00.0: amdgpu: SE 4, SH per SE 1, CU per SH 16, active_cu_number 60
 amdgpu 0000:44:00.0: amdgpu: ring gfx uses VM inv eng 0 on hub 0
 amdgpu 0000:44:00.0: amdgpu: ring gfx_low uses VM inv eng 1 on hub 0
 amdgpu 0000:44:00.0: amdgpu: ring gfx_high uses VM inv eng 4 on hub 0
 amdgpu 0000:44:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 5 on hub 0
 amdgpu 0000:44:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 6 on hub 0
 amdgpu 0000:44:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 7 on hub 0
 amdgpu 0000:44:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 8 on hub 0
 amdgpu 0000:44:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 9 on hub 0
 amdgpu 0000:44:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 10 on hub 0
 amdgpu 0000:44:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 11 on hub 0
 amdgpu 0000:44:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 12 on hub 0
 amdgpu 0000:44:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 13 on hub 0
 amdgpu 0000:44:00.0: amdgpu: ring sdma0 uses VM inv eng 0 on hub 1
 amdgpu 0000:44:00.0: amdgpu: ring page0 uses VM inv eng 1 on hub 1
 amdgpu 0000:44:00.0: amdgpu: ring sdma1 uses VM inv eng 4 on hub 1
 amdgpu 0000:44:00.0: amdgpu: ring page1 uses VM inv eng 5 on hub 1
 amdgpu 0000:44:00.0: amdgpu: ring uvd_0 uses VM inv eng 6 on hub 1
 amdgpu 0000:44:00.0: amdgpu: ring uvd_enc_0.0 uses VM inv eng 7 on hub 1
 amdgpu 0000:44:00.0: amdgpu: ring uvd_enc_0.1 uses VM inv eng 8 on hub 1
 amdgpu 0000:44:00.0: amdgpu: ring uvd_1 uses VM inv eng 9 on hub 1
 amdgpu 0000:44:00.0: amdgpu: ring uvd_enc_1.0 uses VM inv eng 10 on hub 1
 amdgpu 0000:44:00.0: amdgpu: ring uvd_enc_1.1 uses VM inv eng 11 on hub 1
 amdgpu 0000:44:00.0: amdgpu: ring vce0 uses VM inv eng 12 on hub 1
 amdgpu 0000:44:00.0: amdgpu: ring vce1 uses VM inv eng 13 on hub 1
 amdgpu 0000:44:00.0: amdgpu: ring vce2 uses VM inv eng 14 on hub 1
 amdgpu: Detected AMDGPU DF Counters. # of Counters = 8.
 amdgpu: Detected AMDGPU 2 Perf Events.
 [drm] Initialized amdgpu 3.49.0 20150101 for 0000:44:00.0 on minor 1
 systemd[1]: Starting Load Kernel Module drm...
 systemd[1]: [email protected]: Deactivated successfully.
 systemd[1]: Finished Load Kernel Module drm.

Alpine 3.19 (installed)
Kernel: 6.6.9
linux-firmware: 20231111

 ACPI: bus type drm_connector registered
 [drm] amdgpu kernel modesetting enabled.
 amdgpu: Virtual CRAT table created for CPU
 amdgpu: Topology: Add CPU node
 [drm] initializing kernel modesetting (VEGA20 0x1002:0x66A1 0x1002:0x0834 0x02).
 [drm] register mmio base: 0xD4F80000
 [drm] register mmio size: 524288
 [drm] add ip block number 0 <soc15_common>
 [drm] add ip block number 1 <gmc_v9_0>
 [drm] add ip block number 2 <vega20_ih>
 [drm] add ip block number 3 <psp>
 [drm] add ip block number 4 <powerplay>
 [drm] add ip block number 5 <dm>
 [drm] add ip block number 6 <gfx_v9_0>
 [drm] add ip block number 7 <sdma_v4_0>
 [drm] add ip block number 8 <uvd_v7_0>
 [drm] add ip block number 9 <vce_v4_0>
 amdgpu 0000:44:00.0: amdgpu: Fetched VBIOS from ROM BAR
 amdgpu: ATOM BIOS: 113-D1631700-111
 [drm] UVD(0) is enabled in VM mode
 [drm] UVD(1) is enabled in VM mode
 [drm] UVD(0) ENC is enabled in VM mode
 [drm] UVD(1) ENC is enabled in VM mode
 [drm] VCE enabled in VM mode
 amdgpu 0000:44:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
 [drm] GPU posting now...
 amdgpu 0000:44:00.0: amdgpu: MEM ECC is active.
 amdgpu 0000:44:00.0: amdgpu: SRAM ECC is active.
 amdgpu 0000:44:00.0: amdgpu: RAS INFO: ras initialized successfully, hardware ability[7f7f] ras_mask[7f7f]
 [drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
 amdgpu 0000:44:00.0: amdgpu: VRAM: 32752M 0x0000008000000000 - 0x00000087FEFFFFFF (32752M used)
 amdgpu 0000:44:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
 amdgpu 0000:44:00.0: amdgpu: AGP: 267878400M 0x0000008800000000 - 0x0000FFFFFFFFFFFF
 [drm] Detected VRAM RAM=32752M, BAR=32768M
 [drm] RAM width 4096bits HBM
 [drm] amdgpu: 32752M of VRAM memory ready
 [drm] amdgpu: 257967M of GTT memory ready.
 [drm] GART: num cpu pages 131072, num gpu pages 131072
 [drm] PCIE GART of 512M enabled.
 [drm] PTB located at 0x00000087FEF00000
 amdgpu: hwmgr_sw_init smu backed is vega20_smu
 [drm] Found UVD firmware ENC: 1.2 DEC: .43 Family ID: 19
 [drm] PSP loading UVD firmware
 [drm] Found VCE firmware Version: 57.6 Binary ID: 4
 [drm] PSP loading VCE firmware
 [drm] reserve 0x400000 from 0x87fe000000 for PSP TMR
 amdgpu 0000:44:00.0: amdgpu: HDCP: optional hdcp ta ucode is not available
 amdgpu 0000:44:00.0: amdgpu: DTM: optional dtm ta ucode is not available
 amdgpu 0000:44:00.0: amdgpu: RAP: optional rap ta ucode is not available
 amdgpu 0000:44:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
 [drm] Display Core v3.2.247 initialized on DCE 12.1
 [drm] kiq ring mec 2 pipe 1 q 0
 [drm] UVD and UVD ENC initialized successfully.
 [drm] VCE initialized successfully.
 kfd kfd: amdgpu: Allocated 3969056 bytes on gart
 kfd kfd: amdgpu: Total number of KFD nodes to be created: 1
 amdgpu: Virtual CRAT table created for GPU
 amdgpu: Topology: Add dGPU node [0x66a1:0x1002]
 kfd kfd: amdgpu: added device 1002:66a1
 amdgpu 0000:44:00.0: amdgpu: SE 4, SH per SE 1, CU per SH 16, active_cu_number 60
 amdgpu 0000:44:00.0: amdgpu: ring gfx uses VM inv eng 0 on hub 0
 amdgpu 0000:44:00.0: amdgpu: ring gfx_low uses VM inv eng 1 on hub 0
 amdgpu 0000:44:00.0: amdgpu: ring gfx_high uses VM inv eng 4 on hub 0
 amdgpu 0000:44:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 5 on hub 0
 amdgpu 0000:44:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 6 on hub 0
 amdgpu 0000:44:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 7 on hub 0
 amdgpu 0000:44:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 8 on hub 0
 amdgpu 0000:44:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 9 on hub 0
 amdgpu 0000:44:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 10 on hub 0
 amdgpu 0000:44:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 11 on hub 0
 amdgpu 0000:44:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 12 on hub 0
 amdgpu 0000:44:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 13 on hub 0
 amdgpu 0000:44:00.0: amdgpu: ring sdma0 uses VM inv eng 0 on hub 8
 amdgpu 0000:44:00.0: amdgpu: ring page0 uses VM inv eng 1 on hub 8
 amdgpu 0000:44:00.0: amdgpu: ring sdma1 uses VM inv eng 4 on hub 8
 amdgpu 0000:44:00.0: amdgpu: ring page1 uses VM inv eng 5 on hub 8
 amdgpu 0000:44:00.0: amdgpu: ring uvd_0 uses VM inv eng 6 on hub 8
 amdgpu 0000:44:00.0: amdgpu: ring uvd_enc_0.0 uses VM inv eng 7 on hub 8
 amdgpu 0000:44:00.0: amdgpu: ring uvd_enc_0.1 uses VM inv eng 8 on hub 8
 amdgpu 0000:44:00.0: amdgpu: ring uvd_1 uses VM inv eng 9 on hub 8
 amdgpu 0000:44:00.0: amdgpu: ring uvd_enc_1.0 uses VM inv eng 10 on hub 8
 amdgpu 0000:44:00.0: amdgpu: ring uvd_enc_1.1 uses VM inv eng 11 on hub 8
 amdgpu 0000:44:00.0: amdgpu: ring vce0 uses VM inv eng 12 on hub 8
 amdgpu 0000:44:00.0: amdgpu: ring vce1 uses VM inv eng 13 on hub 8
 amdgpu 0000:44:00.0: amdgpu: ring vce2 uses VM inv eng 14 on hub 8
 amdgpu: Detected AMDGPU DF Counters. # of Counters = 8.
 amdgpu: Detected AMDGPU 2 Perf Events.
 [drm] Initialized amdgpu 3.54.0 20150101 for 0000:44:00.0 on minor 0
 [drm] Initialized mgag200 1.0.0 20110418 for 0000:0c:00.0 on minor 1
 fbcon: mgag200drmfb (fb0) is primary device
 mgag200 0000:0c:00.0: [drm] fb0: mgag200drmfb frame buffer device

The correct driver is assigned according to lspci -nnv

44:00.0 Display controller [0380]: Advanced Micro Devices, Inc. [AMD/ATI] Vega 20 [Radeon Pro VII/Radeon Instinct MI50 32GB] [1002:66a1] (rev 02)
	IOMMU group: 16
    Kernel driver in use: amdgpu

KVM/QEMU

I setup Alpine 3.19 as host with PCI passthrough using VFIO per the instructions at https://wiki.alpinelinux.org/wiki/KVM

After rebooting the host, the MI50 has the vfio-pci driver attached.

Alpine dmesg:

host $ sudo dmesg
	...
 modules=sd-mod,usb-storage,ext4,vfio,vfio-pci,vfio_iommu_type1,vfio_virqfd
	...
 ACPI: bus type drm_connector registered
 vfio_pci: add [1002:66a1[ffffffff:ffffffff]] class 0x000000/00000000
 [drm] amdgpu kernel modesetting enabled.
 amdgpu: Virtual CRAT table created for CPU
 amdgpu: Topology: Add CPU node
 [drm] Initialized mgag200 1.0.0 20110418 for 0000:0c:00.0 on minor 0
 fbcon: mgag200drmfb (fb0) is primary device
 mgag200 0000:0c:00.0: [drm] fb0: mgag200drmfb frame buffer device

host $ lspci -nnv

44:00.0 Display controller [0380]: Advanced Micro Devices, Inc. [AMD/ATI] Vega 20 [Radeon Pro VII/Radeon Instinct MI50 32GB] [1002:66a1] (rev 02)
..
        Kernel driver in use: vfio-pci

Using virt-manager on the host, I setup Ubuntu 22.04 as the guest OS with a Q35 chipet and UEFI bios.
I also add the MI50 PCI card during the hardware setup stage so the quest OS can access it.

After installing the OS and rebooting the Ubuntu VM, I get the following error about initialising the card.

vm $ dmesg 
	...
systemd[1]: Starting Load Kernel Module drm...
 ACPI: bus type drm_connector registered
 systemd[1]: [email protected]: Deactivated successfully.
 systemd[1]: Finished Load Kernel Module drm.
 [drm] Device Version 0.0
 [drm] Compression level 0 log level 0
 [drm] 12286 io pages at offset 0x1000000
 [drm] 16777216 byte draw area at offset 0x0
 [drm] RAM header offset: 0x3ffe000
 [drm] qxl: 16M of VRAM memory size
 [drm] qxl: 63M of IO pages memory ready (VRAM domain)
 [drm] qxl: 64M of Surface memory size
 [drm] slot 0 (main): base 0xc4000000, size 0x03ffe000
 [drm] slot 1 (surfaces): base 0xc0000000, size 0x04000000
 [drm] Initialized qxl 0.1.0 20120117 for 0000:00:01.0 on minor 0
 fbcon: qxldrmfb (fb0) is primary device
 qxl 0000:00:01.0: [drm] fb0: qxldrmfb frame buffer device
 [drm] amdgpu kernel modesetting enabled.
 amdgpu: CRAT table not found
 amdgpu: Virtual CRAT table created for CPU
 amdgpu: Topology: Add CPU node
 [drm] initializing kernel modesetting (VEGA20 0x1002:0x66A1 0x1002:0x0834 0x02).
 [drm] register mmio base: 0xC9200000
 [drm] register mmio size: 524288
 [drm] add ip block number 0 <soc15_common>
 [drm] add ip block number 1 <gmc_v9_0>
 [drm] add ip block number 2 <vega20_ih>
 [drm] add ip block number 3 <psp>
 [drm] add ip block number 4 <powerplay>
 [drm] add ip block number 5 <dm>
 [drm] add ip block number 6 <gfx_v9_0>
 [drm] add ip block number 7 <sdma_v4_0>
 [drm] add ip block number 8 <uvd_v7_0>
 [drm] add ip block number 9 <vce_v4_0>
 amdgpu 0000:05:00.0: amdgpu: Fetched VBIOS from ROM BAR
 amdgpu: ATOM BIOS: 113-D1631700-111
 [drm] UVD(0) is enabled in VM mode
 [drm] UVD(1) is enabled in VM mode
 [drm] UVD(0) ENC is enabled in VM mode
 [drm] UVD(1) ENC is enabled in VM mode
 [drm] VCE enabled in VM mode
 amdgpu 0000:05:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
 [drm] GPU posting now...
 [drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 20secs aborting
 [drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing 4EC8 (len 74, WS 0, PS 8) @ 0x4EE0
 amdgpu 0000:05:00.0: amdgpu: gpu post error!
 amdgpu 0000:05:00.0: amdgpu: Fatal error during GPU init
 amdgpu 0000:05:00.0: amdgpu: amdgpu: finishing device.
 amdgpu: probe of 0000:05:00.0 failed with error -22

Ubuntu has the necessary linux-firmware installed.

I've searched extensively for this problem, but mostly found dead-ends or problems that were not the same.
There are few people with the problem, but no definitive answers.

Help with RadeonVII error "atombios stuck in a loop" (not a ROCm issue)
ROCm/ROCm#1320

What does this mean? (MI60)
https://community.amd.com/t5/graphics-cards/what-does-this-mean/td-p/599894

I would appreciate any help solving this problem so I can actually use rocm (https://github.com/ROCm/ROCm) as I'm stumped with no leads.

I'm not sure if it's a Dell Server issue or the MI50 doesn't like being in a VM (...like Nvidia that charge extra fees for the privilege)...or I've not setup the KVM/QEMU correctly.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions