-
Notifications
You must be signed in to change notification settings - Fork 107
Description
Hi everyone, happy new year everyone!
I'm stumped trying to get an AMD MI50 GPU to work in a KVM/QEMU virtual environment.
Problem
The AMD Radeon MI50 is detected by both Ubuntu 22.04 and Alpine 3.19 if the proprietary linux-firmware is installed, running on bare-metal. However, the MI50 has a fatal error in a KVM/QEMU instance.
Hardware
Dell R720 server
Running latest firmware: 2.9.0
CPU 1 and 2: E5-2670 v2
2 x 1100W PSU's
GPU installed in x16 PCI Riser 2.
I tried to disable the integrated graphics option in the R720's bios, but it is greyed out.
Presumably because Dell only officially supported a small list of GPUs on the R720 .
Supported GPU cards listed on Pages 36-37 of Technical Guide https://downloads.dell.com/manuals/all-products/esuprt_ser_stor_net/esuprt_poweredge/poweredge-r720_reference-guide_en-us.pdf
Below is the results with Ubuntu and Alpine running on bare-metal.
Ubuntu 22.04 (Live OS)
Kernel: 6.2.0-26-generic
linux-firmware: 20220329
Default settings used.
ACPI: bus type drm_connector registered
[drm] Initialized mgag200 1.0.0 20110418 for 0000:0c:00.0 on minor 0
fbcon: mgag200drmfb (fb0) is primary device
mgag200 0000:0c:00.0: [drm] fb0: mgag200drmfb frame buffer device
[drm] amdgpu kernel modesetting enabled.
amdgpu: CRAT table not found
amdgpu: Virtual CRAT table created for CPU
amdgpu: Topology: Add CPU node
[drm] initializing kernel modesetting (VEGA20 0x1002:0x66A1 0x1002:0x0834 0x02).
[drm] register mmio base: 0xD4F80000
[drm] register mmio size: 524288
[drm] add ip block number 0 <soc15_common>
[drm] add ip block number 1 <gmc_v9_0>
[drm] add ip block number 2 <vega20_ih>
[drm] add ip block number 3 <psp>
[drm] add ip block number 4 <powerplay>
[drm] add ip block number 5 <dm>
[drm] add ip block number 6 <gfx_v9_0>
[drm] add ip block number 7 <sdma_v4_0>
[drm] add ip block number 8 <uvd_v7_0>
[drm] add ip block number 9 <vce_v4_0>
amdgpu 0000:44:00.0: amdgpu: Fetched VBIOS from ROM BAR
amdgpu: ATOM BIOS: 113-D1631700-111
[drm] UVD(0) is enabled in VM mode
[drm] UVD(1) is enabled in VM mode
[drm] UVD(0) ENC is enabled in VM mode
[drm] UVD(1) ENC is enabled in VM mode
[drm] VCE enabled in VM mode
amdgpu 0000:44:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
[drm] GPU posting now...
amdgpu 0000:44:00.0: amdgpu: MEM ECC is active.
amdgpu 0000:44:00.0: amdgpu: SRAM ECC is active.
amdgpu 0000:44:00.0: amdgpu: RAS INFO: ras initialized successfully, hardware ability[7fff] ras_mask[7fff]
[drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
amdgpu 0000:44:00.0: amdgpu: VRAM: 32752M 0x0000008000000000 - 0x00000087FEFFFFFF (32752M used)
amdgpu 0000:44:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
amdgpu 0000:44:00.0: amdgpu: AGP: 267878400M 0x0000008800000000 - 0x0000FFFFFFFFFFFF
[drm] Detected VRAM RAM=32752M, BAR=32768M
[drm] RAM width 4096bits HBM
[drm] amdgpu: 32752M of VRAM memory ready
[drm] amdgpu: 257959M of GTT memory ready.
[drm] GART: num cpu pages 131072, num gpu pages 131072
[drm] PCIE GART of 512M enabled.
[drm] PTB located at 0x00000087FEF00000
amdgpu 0000:44:00.0: amdgpu: PSP runtime database doesn't exist
amdgpu 0000:44:00.0: amdgpu: PSP runtime database doesn't exist
amdgpu: hwmgr_sw_init smu backed is vega20_smu
[drm] Found UVD firmware ENC: 1.2 DEC: .43 Family ID: 19
[drm] PSP loading UVD firmware
[drm] Found VCE firmware Version: 57.6 Binary ID: 4
[drm] PSP loading VCE firmware
[drm] reserve 0x400000 from 0x87fe000000 for PSP TMR
amdgpu 0000:44:00.0: amdgpu: HDCP: optional hdcp ta ucode is not available
amdgpu 0000:44:00.0: amdgpu: DTM: optional dtm ta ucode is not available
amdgpu 0000:44:00.0: amdgpu: RAP: optional rap ta ucode is not available
amdgpu 0000:44:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[drm] Display Core initialized with v3.2.215!
[drm] kiq ring mec 2 pipe 1 q 0
[drm] UVD and UVD ENC initialized successfully.
[drm] VCE initialized successfully.
kfd kfd: amdgpu: Allocated 3969056 bytes on gart
amdgpu: sdma_bitmap: ffff
amdgpu: HMM registered 32752MB device memory
amdgpu: Virtual CRAT table created for GPU
amdgpu: Topology: Add dGPU node [0x66a1:0x1002]
kfd kfd: amdgpu: added device 1002:66a1
amdgpu 0000:44:00.0: amdgpu: SE 4, SH per SE 1, CU per SH 16, active_cu_number 60
amdgpu 0000:44:00.0: amdgpu: ring gfx uses VM inv eng 0 on hub 0
amdgpu 0000:44:00.0: amdgpu: ring gfx_low uses VM inv eng 1 on hub 0
amdgpu 0000:44:00.0: amdgpu: ring gfx_high uses VM inv eng 4 on hub 0
amdgpu 0000:44:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 5 on hub 0
amdgpu 0000:44:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 6 on hub 0
amdgpu 0000:44:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 7 on hub 0
amdgpu 0000:44:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 8 on hub 0
amdgpu 0000:44:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 9 on hub 0
amdgpu 0000:44:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 10 on hub 0
amdgpu 0000:44:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 11 on hub 0
amdgpu 0000:44:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 12 on hub 0
amdgpu 0000:44:00.0: amdgpu: ring kiq_2.1.0 uses VM inv eng 13 on hub 0
amdgpu 0000:44:00.0: amdgpu: ring sdma0 uses VM inv eng 0 on hub 1
amdgpu 0000:44:00.0: amdgpu: ring page0 uses VM inv eng 1 on hub 1
amdgpu 0000:44:00.0: amdgpu: ring sdma1 uses VM inv eng 4 on hub 1
amdgpu 0000:44:00.0: amdgpu: ring page1 uses VM inv eng 5 on hub 1
amdgpu 0000:44:00.0: amdgpu: ring uvd_0 uses VM inv eng 6 on hub 1
amdgpu 0000:44:00.0: amdgpu: ring uvd_enc_0.0 uses VM inv eng 7 on hub 1
amdgpu 0000:44:00.0: amdgpu: ring uvd_enc_0.1 uses VM inv eng 8 on hub 1
amdgpu 0000:44:00.0: amdgpu: ring uvd_1 uses VM inv eng 9 on hub 1
amdgpu 0000:44:00.0: amdgpu: ring uvd_enc_1.0 uses VM inv eng 10 on hub 1
amdgpu 0000:44:00.0: amdgpu: ring uvd_enc_1.1 uses VM inv eng 11 on hub 1
amdgpu 0000:44:00.0: amdgpu: ring vce0 uses VM inv eng 12 on hub 1
amdgpu 0000:44:00.0: amdgpu: ring vce1 uses VM inv eng 13 on hub 1
amdgpu 0000:44:00.0: amdgpu: ring vce2 uses VM inv eng 14 on hub 1
amdgpu: Detected AMDGPU DF Counters. # of Counters = 8.
amdgpu: Detected AMDGPU 2 Perf Events.
[drm] Initialized amdgpu 3.49.0 20150101 for 0000:44:00.0 on minor 1
systemd[1]: Starting Load Kernel Module drm...
systemd[1]: [email protected]: Deactivated successfully.
systemd[1]: Finished Load Kernel Module drm.
Alpine 3.19 (installed)
Kernel: 6.6.9
linux-firmware: 20231111
ACPI: bus type drm_connector registered
[drm] amdgpu kernel modesetting enabled.
amdgpu: Virtual CRAT table created for CPU
amdgpu: Topology: Add CPU node
[drm] initializing kernel modesetting (VEGA20 0x1002:0x66A1 0x1002:0x0834 0x02).
[drm] register mmio base: 0xD4F80000
[drm] register mmio size: 524288
[drm] add ip block number 0 <soc15_common>
[drm] add ip block number 1 <gmc_v9_0>
[drm] add ip block number 2 <vega20_ih>
[drm] add ip block number 3 <psp>
[drm] add ip block number 4 <powerplay>
[drm] add ip block number 5 <dm>
[drm] add ip block number 6 <gfx_v9_0>
[drm] add ip block number 7 <sdma_v4_0>
[drm] add ip block number 8 <uvd_v7_0>
[drm] add ip block number 9 <vce_v4_0>
amdgpu 0000:44:00.0: amdgpu: Fetched VBIOS from ROM BAR
amdgpu: ATOM BIOS: 113-D1631700-111
[drm] UVD(0) is enabled in VM mode
[drm] UVD(1) is enabled in VM mode
[drm] UVD(0) ENC is enabled in VM mode
[drm] UVD(1) ENC is enabled in VM mode
[drm] VCE enabled in VM mode
amdgpu 0000:44:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
[drm] GPU posting now...
amdgpu 0000:44:00.0: amdgpu: MEM ECC is active.
amdgpu 0000:44:00.0: amdgpu: SRAM ECC is active.
amdgpu 0000:44:00.0: amdgpu: RAS INFO: ras initialized successfully, hardware ability[7f7f] ras_mask[7f7f]
[drm] vm size is 262144 GB, 4 levels, block size is 9-bit, fragment size is 9-bit
amdgpu 0000:44:00.0: amdgpu: VRAM: 32752M 0x0000008000000000 - 0x00000087FEFFFFFF (32752M used)
amdgpu 0000:44:00.0: amdgpu: GART: 512M 0x0000000000000000 - 0x000000001FFFFFFF
amdgpu 0000:44:00.0: amdgpu: AGP: 267878400M 0x0000008800000000 - 0x0000FFFFFFFFFFFF
[drm] Detected VRAM RAM=32752M, BAR=32768M
[drm] RAM width 4096bits HBM
[drm] amdgpu: 32752M of VRAM memory ready
[drm] amdgpu: 257967M of GTT memory ready.
[drm] GART: num cpu pages 131072, num gpu pages 131072
[drm] PCIE GART of 512M enabled.
[drm] PTB located at 0x00000087FEF00000
amdgpu: hwmgr_sw_init smu backed is vega20_smu
[drm] Found UVD firmware ENC: 1.2 DEC: .43 Family ID: 19
[drm] PSP loading UVD firmware
[drm] Found VCE firmware Version: 57.6 Binary ID: 4
[drm] PSP loading VCE firmware
[drm] reserve 0x400000 from 0x87fe000000 for PSP TMR
amdgpu 0000:44:00.0: amdgpu: HDCP: optional hdcp ta ucode is not available
amdgpu 0000:44:00.0: amdgpu: DTM: optional dtm ta ucode is not available
amdgpu 0000:44:00.0: amdgpu: RAP: optional rap ta ucode is not available
amdgpu 0000:44:00.0: amdgpu: SECUREDISPLAY: securedisplay ta ucode is not available
[drm] Display Core v3.2.247 initialized on DCE 12.1
[drm] kiq ring mec 2 pipe 1 q 0
[drm] UVD and UVD ENC initialized successfully.
[drm] VCE initialized successfully.
kfd kfd: amdgpu: Allocated 3969056 bytes on gart
kfd kfd: amdgpu: Total number of KFD nodes to be created: 1
amdgpu: Virtual CRAT table created for GPU
amdgpu: Topology: Add dGPU node [0x66a1:0x1002]
kfd kfd: amdgpu: added device 1002:66a1
amdgpu 0000:44:00.0: amdgpu: SE 4, SH per SE 1, CU per SH 16, active_cu_number 60
amdgpu 0000:44:00.0: amdgpu: ring gfx uses VM inv eng 0 on hub 0
amdgpu 0000:44:00.0: amdgpu: ring gfx_low uses VM inv eng 1 on hub 0
amdgpu 0000:44:00.0: amdgpu: ring gfx_high uses VM inv eng 4 on hub 0
amdgpu 0000:44:00.0: amdgpu: ring comp_1.0.0 uses VM inv eng 5 on hub 0
amdgpu 0000:44:00.0: amdgpu: ring comp_1.1.0 uses VM inv eng 6 on hub 0
amdgpu 0000:44:00.0: amdgpu: ring comp_1.2.0 uses VM inv eng 7 on hub 0
amdgpu 0000:44:00.0: amdgpu: ring comp_1.3.0 uses VM inv eng 8 on hub 0
amdgpu 0000:44:00.0: amdgpu: ring comp_1.0.1 uses VM inv eng 9 on hub 0
amdgpu 0000:44:00.0: amdgpu: ring comp_1.1.1 uses VM inv eng 10 on hub 0
amdgpu 0000:44:00.0: amdgpu: ring comp_1.2.1 uses VM inv eng 11 on hub 0
amdgpu 0000:44:00.0: amdgpu: ring comp_1.3.1 uses VM inv eng 12 on hub 0
amdgpu 0000:44:00.0: amdgpu: ring kiq_0.2.1.0 uses VM inv eng 13 on hub 0
amdgpu 0000:44:00.0: amdgpu: ring sdma0 uses VM inv eng 0 on hub 8
amdgpu 0000:44:00.0: amdgpu: ring page0 uses VM inv eng 1 on hub 8
amdgpu 0000:44:00.0: amdgpu: ring sdma1 uses VM inv eng 4 on hub 8
amdgpu 0000:44:00.0: amdgpu: ring page1 uses VM inv eng 5 on hub 8
amdgpu 0000:44:00.0: amdgpu: ring uvd_0 uses VM inv eng 6 on hub 8
amdgpu 0000:44:00.0: amdgpu: ring uvd_enc_0.0 uses VM inv eng 7 on hub 8
amdgpu 0000:44:00.0: amdgpu: ring uvd_enc_0.1 uses VM inv eng 8 on hub 8
amdgpu 0000:44:00.0: amdgpu: ring uvd_1 uses VM inv eng 9 on hub 8
amdgpu 0000:44:00.0: amdgpu: ring uvd_enc_1.0 uses VM inv eng 10 on hub 8
amdgpu 0000:44:00.0: amdgpu: ring uvd_enc_1.1 uses VM inv eng 11 on hub 8
amdgpu 0000:44:00.0: amdgpu: ring vce0 uses VM inv eng 12 on hub 8
amdgpu 0000:44:00.0: amdgpu: ring vce1 uses VM inv eng 13 on hub 8
amdgpu 0000:44:00.0: amdgpu: ring vce2 uses VM inv eng 14 on hub 8
amdgpu: Detected AMDGPU DF Counters. # of Counters = 8.
amdgpu: Detected AMDGPU 2 Perf Events.
[drm] Initialized amdgpu 3.54.0 20150101 for 0000:44:00.0 on minor 0
[drm] Initialized mgag200 1.0.0 20110418 for 0000:0c:00.0 on minor 1
fbcon: mgag200drmfb (fb0) is primary device
mgag200 0000:0c:00.0: [drm] fb0: mgag200drmfb frame buffer device
The correct driver is assigned according to lspci -nnv
44:00.0 Display controller [0380]: Advanced Micro Devices, Inc. [AMD/ATI] Vega 20 [Radeon Pro VII/Radeon Instinct MI50 32GB] [1002:66a1] (rev 02)
IOMMU group: 16
Kernel driver in use: amdgpu
KVM/QEMU
I setup Alpine 3.19 as host with PCI passthrough using VFIO per the instructions at https://wiki.alpinelinux.org/wiki/KVM
After rebooting the host, the MI50 has the vfio-pci driver attached.
Alpine dmesg:
host $ sudo dmesg
...
modules=sd-mod,usb-storage,ext4,vfio,vfio-pci,vfio_iommu_type1,vfio_virqfd
...
ACPI: bus type drm_connector registered
vfio_pci: add [1002:66a1[ffffffff:ffffffff]] class 0x000000/00000000
[drm] amdgpu kernel modesetting enabled.
amdgpu: Virtual CRAT table created for CPU
amdgpu: Topology: Add CPU node
[drm] Initialized mgag200 1.0.0 20110418 for 0000:0c:00.0 on minor 0
fbcon: mgag200drmfb (fb0) is primary device
mgag200 0000:0c:00.0: [drm] fb0: mgag200drmfb frame buffer device
host $ lspci -nnv
44:00.0 Display controller [0380]: Advanced Micro Devices, Inc. [AMD/ATI] Vega 20 [Radeon Pro VII/Radeon Instinct MI50 32GB] [1002:66a1] (rev 02)
..
Kernel driver in use: vfio-pci
Using virt-manager on the host, I setup Ubuntu 22.04 as the guest OS with a Q35 chipet and UEFI bios.
I also add the MI50 PCI card during the hardware setup stage so the quest OS can access it.
After installing the OS and rebooting the Ubuntu VM, I get the following error about initialising the card.
vm $ dmesg
...
systemd[1]: Starting Load Kernel Module drm...
ACPI: bus type drm_connector registered
systemd[1]: [email protected]: Deactivated successfully.
systemd[1]: Finished Load Kernel Module drm.
[drm] Device Version 0.0
[drm] Compression level 0 log level 0
[drm] 12286 io pages at offset 0x1000000
[drm] 16777216 byte draw area at offset 0x0
[drm] RAM header offset: 0x3ffe000
[drm] qxl: 16M of VRAM memory size
[drm] qxl: 63M of IO pages memory ready (VRAM domain)
[drm] qxl: 64M of Surface memory size
[drm] slot 0 (main): base 0xc4000000, size 0x03ffe000
[drm] slot 1 (surfaces): base 0xc0000000, size 0x04000000
[drm] Initialized qxl 0.1.0 20120117 for 0000:00:01.0 on minor 0
fbcon: qxldrmfb (fb0) is primary device
qxl 0000:00:01.0: [drm] fb0: qxldrmfb frame buffer device
[drm] amdgpu kernel modesetting enabled.
amdgpu: CRAT table not found
amdgpu: Virtual CRAT table created for CPU
amdgpu: Topology: Add CPU node
[drm] initializing kernel modesetting (VEGA20 0x1002:0x66A1 0x1002:0x0834 0x02).
[drm] register mmio base: 0xC9200000
[drm] register mmio size: 524288
[drm] add ip block number 0 <soc15_common>
[drm] add ip block number 1 <gmc_v9_0>
[drm] add ip block number 2 <vega20_ih>
[drm] add ip block number 3 <psp>
[drm] add ip block number 4 <powerplay>
[drm] add ip block number 5 <dm>
[drm] add ip block number 6 <gfx_v9_0>
[drm] add ip block number 7 <sdma_v4_0>
[drm] add ip block number 8 <uvd_v7_0>
[drm] add ip block number 9 <vce_v4_0>
amdgpu 0000:05:00.0: amdgpu: Fetched VBIOS from ROM BAR
amdgpu: ATOM BIOS: 113-D1631700-111
[drm] UVD(0) is enabled in VM mode
[drm] UVD(1) is enabled in VM mode
[drm] UVD(0) ENC is enabled in VM mode
[drm] UVD(1) ENC is enabled in VM mode
[drm] VCE enabled in VM mode
amdgpu 0000:05:00.0: amdgpu: Trusted Memory Zone (TMZ) feature not supported
[drm] GPU posting now...
[drm:atom_op_jump [amdgpu]] *ERROR* atombios stuck in loop for more than 20secs aborting
[drm:amdgpu_atom_execute_table_locked [amdgpu]] *ERROR* atombios stuck executing 4EC8 (len 74, WS 0, PS 8) @ 0x4EE0
amdgpu 0000:05:00.0: amdgpu: gpu post error!
amdgpu 0000:05:00.0: amdgpu: Fatal error during GPU init
amdgpu 0000:05:00.0: amdgpu: amdgpu: finishing device.
amdgpu: probe of 0000:05:00.0 failed with error -22
Ubuntu has the necessary linux-firmware installed.
I've searched extensively for this problem, but mostly found dead-ends or problems that were not the same.
There are few people with the problem, but no definitive answers.
Help with RadeonVII error "atombios stuck in a loop" (not a ROCm issue)
ROCm/ROCm#1320
What does this mean? (MI60)
https://community.amd.com/t5/graphics-cards/what-does-this-mean/td-p/599894
I would appreciate any help solving this problem so I can actually use rocm (https://github.com/ROCm/ROCm) as I'm stumped with no leads.
I'm not sure if it's a Dell Server issue or the MI50 doesn't like being in a VM (...like Nvidia that charge extra fees for the privilege)...or I've not setup the KVM/QEMU correctly.