[6.14-adv-next] Add Grace virtualization support to 6.14-adv, (upstream vEVENTQ + HW QUEUE and OOT vEGM) #167

nirmoy · 2025-07-22T13:24:35Z

This PR backport/cherry-pick patches for upstream vEVENTQ + HW QUEUE and OOT vEGM

testing sources:

QEMU src: https://github.com/nvmochs/QEMU/tree/6.11_gracevirt_vcmdq_v9
VM image: https://urm.nvidia.com/artifactory/sw-dgx-platform-generic-local/staging/ghvirt/guest/jammy-server-cloudimg-arm64_may022024_public_r550.54.15_cuda12.4.qcow2.xz
CUDA Test: https://dvstransfer.nvidia.com/dvsshare/dvs-binaries-virtual/gpu_drv_r575_00_Release_Linux_aarch64sbsa_CUDA_DVS_Test/

VM start command for EGM testing

VM_IMAGE=/localhome/local-nirmoyd/ubuntu-24.04-server-cloudimg-arm64-grace-6.8.0-1009-nvidia-adv-2025-02-07-08-57-55.qcow2
/usr/local/bin/qemu-system-aarch64 -object iommufd,id=iommufd0 \
    -machine hmat=on -machine virt,accel=kvm,gic-version=3,iommu=nested-smmuv3,ras=on \
    -cpu host -smp cpus=4 -m size=16G,slots=2,maxmem=66G -nographic \
    -object memory-backend-file,id=m0,mem-path=/dev/egm4,size=16G,share=on,prealloc=on \
    -numa node,memdev=m0,cpus=0-3,nodeid=0 \
    -numa node,nodeid=1 -numa node,nodeid=2 -numa node,nodeid=3 -numa node,nodeid=4 \
    -numa node,nodeid=5 -numa node,nodeid=6 -numa node,nodeid=7 -numa node,nodeid=8 \
    -device vfio-pci-nohotplug,host=0009:01:00.0,rombar=0,id=dev0,iommufd=iommufd0 \
    -object acpi-egm-memory,id=egm0,pci-dev=dev0,node=0 \
    -object acpi-generic-initiator,id=gi0,pci-dev=dev0,node=1 \
    -object acpi-generic-initiator,id=gi1,pci-dev=dev0,node=2 \
    -object acpi-generic-initiator,id=gi2,pci-dev=dev0,node=3 \
    -object acpi-generic-initiator,id=gi3,pci-dev=dev0,node=4 \
    -object acpi-generic-initiator,id=gi4,pci-dev=dev0,node=5 \
    -object acpi-generic-initiator,id=gi5,pci-dev=dev0,node=6 \
    -object acpi-generic-initiator,id=gi6,pci-dev=dev0,node=7 \
    -object acpi-generic-initiator,id=gi7,pci-dev=dev0,node=8 \
    -bios /usr/share/AAVMF/AAVMF_CODE.fd \
    -device nvme,drive=nvme0,serial=deadbeaf1,bus=pcie.0 \
    -drive file=$VM_IMAGE,index=0,media=disk,format=qcow2,if=none,id=nvme0 \
    -device e1000,romfile=/usr/local/share/qemu/efi-e1000.rom,netdev=net0,bus=pcie.0 \
    -netdev user,id=net0,hostfwd=tcp::5558-:22,hostfwd=tcp::5586-:5586

Test runs for EGM enabled VM

nvidia@ubuntu:~$ lspci -k -d 10de:2348
b0:00.0 3D controller: NVIDIA Corporation Device 2348 (rev a1)
        Subsystem: NVIDIA Corporation Device 18d2
        Kernel driver in use: nvidia
        Kernel modules: nvidiafb, nouveau, nvidia_drm, nvidia
sudo nvidia-smi -q |grep -i egm
        EGM                               : enabled 
nvidia@ubuntu:~$ sudo ./tests/runtime/gflops/gflops
Running GFLOPs test...
&&&& PERF GFLOPs 0
&&&& gflops test PASSED
nvidia@ubuntu:~$ sudo ./tests/driver/egm/egm
Device 0: NVIDIA GH200 144G HBM3e
Driver version: 12090
Runtime version: 12090
Dispatcher pid: 1366
Running test SmokeTest (pid: 1408)
^^^^ PASS: SmokeTest (404.7ms)
Running test SmokeTestIpc (pid: 1411)
(thread 260537756117184 [t0]) At /dvs/p4/build/sw/rel/gpgpu/toolkit/r12.9/cuda/apps/egm/test/smoke.cpp:392:
SmokeTestIpc is NOT supported in a configuration with only one process

^^^^ WAIVE: SmokeTestIpc (298.9ms)
Running test atomictest (pid: 1414)
^^^^ PASS: atomictest (329.2ms)
Total time: 1033ms
2 out of 2 ENABLED tests passed (100%)
    1 ENABLED tests were waived
&&&& egm test PASSED
sudo tests/runtime/uvmConformance/uvmConformance -t texture_simple
Device 0: NVIDIA GH200 144G HBM3e
Driver version: 12090
Runtime version: 12090
Dispatcher pid: 1135
Running test texture_simple (pid: 1177)
^^^^ PASS: texture_simple (350.1ms)
Total time: 350ms
1 out of 1 ENABLED tests passed (100%)
&&&& uvmConformance test PASSED
sudo tests/runtime/uvmConformance/uvmConformance -t ats_malloc_host
Device 0: NVIDIA GH200 144G HBM3e
Driver version: 12090
Runtime version: 12090
Dispatcher pid: 1233
Running test ats_malloc_host (pid: 1275)
^^^^ PASS: ats_malloc_host (351.2ms)
Total time: 351ms
1 out of 1 ENABLED tests passed (100%)
&&&& uvmConformance test PASSED

vEVENTQ validation

VM start command for vEVENTQ testing with `cmdqv=on`

VM_IMAGE=/localhome/local-nirmoyd/ubuntu-24.04-server-cloudimg-arm64-grace-6.8.0-1009-nvidia-adv-2025-02-07-08-57-55.qcow2
qemu-system-aarch64 \
       -object iommufd,id=iommufd0 \
       -machine hmat=on -machine virt,accel=kvm,gic-version=3,iommu=nested-smmuv3,cmdqv=on,ras=on \
       -cpu host -smp cpus=4 -m size=16G,slots=2,maxmem=64G -nographic \
       -object memory-backend-file,size=8G,id=m0,mem-path=/hugepages/,prealloc=on,share=off \
       -object memory-backend-file,size=8G,id=m1,mem-path=/hugepages/,prealloc=on,share=off \
       -numa node,memdev=m0,cpus=0-3,nodeid=0 -numa node,memdev=m1,nodeid=1 \
       -numa node,nodeid=2 -numa node,nodeid=3 -numa node,nodeid=4 -numa node,nodeid=5 \
       -numa node,nodeid=6 -numa node,nodeid=7 -numa node,nodeid=8 -numa node,nodeid=9 \
       -device vfio-pci-nohotplug,host=0009:01:00.0,rombar=0,id=dev0,iommufd=iommufd0 \
       -object acpi-generic-initiator,id=gi0,pci-dev=dev0,node=2 \
       -object acpi-generic-initiator,id=gi1,pci-dev=dev0,node=3 \
       -object acpi-generic-initiator,id=gi2,pci-dev=dev0,node=4 \
       -object acpi-generic-initiator,id=gi3,pci-dev=dev0,node=5 \
       -object acpi-generic-initiator,id=gi4,pci-dev=dev0,node=6 \
       -object acpi-generic-initiator,id=gi5,pci-dev=dev0,node=7 \
       -object acpi-generic-initiator,id=gi6,pci-dev=dev0,node=8 \
       -object acpi-generic-initiator,id=gi7,pci-dev=dev0,node=9 \
       -device vfio-pci-nohotplug,host=0010:01:00.0,rombar=0,id=dev1,iommufd=iommufd0 \
       -bios /usr/share/AAVMF/AAVMF_CODE.fd \
       -device nvme,drive=nvme0,serial=deadbeaf1,bus=pcie.0 \
       -drive file=$VM_IMAGE,index=0,media=disk,format=qcow2,if=none,id=nvme0 \
       -device e1000,romfile=/usr/local/share/qemu/efi-e1000.rom,netdev=net0,bus=pcie.0 \
       -netdev user,id=net0,hostfwd=tcp::5558-:22,hostfwd=tcp::5586-:5586

Test runs for vEVENTQ enabled VM

nvidia@ubuntu:~$ sudo dmesg | grep "Default domain type"
[    0.274182] iommu: Default domain type: Translated
nvidia@ubuntu:~$ sudo journalctl -b|grep vcmdq -i|head -n1
Jul 22 15:32:54 ubuntu kernel: arm-smmu-v3 arm-smmu-v3.0.auto: allocated 524288 entries for vcmdq0
sudo ./tests/runtime/gflops/gflops
Running GFLOPs test...
&&&& PERF GFLOPs 0
&&&& gflops test PASSED
sudo tests/runtime/uvmConformance/uvmConformance -t texture_simple
Device 0: NVIDIA GH200 144G HBM3e
Driver version: 12090
Runtime version: 12090
Dispatcher pid: 1443
Running test texture_simple (pid: 1485)
^^^^ PASS: texture_simple (339.6ms)
Total time: 340ms
1 out of 1 ENABLED tests passed (100%)
&&&& uvmConformance test PASSED
nvidia@ubuntu:~$ sudo tests/runtime/uvmConformance/uvmConformance -t texture_simple
Device 0: NVIDIA GH200 144G HBM3e
Driver version: 12090
Runtime version: 12090
Dispatcher pid: 1537
Running test texture_simple (pid: 1579)
^^^^ PASS: texture_simple (339.6ms)
Total time: 340ms
1 out of 1 ENABLED tests passed (100%)
&&&& uvmConformance test PASSED
sudo tests/runtime/uvmConformance/uvmConformance -t ats_malloc_host
Device 0: NVIDIA GH200 144G HBM3e
Driver version: 12090
Runtime version: 12090
Dispatcher pid: 1038
Running test ats_malloc_host (pid: 1080)
^^^^ PASS: ats_malloc_host (349.2ms)
Total time: 349ms
1 out of 1 ENABLED tests passed (100%)
&&&& uvmConformance test PASSED

The module code does not create a writable copy of the executable memory anymore so there is no need to handle it in module relocation and alternatives patching. This reverts commit 9bfc482. Signed-off-by: "Mike Rapoport (Microsoft)" <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected] (cherry picked from commit 1d7e707) Signed-off-by: Nirmoy Das <[email protected]>

Pretty much every caller of is_endbr() actually wants to test something at an address and ends up doing get_kernel_nofault(). Fold the lot into a more convenient helper. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Sami Tolvanen <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Acked-by: "Masami Hiramatsu (Google)" <[email protected]> Link: https://lore.kernel.org/r/[email protected] (cherry picked from commit 72e213a) Signed-off-by: Nirmoy Das <[email protected]>

…mapping_domain" This reverts commit 78480b2. Signed-off-by: Nirmoy Das <[email protected]>

…m ids ASPEED VGA card has two built-in devices: 0008:06:00.0 PCI bridge: ASPEED Technology, Inc. AST1150 PCI-to-PCI Bridge (rev 06) 0008:07:00.0 VGA compatible controller: ASPEED Technology, Inc. ASPEED Graphics Family (rev 52) Its toplogy looks like this: +-[0008:00]---00.0-[01-09]--+-00.0-[02-09]--+-00.0-[03]----00.0 Sandisk Corp Device 5017 | +-01.0-[04]-- | +-02.0-[05]----00.0 NVIDIA Corporation Device | +-03.0-[06-07]----00.0-[07]----00.0 ASPEED Technology, Inc. ASPEED Graphics Family | +-04.0-[08]----00.0 Renesas Technology Corp. uPD720201 USB 3.0 Host Controller | \-05.0-[09]----00.0 Realtek Semiconductor Co., Ltd. RTL8111/8168/8411 PCI Express Gigabit Ethernet Controller \-00.1 PMC-Sierra Inc. Device 4028 The IORT logic populaties two identical IDs into the fwspec->ids array via DMA aliasing in iort_pci_iommu_init() called by pci_for_each_dma_alias(). Though the SMMU driver had been able to handle this situation since commit 563b5cb ("iommu/arm-smmu-v3: Cope with duplicated Stream IDs"), that got broken by the later commit cdf315f ("iommu/arm-smmu-v3: Maintain a SID->device structure"), which ended up with allocating separate streams with the same stuffing. On a kernel prior to v6.15-rc1, there has been an overlooked warning: pci 0008:07:00.0: vgaarb: setting as boot VGA device pci 0008:07:00.0: vgaarb: bridge control possible pci 0008:07:00.0: vgaarb: VGA device added: decodes=io+mem,owns=none,locks=none pcieport 0008:06:00.0: Adding to iommu group 14 ast 0008:07:00.0: stream 67328 already in tree <===== WARNING ast 0008:07:00.0: enabling device (0002 -> 0003) ast 0008:07:00.0: Using default configuration ast 0008:07:00.0: AST 2600 detected ast 0008:07:00.0: [drm] Using analog VGA ast 0008:07:00.0: [drm] dram MCLK=396 Mhz type=1 bus_width=16 [drm] Initialized ast 0.1.0 for 0008:07:00.0 on minor 0 ast 0008:07:00.0: [drm] fb0: astdrmfb frame buffer device With v6.15-rc, since the commit bcb81ac ("iommu: Get DT/ACPI parsing into the proper probe path"), the error returned with the warning is moved to the SMMU device probe flow: arm_smmu_probe_device+0x15c/0x4c0 __iommu_probe_device+0x150/0x4f8 probe_iommu_group+0x44/0x80 bus_for_each_dev+0x7c/0x100 bus_iommu_probe+0x48/0x1a8 iommu_device_register+0xb8/0x178 arm_smmu_device_probe+0x1350/0x1db0 which then fails the entire SMMU driver probe: pci 0008:06:00.0: Adding to iommu group 21 pci 0008:07:00.0: stream 67328 already in tree arm-smmu-v3 arm-smmu-v3.9.auto: Failed to register iommu arm-smmu-v3 arm-smmu-v3.9.auto: probe with driver arm-smmu-v3 failed with error -22 Since SMMU driver had been already expecting a potential duplicated Stream ID in arm_smmu_install_ste_for_dev(), change the arm_smmu_insert_master() routine to ignore a duplicated ID from the fwspec->sids array as well. Note: this has been failing the iommu_device_probe() since 2021, although a recent iommu commit in v6.15-rc1 that moves iommu_device_probe() started to fail the SMMU driver probe. Since nobody has cared about DMA Alias support, leave that as it was but fix the fundamental iommu_device_probe() breakage. Fixes: cdf315f ("iommu/arm-smmu-v3: Maintain a SID->device structure") Cc: [email protected] Suggested-by: Jason Gunthorpe <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit b00d249 linux-next) Signed-off-by: Nirmoy Das <[email protected]>

There are new attach/detach/replace helpers in device.c taking care of both the attach_handle and the fault specific routines for iopf_enable/disable() and auto response. Clean up these redundant functions in the fault.c file. Link: https://patch.msgid.link/r/3ca94625e9d78270d9a715fa0809414fddd57e58.1738645017.git.nicolinc@nvidia.com Signed-off-by: Nicolin Chen <[email protected]> Reviewed-by: Yi Liu <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit dc10ba2) Signed-off-by: Nirmoy Das <[email protected]>

This reverts commit 8aced5e. Signed-off-by: Nirmoy Das <[email protected]>

…u_cookie The IOMMU translation for MSI message addresses has been a 2-step process, separated in time: 1) iommu_dma_prepare_msi(): A cookie pointer containing the IOVA address is stored in the MSI descriptor when an MSI interrupt is allocated. 2) iommu_dma_compose_msi_msg(): this cookie pointer is used to compute a translated message address. This has an inherent lifetime problem for the pointer stored in the cookie that must remain valid between the two steps. However, there is no locking at the irq layer that helps protect the lifetime. Today, this works under the assumption that the iommu domain is not changed while MSI interrupts being programmed. This is true for normal DMA API users within the kernel, as the iommu domain is attached before the driver is probed and cannot be changed while a driver is attached. Classic VFIO type1 also prevented changing the iommu domain while VFIO was running as it does not support changing the "container" after starting up. However, iommufd has improved this so that the iommu domain can be changed during VFIO operation. This potentially allows userspace to directly race VFIO_DEVICE_ATTACH_IOMMUFD_PT (which calls iommu_attach_group()) and VFIO_DEVICE_SET_IRQS (which calls into iommu_dma_compose_msi_msg()). This potentially causes both the cookie pointer and the unlocked call to iommu_get_domain_for_dev() on the MSI translation path to become UAFs. Fix the MSI cookie UAF by removing the cookie pointer. The translated IOVA address is already known during iommu_dma_prepare_msi() and cannot change. Thus, it can simply be stored as an integer in the MSI descriptor. The other UAF related to iommu_get_domain_for_dev() will be addressed in patch "iommu: Make iommu_dma_prepare_msi() into a generic operation" by using the IOMMU group mutex. Link: https://patch.msgid.link/r/a4f2cd76b9dc1833ee6c1cf325cba57def22231c.1740014950.git.nicolinc@nvidia.com Signed-off-by: Nicolin Chen <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 1f7df3a) Signed-off-by: Nirmoy Das <[email protected]>

The two-step process to translate the MSI address involves two functions, iommu_dma_prepare_msi() and iommu_dma_compose_msi_msg(). Previously iommu_dma_compose_msi_msg() needed to be in the iommu layer as it had to dereference the opaque cookie pointer. Now, the previous patch changed the cookie pointer into an integer so there is no longer any need for the iommu layer to be involved. Further, the call sites of iommu_dma_compose_msi_msg() all follow the same pattern of setting an MSI message address_hi/lo to non-translated and then immediately calling iommu_dma_compose_msi_msg(). Refactor iommu_dma_compose_msi_msg() into msi_msg_set_addr() that directly accepts the u64 version of the address and simplifies all the callers. Move the new helper to linux/msi.h since it has nothing to do with iommu. Aside from refactoring, this logically prepares for the next patch, which allows multiple implementation options for iommu_dma_prepare_msi(). So, it does not make sense to have the iommu_dma_compose_msi_msg() in dma-iommu.c as it no longer provides the only iommu_dma_prepare_msi() implementation. Link: https://patch.msgid.link/r/eda62a9bafa825e9cdabd7ddc61ad5a21c32af24.1740014950.git.nicolinc@nvidia.com Signed-off-by: Nicolin Chen <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 9349887) Signed-off-by: Nirmoy Das <[email protected]>

SW_MSI supports IOMMU to translate an MSI message before the MSI message is delivered to the interrupt controller. On such systems, an iommu_domain must have a translation for the MSI message for interrupts to work. The IRQ subsystem will call into IOMMU to request that a physical page be set up to receive MSI messages, and the IOMMU then sets an IOVA that maps to that physical page. Ultimately the IOVA is programmed into the device via the msi_msg. Generalize this by allowing iommu_domain owners to provide implementations of this mapping. Add a function pointer in struct iommu_domain to allow a domain owner to provide its own implementation. Have dma-iommu supply its implementation for IOMMU_DOMAIN_DMA types during the iommu_get_dma_cookie() path. For IOMMU_DOMAIN_UNMANAGED types used by VFIO (and iommufd for now), have the same iommu_dma_sw_msi set as well in the iommu_get_msi_cookie() path. Hold the group mutex while in iommu_dma_prepare_msi() to ensure the domain doesn't change or become freed while running. Races with IRQ operations from VFIO and domain changes from iommufd are possible here. Replace the msi_prepare_lock with a lockdep assertion for the group mutex as documentation. For the dmau_iommu.c each iommu_domain is unique to a group. Link: https://patch.msgid.link/r/4ca696150d2baee03af27c4ddefdb7b0b0280e7b.1740014950.git.nicolinc@nvidia.com Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 288683c) Signed-off-by: Nirmoy Das <[email protected]>

@handle

Caller of the two APIs always provide a valid handle, make @handle as mandatory parameter. Take this chance incoporate the handle->domain set under the protection of group->mutex in iommu_attach_group_handle(). Link: https://patch.msgid.link/r/[email protected] Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Reviewed-by: Lu Baolu <[email protected]> Signed-off-by: Yi Liu <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 237603a) Signed-off-by: Nirmoy Das <[email protected]>

iommufd does not use it now, so drop it. Link: https://patch.msgid.link/r/[email protected] Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Signed-off-by: Yi Liu <[email protected]> Reviewed-by: Lu Baolu <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 473ec07) Signed-off-by: Nirmoy Das <[email protected]>

iommu_attach_device_pasid() only stores handle to group->pasid_array when there is a valid handle input. However, it makes the iommu_attach_device_pasid() unable to detect if the pasid has been attached or not previously. To be complete, let the iommu_attach_device_pasid() store the domain to group->pasid_array if no valid handle. The other users of the group->pasid_array should be updated to be consistent. e.g. the iommu_attach_group_handle() and iommu_replace_group_handle(). Link: https://patch.msgid.link/r/[email protected] Suggested-by: Jason Gunthorpe <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Signed-off-by: Yi Liu <[email protected]> Reviewed-by: Lu Baolu <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit e1ea9d3) Signed-off-by: Nirmoy Das <[email protected]>

…h op of iommu drivers The current implementation stores entry to the group->pasid_array before the underlying iommu driver has successfully set the new domain. This can lead to issues where PRIs are received on the new domain before the attach operation is completed. This patch swaps the order of operations to ensure that the domain is set in the underlying iommu driver before updating the group->pasid_array. Link: https://patch.msgid.link/r/[email protected] Suggested-by: Jason Gunthorpe <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Reviewed-by: Lu Baolu <[email protected]> Signed-off-by: Yi Liu <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 5e9f822) Signed-off-by: Nirmoy Das <[email protected]>

The drivers doing their own fwspec parsing have no need to call iommu_fwspec_free() since fwspecs were moved into dev_iommu, as returning an error from .probe_device will tear down the whole lot anyway. Move it into the private interface now that it only serves for of_iommu to clean up in an error case. I have no idea what mtk_v1 was doing in effectively guaranteeing a NULL fwspec would be dereferenced if no "iommus" DT property was found, so add a check for that to at least make the code look sane. Signed-off-by: Robin Murphy <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/36e245489361de2d13db22a510fa5c79e7126278.1740667667.git.robin.murphy@arm.com Signed-off-by: Joerg Roedel <[email protected]> (cherry picked from commit 29c6e1c) Signed-off-by: Nirmoy Das <[email protected]>

At the moment, if of_iommu_configure() allocates dev->iommu itself via iommu_fwspec_init(), then suffers a DT parsing failure, it cleans up the fwspec but leaves the empty dev_iommu hanging around. So far this is benign (if a tiny bit wasteful), but we'd like to be able to reason about dev->iommu having a consistent and unambiguous lifecycle. Thus make sure that the of_iommu cleanup undoes precisely whatever it did. Signed-off-by: Robin Murphy <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/d219663a3f23001f23d520a883ac622d70b4e642.1740753261.git.robin.murphy@arm.com Signed-off-by: Joerg Roedel <[email protected]> (cherry picked from commit 3832862) Signed-off-by: Nirmoy Das <[email protected]>

Currently, IRQ_MSI_IOMMU is selected if DMA_IOMMU is available to provide an implementation for iommu_dma_prepare/compose_msi_msg(). However, it'll make more sense for irqchips that call prepare/compose to select it, and that will trigger all the additional code and data to be compiled into the kernel. If IRQ_MSI_IOMMU is selected with no IOMMU side implementation, then the prepare/compose() will be NOP stubs. If IRQ_MSI_IOMMU is not selected by an irqchip, then the related code on the iommu side is compiled out. Link: https://patch.msgid.link/r/a2620f67002c5cdf974e89ca3bf905f5c0817be6.1740014950.git.nicolinc@nvidia.com Signed-off-by: Nicolin Chen <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 96093fe) Signed-off-by: Nirmoy Das <[email protected]>

nvgrace-egm exposes the API register_egm_node & unregister_egm_node to manage EGM (Extended GPU Memory) present on the system. To allow out-of-tree driver such as nvidia-vgpu-vfio make use of them, move the declaration to a new nvgrace-egm.h in include. Signed-off-by: Ankit Agrawal <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit bed340f https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.8-next) Signed-off-by: Koba Ko <[email protected]> Acked-by: Matthew R. Ochs <[email protected]> Acked-by: Carol L. Soto <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit a961663 https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.11-next) Signed-off-by: Nirmoy Das <[email protected]>

…tion Free the kmalloc'd region when the EGM is unregistered. Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Carol L. Soto <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit fc592b9 https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.8-next) Signed-off-by: Koba Ko <[email protected]> Acked-by: Matthew R. Ochs <[email protected]> Acked-by: Carol L. Soto <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit f24760c https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.11-next) Signed-off-by: Nirmoy Das <[email protected]>

Move region hash initiaization alongside the other region initialization statements to avoid situations where the hash table was not properly initialized. Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Carol L. Soto <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit 8021c1d https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.8-next) Signed-off-by: Koba Ko <[email protected]> Acked-by: Matthew R. Ochs <[email protected]> Acked-by: Carol L. Soto <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit e1264a6 https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.11-next) Signed-off-by: Nirmoy Das <[email protected]>

…rrors Update error handling within EGM regiration routine to catch and return errors to the caller. Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Carol L. Soto <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit a57210c https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.8-next) Signed-off-by: Koba Ko <[email protected]> Acked-by: Matthew R. Ochs <[email protected]> Acked-by: Carol L. Soto <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit a706ff8 https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.11-next) Signed-off-by: Nirmoy Das <[email protected]>

Detect and handle a failure from the EGM registration service. Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Carol L. Soto <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit f18eee3 https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.8-next) Signed-off-by: Koba Ko <[email protected]> Acked-by: Matthew R. Ochs <[email protected]> Acked-by: Carol L. Soto <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit 8371b68 https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.11-next) Signed-off-by: Nirmoy Das <[email protected]>

Fix source to resolve checkpatch warnings Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Carol L. Soto <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit c7b47b7 https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.8-next) Signed-off-by: Koba Ko <[email protected]> Acked-by: Matthew R. Ochs <[email protected]> Acked-by: Carol L. Soto <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit dfa0e06 https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.11-next) Signed-off-by: Nirmoy Das <[email protected]>

Fix minor syntax errors from sparse. Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Carol L. Soto <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit bbb64e6 https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.8-next) Signed-off-by: Koba Ko <[email protected]> Acked-by: Matthew R. Ochs <[email protected]> Acked-by: Carol L. Soto <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit fe78194 https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.11-next) Signed-off-by: Nirmoy Das <[email protected]>

Return the intended errno upon a copyout fault, remove unnecessary checks following container_of pointer derivation, and use the correct macro and types for overflow checking. Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Carol L. Soto <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit 429910b https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.8-next) Signed-off-by: Koba Ko <[email protected]> Acked-by: Matthew R. Ochs <[email protected]> Acked-by: Carol L. Soto <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit bda63f3 https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.11-next) Signed-off-by: Nirmoy Das <[email protected]>

Use the correct macro and types for overflow checking. Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Carol L. Soto <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit afa8f63 https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.8-next) Signed-off-by: Koba Ko <[email protected]> Acked-by: Matthew R. Ochs <[email protected]> Acked-by: Carol L. Soto <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit d110330 https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.11-next) Signed-off-by: Nirmoy Das <[email protected]>

Ensure ACPI table reads are successful prior to using the value. Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Carol L. Soto <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit b2947b0 https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.8-next) Signed-off-by: Koba Ko <[email protected]> Acked-by: Matthew R. Ochs <[email protected]> Acked-by: Carol L. Soto <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit 9258355 https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.11-next) Signed-off-by: Nirmoy Das <[email protected]>

Some environments may provide a "nvidia,egm-retired-pages-data-base” but fail to populate it with a base address, leaving it NULL. Mapping this invalid value results in a synchronous exception when the region is first touched. Detect a NULL value, generate a warning to draw attention to the firmware bug, and return without mapping. INFO: th500_ras_intr_handler: External Abort reason=1 syndrome=0x92000410 flags=0x1 [ 82.104493] Internal error: synchronous external abort: 0000000096000410 [#1] SMP [ 82.114898] Modules linked in: nvgrace_gpu_vfio_pci(E) nvgrace_egm(E) [ 82.257218] CPU: 0 PID: 10 Comm: kworker/0:1 Tainted: G OE 6.8.12+ #5 [ 82.265135] Hardware name: NVIDIA GH200 P5042, BIOS 24103110 20241031 [ 82.271720] Workqueue: events work_for_cpu_fn [ 82.276180] pstate: 03400009 (nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) [ 82.283298] pc : register_egm_node+0x2cc/0x440 [nvgrace_egm] [ 82.289087] lr : register_egm_node+0x2c4/0x440 [nvgrace_egm] [ 82.294872] sp : ffff8000802ebc30 [ 82.298254] x29: ffff8000802ebc60 x28: 00000000000000ff x27: 0000000000000000 [ 82.305550] x26: ffff000087a320c8 x25: ffff0000a5700000 x24: ffff000087a32000 [ 82.312846] x23: ffffa77cd758e368 x22: 0000000000000000 x21: ffffa77cd758c640 [ 82.320141] x20: ffffa77cd758e170 x19: ffff800081e7d000 x18: ffff800080293038 [ 82.327437] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 [ 82.334732] x14: 0000000000000000 x13: 65203a65646f6e5f x12: 0000000000000000 [ 82.342027] x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000000 [ 82.349322] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000 [ 82.356618] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000 [ 82.363913] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff800081e7d000 [ 82.371210] Call trace: [ 82.373705] register_egm_node+0x2cc/0x440 [nvgrace_egm] [ 82.379135] nvgrace_gpu_probe+0x2ac/0x528 [nvgrace_gpu_vfio_pci] [ 82.385366] local_pci_probe+0x4c/0xe0 [ 82.389198] work_for_cpu_fn+0x28/0x58 [ 82.393026] process_one_work+0x168/0x3f0 [ 82.397123] worker_thread+0x360/0x480 [ 82.400952] kthread+0x11c/0x128 [ 82.404248] ret_from_fork+0x10/0x20 [ 82.407906] Code: d2820001 940002b3 aa0003f3 b4fffac0 (f9400017) [ 82.414134] ---[ end trace 0000000000000000 ]--- Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Carol L. Soto <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit 7ba2930 https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.8-next) Signed-off-by: Koba Ko <[email protected]> Acked-by: Matthew R. Ochs <[email protected]> Acked-by: Carol L. Soto <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit 349fb1c https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.11-next) Signed-off-by: Nirmoy Das <[email protected]>

In an effort to simplify the programming model, use a symmetrical model for the the EGM regsiration APIs. This avoids the caller needing to keep a cookie or even have knowlege of if EGM is supported. Update the EGM unregisration API to use the PCI device as its parameter. Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit d8903ec https://github.com/nvmochs/NV-Kernels/tree/vegm_01232025) Signed-off-by: Koba Ko <[email protected]> Acked-by: Matthew R. Ochs <[email protected]> Acked-by: Carol L. Soto <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit 5839fc5 https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.11-next) Signed-off-by: Nirmoy Das <[email protected]>

…egions GB200 systems could have multiple GPUs associated with an EGM region. For proper EGM functionality the host topology in terms of GPU affinity has to be replicated in the VM. Hence the EGM region structure must track the GPU devices belonging to the same socket. On the device probe, the device pci_dev struct is added to a linked list of the appropriate EGM region. Similarly on device remove, the pci_dev struct for the GPU is removed from the EGM region. Signed-off-by: Ankit Agrawal <[email protected]> Ref: sj24: /home/nvidia/ankita/kernel_patches/0001_vfio_nvgrace-egm_track_GPUs_associated_with_the_EGM_regions.patch (koba: Enhance error handling, Remove egm_node from unregister_egm_node and move destroy_egm_chardev a little forward) Signed-off-by: Koba Ko <[email protected]> Acked-by: Matthew R. Ochs <[email protected]> Acked-by: Carol L. Soto <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit 0222c35 https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.11-next) Signed-off-by: Nirmoy Das <[email protected]>

To replicate the host EGM topology in the VM in terms of the GPU affinity, the userspace need to be aware of which GPUs belong to the same socket as the EGM region. Expose the list of GPUs associated with an EGM region through sysfs. The list can be queried from the location /sys/devices/virtual/egm/egmX/gpu_devices. Signed-off-by: Ankit Agrawal <[email protected]> Ref: sj24: /home/nvidia/ankita/kernel_patches/0002_vfio_nvgrace-egm_list_gpus_through_sysfs.patch (koba: Enchance error handling for sysfs_create_group) Signed-off-by: Koba Ko <[email protected]> Acked-by: Matthew R. Ochs <[email protected]> Acked-by: Carol L. Soto <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit fec2356 https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.11-next) Signed-off-by: Nirmoy Das <[email protected]>

To allocate the EGM, the userspace need to know it's size. Currently, there is no easy way for the userspace to determine that. Make nvgrace-egm expose the size through sysfs that can be queried by the userspace from /sys/devices/virtual/egm/egmX/egm_size. Signed-off-by: Ankit Agrawal <[email protected]> Ref: sj24: /home/nvidia/ankita/kernel_patches/0003_vfio_nvgrace-egm_expose_the_egm_size_through_sysfs.patch Signed-off-by: Koba Ko <[email protected]> Acked-by: Matthew R. Ochs <[email protected]> Acked-by: Carol L. Soto <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit dcdcef2 https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.11-next) Signed-off-by: Nirmoy Das <[email protected]>

… allocations Add missing null pointer checks after vzalloc() calls in the NVIDIA Grace GPU driver's EGM (External GPU Memory) handling code. This prevents potential null pointer dereferences in the memory failure handling and bad page fetching functions, providing proper error handling for allocation failures. Signed-off-by: Koba Ko <[email protected]> Acked-by: Matthew R. Ochs <[email protected]> Acked-by: Carol L. Soto <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit 63127e2 https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.11-next) Signed-off-by: Nirmoy Das <[email protected]>

Add CONFIG_NVGRACE_EGM with policy 'm' for arm64 architecture. Signed-off-by: Nirmoy Das <[email protected]>

On platforms without the mig HW bug (e.g. Grace-Blackwell) there is not a requirement to create the resmem region. Accordingly, this region is not configured on these platforms, which leads to the following print when the device is closed: resource: Trying to free nonexistent resource <0x0000000000000000-0x000000000000ffff> Avoid calling unregister_pfn_address_space for resmem when the region is not being used. Fixes: 2d21b7b ("vfio/nvgrace-gpu: register device memory for poison handling") Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Carol L. Soto <[email protected]> Acked-by: Nirmoy Das <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit bd0187d https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.11-next) Signed-off-by: Nirmoy Das <[email protected]>

nvmochs

Ran pick analyzer on c7d2a4a^..e2d029c, the majority of the patches match upstream exactly. Of the ones that were flagged, reviewed and found they were only called out due to minor context differences or the addition of "NVIDIA: SAUCE:" tags.

Manual review of the patches with backport tags, no issues or concerns.

Lastly, confirmed pick tags and trailers are present and correct.

Acked-by: Matthew R. Ochs <[email protected]>

clsotog · 2025-07-24T19:53:02Z

Question I see CONFIG_NVGRACE_EGM like configured in 2 places:
4ff9dc6 NVIDIA: SAUCE: arm64: configs: enable NVGRACE_EGM as module
fa811f0 NVIDIA: SAUCE: arm64: configs: Build CONFIG_NVGRACE_EGM as LKM

Do we need both places?

nvmochs · 2025-07-24T20:41:17Z

Question I see CONFIG_NVGRACE_EGM like configured in 2 places: 4ff9dc6 NVIDIA: SAUCE: arm64: configs: enable NVGRACE_EGM as module fa811f0 NVIDIA: SAUCE: arm64: configs: Build CONFIG_NVGRACE_EGM as LKM

Do we need both places?

4ff9dc6 sets it in the annotations, fa811f0 sets it in the defconfig. The defconfig one is not really needed in the Ubuntu tree, but I have continued to carry it forward since some CSPs were not using the annotations.

clsotog · 2025-07-24T21:11:17Z

So CSPs can be using this exact git tree but they do not use the annotations to build the kernel?
There are more things in the annotations for our kernel like tpm, cpufreq performance, etc. So they use the grace doc to get the other config needed?

nvmochs · 2025-07-24T21:21:02Z

So CSPs can be using this exact git tree but they do not use the annotations to build the kernel? There are more things in the annotations for our kernel like tpm, cpufreq performance, etc. So they use the grace doc to get the other config needed?

We have advised them now to use the annotations, and updated the reference code release notes with the command to generate the .config from it. But of course we cannot force them. =)

If you feel strongly about it we can remove the defconfig commit, I don't have a preference either way.

clsotog · 2025-07-24T21:51:55Z

no its ok.Leave it.
I remembered the issue that I was helping Nathan and it was a config issue. If we now recommend the annotations thats great!

clsotog

Acked-by: Carol L Soto <[email protected]>

nvmochs · 2025-07-24T23:33:50Z

Merged, closing PR.

rppt and others added 2 commits July 18, 2025 07:38

nirmoy marked this pull request as draft July 22, 2025 13:24

nirmoy changed the title ~~[draft][6.14-adv-next] Backport: Add Extended GPU Memory (EGM) virtualization support~~ [draft][6.14-adv-next] Add Grace virtualization support to 6.14-adv, (upstream vEVENTQ + HW QUEUE and OOT vEGM) Jul 22, 2025

nirmoy force-pushed the 614_tech_preview_virt.1 branch 2 times, most recently from b4c3a62 to e30dedb Compare July 23, 2025 13:05

nirmoy marked this pull request as ready for review July 23, 2025 13:13

nirmoy changed the title ~~[draft][6.14-adv-next] Add Grace virtualization support to 6.14-adv, (upstream vEVENTQ + HW QUEUE and OOT vEGM)~~ [6.14-adv-next] Add Grace virtualization support to 6.14-adv, (upstream vEVENTQ + HW QUEUE and OOT vEGM) Jul 23, 2025

nirmoy force-pushed the 614_tech_preview_virt.1 branch 3 times, most recently from c36eb0a to df3cae8 Compare July 23, 2025 13:28

nvmochs requested review from clsotog and nvmochs July 23, 2025 14:43

nirmoy force-pushed the 614_tech_preview_virt.1 branch 3 times, most recently from e7e4110 to 7eeda3f Compare July 23, 2025 16:07

nirmoy and others added 14 commits July 23, 2025 09:21

Revert "NVIDIA: SAUCE: iommu/arm-smmu-v3: Implement arm_smmu_get_msi_…

c23c959

…mapping_domain" This reverts commit 78480b2. Signed-off-by: Nirmoy Das <[email protected]>

Revert "NVIDIA: SAUCE: iommu/dma: Support MSIs through nested domains"

d2e60f9

This reverts commit 8aced5e. Signed-off-by: Nirmoy Das <[email protected]>

ankita-nv and others added 18 commits July 23, 2025 09:22

NVIDIA: SAUCE: arm64: configs: enable NVGRACE_EGM as module

4ff9dc6

Add CONFIG_NVGRACE_EGM with policy 'm' for arm64 architecture. Signed-off-by: Nirmoy Das <[email protected]>

nirmoy force-pushed the 614_tech_preview_virt.1 branch from 7eeda3f to e2d029c Compare July 23, 2025 16:22

nvmochs approved these changes Jul 23, 2025

View reviewed changes

clsotog approved these changes Jul 24, 2025

View reviewed changes

nvmochs closed this Jul 24, 2025

nirmoy mentioned this pull request Aug 5, 2025

Add Grace virtualization support to 6.14 HWE, (upstream vEVENTQ + HW QUEUE and OOT vEGM #179

Closed

nvmochs mentioned this pull request Aug 25, 2025

Revert iommu patch series from 24.04_linux-nvidia-6.14-next #195

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[6.14-adv-next] Add Grace virtualization support to 6.14-adv, (upstream vEVENTQ + HW QUEUE and OOT vEGM) #167

[6.14-adv-next] Add Grace virtualization support to 6.14-adv, (upstream vEVENTQ + HW QUEUE and OOT vEGM) #167

Uh oh!

nirmoy commented Jul 22, 2025 •

edited

Loading

Uh oh!

nvmochs left a comment

Uh oh!

clsotog commented Jul 24, 2025

Uh oh!

nvmochs commented Jul 24, 2025

Uh oh!

clsotog commented Jul 24, 2025

Uh oh!

nvmochs commented Jul 24, 2025

Uh oh!

clsotog commented Jul 24, 2025

Uh oh!

clsotog left a comment

Uh oh!

nvmochs commented Jul 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

[6.14-adv-next] Add Grace virtualization support to 6.14-adv, (upstream vEVENTQ + HW QUEUE and OOT vEGM) #167

[6.14-adv-next] Add Grace virtualization support to 6.14-adv, (upstream vEVENTQ + HW QUEUE and OOT vEGM) #167

Uh oh!

Conversation

nirmoy commented Jul 22, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

testing sources:

VM start command for EGM testing

Test runs for EGM enabled VM

vEVENTQ validation

VM start command for vEVENTQ testing with cmdqv=on

Test runs for vEVENTQ enabled VM

Uh oh!

nvmochs left a comment

Choose a reason for hiding this comment

Uh oh!

clsotog commented Jul 24, 2025

Uh oh!

nvmochs commented Jul 24, 2025

Uh oh!

clsotog commented Jul 24, 2025

Uh oh!

nvmochs commented Jul 24, 2025

Uh oh!

clsotog commented Jul 24, 2025

Uh oh!

clsotog left a comment

Choose a reason for hiding this comment

Uh oh!

nvmochs commented Jul 24, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

12 participants

nirmoy commented Jul 22, 2025 •

edited

Loading

VM start command for vEVENTQ testing with `cmdqv=on`