Add Grace virtualization support to 6.14 HWE, (upstream vEVENTQ + HW QUEUE and OOT vEGM #179

nirmoy · 2025-08-05T11:35:44Z

This PR backport/cherry-pick patches for upstream vEVENTQ + HW QUEUE and OOT vEGM

This is replica of #167 but based on HWE branch instead.

The module code does not create a writable copy of the executable memory anymore so there is no need to handle it in module relocation and alternatives patching. This reverts commit 9bfc482. Signed-off-by: "Mike Rapoport (Microsoft)" <[email protected]> Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Link: https://lore.kernel.org/r/[email protected] (cherry picked from commit 1d7e707) Signed-off-by: Nirmoy Das <[email protected]>

Pretty much every caller of is_endbr() actually wants to test something at an address and ends up doing get_kernel_nofault(). Fold the lot into a more convenient helper. Signed-off-by: Peter Zijlstra (Intel) <[email protected]> Reviewed-by: Sami Tolvanen <[email protected]> Acked-by: Alexei Starovoitov <[email protected]> Acked-by: Andrii Nakryiko <[email protected]> Acked-by: "Masami Hiramatsu (Google)" <[email protected]> Link: https://lore.kernel.org/r/[email protected] (cherry picked from commit 72e213a) Signed-off-by: Nirmoy Das <[email protected]>

…mapping_domain" This reverts commit 78480b2. Signed-off-by: Nirmoy Das <[email protected]>

There are new attach/detach/replace helpers in device.c taking care of both the attach_handle and the fault specific routines for iopf_enable/disable() and auto response. Clean up these redundant functions in the fault.c file. Link: https://patch.msgid.link/r/3ca94625e9d78270d9a715fa0809414fddd57e58.1738645017.git.nicolinc@nvidia.com Signed-off-by: Nicolin Chen <[email protected]> Reviewed-by: Yi Liu <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit dc10ba2) Signed-off-by: Nirmoy Das <[email protected]>

This reverts commit 8aced5e. Signed-off-by: Nirmoy Das <[email protected]>

…u_cookie The IOMMU translation for MSI message addresses has been a 2-step process, separated in time: 1) iommu_dma_prepare_msi(): A cookie pointer containing the IOVA address is stored in the MSI descriptor when an MSI interrupt is allocated. 2) iommu_dma_compose_msi_msg(): this cookie pointer is used to compute a translated message address. This has an inherent lifetime problem for the pointer stored in the cookie that must remain valid between the two steps. However, there is no locking at the irq layer that helps protect the lifetime. Today, this works under the assumption that the iommu domain is not changed while MSI interrupts being programmed. This is true for normal DMA API users within the kernel, as the iommu domain is attached before the driver is probed and cannot be changed while a driver is attached. Classic VFIO type1 also prevented changing the iommu domain while VFIO was running as it does not support changing the "container" after starting up. However, iommufd has improved this so that the iommu domain can be changed during VFIO operation. This potentially allows userspace to directly race VFIO_DEVICE_ATTACH_IOMMUFD_PT (which calls iommu_attach_group()) and VFIO_DEVICE_SET_IRQS (which calls into iommu_dma_compose_msi_msg()). This potentially causes both the cookie pointer and the unlocked call to iommu_get_domain_for_dev() on the MSI translation path to become UAFs. Fix the MSI cookie UAF by removing the cookie pointer. The translated IOVA address is already known during iommu_dma_prepare_msi() and cannot change. Thus, it can simply be stored as an integer in the MSI descriptor. The other UAF related to iommu_get_domain_for_dev() will be addressed in patch "iommu: Make iommu_dma_prepare_msi() into a generic operation" by using the IOMMU group mutex. Link: https://patch.msgid.link/r/a4f2cd76b9dc1833ee6c1cf325cba57def22231c.1740014950.git.nicolinc@nvidia.com Signed-off-by: Nicolin Chen <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 1f7df3a) Signed-off-by: Nirmoy Das <[email protected]>

The two-step process to translate the MSI address involves two functions, iommu_dma_prepare_msi() and iommu_dma_compose_msi_msg(). Previously iommu_dma_compose_msi_msg() needed to be in the iommu layer as it had to dereference the opaque cookie pointer. Now, the previous patch changed the cookie pointer into an integer so there is no longer any need for the iommu layer to be involved. Further, the call sites of iommu_dma_compose_msi_msg() all follow the same pattern of setting an MSI message address_hi/lo to non-translated and then immediately calling iommu_dma_compose_msi_msg(). Refactor iommu_dma_compose_msi_msg() into msi_msg_set_addr() that directly accepts the u64 version of the address and simplifies all the callers. Move the new helper to linux/msi.h since it has nothing to do with iommu. Aside from refactoring, this logically prepares for the next patch, which allows multiple implementation options for iommu_dma_prepare_msi(). So, it does not make sense to have the iommu_dma_compose_msi_msg() in dma-iommu.c as it no longer provides the only iommu_dma_prepare_msi() implementation. Link: https://patch.msgid.link/r/eda62a9bafa825e9cdabd7ddc61ad5a21c32af24.1740014950.git.nicolinc@nvidia.com Signed-off-by: Nicolin Chen <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 9349887) Signed-off-by: Nirmoy Das <[email protected]>

SW_MSI supports IOMMU to translate an MSI message before the MSI message is delivered to the interrupt controller. On such systems, an iommu_domain must have a translation for the MSI message for interrupts to work. The IRQ subsystem will call into IOMMU to request that a physical page be set up to receive MSI messages, and the IOMMU then sets an IOVA that maps to that physical page. Ultimately the IOVA is programmed into the device via the msi_msg. Generalize this by allowing iommu_domain owners to provide implementations of this mapping. Add a function pointer in struct iommu_domain to allow a domain owner to provide its own implementation. Have dma-iommu supply its implementation for IOMMU_DOMAIN_DMA types during the iommu_get_dma_cookie() path. For IOMMU_DOMAIN_UNMANAGED types used by VFIO (and iommufd for now), have the same iommu_dma_sw_msi set as well in the iommu_get_msi_cookie() path. Hold the group mutex while in iommu_dma_prepare_msi() to ensure the domain doesn't change or become freed while running. Races with IRQ operations from VFIO and domain changes from iommufd are possible here. Replace the msi_prepare_lock with a lockdep assertion for the group mutex as documentation. For the dmau_iommu.c each iommu_domain is unique to a group. Link: https://patch.msgid.link/r/4ca696150d2baee03af27c4ddefdb7b0b0280e7b.1740014950.git.nicolinc@nvidia.com Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 288683c) Signed-off-by: Nirmoy Das <[email protected]>

@handle

Caller of the two APIs always provide a valid handle, make @handle as mandatory parameter. Take this chance incoporate the handle->domain set under the protection of group->mutex in iommu_attach_group_handle(). Link: https://patch.msgid.link/r/[email protected] Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Reviewed-by: Lu Baolu <[email protected]> Signed-off-by: Yi Liu <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 237603a) Signed-off-by: Nirmoy Das <[email protected]>

iommufd does not use it now, so drop it. Link: https://patch.msgid.link/r/[email protected] Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Signed-off-by: Yi Liu <[email protected]> Reviewed-by: Lu Baolu <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 473ec07) Signed-off-by: Nirmoy Das <[email protected]>

iommu_attach_device_pasid() only stores handle to group->pasid_array when there is a valid handle input. However, it makes the iommu_attach_device_pasid() unable to detect if the pasid has been attached or not previously. To be complete, let the iommu_attach_device_pasid() store the domain to group->pasid_array if no valid handle. The other users of the group->pasid_array should be updated to be consistent. e.g. the iommu_attach_group_handle() and iommu_replace_group_handle(). Link: https://patch.msgid.link/r/[email protected] Suggested-by: Jason Gunthorpe <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Signed-off-by: Yi Liu <[email protected]> Reviewed-by: Lu Baolu <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit e1ea9d3) Signed-off-by: Nirmoy Das <[email protected]>

…h op of iommu drivers The current implementation stores entry to the group->pasid_array before the underlying iommu driver has successfully set the new domain. This can lead to issues where PRIs are received on the new domain before the attach operation is completed. This patch swaps the order of operations to ensure that the domain is set in the underlying iommu driver before updating the group->pasid_array. Link: https://patch.msgid.link/r/[email protected] Suggested-by: Jason Gunthorpe <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Reviewed-by: Lu Baolu <[email protected]> Signed-off-by: Yi Liu <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 5e9f822) Signed-off-by: Nirmoy Das <[email protected]>

The drivers doing their own fwspec parsing have no need to call iommu_fwspec_free() since fwspecs were moved into dev_iommu, as returning an error from .probe_device will tear down the whole lot anyway. Move it into the private interface now that it only serves for of_iommu to clean up in an error case. I have no idea what mtk_v1 was doing in effectively guaranteeing a NULL fwspec would be dereferenced if no "iommus" DT property was found, so add a check for that to at least make the code look sane. Signed-off-by: Robin Murphy <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/36e245489361de2d13db22a510fa5c79e7126278.1740667667.git.robin.murphy@arm.com Signed-off-by: Joerg Roedel <[email protected]> (cherry picked from commit 29c6e1c) Signed-off-by: Nirmoy Das <[email protected]>

At the moment, if of_iommu_configure() allocates dev->iommu itself via iommu_fwspec_init(), then suffers a DT parsing failure, it cleans up the fwspec but leaves the empty dev_iommu hanging around. So far this is benign (if a tiny bit wasteful), but we'd like to be able to reason about dev->iommu having a consistent and unambiguous lifecycle. Thus make sure that the of_iommu cleanup undoes precisely whatever it did. Signed-off-by: Robin Murphy <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/d219663a3f23001f23d520a883ac622d70b4e642.1740753261.git.robin.murphy@arm.com Signed-off-by: Joerg Roedel <[email protected]> (cherry picked from commit 3832862) Signed-off-by: Nirmoy Das <[email protected]>

Currently, IRQ_MSI_IOMMU is selected if DMA_IOMMU is available to provide an implementation for iommu_dma_prepare/compose_msi_msg(). However, it'll make more sense for irqchips that call prepare/compose to select it, and that will trigger all the additional code and data to be compiled into the kernel. If IRQ_MSI_IOMMU is selected with no IOMMU side implementation, then the prepare/compose() will be NOP stubs. If IRQ_MSI_IOMMU is not selected by an irqchip, then the related code on the iommu side is compiled out. Link: https://patch.msgid.link/r/a2620f67002c5cdf974e89ca3bf905f5c0817be6.1740014950.git.nicolinc@nvidia.com Signed-off-by: Nicolin Chen <[email protected]> Reviewed-by: Thomas Gleixner <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 96093fe) Signed-off-by: Nirmoy Das <[email protected]>

A "fault_data" was added exclusively for the iommufd_fault_iopf_handler() used by IOPF/PRI use cases, along with the attach_handle. Now, the iommufd version of the sw_msi function will reuse the attach_handle and fault_data for a non-fault case. Rename "fault_data" to "iommufd_hwpt" so as not to confine it to a "fault" case. Move it into a union to be the iommufd private pointer. A following patch will move the iova_cookie to the union for dma-iommu too after the iommufd_sw_msi implementation is added. Since we have two unions now, add some simple comments for readability. Link: https://patch.msgid.link/r/ee5039503f28a16590916e9eef28b917e2d1607a.1740014950.git.nicolinc@nvidia.com Signed-off-by: Nicolin Chen <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 748706d) Signed-off-by: Nirmoy Das <[email protected]>

iommufd has a model where the iommu_domain can be changed while the VFIO device is attached. In this case, the MSI should continue to work. This corner case has not worked because the dma-iommu implementation of sw_msi is tied to a single domain. Implement the sw_msi mapping directly and use a global per-fd table to associate assigned IOVA to the MSI pages. This allows the MSI pages to be loaded into a domain before it is attached ensuring that MSI is not disrupted. Link: https://patch.msgid.link/r/e13d23eeacd67c0a692fc468c85b483f4dd51c57.1740014950.git.nicolinc@nvidia.com Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 40f5175) Signed-off-by: Nirmoy Das <[email protected]>

There is no need to keep them in the header. The vEVENTQ version of these two functions will turn out to be a different implementation and will not share with this fault version. Thus, move them out of the header. Link: https://patch.msgid.link/r/7eebe32f3d354799f5e28128c693c3c284740b21.1741719725.git.nicolinc@nvidia.com Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Reviewed-by: Lu Baolu <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit dbf00d7) Signed-off-by: Nirmoy Das <[email protected]>

The infrastructure of a fault object will be shared with a new vEVENTQ object in a following change. Add an iommufd_fault_init helper and an INIT_EVENTQ_FOPS marco for a vEVENTQ allocator to use too. Reorder the iommufd_ctx_get and refcount_inc, to keep them symmetrical with the iommufd_fault_fops_release(). Since the new vEVENTQ doesn't need "response" and its "mutex", so keep the xa_init_flags and mutex_init in their original locations. Link: https://patch.msgid.link/r/a9522c521909baeb1bd843950b2490478f3d06e0.1741719725.git.nicolinc@nvidia.com Reviewed-by: Kevin Tian <[email protected]> Reviewed-by: Lu Baolu <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 927dabc) Signed-off-by: Nirmoy Das <[email protected]>

The fault object was designed exclusively for hwpt's IO page faults (PRI). But its queue implementation can be reused for other purposes too, such as hardware IRQ and event injections to user space. Meanwhile, a fault object holds a list of faults. So it's more accurate to call it a "fault queue". Combining the reusing idea above, abstract a new iommufd_eventq as a common structure embedded into struct iommufd_fault, similar to hwpt_paging holding a common hwpt. Add a common iommufd_eventq_ops and iommufd_eventq_init to prepare for an IOMMUFD_OBJ_VEVENTQ (vIOMMU Event Queue). Link: https://patch.msgid.link/r/e7336a857954209aabb466e0694aab323da95d90.1741719725.git.nicolinc@nvidia.com Reviewed-by: Lu Baolu <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 5426a78) Signed-off-by: Nirmoy Das <[email protected]>

Rename the file, aligning with the new eventq object. Link: https://patch.msgid.link/r/d726397e2d08028e25a1cb6eb9febefac35a32ba.1741719725.git.nicolinc@nvidia.com Reviewed-by: Kevin Tian <[email protected]> Reviewed-by: Lu Baolu <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 0507f33) Signed-off-by: Nirmoy Das <[email protected]>

Introduce a new IOMMUFD_OBJ_VEVENTQ object for vIOMMU Event Queue that provides user space (VMM) another FD to read the vIOMMU Events. Allow a vIOMMU object to allocate vEVENTQs, with a condition that each vIOMMU can only have one single vEVENTQ per type. Add iommufd_veventq_alloc() with iommufd_veventq_ops for the new ioctl. Link: https://patch.msgid.link/r/21acf0751dd5c93846935ee06f93b9c65eff5e04.1741719725.git.nicolinc@nvidia.com Reviewed-by: Lu Baolu <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit e36ba5a) Signed-off-by: Nirmoy Das <[email protected]>

This is a reverse search v.s. iommufd_viommu_find_dev, as drivers may want to convert a struct device pointer (physical) to its virtual device ID for an event injection to the user space VM. Again, this avoids exposing more core structures to the drivers, than the iommufd_viommu alone. Link: https://patch.msgid.link/r/18b8e8bc1b8104d43b205d21602c036fd0804e56.1741719725.git.nicolinc@nvidia.com Reviewed-by: Lu Baolu <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit ea94b21) Signed-off-by: Nirmoy Das <[email protected]>

Similar to iommu_report_device_fault, this allows IOMMU drivers to report vIOMMU events from threaded IRQ handlers to user space hypervisors. Link: https://patch.msgid.link/r/44be825042c8255e75d0151b338ffd8ba0e4920b.1741719725.git.nicolinc@nvidia.com Reviewed-by: Lu Baolu <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit e8e1ef9) Signed-off-by: Nirmoy Das <[email protected]>

When attaching a device to a vIOMMU-based nested domain, vdev_id must be present. Add a piece of code hard-requesting it, preparing for a vEVENTQ support in the following patch. Then, update the TEST_F. A HWPT-based nested domain will return a NULL new_viommu, thus no such a vDEVICE requirement. Link: https://patch.msgid.link/r/4051ca8a819e51cb30de6b4fe9e4d94d956afe3d.1741719725.git.nicolinc@nvidia.com Reviewed-by: Kevin Tian <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 941d071) Signed-off-by: Nirmoy Das <[email protected]>

The handler will get vDEVICE object from the given mdev and convert it to its per-vIOMMU virtual ID to mimic a real IOMMU driver. Link: https://patch.msgid.link/r/1ea874d20e56d65e7cfd6e0e8e01bd3dbd038761.1741719725.git.nicolinc@nvidia.com Reviewed-by: Kevin Tian <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit b3cc0b7) Signed-off-by: Nirmoy Das <[email protected]>

Trigger vEVENTs by feeding an idev ID and validating the returned output virt_ids whether they equal to the value that was set to the vDEVICE. Link: https://patch.msgid.link/r/e829532ec0a3927d61161b7674b20e731ecd495b.1741719725.git.nicolinc@nvidia.com Reviewed-by: Kevin Tian <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 97717a1) Signed-off-by: Nirmoy Das <[email protected]>

With the introduction of the new objects, update the doc to reflect that. Link: https://patch.msgid.link/r/09829fbc218872d242323d8834da4bec187ce6f4.1741719725.git.nicolinc@nvidia.com Reviewed-by: Lu Baolu <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Reviewed-by: Bagas Sanjaya <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 2ec0458) Signed-off-by: Nirmoy Das <[email protected]>

Use it to store all vSMMU-related data. The vsid (Virtual Stream ID) will be the first use case. Since the vsid reader will be the eventq handler that already holds a streams_mutex, reuse that to fence the vmaster too. Also add a pair of arm_smmu_attach_prepare/commit_vmaster helpers to set or unset the master->vmaster pointer. Put the helpers inside the existing arm_smmu_attach_prepare/commit(). For identity/blocked ops that don't call arm_smmu_attach_prepare/commit(), add a simpler arm_smmu_master_clear_vmaster helper to unset the vmaster. Link: https://patch.msgid.link/r/a7f282e1a531279e25f06c651e95d56f6b120886.1741719725.git.nicolinc@nvidia.com Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: Pranjal Shrivastava <[email protected]> Acked-by: Will Deacon <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit f0ea207) Signed-off-by: Nirmoy Das <[email protected]>

…IOMMU Aside from the IOPF framework, iommufd provides an additional pathway to report hardware events, via the vEVENTQ of vIOMMU infrastructure. Define an iommu_vevent_arm_smmuv3 uAPI structure, and report stage-1 events in the threaded IRQ handler. Also, add another four event record types that can be forwarded to a VM. Link: https://patch.msgid.link/r/5cf6719682fdfdabffdb08374cdf31ad2466d75a.1741719725.git.nicolinc@nvidia.com Reviewed-by: Kevin Tian <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: Pranjal Shrivastava <[email protected]> Acked-by: Will Deacon <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit e7d3fa3) Signed-off-by: Nirmoy Das <[email protected]>

…e in VMA When the Grace Hopper/Blackwell system is setup with EGM mode in virtualization, the system memory is partitioned into two: A Host OS visible memory and a second EGM region that is not added to the host OS. The EGM region is assigned to the VM as its system memory with the QEMU VMA mapped through remap_pfn_range. Currently KVM sets up the stage-2 mapping for memory that is not added to the kernel with device properties. It thus does not allow support for execution fault on such region. Since the EGM memory is mapped through remap_pfn_range and not added to the kernel, such memory is set without execution fault support. This patch intends to update the KVM behaviour. It is an extension of the proposal [1] to make KVM determine whether a region should have NORMAL memory properties based on the VMA pgprot. The KVM behavior is changed to set a region with support of executable fault if and only if its VMA is mapped cacheable. The EGM memory is NORMAL system memory that is not added to the kernel. It is safe in terms of execution fault and is expected to display all properties of NORMAL memory. The patch enables this use case. Check QEMU VMA pgprot to check if it is mapped as Normal cacheable memory and allow exec fault. Link: https://lore.kernel.org/lkml/[email protected] [1] Signed-off-by: Ankit Agrawal <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit e38eceb https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.8-next) Signed-off-by: Koba Ko <[email protected]> Acked-by: Matthew R. Ochs <[email protected]> Acked-by: Carol L. Soto <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (backported from commit b6bd6da https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.11-next) [Nirmoy: s/device/s2_force_noncacheable, s/mapping_type()/FIELD_GET(PTE_ATTRINDX_MASK, pgprot_val(page_prot))] Signed-off-by: Nirmoy Das <[email protected]>

The Extended GPU Memory (EGM) feature enables the GPU access to the system memory across sockets and nodes. In this mode, the physical memory can be allocated for GPU usage from anywhere in a multi-node system. The feature is being extended to virtualization. EGM when enabled in the virtualization stack, the host memory is partitioned into 2: One partition for the Host OS usage, and a second EGM region. The EGM region essentially becomes the system memory of the VM. The following figure shows the memory map in the virtualization environment. |---- Sysmem ----| |--- GPU mem ---| VM Memory Map | | | | | | | | |------ EGM -----|--Host Mem----| |--- GPU mem ---| Host Memory Map The EGM region is not available to the host memory for its usage as it is not added to the kernel. Its base HPA and the length is communicated through the DSDT entries. A linear mapping between the VM IPA and system HPA is a requirement for EGM support. The EGM region is thus assigned to a VM by mapping the QEMU VMA to a linearly increasing HPA of the EGM region using remap_pfn_range(). Introduce a new nvgrace-egm helper module to nvgrace-gpu to manage the EGM/VM region for the VM. nvgrace-egm module handles the following: 1. Fetch the EGM memory properties (base HPA, length, proximity domain). 2. Create a char device that can be used as memory-backend-file by Qemu for the VM and implement file operations. The char device is /dev/egmX, where X is the PXM node ID of the EGM being mapped fetched in 1. 3. Zero the EGM memory on first device open(). 4. Map the QEMU VMA to the EGM region using remap_pfn_range. 5. Cleaning up state and destroying the chardev on device unbind. Signed-off-by: Ankit Agrawal <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit 892ac24 https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.8-next) Signed-off-by: Koba Ko <[email protected]> Acked-by: Matthew R. Ochs <[email protected]> Acked-by: Carol L. Soto <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit 3a1b819 https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.11-next) Signed-off-by: Nirmoy Das <[email protected]>

It is possible for some system memory pages on the EGM to have uncorrectable ECC errors. A list of pages known with such errors (referred as retired pages) are maintained by the Host UEFI. The Host UEFI populates such list in a reserved region. It communicates the SPA of this region through a ACPI DSDT property. nvgrace-egm module is responsible to store the list of retired page offsets to be made available for usermode processes. The module: 1. Get the reserved memory region SPA and maps to it to fetch the list of bad pages. 2. Calculate the retired page offsets in the EGM and stores it. 3. Expose an ioctl to allow querying of the offsets. The ioctl is called by usermode apps such as QEMU to get the retired page offsets. The usermode apps are expected to take appropriate action to communicate the list to the VM. Signed-off-by: Ankit Agrawal <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit be54641 https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.8-next) Signed-off-by: Koba Ko <[email protected]> Acked-by: Matthew R. Ochs <[email protected]> Acked-by: Carol L. Soto <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit c4cb193 https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.11-next) Signed-off-by: Nirmoy Das <[email protected]>

…errors handling The Extended GPU Memory (EGM) is mapped through remap_pfn_range() and is not backed by struct pages. Currently, memory_failure() on such region is unsupported in kernel MM. There is a proposal to handle such memory region [1]. The implementation exports APIs to register a memory region and a corresponding callback function with the kernel MM. On the occurrence of memory failure on the registered region, kernel MM calls the callback to communicate the faulting PFN. This patch registers the EGM memory and the callback function nvgrace_egm_pfn_memory_failure with the kernel MM. On memory failure, nvgrace_egm_pfn_memory_failure is triggered and the nvgrace-egm module adds the faulting PFN to the hashtable tracking retired ECC error pages. It also implements a fault VM ops to check if the access is being made to a page known with ECC errors and returns VM_FAULT_HWPOISON in such case. Link: https://lore.kernel.org/all/[email protected]/ [1] Signed-off-by: Ankit Agrawal <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (backported from commit 215f345 https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.8-next) (koba: vmalloc.h exists) Signed-off-by: Koba Ko <[email protected]> Acked-by: Matthew R. Ochs <[email protected]> Acked-by: Carol L. Soto <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit 4eba6e1 https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.11-next) Signed-off-by: Nirmoy Das <[email protected]>

Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit 5bb23c1 https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.8-next) Signed-off-by: Koba Ko <[email protected]> Acked-by: Matthew R. Ochs <[email protected]> Acked-by: Carol L. Soto <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit 7d2ea55 https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.11-next) Signed-off-by: Nirmoy Das <[email protected]>

nvgrace-egm exposes the API register_egm_node & unregister_egm_node to manage EGM (Extended GPU Memory) present on the system. To allow out-of-tree driver such as nvidia-vgpu-vfio make use of them, move the declaration to a new nvgrace-egm.h in include. Signed-off-by: Ankit Agrawal <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit bed340f https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.8-next) Signed-off-by: Koba Ko <[email protected]> Acked-by: Matthew R. Ochs <[email protected]> Acked-by: Carol L. Soto <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit a961663 https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.11-next) Signed-off-by: Nirmoy Das <[email protected]>

…tion Free the kmalloc'd region when the EGM is unregistered. Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Carol L. Soto <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit fc592b9 https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.8-next) Signed-off-by: Koba Ko <[email protected]> Acked-by: Matthew R. Ochs <[email protected]> Acked-by: Carol L. Soto <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit f24760c https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.11-next) Signed-off-by: Nirmoy Das <[email protected]>

Move region hash initiaization alongside the other region initialization statements to avoid situations where the hash table was not properly initialized. Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Carol L. Soto <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit 8021c1d https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.8-next) Signed-off-by: Koba Ko <[email protected]> Acked-by: Matthew R. Ochs <[email protected]> Acked-by: Carol L. Soto <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit e1264a6 https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.11-next) Signed-off-by: Nirmoy Das <[email protected]>

…rrors Update error handling within EGM regiration routine to catch and return errors to the caller. Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Carol L. Soto <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit a57210c https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.8-next) Signed-off-by: Koba Ko <[email protected]> Acked-by: Matthew R. Ochs <[email protected]> Acked-by: Carol L. Soto <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit a706ff8 https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.11-next) Signed-off-by: Nirmoy Das <[email protected]>

Detect and handle a failure from the EGM registration service. Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Carol L. Soto <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit f18eee3 https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.8-next) Signed-off-by: Koba Ko <[email protected]> Acked-by: Matthew R. Ochs <[email protected]> Acked-by: Carol L. Soto <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit 8371b68 https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.11-next) Signed-off-by: Nirmoy Das <[email protected]>

Fix source to resolve checkpatch warnings Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Carol L. Soto <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit c7b47b7 https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.8-next) Signed-off-by: Koba Ko <[email protected]> Acked-by: Matthew R. Ochs <[email protected]> Acked-by: Carol L. Soto <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit dfa0e06 https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.11-next) Signed-off-by: Nirmoy Das <[email protected]>

Fix minor syntax errors from sparse. Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Carol L. Soto <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit bbb64e6 https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.8-next) Signed-off-by: Koba Ko <[email protected]> Acked-by: Matthew R. Ochs <[email protected]> Acked-by: Carol L. Soto <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit fe78194 https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.11-next) Signed-off-by: Nirmoy Das <[email protected]>

Return the intended errno upon a copyout fault, remove unnecessary checks following container_of pointer derivation, and use the correct macro and types for overflow checking. Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Carol L. Soto <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit 429910b https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.8-next) Signed-off-by: Koba Ko <[email protected]> Acked-by: Matthew R. Ochs <[email protected]> Acked-by: Carol L. Soto <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit bda63f3 https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.11-next) Signed-off-by: Nirmoy Das <[email protected]>

Use the correct macro and types for overflow checking. Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Carol L. Soto <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit afa8f63 https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.8-next) Signed-off-by: Koba Ko <[email protected]> Acked-by: Matthew R. Ochs <[email protected]> Acked-by: Carol L. Soto <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit d110330 https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.11-next) Signed-off-by: Nirmoy Das <[email protected]>

Ensure ACPI table reads are successful prior to using the value. Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Carol L. Soto <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit b2947b0 https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.8-next) Signed-off-by: Koba Ko <[email protected]> Acked-by: Matthew R. Ochs <[email protected]> Acked-by: Carol L. Soto <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit 9258355 https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.11-next) Signed-off-by: Nirmoy Das <[email protected]>

Some environments may provide a "nvidia,egm-retired-pages-data-base” but fail to populate it with a base address, leaving it NULL. Mapping this invalid value results in a synchronous exception when the region is first touched. Detect a NULL value, generate a warning to draw attention to the firmware bug, and return without mapping. INFO: th500_ras_intr_handler: External Abort reason=1 syndrome=0x92000410 flags=0x1 [ 82.104493] Internal error: synchronous external abort: 0000000096000410 [#1] SMP [ 82.114898] Modules linked in: nvgrace_gpu_vfio_pci(E) nvgrace_egm(E) [ 82.257218] CPU: 0 PID: 10 Comm: kworker/0:1 Tainted: G OE 6.8.12+ #5 [ 82.265135] Hardware name: NVIDIA GH200 P5042, BIOS 24103110 20241031 [ 82.271720] Workqueue: events work_for_cpu_fn [ 82.276180] pstate: 03400009 (nzcv daif +PAN -UAO +TCO +DIT -SSBS BTYPE=--) [ 82.283298] pc : register_egm_node+0x2cc/0x440 [nvgrace_egm] [ 82.289087] lr : register_egm_node+0x2c4/0x440 [nvgrace_egm] [ 82.294872] sp : ffff8000802ebc30 [ 82.298254] x29: ffff8000802ebc60 x28: 00000000000000ff x27: 0000000000000000 [ 82.305550] x26: ffff000087a320c8 x25: ffff0000a5700000 x24: ffff000087a32000 [ 82.312846] x23: ffffa77cd758e368 x22: 0000000000000000 x21: ffffa77cd758c640 [ 82.320141] x20: ffffa77cd758e170 x19: ffff800081e7d000 x18: ffff800080293038 [ 82.327437] x17: 0000000000000000 x16: 0000000000000000 x15: 0000000000000000 [ 82.334732] x14: 0000000000000000 x13: 65203a65646f6e5f x12: 0000000000000000 [ 82.342027] x11: 0000000000000000 x10: 0000000000000000 x9 : 0000000000000000 [ 82.349322] x8 : 0000000000000000 x7 : 0000000000000000 x6 : 0000000000000000 [ 82.356618] x5 : 0000000000000000 x4 : 0000000000000000 x3 : 0000000000000000 [ 82.363913] x2 : 0000000000000000 x1 : 0000000000000000 x0 : ffff800081e7d000 [ 82.371210] Call trace: [ 82.373705] register_egm_node+0x2cc/0x440 [nvgrace_egm] [ 82.379135] nvgrace_gpu_probe+0x2ac/0x528 [nvgrace_gpu_vfio_pci] [ 82.385366] local_pci_probe+0x4c/0xe0 [ 82.389198] work_for_cpu_fn+0x28/0x58 [ 82.393026] process_one_work+0x168/0x3f0 [ 82.397123] worker_thread+0x360/0x480 [ 82.400952] kthread+0x11c/0x128 [ 82.404248] ret_from_fork+0x10/0x20 [ 82.407906] Code: d2820001 940002b3 aa0003f3 b4fffac0 (f9400017) [ 82.414134] ---[ end trace 0000000000000000 ]--- Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Kai-Heng Feng <[email protected]> Acked-by: Carol L. Soto <[email protected]> Acked-by: Koba Ko <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit 7ba2930 https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.8-next) Signed-off-by: Koba Ko <[email protected]> Acked-by: Matthew R. Ochs <[email protected]> Acked-by: Carol L. Soto <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit 349fb1c https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.11-next) Signed-off-by: Nirmoy Das <[email protected]>

In an effort to simplify the programming model, use a symmetrical model for the the EGM regsiration APIs. This avoids the caller needing to keep a cookie or even have knowlege of if EGM is supported. Update the EGM unregisration API to use the PCI device as its parameter. Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit d8903ec https://github.com/nvmochs/NV-Kernels/tree/vegm_01232025) Signed-off-by: Koba Ko <[email protected]> Acked-by: Matthew R. Ochs <[email protected]> Acked-by: Carol L. Soto <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit 5839fc5 https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.11-next) Signed-off-by: Nirmoy Das <[email protected]>

…egions GB200 systems could have multiple GPUs associated with an EGM region. For proper EGM functionality the host topology in terms of GPU affinity has to be replicated in the VM. Hence the EGM region structure must track the GPU devices belonging to the same socket. On the device probe, the device pci_dev struct is added to a linked list of the appropriate EGM region. Similarly on device remove, the pci_dev struct for the GPU is removed from the EGM region. Signed-off-by: Ankit Agrawal <[email protected]> Ref: sj24: /home/nvidia/ankita/kernel_patches/0001_vfio_nvgrace-egm_track_GPUs_associated_with_the_EGM_regions.patch (koba: Enhance error handling, Remove egm_node from unregister_egm_node and move destroy_egm_chardev a little forward) Signed-off-by: Koba Ko <[email protected]> Acked-by: Matthew R. Ochs <[email protected]> Acked-by: Carol L. Soto <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit 0222c35 https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.11-next) Signed-off-by: Nirmoy Das <[email protected]>

To replicate the host EGM topology in the VM in terms of the GPU affinity, the userspace need to be aware of which GPUs belong to the same socket as the EGM region. Expose the list of GPUs associated with an EGM region through sysfs. The list can be queried from the location /sys/devices/virtual/egm/egmX/gpu_devices. Signed-off-by: Ankit Agrawal <[email protected]> Ref: sj24: /home/nvidia/ankita/kernel_patches/0002_vfio_nvgrace-egm_list_gpus_through_sysfs.patch (koba: Enchance error handling for sysfs_create_group) Signed-off-by: Koba Ko <[email protected]> Acked-by: Matthew R. Ochs <[email protected]> Acked-by: Carol L. Soto <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit fec2356 https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.11-next) Signed-off-by: Nirmoy Das <[email protected]>

To allocate the EGM, the userspace need to know it's size. Currently, there is no easy way for the userspace to determine that. Make nvgrace-egm expose the size through sysfs that can be queried by the userspace from /sys/devices/virtual/egm/egmX/egm_size. Signed-off-by: Ankit Agrawal <[email protected]> Ref: sj24: /home/nvidia/ankita/kernel_patches/0003_vfio_nvgrace-egm_expose_the_egm_size_through_sysfs.patch Signed-off-by: Koba Ko <[email protected]> Acked-by: Matthew R. Ochs <[email protected]> Acked-by: Carol L. Soto <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit dcdcef2 https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.11-next) Signed-off-by: Nirmoy Das <[email protected]>

… allocations Add missing null pointer checks after vzalloc() calls in the NVIDIA Grace GPU driver's EGM (External GPU Memory) handling code. This prevents potential null pointer dereferences in the memory failure handling and bad page fetching functions, providing proper error handling for allocation failures. Signed-off-by: Koba Ko <[email protected]> Acked-by: Matthew R. Ochs <[email protected]> Acked-by: Carol L. Soto <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit 63127e2 https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.11-next) Signed-off-by: Nirmoy Das <[email protected]>

Add CONFIG_NVGRACE_EGM with policy 'm' for arm64 architecture. Signed-off-by: Nirmoy Das <[email protected]>

On platforms without the mig HW bug (e.g. Grace-Blackwell) there is not a requirement to create the resmem region. Accordingly, this region is not configured on these platforms, which leads to the following print when the device is closed: resource: Trying to free nonexistent resource <0x0000000000000000-0x000000000000ffff> Avoid calling unregister_pfn_address_space for resmem when the region is not being used. Fixes: 2d21b7b ("vfio/nvgrace-gpu: register device memory for poison handling") Signed-off-by: Matthew R. Ochs <[email protected]> Acked-by: Carol L. Soto <[email protected]> Acked-by: Nirmoy Das <[email protected]> Signed-off-by: Matthew R. Ochs <[email protected]> (cherry picked from commit bd0187d https://github.com/NVIDIA/NV-Kernels/tree/24.04_linux-nvidia-adv-6.11-next) Signed-off-by: Nirmoy Das <[email protected]>

Commit 222675c ("irqchip: Have CONFIG_IRQ_MSI_IOMMU be selected by irqchips that need it") changed the behavior of CONFIG_IRQ_MSI_IOMMU to a dynamic selection, so it might not always be needed by amd64 builds. Signed-off-by: Matthew R. Ochs <[email protected]>

…nce count imbalance syzbot reported a uaf in software_node_notify_remove. [1] When any of the two sysfs_create_link() in software_node_notify() fails, the swnode->kobj reference count will not increase normally, which will cause swnode to be released incorrectly due to the imbalance of kobj reference count when executing software_node_notify_remove(). Increase the reference count of kobj before creating the link to avoid uaf. [1] BUG: KASAN: slab-use-after-free in software_node_notify_remove+0x1bc/0x1c0 drivers/base/swnode.c:1108 Read of size 1 at addr ffff888033c08908 by task syz-executor105/5844 Freed by task 5844: software_node_notify_remove+0x159/0x1c0 drivers/base/swnode.c:1106 device_platform_notify_remove drivers/base/core.c:2387 [inline] Fixes: 9eb5920 ("iommufd/selftest: Add set_dev_pasid in mock iommu") Reported-by: [email protected] Closes: https://syzkaller.appspot.com/bug?extid=2ff22910687ee0dfd48e Tested-by: [email protected] Signed-off-by: Lizhi Xu <[email protected]> Reviewed-by: Sakari Ailus <[email protected]> Reviewed-by: Andy Shevchenko <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Greg Kroah-Hartman <[email protected]> (cherry picked from commit bc2c464) Signed-off-by: Nirmoy Das <[email protected]>

nvmochs · 2025-08-11T14:27:44Z

Canonical pointed out some missing fix commits:

bc2c464 ("software node: Prevent link creation failure from causing kobj reference count imbalance")
7be11d3 ("iommufd: Test attach before detaching pasid")
df4bf3f ("iommu: Fix crash in report_iommu_fault()")
3a2ffd3 ("iommu: Convert unreachable() to BUG()")

And some commit message nits:

commit Revert "iommu: Skip PASID validation for devices without PASID capability" should be labeled NVIDIA: SAUCE:
commit NVIDIA: SAUCE: mm: Add poison error check in fixup_user_fault() for mapped pfn backport note should be surrounded with square brackets instead of parenthesis
commit NVIDIA: SAUCE: KVM: arm64: Allow exec fault on memory mapped cacheable in VMA backport note should be surrounded with square brackets instead of parenthesis

Confirmed all of these are addressed on the updated branch.

nvmochs · 2025-08-12T16:36:08Z

Merged, now present in https://git.launchpad.net/~canonical-kernel/ubuntu/+source/linux-nvidia/+git/noble/log/?h=nvidia-6.14-next

Closing PR.

When smc91x.c is built with PREEMPT_RT, the following splat occurs in FVP_RevC: [ 13.055000] smc91x LNRO0003:00 eth0: link up, 10Mbps, half-duplex, lpa 0x0000 [ 13.062137] BUG: workqueue leaked atomic, lock or RCU: kworker/2:1[106] [ 13.062137] preempt=0x00000000 lock=0->0 RCU=0->1 workfn=mld_ifc_work [ 13.062266] C ** replaying previous printk message ** [ 13.062266] CPU: 2 UID: 0 PID: 106 Comm: kworker/2:1 Not tainted 6.18.0-dirty #179 PREEMPT_{RT,(full)} [ 13.062353] Hardware name: , BIOS [ 13.062382] Workqueue: mld mld_ifc_work [ 13.062469] Call trace: [ 13.062494] show_stack+0x24/0x40 (C) [ 13.062602] __dump_stack+0x28/0x48 [ 13.062710] dump_stack_lvl+0x7c/0xb0 [ 13.062818] dump_stack+0x18/0x34 [ 13.062926] process_scheduled_works+0x294/0x450 [ 13.063043] worker_thread+0x260/0x3d8 [ 13.063124] kthread+0x1c4/0x228 [ 13.063235] ret_from_fork+0x10/0x20 This happens because smc_special_trylock() disables IRQs even on PREEMPT_RT, but smc_special_unlock() does not restore IRQs on PREEMPT_RT. The reason is that smc_special_unlock() calls spin_unlock_irqrestore(), and rcu_read_unlock_bh() in __dev_queue_xmit() cannot invoke rcu_read_unlock() through __local_bh_enable_ip() when current->softirq_disable_cnt becomes zero. To address this issue, replace smc_special_trylock() with spin_trylock_irqsave(). Fixes: 342a932 ("locking/spinlock: Provide RT variant header: <linux/spinlock_rt.h>") Signed-off-by: Yeoreum Yun <[email protected]> Reviewed-by: Simon Horman <[email protected]> Link: https://patch.msgid.link/[email protected] Signed-off-by: Paolo Abeni <[email protected]>

rppt and others added 30 commits August 4, 2025 03:11

Revert "NVIDIA: SAUCE: iommu/arm-smmu-v3: Implement arm_smmu_get_msi_…

301e3ae

…mapping_domain" This reverts commit 78480b2. Signed-off-by: Nirmoy Das <[email protected]>

Revert "NVIDIA: SAUCE: iommu/dma: Support MSIs through nested domains"

1d46c11

This reverts commit 8aced5e. Signed-off-by: Nirmoy Das <[email protected]>

ankita-nv and others added 25 commits August 11, 2025 02:05

NVIDIA: SAUCE: arm64: configs: enable NVGRACE_EGM as module

ee4d180

Add CONFIG_NVGRACE_EGM with policy 'm' for arm64 architecture. Signed-off-by: Nirmoy Das <[email protected]>

nirmoy force-pushed the 24.04_linux-nvidia-6.14-next_virt branch from a828c22 to 043885e Compare August 11, 2025 09:21

nvmochs closed this Aug 12, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add Grace virtualization support to 6.14 HWE, (upstream vEVENTQ + HW QUEUE and OOT vEGM #179

Add Grace virtualization support to 6.14 HWE, (upstream vEVENTQ + HW QUEUE and OOT vEGM #179

Uh oh!

nirmoy commented Aug 5, 2025

Uh oh!

nvmochs commented Aug 11, 2025

Uh oh!

nvmochs commented Aug 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants

Add Grace virtualization support to 6.14 HWE, (upstream vEVENTQ + HW QUEUE and OOT vEGM #179

Add Grace virtualization support to 6.14 HWE, (upstream vEVENTQ + HW QUEUE and OOT vEGM #179

Uh oh!

Conversation

nirmoy commented Aug 5, 2025

Uh oh!

nvmochs commented Aug 11, 2025

Uh oh!

nvmochs commented Aug 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

14 participants