-
Notifications
You must be signed in to change notification settings - Fork 49
fix: [linux-nvidia-6.6] Resolve IOMMU passthrough conflicts with NVIDIA UVM on ARM64 causing NEMO/PyTorch failures #202
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: linux-nvidia-6.6
Are you sure you want to change the base?
Conversation
Instead of passing a naked __le16 * around to represent a STE wrap it in a "struct arm_smmu_ste" with an array of the correct size. This makes it much clearer which functions will comprise the "STE API". Reviewed-by: Moritz Fischer <[email protected]> Reviewed-by: Michael Shavit <[email protected]> Reviewed-by: Eric Auger <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Tested-by: Shameer Kolothum <[email protected]> Tested-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit 57b8904) Signed-off-by: Koba Ko <[email protected]>
arm_smmu_s1_cfg (and by extension arm_smmu_domain) owns both a CD table and the CD inserted into that table's non-pasid CD entry. This limits arm_smmu_domain's ability to represent non-pasid domains, where multiple domains need to be inserted into a common CD table. Rather than describing an STE entry (which may have multiple domains installed into it with PASID), a domain should describe a single CD entry instead. This is precisely the role of arm_smmu_ctx_desc. A subsequent commit will also move the CD table outside of arm_smmu_domain. Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Signed-off-by: Michael Shavit <[email protected]> Tested-by: Nicolin Chen <[email protected]> Link: https://lore.kernel.org/r/20230915211705.v8.1.I67ab103c18d882aedc8a08985af1fba70bca084e@changeid Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit 987a878) Signed-off-by: Koba Ko <[email protected]>
Remove struct arm_smmu_s1_cfg. This is really just a CD table with a bit of extra information. Move other attributes of the CD table that were held there into the existing CD table structure, struct arm_smmu_ctx_desc_cfg, and replace all usages of arm_smmu_s1_cfg with arm_smmu_ctx_desc_cfg. For clarity, use the name "cd_table" for the variables pointing to arm_smmu_ctx_desc_cfg in the new code instead of cdcfg. A later patch will make this fully consistent. Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Signed-off-by: Michael Shavit <[email protected]> Tested-by: Nicolin Chen <[email protected]> Link: https://lore.kernel.org/r/20230915211705.v8.2.I1ef1ed19d7786c8176a0d05820c869e650c8d68f@changeid Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit 1f85888) Signed-off-by: Koba Ko <[email protected]>
This is slighlty cleaner: arm_smmu_ctx_desc_cfg is initialized in a single function instead of having pieces set ahead-of time by its caller. Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Signed-off-by: Michael Shavit <[email protected]> Tested-by: Nicolin Chen <[email protected]> Link: https://lore.kernel.org/r/20230915211705.v8.3.I875254464d044a8ce8b3a2ad6beb655a4a006456@changeid Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit e3aad74) Signed-off-by: Koba Ko <[email protected]>
A domain can be attached to multiple masters with different master->stall_enabled values. The stall bit of a CD entry should follow master->stall_enabled and has an inverse relationship with the STE.S1STALLD bit. The stall_enabled bit does not depend on any property of the domain, so move it out of the arm_smmu_domain struct. Move it to the CD table struct so that it can fully describe how CD entries should be written to it. Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Signed-off-by: Michael Shavit <[email protected]> Tested-by: Nicolin Chen <[email protected]> Link: https://lore.kernel.org/r/20230915211705.v8.4.I5aa89c849228794a64146cfe86df21fb71629384@changeid Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit 1228cc5) Signed-off-by: Koba Ko <[email protected]>
Update arm_smmu_write_ctx_desc and downstream functions to operate on a master instead of an smmu domain. We expect arm_smmu_write_ctx_desc() to only be called to write a CD entry into a CD table owned by the master. Under the hood, arm_smmu_write_ctx_desc still fetches the CD table from the domain that is attached to the master, but a subsequent commit will move that table's ownership to the master. Note that this change isn't a nop refactor since SVA will call arm_smmu_write_ctx_desc in a loop for every master the domain is attached to despite the fact that they all share the same CD table. This loop may look weird but becomes necessary when the CD table becomes per-master in a subsequent commit. Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Signed-off-by: Michael Shavit <[email protected]> Tested-by: Nicolin Chen <[email protected]> Link: https://lore.kernel.org/r/20230915211705.v8.5.I219054a6cf538df5bb22f4ada2d9933155d6058c@changeid Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit 2450314) Signed-off-by: Koba Ko <[email protected]>
With this change, each master will now own its own CD table instead of sharing one with other masters attached to the same domain. Attaching a stage 1 domain installs CD entries into the master's CD table. SVA writes its CD entries into each master's CD table if the domain is shared across masters. Also add the device to the devices list before writing the CD to the table so that SVA will know that the CD needs to be re-written to this device's CD table as well if it decides to update the CD's ASID concurrently with this function. Tested-by: Nicolin Chen <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Signed-off-by: Michael Shavit <[email protected]> Link: https://lore.kernel.org/r/20230915211705.v8.6.Ice063dcf87d1b777a72e008d9e3406d2bcf6d876@changeid Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit 10e4968) Signed-off-by: Koba Ko <[email protected]>
Remove unused master parameter now that the CD table is allocated elsewhere. Reviewed-by: Nicolin Chen <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Signed-off-by: Michael Shavit <[email protected]> Tested-by: Nicolin Chen <[email protected]> Link: https://lore.kernel.org/r/20230915211705.v8.7.Iff18df41564b9df82bf40b3ec7af26b87f08ef6e@changeid Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit 5e14313) Signed-off-by: Koba Ko <[email protected]>
Update the comment to reflect the fact that the STE is not always installed. arm_smmu_domain_finalise_s1 intentionnaly calls arm_smmu_write_ctx_desc while the STE is not installed. Reviewed-by: Nicolin Chen <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Signed-off-by: Michael Shavit <[email protected]> Tested-by: Nicolin Chen <[email protected]> Link: https://lore.kernel.org/r/20230915211705.v8.8.I7a8beb615e2520ad395d96df94b9ab9708ee0d9c@changeid Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit 6032f58) Signed-off-by: Koba Ko <[email protected]>
cdcfg is a confusing name, especially given other variables with the cfg suffix in this driver. cd_table more clearly describes what is being operated on. Tested-by: Nicolin Chen <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Signed-off-by: Michael Shavit <[email protected]> Link: https://lore.kernel.org/r/20230915211705.v8.9.I5ee79793b444ddb933e8bc1eb7b77e728d7f8350@changeid Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit 475918e) Signed-off-by: Koba Ko <[email protected]>
The only caller is arm_smmu_install_ste_for_dev() which never has a NULL master. Remove the confusing if. Reviewed-by: Moritz Fischer <[email protected]> Reviewed-by: Michael Shavit <[email protected]> Reviewed-by: Eric Auger <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Tested-by: Shameer Kolothum <[email protected]> Tested-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit 12a48fe) Signed-off-by: Koba Ko <[email protected]>
This allows a driver to set a global static to an IDENTITY domain and the core code will automatically use it whenever an IDENTITY domain is requested. By making it always available it means the IDENTITY can be used in error handling paths to force the iommu driver into a known state. Devices implementing global static identity domains should avoid failing their attach_dev ops. To make global static domains simpler allow drivers to omit their free function and update the iommufd selftest. Convert rockchip to use the new mechanism. Tested-by: Steven Price <[email protected]> Tested-by: Marek Szyprowski <[email protected]> Tested-by: Nicolin Chen <[email protected]> Reviewed-by: Lu Baolu <[email protected]> Reviewed-by: Jerry Snitselaar <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Joerg Roedel <[email protected]> (cherry picked from commit df31b29) Signed-off-by: Koba Ko <[email protected]>
This is used when the iommu driver is taking control of the dma_ops, currently only on S390 and power spapr. It is designed to preserve the original ops->detach_dev() semantic that these S390 was built around. Provide an opaque domain type and a 'default_domain' ops value that allows the driver to trivially force any single domain as the default domain. Update iommufd selftest to use this instead of set_platform_dma_ops Reviewed-by: Lu Baolu <[email protected]> Reviewed-by: Jerry Snitselaar <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Joerg Roedel <[email protected]> (cherry picked from commit 1c68cbc) Signed-off-by: Koba Ko <[email protected]>
Introduce a new iommu_domain op to create domains owned by userspace, e.g. through IOMMUFD. These domains have a few different properties compares to kernel owned domains: - They may be PAGING domains, but created with special parameters. For instance aperture size changes/number of levels, different IOPTE formats, or other things necessary to make a vIOMMU work - We have to track all the memory allocations with GFP_KERNEL_ACCOUNT to make the cgroup sandbox stronger - Device-specialty domains, such as NESTED domains can be created by IOMMUFD. The new op clearly says the domain is being created by IOMMUFD, that the domain is intended for userspace use, and it provides a way to pass user flags or a driver specific uAPI structure to customize the created domain to exactly what the vIOMMU userspace driver requires. iommu drivers that cannot support VFIO/IOMMUFD should not support this op. This includes any driver that cannot provide a fully functional PAGING domain. This new op for now is only supposed to be used by IOMMUFD, hence no wrapper for it. IOMMUFD would call the callback directly. As for domain free, IOMMUFD would use iommu_domain_free(). Link: https://lore.kernel.org/r/[email protected] Suggested-by: Jason Gunthorpe <[email protected]> Signed-off-by: Lu Baolu <[email protected]> Co-developed-by: Nicolin Chen <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Yi Liu <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 909f4ab) Signed-off-by: Koba Ko <[email protected]>
Make IOMMUFD use iommu_domain_alloc_user() by default for iommu_domain creation. IOMMUFD needs to support iommu_domain allocation with parameters from userspace in nested support, and a driver is expected to implement everything under this op. If the iommu driver doesn't provide domain_alloc_user callback then IOMMUFD falls back to use iommu_domain_alloc() with an UNMANAGED type if possible. Link: https://lore.kernel.org/r/[email protected] Suggested-by: Jason Gunthorpe <[email protected]> Reviewed-by: Lu Baolu <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Co-developed-by: Nicolin Chen <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Yi Liu <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 7975b72) Signed-off-by: Koba Ko <[email protected]>
Extends iommufd_hw_pagetable_alloc() to accept user flags, the uAPI will provide the flags. Link: https://lore.kernel.org/r/[email protected] Reviewed-by: Kevin Tian <[email protected]> Signed-off-by: Yi Liu <[email protected]> Reviewed-by: Lu Baolu <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 89d6387) Signed-off-by: Koba Ko <[email protected]>
Extend IOMMU_HWPT_ALLOC to allocate domains to be used as parent (stage-2) in nested translation. Add IOMMU_HWPT_ALLOC_NEST_PARENT to the uAPI. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Yi Liu <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Reviewed-by: Lu Baolu <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 4ff5421) Signed-off-by: Koba Ko <[email protected]>
Add mock_domain_alloc_user() and a new test case for IOMMU_HWPT_ALLOC_NEST_PARENT. Link: https://lore.kernel.org/r/[email protected] Co-developed-by: Nicolin Chen <[email protected]> Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Yi Liu <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 4086636) Signed-off-by: Koba Ko <[email protected]>
Add the domain_alloc_user() op implementation. It supports allocating domains to be used as parent under nested translation. Unlike other drivers VT-D uses only a single page table format so it only needs to check if the HW can support nesting. Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Yi Liu <[email protected]> Reviewed-by: Lu Baolu <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit c97d1b2) Signed-off-by: Koba Ko <[email protected]>
arighi
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a pretty large patch set (152 commits), but if it clearly fixes a problem, then I guess we don't have many chances...
I'm a bit worried about the risk of conflicts that we may have in the future by applying this big patch set, even if this change is pretty much contained in the iommu stuff, so hopefully it won't be too much pain to maintain.
A couple of comments:
- Is there a simple test case that we can use to validate that this patch set fixes the problem (you may have already mentioned it, but I can't find it)?
- for the top commit (
arm64: configs: Add kernel configuration for 6.6.63 with 64k pages) what about using a similar convention that we were using in Canonical to quickly identify our commits, like adding anNVIDIA:prefix?
|
Did you intend to have more information in the PR summary? (It looks like maybe some of the content was lost) Can you provide information on how this was tested?
While the performance tuning for Grace does state that some applications might have some performance benefits when using this setting, the generic Linux OS guide points out that a patch from 6.11 is required for to enable it. Hence this PR, where you are doing a subsystem backport from 6.11 to 6.6 to obtain that patch. I don’t have a conern with the individual patches that have been picked as it appears they all picked cleanly. I did manually review the ones that did not match upstream exactly and they confirmed they were only flagged for context differences (likely due to some out of order pick or other unrelated patches being absent). However, I have the same overall concern as @arighi. Before taking this outright, I think we should strongly consider the long-term implications vs. benefit of supporting this feature. |
fe18048 to
78908fd
Compare
|
I have a simple script to find upstream fixes and here's the output: |
|
Is this 6.6 kernel ARM64 only? Or should AMD64 be covered as well? |
This adds the scalable mode check before allocating the nested parent domain as checking nested capability is not enough. User may turn off scalable mode which also means no nested support even if the hardware supports it. Fixes: c97d1b2 ("iommu/vt-d: Add domain_alloc_user op") Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Yi Liu <[email protected]> Reviewed-by: Lu Baolu <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit a2cdecd) Signed-off-by: Koba Ko <[email protected]>
The IOMMU_HWPT_ALLOC_NEST_PARENT flag is used to allocate a HWPT. Though a HWPT holds a domain in the core structure, it is still quite confusing to describe it using "domain" in the uAPI kdoc. Correct it to "HWPT". Fixes: 4ff5421 ("iommufd: Support allocating nested parent domain") Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit b5f9e63) Signed-off-by: Koba Ko <[email protected]>
A cleared entry is all 0's. Make arm_smmu_clear_cd() do this sequence. If we are clearing an entry and for some reason it is not already allocated in the CD table then something has gone wrong. Remove case (5) from arm_smmu_write_ctx_desc(). Tested-by: Nicolin Chen <[email protected]> Tested-by: Shameer Kolothum <[email protected]> Reviewed-by: Michael Shavit <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Reviewed-by: Moritz Fischer <[email protected]> Reviewed-by: Mostafa Saleh <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit af8f0b8) Signed-off-by: Koba Ko <[email protected]>
Only the attach callers can perform an allocation for the CD table entry, the other callers must not do so, they do not have the correct locking and they cannot sleep. Split up the functions so this is clear. arm_smmu_get_cd_ptr() will return pointer to a CD table entry without doing any kind of allocation. arm_smmu_alloc_cd_ptr() will allocate the table and any required leaf. A following patch will add lockdep assertions to arm_smmu_alloc_cd_ptr() once the restructuring is completed and arm_smmu_alloc_cd_ptr() is never called in the wrong context. Tested-by: Nicolin Chen <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit b2f4c0f) Signed-off-by: Koba Ko <[email protected]>
Avoid arm_smmu_attach_dev() having to undo the changes to the smmu_domain->devices list, acquire the cdptr earlier so we don't need to handle that error. Now there is a clear break in arm_smmu_attach_dev() where all the prep-work has been done non-disruptively and we commit to making the HW change, which cannot fail. This completes transforming arm_smmu_attach_dev() so that it does not disturb the HW if it fails. Tested-by: Nicolin Chen <[email protected]> Tested-by: Shameer Kolothum <[email protected]> Reviewed-by: Michael Shavit <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Reviewed-by: Mostafa Saleh <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit 13abe4f) Signed-off-by: Koba Ko <[email protected]>
Pull all the calculations for building the CD table entry for a mmu_struct into arm_smmu_make_sva_cd(). Call it in the two places installing the SVA CD table entry. Open code the last caller of arm_smmu_update_ctx_desc_devices() and remove the function. Remove arm_smmu_write_ctx_desc() since all callers are gone. Add the locking assertions to arm_smmu_alloc_cd_ptr() since arm_smmu_update_ctx_desc_devices() was the last problematic caller. Remove quiet_cd since all users are gone, arm_smmu_make_sva_cd() creates the same value. The behavior of quiet_cd changes slightly, the old implementation edited the CD in place to set CTXDESC_CD_0_TCR_EPD0 assuming it was a SVA CD entry. This version generates a full CD entry with a 0 TTB0 and relies on arm_smmu_write_cd_entry() to install it hitlessly. Tested-by: Nicolin Chen <[email protected]> Tested-by: Shameer Kolothum <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit 7b87c93) Signed-off-by: Koba Ko <[email protected]>
Half the code was living in arm_smmu_domain_finalise_s1(), just move it here and take the values directly from the pgtbl_ops instead of storing copies. Tested-by: Nicolin Chen <[email protected]> Tested-by: Shameer Kolothum <[email protected]> Reviewed-by: Michael Shavit <[email protected]> Reviewed-by: Mostafa Saleh <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit 04905c1) Signed-off-by: Koba Ko <[email protected]>
Add tests for some of the more common STE update operations that we expect to see, as well as some artificial STE updates to test the edges of arm_smmu_write_entry. These also serve as a record of which common operation is expected to be hitless, and how many syncs they require. arm_smmu_write_entry implements a generic algorithm that updates an STE/CD to any other abritrary STE/CD configuration. The update requires a sequence of write+sync operations with some invariants that must be held true after each sync. arm_smmu_write_entry lends itself well to unit-testing since the function's interaction with the STE/CD is already abstracted by input callbacks that we can hook to introspect into the sequence of operations. We can use these hooks to guarantee that invariants are held throughout the entire update operation. Link: https://lore.kernel.org/r/[email protected] Tested-by: Nicolin Chen <[email protected]> Signed-off-by: Michael Shavit <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit 56e1a4c) Signed-off-by: Koba Ko <[email protected]>
Static checker is complaining about the ASID possibly set uninitialized. This only happens in case of error and this value would be ignored anyway. A simple fix would be just to initialize the local variable to zero, this path will only be reached on the first attach to a domain where the CD is already initialized to zero. This avoids having to bloat the function with an error path. Closes: https://lore.kernel.org/linux-iommu/[email protected]/T/#u Reported-by: Dan Carpenter <[email protected]> Signed-off-by: Mostafa Saleh <[email protected]> Fixes: 04905c1 ("iommu/arm-smmu-v3: Build the whole CD in arm_smmu_make_s1_cd()") Reviewed-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit d3867e7) Signed-off-by: Koba Ko <[email protected]>
This allows the driver the receive the mm and always a device during allocation. Later patches need this to properly setup the notifier when the domain is first allocated. Remove ops->domain_alloc() as SVA was the only remaining purpose. Tested-by: Nicolin Chen <[email protected]> Tested-by: Shameer Kolothum <[email protected]> Reviewed-by: Michael Shavit <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Reviewed-by: Jerry Snitselaar <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit 678d79b) Signed-off-by: Koba Ko <[email protected]>
Add arm_smmu_set_pasid()/arm_smmu_remove_pasid() which are to be used by callers that already constructed the arm_smmu_cd they wish to program. These functions will encapsulate the shared logic to setup a CD entry that will be shared by SVA and S1 domain cases. Prior fixes had already moved most of this logic up into __arm_smmu_sva_bind(), move it to it's final home. Following patches will relieve some of the remaining SVA restrictions: - The RID domain is a S1 domain and has already setup the STE to point to the CD table - The programmed PASID is the mm_get_enqcmd_pasid() - Nothing changes while SVA is running (sva_enable) SVA invalidation will still iterate over the S1 domain's master list, later patches will resolve that. Tested-by: Nicolin Chen <[email protected]> Tested-by: Shameer Kolothum <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Reviewed-by: Jerry Snitselaar <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit 85f2fb6) Signed-off-by: Koba Ko <[email protected]>
The next patch will need to store the same master twice (with different SSIDs), so allocate memory for each list element. Tested-by: Nicolin Chen <[email protected]> Tested-by: Shameer Kolothum <[email protected]> Reviewed-by: Michael Shavit <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Reviewed-by: Jerry Snitselaar <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit ad10dce) Signed-off-by: Koba Ko <[email protected]>
The core code allows the domain to be changed on the fly without a forced stop in BLOCKED/IDENTITY. In this flow the driver should just continually maintain the ATS with no change while the STE is updated. ATS relies on a linked list smmu_domain->devices to keep track of which masters have the domain programmed, but this list is also used by arm_smmu_share_asid(), unrelated to ats. Create two new functions to encapsulate this combined logic: arm_smmu_attach_prepare() <caller generates and sets the STE> arm_smmu_attach_commit() The two functions can sequence both enabling ATS and disabling across the STE store. Have every update of the STE use this sequence. Installing a S1/S2 domain always enables the ATS if the PCIe device supports it. The enable flow is now ordered differently to allow it to be hitless: 1) Add the master to the new smmu_domain->devices list 2) Program the STE 3) Enable ATS at PCIe 4) Remove the master from the old smmu_domain This flow ensures that invalidations to either domain will generate an ATC invalidation to the device while the STE is being switched. Thus we don't need to turn off the ATS anymore for correctness. The disable flow is the reverse: 1) Disable ATS at PCIe 2) Program the STE 3) Invalidate the ATC 4) Remove the master from the old smmu_domain Move the nr_ats_masters adjustments to be close to the list manipulations. It is a count of the number of ATS enabled masters currently in the list. This is stricly before and after the STE/CD are revised, and done under the list's spin_lock. This is part of the bigger picture to allow changing the RID domain while a PASID is in use. If a SVA PASID is relying on ATS to function then changing the RID domain cannot just temporarily toggle ATS off without also wrecking the SVA PASID. The new infrastructure here is organized so that the PASID attach/detach flows will make use of it as well in following patches. Tested-by: Nicolin Chen <[email protected]> Tested-by: Shameer Kolothum <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Reviewed-by: Michael Shavit <[email protected]> Reviewed-by: Jerry Snitselaar <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit 7497f42) Signed-off-by: Koba Ko <[email protected]>
Prepare to allow a S1 domain to be attached to a PASID as well. Keep track of the SSID the domain is using on each master in the arm_smmu_master_domain. Tested-by: Nicolin Chen <[email protected]> Tested-by: Shameer Kolothum <[email protected]> Reviewed-by: Michael Shavit <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Reviewed-by: Jerry Snitselaar <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit 64efb3d) Signed-off-by: Koba Ko <[email protected]>
We no longer need a master->sva_enable to control what attaches are allowed. Instead we can tell if the attach is legal based on the current configuration of the master. Keep track of the number of valid CD entries for SSID's in the cd_table and if the cd_table has been installed in the STE directly so we know what the configuration is. The attach logic is then made into: - SVA bind, check if the CD is installed - RID attach of S2, block if SSIDs are used - RID attach of IDENTITY/BLOCKING, block if SSIDs are used arm_smmu_set_pasid() is already checking if it is possible to setup a CD entry, at this patch it means the RID path already set a STE pointing at the CD table. Tested-by: Nicolin Chen <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Reviewed-by: Jerry Snitselaar <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit be7c90d) Signed-off-by: Koba Ko <[email protected]>
Allow creating and managing arm_smmu_mater_domain's with a non-zero SSID through the arm_smmu_attach_*() family of functions. This triggers ATC invalidation for the correct SSID in PASID cases and tracks the per-attachment SSID in the struct arm_smmu_master_domain. Generalize arm_smmu_attach_remove() to be able to remove SSID's as well by ensuring the ATC for the PASID is flushed properly. Tested-by: Nicolin Chen <[email protected]> Tested-by: Shameer Kolothum <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Reviewed-by: Jerry Snitselaar <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit 1d5f34f) Signed-off-by: Koba Ko <[email protected]>
Currently the SVA domain is a naked struct iommu_domain, allocate a struct arm_smmu_domain instead. This is necessary to be able to use the struct arm_master_domain mechanism. Tested-by: Nicolin Chen <[email protected]> Tested-by: Shameer Kolothum <[email protected]> Reviewed-by: Michael Shavit <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Reviewed-by: Jerry Snitselaar <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit d7b2d2b) Signed-off-by: Koba Ko <[email protected]>
Fill in the smmu_domain->devices list in the new struct arm_smmu_domain that SVA allocates. Keep track of every SSID and master that is using the domain reusing the logic for the RID attach. This is the first step to making the SVA invalidation follow the same design as S1/S2 invalidation. At present nothing will read this list. Tested-by: Nicolin Chen <[email protected]> Tested-by: Shameer Kolothum <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Reviewed-by: Jerry Snitselaar <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit 49db2ed) Signed-off-by: Koba Ko <[email protected]>
This removes all the notifier de-duplication logic in the driver and relies on the core code to de-duplicate and allocate only one SVA domain per mm per smmu instance. This naturally gives a 1:1 relationship between SVA domain and mmu notifier. It is a significant simplication of the flow, as we end up with a single struct arm_smmu_domain for each MM and the invalidation can then be shifted to properly use the masters list like S1/S2 do. Remove all of the previous mmu_notifier, bond, shared cd, and cd refcount logic entirely. The logic here is tightly wound together with the unusued BTM support. Since the BTM logic requires holding all the iommu_domains in a global ASID xarray it conflicts with the design to have a single SVA domain per PASID, as multiple SMMU instances will need to have different domains. Following patches resolve this by making the ASID xarray per-instance instead of global. However, converting the BTM code over to this methodology requires many changes. Thus, since ARM_SMMU_FEAT_BTM is never enabled, remove the parts of the BTM support for ASID sharing that interact with SVA as well. A followup series is already working on fully enabling the BTM support, that requires iommufd's VIOMMU feature to bring in the KVM's VMID as well. It will come with an already written patch to bring back the ASID sharing using a per-instance ASID xarray. https://lore.kernel.org/linux-iommu/[email protected]/ https://lore.kernel.org/linux-iommu/[email protected]/ Tested-by: Nicolin Chen <[email protected]> Tested-by: Shameer Kolothum <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Reviewed-by: Michael Shavit <[email protected]> Reviewed-by: Jerry Snitselaar <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit d38c28d) Signed-off-by: Koba Ko <[email protected]>
The HW supports this, use the S1DSS bits to configure the behavior of SSID=0 which is the RID's translation. If SSID's are currently being used in the CD table then just update the S1DSS bits in the STE, remove the master_domain and leave ATS alone. For iommufd the driver design has a small problem that all the unused CD table entries are set with V=0 which will generate an event if VFIO userspace tries to use the CD entry. This patch extends this problem to include the RID as well if PASID is being used. For BLOCKED with used PASIDs the F_STREAM_DISABLED (STRTAB_STE_1_S1DSS_TERMINATE) event is generated on untagged traffic and a substream CD table entry with V=0 (removed pasid) will generate C_BAD_CD. Arguably there is no advantage to using S1DSS over the CD entry 0 with V=0. As we don't yet support PASID in iommufd this is a problem to resolve later, possibly by using EPD0 for unused CD table entries instead of V=0, and not using S1DSS for BLOCKED. Tested-by: Nicolin Chen <[email protected]> Tested-by: Shameer Kolothum <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Reviewed-by: Jerry Snitselaar <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit ce26ea9) Signed-off-by: Koba Ko <[email protected]>
S1DSS brings in quite a few new transition pairs that are interesting. Test to/from S1DSS_BYPASS <-> S1DSS_SSID0, and BYPASS <-> S1DSS_SSID0. Test a contrived non-hitless flow to make sure that the logic works. Tested-by: Nicolin Chen <[email protected]> Signed-off-by: Michael Shavit <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Reviewed-by: Jerry Snitselaar <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit 3b5302c) Signed-off-by: Koba Ko <[email protected]>
If the STE doesn't point to the CD table we can upgrade it by reprogramming the STE with the appropriate S1DSS. We may also need to turn on ATS at the same time. Keep track if the installed STE is pointing at the cd_table and the ATS state to trigger this path. Tested-by: Nicolin Chen <[email protected]> Tested-by: Shameer Kolothum <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Reviewed-by: Jerry Snitselaar <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit 8ee9175) Signed-off-by: Koba Ko <[email protected]>
The SVA cleanup made the SSID logic entirely general so all we need to do is call it with the correct cd table entry for a S1 domain. This is slightly tricky because of the ASID and how the locking works, the simple fix is to just update the ASID once we get the right locks. Tested-by: Nicolin Chen <[email protected]> Tested-by: Shameer Kolothum <[email protected]> Reviewed-by: Nicolin Chen <[email protected]> Reviewed-by: Jerry Snitselaar <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Will Deacon <[email protected]> (cherry picked from commit f3b273b) Signed-off-by: Koba Ko <[email protected]>
Add a comprehensive kernel configuration file for Linux 6.6.63 targeting ARM64 architecture with 64k page size. This configuration enables essential features for NVIDIA ARM64 platforms including: - 64k page size configuration (CONFIG_ARM64_64K_PAGES) - Full ARM64 architecture support with NUMA balancing - BPF subsystem with JIT compilation - Control groups (cgroups) support for resource management - Memory controller with swap support - CPU isolation and RCU subsystem configuration - Process accounting and task statistics - Standard kernel debugging and security features The configuration is built with GCC 12.2.0 and includes the build salt "6.6.y.bsk.z-64k-arm64" to identify this specific kernel build variant. This configuration serves as the baseline for NVIDIA ARM64 kernel builds on the linux-nvidia-6.6 branch. Signed-off-by: Koba Ko <[email protected]>
This function returns NULL on errors, not ERR_PTR. Fixes: 1c68cbc ("iommu: Add IOMMU_DOMAIN_PLATFORM") Reported-by: Dan Carpenter <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Joerg Roedel <[email protected]> (cherry picked from commit b85b4f3) Signed-off-by: Koba Ko <[email protected]>
In the review for iommu_copy_struct_to_user() helper, Matt pointed out that a NULL pointer should be rejected prior to dereferencing it: https://lore.kernel.org/all/[email protected] And Alok pointed out a typo at the same time: https://lore.kernel.org/all/[email protected] Since both issues were copied from iommu_copy_struct_from_user(), fix them first in the current header. Fixes: e9d36c0 ("iommu: Add iommu_copy_struct_from_user helper") Cc: [email protected] Signed-off-by: Nicolin Chen <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Acked-by: Alok Tiwari <[email protected]> Reviewed-by: Matthew R. Ochs <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Joerg Roedel <[email protected]> (cherry picked from commit 30a3f2f) Signed-off-by: Koba Ko <[email protected]>
Commit bcb81ac ("iommu: Get DT/ACPI parsing into the proper probe path") changed the sequence of probing the SYSMMU controller devices and calls to arm_iommu_attach_device(), what results in resuming SYSMMU controller earlier, when it is still set to IDENTITY mapping. Such change revealed the bug in IDENTITY handling in the exynos-iommu driver. When SYSMMU controller is set to IDENTITY mapping, data->domain is NULL, so adjust checks in suspend & resume callbacks to handle this case correctly. Fixes: b3d1496 ("iommu/exynos: Implement an IDENTITY domain") Signed-off-by: Marek Szyprowski <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Joerg Roedel <[email protected]> (cherry picked from commit 99deffc) Signed-off-by: Koba Ko <[email protected]>
It turns out kconfig has problems ensuring the SMMU module and the KUNIT module are consistently y/m to allow linking. It will permit KUNIT to be a module while SMMU is built in. Also, Fedora apparently enables kunit on production kernels. So, put the entire kunit in its own module using the VISIBLE_IF_KUNIT/EXPORT_SYMBOL_IF_KUNIT machinery. This keeps it out of vmlinus on Fedora and makes the kconfig work in the normal way. There is no cost if kunit is disabled. Fixes: 56e1a4c ("iommu/arm-smmu-v3: Add unit tests for arm_smmu_write_entry") Reported-by: Thorsten Leemhuis <[email protected]> Link: https://lore.kernel.org/all/[email protected] Signed-off-by: Jason Gunthorpe <[email protected]> Tested-by: Thorsten Leemhuis <[email protected]> Acked-by: Will Deacon <[email protected]> Link: https://lore.kernel.org/r/[email protected] Signed-off-by: Joerg Roedel <[email protected]> (cherry picked from commit da55da5) Signed-off-by: Koba Ko <[email protected]>
Previously with tegra-smmu, even with CONFIG_IOMMU_DMA, the default domain could have been left as NULL. The NULL domain is specially recognized by host1x_iommu_attach() as meaning it is not the DMA domain and should be replaced with the special shared domain. This happened prior to the below commit because tegra-smmu was using the NULL domain to mean IDENTITY. Now that the domain is properly labled the test in DRM doesn't see NULL. Check for IDENTITY as well to enable the special domains. This is the same issue and basic fix as seen in commit fae6e66 ("drm/tegra: Do not assume that a NULL domain means no DMA IOMMU"). Fixes: c8cc265 ("iommu/tegra-smmu: Implement an IDENTITY domain") Reported-by: Diogo Ivo <[email protected]> Closes: https://lore.kernel.org/all/[email protected]/ Tested-by: Diogo Ivo <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> Signed-off-by: Thierry Reding <[email protected]> Link: https://patchwork.freedesktop.org/patch/msgid/[email protected] (cherry picked from commit cb83f4b) Signed-off-by: Koba Ko <[email protected]>
The commit 2ad56ef ("powerpc/iommu: Setup a default domain and remove set_platform_dma_ops") refactored the code removing the set_platform_dma_ops(). It missed out the table group release_ownership() call which would have got called otherwise during the guest shutdown via vfio_group_detach_container(). On PPC64, this particular call actually sets up the 32-bit TCE table, and enables the 64-bit DMA bypass etc. Now after guest shutdown, the subsequent host driver (e.g megaraid-sas) probe post unbind from vfio-pci fails like, megaraid_sas 0031:01:00.0: Warning: IOMMU dma not supported: mask 0x7fffffffffffffff, table unavailable megaraid_sas 0031:01:00.0: Warning: IOMMU dma not supported: mask 0xffffffff, table unavailable megaraid_sas 0031:01:00.0: Failed to set DMA mask megaraid_sas 0031:01:00.0: Failed from megasas_init_fw 6539 The patch brings back the call to table_group release_ownership() call when switching back to PLATFORM domain from BLOCKED, while also separates the domain_ops for both. Fixes: 2ad56ef ("powerpc/iommu: Setup a default domain and remove set_platform_dma_ops") Signed-off-by: Shivaprasad G Bhat <[email protected]> Reviewed-by: Jason Gunthorpe <[email protected]> Link: https://lore.kernel.org/r/170628173462.3742.18330000394415935845.stgit@ltcd48-lp2.aus.stglab.ibm.com Signed-off-by: Joerg Roedel <[email protected]> (cherry picked from commit d2d00e1) Signed-off-by: Koba Ko <[email protected]>
The rewind routine should remove the reserved iovas added to the new hwpt. Fixes: 89db316 ("iommufd: Derive iommufd_hwpt_paging from iommufd_hw_pagetable") Cc: [email protected] Link: https://patch.msgid.link/r/[email protected] Signed-off-by: Nicolin Chen <[email protected]> Reviewed-by: Kevin Tian <[email protected]> Signed-off-by: Jason Gunthorpe <[email protected]> (cherry picked from commit 950aeef) Signed-off-by: Koba Ko <[email protected]>
c08b4c7 to
5c5c4ac
Compare
Summary
This PR addresses a critical compatibility issue between IOMMU passthrough mode and NVIDIA UVM driver on ARM64 platforms, which causes NEMO framework and PyTorch CUDA initialization failures.
Problem
During NEMO framework testing on ARM64 platforms, tests consistently fail on machines configured with
iommu.passthrough=1. The failure is specifically triggered bytorch.cuda.set_device(1)calls, resulting in kernel warnings and IOMMU-related errors.Note:
iommu.passthrough=1was previously recommended by NVIDIA in official presentation slides as a performance optimization parameter, making this a high-impact issue for users following NVIDIA's best practices.Root Cause
The issue occurs when:
iommu.passthrough=1The kernel warning indicates a conflict in the ARM SMMU v3 driver when UVM attempts to bind GPU VA space: