Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Demystifying module (un)loading #53

Open
MacGyverNL opened this issue Mar 25, 2021 · 6 comments
Open

Demystifying module (un)loading #53

MacGyverNL opened this issue Mar 25, 2021 · 6 comments
Assignees

Comments

@MacGyverNL
Copy link

MacGyverNL commented Mar 25, 2021

Hi!

Since that reddit thread got deleted (you know the one) I figured this was probably the best way to continue this.

So, I checked the project history and it seems that the nvidia module (un)loading has been in these scripts from day 1. Do you happen to remember why? Because I see several issues with how it's set up right now, not in the least because the card is still bound to the driver at the point of modprobe -r; and I don't understand what problem it's solving in the first place. nodedev-detach should unbind the card from the kernel module and bind it to vfio-pci. From that point, whether those modules are loaded or not should be irrelevant to that card, and unloading them can only cause issues. But maybe there's a good reason for unloading them? But then I'd definitely move it to after the nodedev-detach commands, and move the load-commands of vfio-pci up (also, why is that module not continuously loaded?).

@joeknock90
Copy link
Owner

Heyyyy @MacGyverNL!

Short answer: My experience through testing.

Long answer: I have previously had it work without unloading the modules manually, however, less consistently for me without doing so. I THINK there was a Red Hat issue I found somewhere about why unloading the modules was sometimes nessesary even when using nodedev-detach, however, I can't seem to find it now. I could be mis-remembering.

That was probably... about a year ago last I tested changing up my scripts, which I should probably do periodically.

I'll do some testing now and tomorrow and see how it goes.

I apologize for the traffic it's caused in the subreddit, but in my defense, I did ask people to yell at me directly on reddit (at least mention me in the post), and I didn't expect a video to be made about it.

If you'd need another mod for VFIO I happily offer my services! Otherwise, thanks for all the help through this... unexpected rise in popularity.

@MacGyverNL
Copy link
Author

I have previously had it work without unloading the modules manually, however, less consistently for me without doing so. I THINK there was a Red Hat issue I found somewhere about why unloading the modules was sometimes nessesary even when using nodedev-detach, however, I can't seem to find it now. I could be mis-remembering.

Did nvidia-persistenced maybe have something to do with it? I don't quite see why unloading the modules, rather than killing nvidia-persistenced, would be the better fix in that case, but that's the only thing that comes to mind that might be related.

I'll do some testing now and tomorrow and see how it goes.

Cool, please holler at me when you've got findings.

I apologize for the traffic it's caused in the subreddit, but in my defense, I did ask people to yell at me directly on reddit (at least mention me in the post),

Hey, none of that is your fault and that's not why I'm here. I'm just trying to figure out whether to tell the people with black screen or other issues using this script, as their first troubleshooting step, to stop unloading modules, or whether that's going to just cause more issues still. I don't personally run a single-GPU passthrough nor Nvidia, so can't test it myself.

and I didn't expect a video to be made about it.

In SOG's defense (yeah, seriously), they're also pretty explicit about all this applying to their hardware in their situation and people should figure stuff out themselves, not copy blindly, but yeah... law of large numbers.

If you'd need another mod for VFIO I happily offer my services! Otherwise, thanks for all the help through this... unexpected rise in popularity.

I'm not a mod there either, just a guy who voiced something a lot of other regulars all seemed to be feeling. The influx seems to have died down a bit, but regardless, I'd like to figure out whether this module switcheroo is really necessary for anything. The simpler the setup is, the easier it is to help people get it working.

@joeknock90
Copy link
Owner

Thought you were a mod! Fooled me! You probably should be!

So, I've never used nvidia-persistenced. there's not much point with a single GPU honestly.

I just tested, and removing the modules DID in fact allow me to pass the my GPU just fine.

I'm going to update my guide and make a post on reddit about it.

Thanks for bringing it up man! I haven't been as on top of this as I probably should have.

@joeknock90
Copy link
Owner

It LOOKs like. I might actually have to re load at least 2 modules during revert

nvidia_uvm and nvidia_drm

don't seem to load automatically.

@MacGyverNL
Copy link
Author

Would echoing the right PCI address into the bind file in their /sys hierarchy have the same effect?

@joeknock90
Copy link
Owner

I meant to test this a few days ago but I've since been indisposed with crappy REAL LIFE. I'll try to do so soon.

@joeknock90 joeknock90 self-assigned this Mar 29, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants