Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Indiana Jones and the Great Circle - DLSS Frame Gen #9

Closed
SveSop opened this issue Dec 12, 2024 · 102 comments
Closed

Indiana Jones and the Great Circle - DLSS Frame Gen #9

SveSop opened this issue Dec 12, 2024 · 102 comments

Comments

@SveSop
Copy link
Owner

SveSop commented Dec 12, 2024

#8

Since this pull did not fix the issue, even tho it may or may not be related to cuda, ill keep it open a while to see if there COULD be something there with Cuda Context creation or something.

I do not own this game, but maybe someone could come up with a link to a vulkan demo or something using DLFG? The NVIDIA "Donut" demo does use DLSS, but does not use Frame Gen to my knowledge.. and that does not work when running with the -vulkan option at all currently for me.
It does not load nvcuda either, so probably not comparable other than maaaaaybe something with winevulkan in general.

@Saancreed
Copy link
Contributor

There is https://github.com/nvpro-samples/vk_streamline, but it will likely refuse to work because legacy Vulkan Reflex is not supported in Wine/Proton.

(I have a pile of hacks for that though, which I can't currently publish because of header licensing 😩)

Fwiw of all the Vulkan games I'm aware of that come with DLSS Frame Generation (Portal RTX, Portal Prelude RTX, No Man's Sky and Indy, please let me know if you're aware of other ones), only Indy ever tried to actually call into nvcuda. Maybe this is specific to the nvngx_dlssg snippet version that's shipped with this game, or maybe the game does this on its own on purpose, but I'm not sure.

There is, however, a concerning pattern where if the game is using Streamline (so, every one except Portal RTX) then Vulkan DLFG will refuse to work and die. Only Portal RTX, the sole game that was able to avoid Streamline due to dxvk-remix having a custom, blessed by Nvidia, DLFG integration, currently works if you pass WINE_DISABLE_HARDWARE_SCHEDULING=0 as environment variable to Proton.

@SveSop
Copy link
Owner Author

SveSop commented Dec 12, 2024

Yeah, i compiled vk_streamline and it does actually fire up. Using Bottles for this test with regular wine, but i should probably try GE-proton or something with that hardware_scheduling thing..

But it does actually start up. Enabling/disabling reflex does not indicate much but probably due to latencyflex or whatsitsname is not working.. dunno. Enabling DLSS does not work, and DLFG (DLSS-G it seems to be called in that log) shows this hardware scheduling thing.

viewing at some logs it does indicate something in the line of Native VK OFA feature not supported on this device! Falling back to OFA VK-Cuda interop feature. So.. maybe this is somewhat indicative of why it would attempt cuda usage?

Ill do some spying on what possible nvcuda calls if any is made in windows 🤔

@SveSop
Copy link
Owner Author

SveSop commented Dec 12, 2024

@Saancreed
Copy link
Contributor

Huh, nice find. Good to know there is actual Cuda interop there, that explains stuff… but we shouldn't be failing the check at https://github.com/NVIDIAGameWorks/Streamline/blob/f9fc648591a88d6accf859cd5c36010c25b6ab7b/source/platforms/sl.chi/vulkan.cpp#L2610 🤔

@SveSop
Copy link
Owner Author

SveSop commented Dec 12, 2024

So.. fiddling a bit with this using GE-proton-9.20, i atleast get a different error for DLFG "Error 6".. Running this in bottles i had to add the registry entry for the "RealPath" thing for NGXCore.. i thought the "DLSS script" in bottles actually would do this, but it did not.

Anyway, a small snippit from the log:

[12.12.2024 21-06-40][streamline][warn]commonentry.cpp:663[getNGXFeatureRequire
ments] Native VK OFA feature not supported on this device! Falling back to OFA
VK-CUDA interop feature.
0024:trace:nvcuda:DllMain (0x78062d650000, 1, (nil))
014c:trace:nvcuda:DllMain (0x78062d650000, 2, (nil))
014c:trace:nvcuda:DllMain (0x78062d650000, 3, (nil))

--

12.12.2024 21-06-41][streamline][warn]commonentry.cpp:663[getNGXFeatureRequire
ments] Native VK OFA feature not supported on this device! Falling back to OFA
VK-CUDA interop feature.
0150:trace:nvcuda:DllMain (0x78062d650000, 2, (nil))
0150:trace:nvcuda:DllMain (0x78062d650000, 3, (nil))
[12.12.2024 21-06-42][streamline][warn]dlss_gentry.cpp:274[updateEmbeddedJSON]
Failed to obtain DLSS-G min spec requirements from NGX, using SL defaults
[12.12.2024 21-06-42][streamline][warn]dlss_gentry.cpp:406[updateEmbeddedJSON]
Disabling DLSS-G since it is not supported on current hardware

I do believe that even tho nvcuda.dll is loaded in this case (for the vk_streamline demo thing), the nvngx.dll that NVIDIA provides with the driver actually loads libcuda.so behind the scenes, and that is why nothing of sense gets logged in the nvcuda log. It is probably some workaround and why most times the "empty" nvcuda.dll stub that is provided with proton just works.. The game just needs to load the dll, but internally when using nvngx.dll it uses libcuda.so directly.

I was looking into this on some other demos using a libcuda relay i made loading it with LD_PRELOAD and atleast that particular demo did use libcuda directly. Not entirely sure why it would fail to obtain minimum DLSS-G spec, but it could be some update needed to be done to nvngx. I seem to remember reading some PROTON_XXX option to allow updating nvngx? Arf.. cant find it..

EDIT: Doh.. if i had half a brain, it would be a better time for me.. ofc it wont work with DLFG on my stupid old linux-hack-box.. RTX2070 aint good enough. 😞

@Saancreed
Copy link
Contributor

Yeah, the GPU needs to support VK_NV_optical_flow for this to work.

On my system with GeForce RTX 4080 Mobile, vk_streamline sample when launched with Proton Experimental never calls into nvcuda. Instead, I see stuff like

info:nvofapi64:DXVK-NVAPI experimental-9.0-20240718-71-g7b2cd347+ NVOFAPI/VK gcc 10.3.0 x86_64 plain (vk_streamline.exe)
info:nvofapi64:OFAPI Client Version: 5.0
info:nvofapi64:<-NvOFAPICreateInstanceVk: Success
info:nvofapi64:<-CreateOpticalFlowVk: Success
info:nvofapi64:<-RegisterResourceVk: Success
info:nvofapi64:<-RegisterResourceVk: Success
info:nvofapi64:<-RegisterResourceVk: Success
info:nvofapi64:<-RegisterResourceVk: Success
info:nvofapi64:<-RegisterResourceVk: Success
info:nvofapi64:<-RegisterResourceVk: Success
info:nvofapi64:<-RegisterResourceVk: Success
info:nvofapi64:<-RegisterResourceVk: Success
info:nvofapi64:<-RegisterResourceVk: Success
info:nvofapi64:<-RegisterResourceVk: Success
info:nvofapi64:<-RegisterResourceVk: Success
info:nvofapi64:OFExecuteVK params: inputFrame: 0x18d23a0 referenceFrame: 0x18d2400 externalHints: 0 disableTemporalHints: 1 hPrivData: 0x24cdf70 numRois: 0 roiData: 0 numWaitSync: 1 pWaitSyncs: 0x2544670
info:nvofapi64:<-ExecuteVk: Success
info:nvofapi64:OFExecuteVK params: inputFrame: 0x18d23a0 referenceFrame: 0x18d2400 externalHints: 0 disableTemporalHints: 1 hPrivData: 0x24cdf70 numRois: 0 roiData: 0 numWaitSync: 1 pWaitSyncs: 0x2546bc0

which then hangs. Oh well. But at least I do pass the sample's native OFA check, so I'm not sure why I would fail it with Indy.

@SveSop
Copy link
Owner Author

SveSop commented Dec 13, 2024

Does it hang immediately when you enable Frame Gen?
Default startup settings should be everything "off", so it should work just fine with various DLSS qualities except for "Ultra Quality" (that does not work in windows either).

I installed the latencyflex binaries to my distro, and used GE-Proton-20 binaries, and created a fresh prefix with whatever needed and ran with LFX=1 and i can run the sample with various DLSS qualities. Cant really see any visible difference on that demo thingy, but atleast i can hear the coil whine i have on my 2070 change in pitch when i change various qualities, so i suppose something is working 🤣

Ofc no Frame Gen due to the old card... just out of interest if the case is that it crashes "no matter what" for you and we can start blaming nvofapi64 🤣

@SveSop
Copy link
Owner Author

SveSop commented Dec 13, 2024

And using my linux libcuda.so relay, cuda is used directly by the nvngx.dll's:

Nvngx_func13: (0x55558afd2690, 0x55558af94aa0, 0x6ffff9ea8d60)
Nvngx_func11: (0x555589a0ba60, 0x55558afd2790, 0x6ffffc6805c0, 0x1, 0x1000ffb58)
Nvngx_func13: (0x55558aff4f30, 0x55558afdb740, 0x6ffff9ea8dd0)
Nvngx_func11: (0x555589a0ba60, 0x55558aff5030, 0x6ffffc62ade0, 0x1, 0x1000ffb58)
Nvngx_func13: (0x55558b027750, 0x55558b0034e0, 0x6ffff9ea8e50)
Nvngx_func38: (0x55558acdd2b0, 0x555589a0ba60, 0x55558ace0cf8)
Nvngx_func36: (0x1000ffb74, 0x1000ffb78, 0x1000ffb7c, 0x55558a5eddf0)
Nvngx_func21: (0x55558b05b550, 0x7725600, 0x55558a5eddf0, 0x1f, 0x5555892bd260, (nil))
Nvngx_func23: (0x55558b05b550, 0x7725600, 0x55558a5eddf0, 0x78, 0x43, 0x1, 0x8, 0x8, 0x1, (nil))
Nvngx_func22: (0x55558b05b550, 0x7725600, 0x55558a5eddf0, (nil), 0x11c790)
Nvngx_func43: (0x55558b05b550, 0x7725600)
Nvngx_func36: (0x1000ffb74, 0x1000ffb78, 0x1000ffb7c, 0x5555892aaae0)
Nvngx_func21: (0x55558b05b550, 0x7725a00, 0x5555892aaae0, 0x1f, 0x5555892bd260, (nil))
Nvngx_func23: (0x55558b05b550, 0x7725a00, 0x5555892aaae0, 0x3e2, 0x1, 0x1, 0x100, 0x1, 0x1, (nil))
Nvngx_func22: (0x55558b05b550, 0x7725a00, 0x5555892aaae0, (nil), 0x11c790)
Nvngx_func43: (0x55558b05b550, 0x7725a00)
Nvngx_func36: (0x1000ffb74, 0x1000ffb78, 0x1000ffb7c, 0x5555892aaae0)
Nvngx_func21: (0x55558b05b550, 0x7725e00, 0x5555892aaae0, 0x1f, 0x5555892bd260, (nil))
Nvngx_func23: (0x55558b05b550, 0x7725e00, 0x5555892aaae0, 0x2, 0x1, 0x1, 0x100, 0x1, 0x1, (nil))
Nvngx_func22: (0x55558b05b550, 0x7725e00, 0x5555892aaae0, (nil), 0x11c790)
Nvngx_func43: (0x55558b05b550, 0x7725e00)

This "Nvngx_funcX" functions is one of those internal hidden API thingys in nvcuda.

cuGetExportTable: Nvngx_UUID {7f9212d6-261d-dd4d-8af6-38dd1aeb10ae}

@shelterx
Copy link

shelterx commented Dec 15, 2024

I can chime in here, I've seen different results. For me it never worked, regardless if DLSS-FG is set to off.
As soon as I set GPU to AD100, the game crashes, with one exception, if nvcuda is not loaded or missing, the game will run fine with AD100 set but FG won't work, the option is there but it doesn't do anything.

For some people, the game seems to crash as soon as they enable FG, however I'm unable to verify this.

Saancreed got some logs from me...

Btw, Maybe Nvidia needs to step in and help here also...

@SveSop
Copy link
Owner Author

SveSop commented Dec 16, 2024

I don't have the game, so i cant test that... but what i tested was this vk_streamline demo linked above. Spoofing a AD100 card makes the demo run with the option to enable FG. Doing so however makes a call that is missing a cuda context - and hang on a black screen:

Nvngx_func11: (0x55555c255ee0, 0x55555d41be50, 0xa88340, 0x1, 0x1000ffb58)
Nvngx_func12: ((nil), 0x55555d7dd)

The func12 is supposed to take 2 "context" addresses, so something is missing.. But then again, spoofing AD100 wont make me have that optical flow vulkan extension anyway, so might be it.

There is also a call made to NvAPI_GPU_QueryNodeInfo immediately when enabling FG that could possibly be related to setting up something, but it is not documented so i do not know the struct fields that needs to be filled out.
NVAPI_INTERFACE NvAPI_GPU_QueryNodeInfo(NvLogicalGpuHandle hLogicalGpu, void* pGpuNodeInfo);
The first parameter IS a logical gpu, but the pGpuNodeInfo parameter is a struct holding some "node information", that could possibly be needed for something. Only thing i have somewhat gleamed from this is that it is using a _V2 kind of struct (from the .version field). But i have not been able to figure out much more other than it is somewhat huge (bytewise). Could be some array holding up to eg. 256 or whatever "max_nodes" is supposed to be perhaps.

Other calls that are made in this demo i do not think would be needed could be things like:
NvAPI_Vulkan_SetSleepMode
NvAPI_Vulkan_Sleep
I suppose NvAPI_Vulkan_SetLatencyMarker and NvAPI_Vulkan_InitLowLatencyDevice together with the two above might be related to Reflex usage.

NvAPI_QueryInterface (0xad298d3f): Unknown function ID seems to be a few times for this demo, but i have no info on what that is.
What i tend to do when looking for those "unknowns" is to look what calls are coming before and after, and in that case, (for this vk_streamline demo) it seems to call NvAPI_Initialize and this one.. and initialize again.. and so on. This could possibly indicate some fallback mechanism that if the unknown call is not "OK", it will try to initialize and one more go.. Just speculation ofc 😏

None of these are documented in the open API.. And i have not checked if they are used in Indiana Jones game, so in that sense it is a huge wall of text not really related to the game itself.. just figuring out what is needed for Frame Gen in the vk_streamline demo.

@Saancreed
Copy link
Contributor

It doesn't help that we are effectively trying to troubleshoot three different issues here:

  1. Something, either in nvcuda, nvapi or nvofapi is missing for DLFG's VK-CUDA interop to be happy.
  2. Indy is trying to use VK-CUDA interop for DLFG, even though it shouldn't have to on my machine.
  3. Streamline is doing something weird, to the point where none of Vulkan games with DLFG using it have Frame Generation working in Proton, even those that don't use this CUDA interop.

We should probably make sure that the issue we are trying to resolve is not hidden behind a manifestation of another issue, so to speak. There's a nonzero chance that any attempt to debug the first issue will be harder because of the third one.

Does it hang immediately when you enable Frame Gen?

I currently have r_streamlineDLSSGMode "1" in my settings so it hangs just as the main menu would be shown.

I installed the latencyflex binaries to my distro, and used GE-Proton-20 binaries, and created a fresh prefix with whatever needed and ran with LFX=1 and i can run the sample with various DLSS qualities.

LatencyFleX won't be able to help us here. It supports only D3D flavor of Reflex and only partially. Vulkan one is a slightly different beast that goes NvLowLatencyVk.dll → some private nvapi64.dll functions (ones you mention later) → Vulkan driver, and we currently can't support it due to missing headers for those private functions… but that will change soon-ish 🤫

cuda is used directly by the nvngx.dll's

This "Nvngx_funcX" functions is one of those internal hidden API thingys in nvcuda.

cuGetExportTable: Nvngx_UUID {7f9212d6-261d-dd4d-8af6-38dd1aeb10ae}

This could be the private API DLSS Frame Gen uses to support that VK-CUDA interop you found.

The func12 is supposed to take 2 "context" addresses, so something is missing.. But then again, spoofing AD100 wont make me have that optical flow vulkan extension anyway, so might be it.

Yeah, I think that's the case here.

There is also a call made to NvAPI_GPU_QueryNodeInfo immediately when enabling FG that could possibly be related to setting up something, but it is not documented so i do not know the struct fields that needs to be filled out.
NVAPI_INTERFACE NvAPI_GPU_QueryNodeInfo(NvLogicalGpuHandle hLogicalGpu, void* pGpuNodeInfo);
The first parameter IS a logical gpu, but the pGpuNodeInfo parameter is a struct holding some "node information", that could possibly be needed for something. Only thing i have somewhat gleamed from this is that it is using a _V2 kind of struct (from the .version field). But i have not been able to figure out much more other than it is somewhat huge (bytewise). Could be some array holding up to eg. 256 or whatever "max_nodes" is supposed to be perhaps.

🙁

Okay, I can imagine this being a problem.

Other calls that are made in this demo i do not think would be needed could be things like:
NvAPI_Vulkan_SetSleepMode
NvAPI_Vulkan_Sleep
I suppose NvAPI_Vulkan_SetLatencyMarker and NvAPI_Vulkan_InitLowLatencyDevice together with the two above might be related to Reflex usage.

Correct.

NvAPI_QueryInterface (0xad298d3f): Unknown function ID seems to be a few times for this demo, but i have no info on what that is. What i tend to do when looking for those "unknowns" is to look what calls are coming before and after, and in that case, (for this vk_streamline demo) it seems to call NvAPI_Initialize and this one.. and initialize again.. and so on. This could possibly indicate some fallback mechanism that if the unknown call is not "OK", it will try to initialize and one more go.. Just speculation ofc 😏

None of these are documented in the open API.. And i have not checked if they are used in Indiana Jones game, so in that sense it is a huge wall of text not really related to the game itself.. just figuring out what is needed for Frame Gen in the vk_streamline demo.

Well, I also have no idea what that is, or how critical it would be for DLFG to work. I should probably recheck if Portal RTX tries to call it.

@SveSop
Copy link
Owner Author

SveSop commented Dec 17, 2024

Since i dont own any of the games, if you could attach a DXVK-NVAPI log from both games i can see if any of those "unknown" addresses are similar. Preferrably with some indication with/without FG usage.

The reason i mentioned latencyflex in regards to this vk_streamline demo was not that i expected it to "work".. it just seemed a requirement for this particular demo to even start... Bottles has this as a toggle, but my "manual" wineprefix did not, and the demo did not even start up without having these. After adding the latencyflex binaries "distro wide" + LFX=1 option, i could run the vk_streamline demo with DLSS, but crashing if i enable FG. (Using GE-Proton-9.20)

If you think it could help in any way i could give you access to my libcuda.so relay library, and you could perhaps get some more info from that (just regular C code compiled with a makefile)? I dont have Linux on my 4070 gaming rig, or else i could have done more testing there WITH the proper vulkan extensions 😢

@Saancreed
Copy link
Contributor

Here is a log from Portal RTX, with both Reflex and Frame Gen enabled and (as far as I can tell) working correctly: steam-2012840.log

Here is Indy, with GPU reported as Ampere: steam-2677660-ampere.log

Here is Indy, with GPU reported as Ada: steam-2677660-ada.log

And here is vk_streamline, which hangs the moment I click on the Frame Generation toggle:
steam-vkstreamline.log

I think at least the DLFG's VK-CUDA interop is bailing out in Indy because it could be loading nvofapi64.dll, looking with GetProcAddress for NvOFAPICreateInstanceCuda which we don't implement, not finding it and failing. Maybe we should try borrowing nvofapi64.dll from Nvidia's Windows driver 🙃

If you think it could help in any way i could give you access to my libcuda.so relay library, and you could perhaps get some more info from that (just regular C code compiled with a makefile)?

I can try, if I find some spare time.

@Saancreed
Copy link
Contributor

Okay, so I spent the last few days finishing my Vulkan Reflex implementation, but with that out of the way I got the logs from Indy with libcuda relay: steam-2677660-relay.log

However, the behavior changed with this library preloaded. The game never loaded nvofapi64.dll in this case, and never got to make any nvcuda.dll call that would be logged by PROTON_LOG=+nvcuda (other than DllMain). Disabling DLSS Super Resolution does not affect this. Could the relay be interfering with Nvidia driver / NVNGX trying to setup VK_NVX_binary_import for CUDA interop?

@SveSop
Copy link
Owner Author

SveSop commented Dec 19, 2024

Could the relay be interfering with Nvidia driver / NVNGX trying to setup VK_NVX_binary_import for CUDA interop?

Short answer: Yes.

Theory:
I am having some similar things happening in windows when hard-replacing nvcuda64.dll in driverstore. There is a _RDATA segment of the original nvcuda that probably contain something that is needed.
I ofc do not know what it contains, but if that segment contains some lookup table, or some information that nvngx.dll & friends access, this would ofc fail misserably.

Same with libcuda.so.xxx.xxx (orignal driver versioned) library. There is a .rodata field like this:

[16] .rodata           PROGBITS         0000000000c2a000  00c2a000
       00000000020bb0e0  0000000000000000   A       0     0     32

compare that with my relay:

[16] .rodata           PROGBITS         0000000000043000  00043000
       000000000000bc72  0000000000000000   A       0     0     32

Quite more "data" in that segment for sure.. so no problem seing that there might be a lot more too it than just relaying the cu** calls.

I do not know if it is nvngx.dll that is responsible for setting up calls to nvofapi64.dll or not tho, but it is highly likely that some data in that read-only field contains something that the relay library does not have...

EDIT: Quick look with IDA kinda shows that this _RDATA segment in the windows .dll is some 4 jumptables with 10+ entries in each pointing to some internal offsets. I have no clue what this is.. Could be anything 😢 I do feel this is leaning towards nvngx.dll functionality that does not currently work - eg. NVIDIA. Implementing this .rdata segment is beyond me for sure.

@SveSop
Copy link
Owner Author

SveSop commented Dec 19, 2024

Looking at the logfile, it does somewhat seem to "work", but all of those Nvngx_func xx calls are more or less speculation "working until it crashes" kind of thing.

So, it may ALSO very well be that some of those calls needs more parameters. Since i do not actually log the result of the return CUresult, they can just as well fail a call. I suppose that could be added to more easily see if something more is up too 😄

@shelterx
Copy link

I just want to say I appreciate all your efforts! ❤️
Seems like Nvidia (Liam?) needs to implement the missing stuff maybe.

For me, the game always crashes and never loads nvofapi64 now. It used to load, now it simply just loads nvcuda and goes boom. I also tried to replicate getting into the game with FG off but I don't know how some people manage to do it (from what I read)

@SveSop
Copy link
Owner Author

SveSop commented Dec 19, 2024

Ill add some return checks then..

This is what happens when i enable FG when spoofing AD100 on my 2070 card:

Nvngx_func11: (0x55557caa3870, 0x55557dc79eb0, 0xa889e0, 0x1, 0x1000ffb58)
Nvngx_func11: Returned error: 209
Nvngx_func12: ((nil), 0x1000fd9d0)
Nvngx_func12: Returned error: 400

CUDA_ERROR_NO_BINARY_FOR_GPU = 209,

Makes perfectly sense, since it probably tries to use a AD100 kernel on my TU100 gpu...

CUDA_ERROR_INVALID_HANDLE = 400,

Yeah.. it needs to take 2 parameters that call.. but gets a nullptr and thus fails (probably due to the previous error). Ill start on that then, and we see if there is anything useful to be gathered.

@SveSop
Copy link
Owner Author

SveSop commented Dec 19, 2024

@Saancreed I pushed some error code checking. It will ofc not solve anything, but could be interesting to see if one of those nvngx calls fails somewhere 👍

I should probably use that more in the code ofc.. for nvcuda too i suppose, but the overhead of calling -> returning might be more than just returning outright and let the app/game handle any errors, especially for those weirdo calls that gets used to an insane degree 😏 (But for libcuda relay it does not matter, cos its just for snooping purposes anyway and not something used regularly).

@Saancreed
Copy link
Contributor

steam-2677660.log

Haven't seen any errors that caught my eye but maybe I missed something. Fwiw at some point the game just stops logging anything CUDA related and just waits for me to terminate it. The log is ~7.9 MiB whether I let it run for a minute or two.

@Saancreed
Copy link
Contributor

Ah, with native nvofapi64.dll (and without libcuda relay) the game gets a bit further, now after calls to cuCtxCreate_v2, cuStreamCreate and cuCtxPopCurrent_v2 NVOFAPI is trying to call D3DKMTQueryAdapterInfo with Type = 31 which Proton doesn't implement, so the call fails and then the usual Cuda cleanup (cuStreamDestroy_v2, cuCtxDestroy_v2) continues. This more or less confirms my theory that the lack of NvOFAPICreateInstanceCuda in dxvk-nvapi's implementation is the current roadblock… and that nvcuda.dll is likely doing everything right until that point.

@SveSop
Copy link
Owner Author

SveSop commented Dec 20, 2024

@Saancreed
Copy link
Contributor

Almost, as that's the old version of nvofapi, new (5.0) header is here: https://github.com/jp7677/dxvk-nvapi/blob/v0.8.0/inc/nvofapi/nvOpticalFlowCuda.h

@SveSop
Copy link
Owner Author

SveSop commented Dec 20, 2024

I can't help wondering if the usage of CUDA here is some sort of fallback mechanism.

Guess ill have to set aside some partition on my 4070 rig to see if i can look into this a bit more. Somewhat hard to compare workings when it is not the same gpu with the missing VK_NV_optical_flow extension i guess.

@shelterx
Copy link

What do you mean by fallback? Natively the game loads nvcuda.dll from system32 and nvcuda64.dll from DriverStore in Windows, this is with an RTX 4070.

@shelterx
Copy link

By the way, FG makes the game run worse for me with RT on in Windows for some reason.
It kind of serves no purpose as it is now, at least not for me... unless it needs to be tweaked with other graphics settings.

With that being said, if you manage to fix this it might help other games...

@SveSop
Copy link
Owner Author

SveSop commented Dec 21, 2024

What do you mean by fallback? Natively the game loads nvcuda.dll from system32 and nvcuda64.dll from DriverStore in Windows, this is with an RTX 4070.

I just wonder why it seems to attempt to use NvOFAPICreateInstanceCuda over NvOFAPICreateInstanceVk perhaps. It could be that NvOFAPICreateInstanceVk uses cuda in the background anyway ofc. Just makes me wonder if SOME vulkan call fail, and the CUDA interop is the fallback? But it could ofc all be just different methods to tie various resources together <-> cuda anyway.

Lets say the default mechanic is to use NvOFAPICreateInstanceVk that ties vulkan resources <-> cuda, and that works in 99.9% of the cases seen from the game manufacturer. Whatever the included dlss_g or whatnot choose to do in case of failures, is something the game dev's have not really accounted for and WHAM, there is no "game mechanical fallback" in place to handle this direct cuda usage through NvOFAPICreateInstanceCuda - and certainly not for wine/proton, since dxvk-nvofapi64 does not have this at all.

Some say the game is good, so it might not be a total waste of $ if i buy it AND set up a linux partition for this testing... christmas and all 🤣

@Saancreed
Copy link
Contributor

Natively the game loads nvcuda.dll from system32 and nvcuda64.dll from DriverStore in Windows, this is with an RTX 4070.

That's unsurprising, nvngx_dlssg.dll dynamically links to nvcuda so it will always be loaded even if unused. The interesting part if whether nvcuda is called into even on Windows.

It could be that NvOFAPICreateInstanceVk uses cuda in the background anyway ofc.

It very well might, but only at the level internal to Linux driver so we don't have to care about this. Would be nice to know if the game is using CUDA-based or VK-based optical flow on Windows, because Indy appears to be using the manual hooking method for Streamline, and this has some additional caveats with regard to DLFG in Vulkan:

https://github.com/NVIDIAGameWorks/Streamline/blob/main/docs/ProgrammingGuideDLSS_G.md#31-checking-dlss-gs-configuration-and-special-requirements

https://github.com/NVIDIAGameWorks/Streamline/blob/main/docs/ProgrammingGuideManualHooking.md#521-instance-and-device-additions

If CUDA is used directly instead of Vulkan even on Windows, it could be that the interop is there by design (or due to failure to satisfy native Vulkan OF requirements?) and just nobody cares because it works anyway.

By the way, FG makes the game run worse for me with RT on in Windows for some reason.

Or maybe it doesn't work and that's the reason why 🙃

@SveSop
Copy link
Owner Author

SveSop commented Dec 21, 2024

NVOFAPI Relay proxy!

All functions loaded successfully from nvofapi64.dll

NvOFAPICreateInstanceCuda (80, 000002DC71FA3290)

I created a quicky relay for the nvofapi64.dll on windows without looking into the structs.. just for logging purposes, and the game does use NvOFAPICreateInstanceCuda on windows. (So, yes i bought the game, not yet tested it on linux)

So, atleast we know that for now... Things could be failing on windows too tho, but starting to look a bit less likely perhaps.

@SveSop
Copy link
Owner Author

SveSop commented Jan 4, 2025

So, do you mean to say that it crashes no matter what setting you use if you use nvcuda from nvidia-libs? Even if DLSS and FG is NOT enabled? Because that does not make sense at all...

@shelterx
Copy link

shelterx commented Jan 4, 2025

Yeah, seems like it.
However sometimes when the game asks to reset graphics settings after a crash. It won't load the checkpoint graphics and just show another screen at the menu instead. If that simplier screen is loaded, the game will work until you load the latest checkpoint. Once it's loaded, it will crash.

@SveSop
Copy link
Owner Author

SveSop commented Jan 4, 2025

Changing scenes could indeed use more vram, so when actually loading a save, it might just do that.

Snippit from the logfile you posted above:

0154:fixme:bcrypt:BCryptGenerateSymmetricKey ignoring object buffer
0124:fixme:bcrypt:BCryptGenRandom ignoring selected algorithm
0274:trace:nvcuda:DllMain (0x788c61570000, 3, (nil))
0278:trace:nvcuda:DllMain (0x788c61570000, 3, (nil))
wine: Unhandled page fault on write access to 0000000000000000 at address 0000000140494D68 (thread 0138), starting debugger...
      2998:     find library=libpthread.so.0 [0]; searching
      2998:      search path=/home/shelter/.local/share/Steam/ubuntu12_64/video/glibc-hwcaps/x86-64-v3:/home/shelter/.local/share/Steam/ubuntu12_64/video/glibc-hwcaps/x86-64-v2:/h>
      2998:       trying file=/home/shelter/.local/share/Steam/ubuntu12_64/video/glibc-hwcaps/x86-64-v3/libpthread.so.0

Snippit from logfile where it works for me:

0138:fixme:bcrypt:BCryptGenerateSymmetricKey ignoring object buffer
0124:fixme:bcrypt:BCryptGenRandom ignoring selected algorithm
0250:trace:nvcuda:DllMain (0x7616c00e0000, 3, (nil))
024c:trace:nvcuda:DllMain (0x7616c00e0000, 3, (nil))
warn:  CreateDXGIFactory2: Ignoring flags
info:  NVIDIA GeForce RTX 4070 Ti SUPER:
info:    Driver : NVIDIA 565.77.0
info:    Memory Heap[0]:
info:      Size: 16376 MiB
info:      Flags: 0x1
info:      Memory Type[1]: Property Flags = 0x1
info:      Memory Type[4]: Property Flags = 0x7

Snippit from logfile where i allocate 8GB Vram BEFORE launching the game:

0164:fixme:bcrypt:BCryptCreateHash ignoring object buffer
0164:fixme:bcrypt:BCryptGenerateSymmetricKey ignoring object buffer
01d8:fixme:x11drv:skip_iconify HACK: skip_iconify.
0114:fixme:oleacc:find_class_data unhandled window class: L"Button"
0114:fixme:uiautomation:msaa_provider_GetPatternProvider Unimplemented patternId 10002
0114:fixme:uiautomation:base_hwnd_provider_GetPatternProvider 0000000000C15280, 10002, 0000000001A3F8A0: stub
0114:fixme:oleacc:find_class_data unhandled window class: L"Button"
0114:fixme:uiautomation:msaa_provider_GetPatternProvider Unimplemented patternId 10002
0114:fixme:uiautomation:base_hwnd_provider_GetPatternProvider 0000000000BD7060, 10002, 0000000001A3F8A0: stub
0114:fixme:oleacc:find_class_data unhandled window class: L"#32770"
0114:fixme:uiautomation:msaa_provider_GetPatternProvider Unimplemented patternId 10002
0114:fixme:uiautomation:base_hwnd_provider_GetPatternProvider 0000000000C15280, 10002, 0000000001A3F8A0: stub
wine: Unhandled page fault on write access to 0000000000000000 at address 0000000140494D68 (thread 014c), starting debugger...
     13942:     find library=libpthread.so.0 [0]; searching

So.. i am fairly confident you ARE actually running out of vram of some kind.
What resolution? What other apps? What does NVTOP show?

Starting the game using 1440p resolution and "Low" gfx setting, and RT on "medium" (the lowest) the game uses 10.6GB vram for me. Then enabling DLSS and FG it uses slightly less - 10.3GB.

However, re-launching the game (since i suppose some graphics settings need that), the game uses 11.2GB! Now, with browser + steam launcher + a couple of CLI windows up, it uses for me a total of 12.05GB.

If that happens for you = crash since that is > 12GB. So, even at everything at "LOW", it WILL eat up > 11GB of vram.. and that is really in danger territory it seems.
Bringing it down to 1080p resolution shaves off some 400-500MB, so down to appx 10.6GB game usage.

I can however imagine some effects or scenes using some +/- amount here, so if you are "lucky" and manage to load this with nothing in the background and hovering around 11GB vram used, i would still consider it somewhat "danger territory".

I would say you can first start off without DLSS/FG, and see how far down you can go.. Keep an eye on nvtop and look at the memory usages. Remember you would need to restart the game everytime you change some setting i think.. Atleast those with textures and whatnot.

It is not extremely far fetched that there can be memory leaks in nvcuda, or some issues releasing memory, so i will look a bit into that to see if i can spot anything suspicious 😄

@SveSop
Copy link
Owner Author

SveSop commented Jan 5, 2025

Hmm.. I fear it is a bit more to this. Playing this for a "while" 20-30 minutes, even tho i turned down things to i was hovering around 12-13GB max TOTAL (game +++), i tend to crash..

0270:trace:nvcuda:wine_cuImportExternalSemaphore (0x1313ef580, 0x81b6df40, 10)
0270:err:nvcuda:wine_cuImportExternalSemaphore Returned error: 2
0270:trace:nvcuda:wine_cuGetErrorName (2, 0x81b6df00)
0270:trace:nvcuda:wine_cuCtxPushCurrent_v2 (0x7308d86e8ee0)
0270:trace:nvcuda:wine_cuImportExternalSemaphore (0x1313ef530, 0x81b6df40, 10)
0270:err:nvcuda:wine_cuImportExternalSemaphore Returned error: 2
0270:trace:nvcuda:wine_cuGetErrorName (2, 0x81b6df00)
0124:err:sync:RtlpWaitForCriticalSection section 0000000144820208 (null) wait timed out in thread 0124, blocked by 0160, retrying (60 sec)

And there it froze...

And what is return code 2 you say?
CUDA_ERROR_OUT_OF_MEMORY = 2,

So it seems it does run out of memory eventually, even tho i was probably around 90% vmem usage, so even tho its not oom, might be some starvation of sorts. Guess i have to do some more logging 🤔

@shelterx
Copy link

shelterx commented Jan 6, 2025

Yeah, I can't really make much sense out of my memory usage because it seems to crash even if there is memory available.
Now I had the game running at the title screen for like 30 seconds, with a checkpoint "screen" loaded... then it just crashed by itself.
According to in-game performance info HEAP 0 usage was 8,7 GB, HEAP 1 was 603Mb
And nvtop said total ram usage was 9,8Gb of 12Gb.

It could be a game compatibility or driver issue of some sort too.

It would be nice if someone with a 12Gb Ada card could test this too... just to see if they run into the same issue.

@shelterx
Copy link

shelterx commented Jan 6, 2025

I think I figured it out... it's __GL_13ebad=0x1
The game allocates VRAM in the wrong heap without it, but the allocation is totally different if you disable it.
2GB in HEAP 0 and 7.6GB in HEAP 1

Update:
On Windows the game uses 10GB of VRAM and 850Mb in HEAP 1. I've never seen HEAP 1 go above 600Mb in Linux.
So it definitely uses more Mem in Windows.
Also in Windows, with nvcuda.dll renamed to nvcuda.dll.bak in Windows/system32, both VRAM and HEAP 1 usage goes down slightly, not by much tho'.

@SveSop
Copy link
Owner Author

SveSop commented Jan 7, 2025

Even if you rename nvcuda.dll in c:\windows\system32, it can still be loaded in windows.. just so you are aware, windows uses a slightly different dll-path resolution thing than LD_LIBRARY_PATH in linux so to speak.

All nvidia "system libraries" including nvcuda.dll (which is named nvcuda64.dll and nvcuda32.dll 64/32 bit), aswell as nvapi64.dll and whatnot is located in the "DriverStore" folder in windows, and ARE loaded from there... but some apps tends to use a compatibility mode thing where it sometimes is loaded from c:\windows\system32 for then to be unloaded and re-loaded from the DriverStore folder system. (Not gonna explain that one).
The game (TheGreatCircle) might not really load nvcuda.dll other than PERHAPS do it as a "check if it exist" kind of thing, as it is probably some other runtime library that do the heavy lifting. (Just the same with cudart64_xx.dll "runtime" library).

Anyho.. I am not certain what this GL_13ebad setting actually does as this is some internal setting thing, but it is clear after some testing that there is some sort of strange memory leak where 2MB chunks of vram is being eaten steadily every 3-4 seconds when running with FrameGen. Trying to investigate this, but not overly easy. The amount of videomem in use by the GAME does not change much once you are in-game, but sys-mem is increasing aswell as some amount of "unknown" videomemory. If you look at nvtop when this is happening, adding up the "app/game usage" fields with ACTUAL videomemory used, it does not seem to be the same amount, so... strange indeed.

It is possibly some special cuda thing, as if i do allocate 4096MB of vram using cumemalloc or whatnot, it will use > 4096MB of video memory for some reason. This is probably a documented feature, some allocationbuffer or somecrap. Not sure.

The cuda functions triggered by nvOFExecute does not seem to do any memory allocation, but there is some context swapping back and forth so maybe something is not freed as it should be.. I am a bit at a loss here atm. Maybe in some register, the "win32 handle" is tied to a spesific cuda context, and what we do is creating a "linux fd handle" to do the function, and return the pointer to some cuda-kernel-blob thing. Wine (proton) does not tie win32 <-> fd handles together possibly, and if the game then is supposed to free that context tied to this "win32 handle", i do not know what will happen. Since this context is not being actively freed using a cuda function, maybe it is trying to free some vulkan thing with a pointer to the linux fd, and it goes tits up from there.. I dunno.

@Saancreed Do you have any theories? If you run the game with the highest settings you can, with DLSS/FG enabled, try to look at vmem usage in nvtop over a few minutes.. rises steadily for me. The GAME's videomemory seems stable, but the "overall" videomemory is used up. Also when this is high enough (close to 100%), there will be some stuttering aswell. Maybe that bios setting i cant remember the name of has some amount of pageable memory thing (3-400MB or whatnot), and it will eat and eat out of that until completely oom and freeze up?

@Saancreed
Copy link
Contributor

I am not certain what this GL_13ebad setting actually does as this is some internal setting thing

Okay, so as it was explained to me, the game (actually most if not all idTech and now also MOTOR games) have this bug where they request memory allocations for performance-critical resources to be done in system memory instead of video memory. This isn't a problem on Windows because WDDM just magically moves stuff on its own behind your back but there is no such OS-wide mechanism on Linux, where applications are just trusted to not do anything nonsensical. This variable enables hacky promotion of unmapped sysmem allocations to video memory so allocations end up in a better place… but I don't know if it's sophisticated enough to actually leave in system memory those allocations that aren't performance-critical, so I wouldn't be surprised if it led to higher VRAM requirements on Linux.

Maybe in some register, the "win32 handle" is tied to a spesific cuda context, and what we do is creating a "linux fd handle" to do the function, and return the pointer to some cuda-kernel-blob thing. Wine (proton) does not tie win32 <-> fd handles together possibly, and if the game then is supposed to free that context tied to this "win32 handle", i do not know what will happen.

IIRC Vulkan spec documents a difference where at least importing one resource on one OS consumes the fd/handle and another case in which it doesn't. Maybe there's a similar difference in CUDA and we should be closing the handle/fd in nvcuda after calling the Linux-side function? I'm not sure if I got it right.

@Saancreed Do you have any theories?

Not really, but I'd try force-enabling VK_EXT_pageable_device_local_memory device extension and VkPhysicalDevicePageableDeviceLocalMemoryFeaturesEXT::pageableDeviceLocalMemory feature, via some kind of Vulkan layer, maybe that will help…

To take best advantage of pageable device-local memory the application must create the Vulkan device with the VkPhysicalDevicePageableDeviceLocalMemoryFeaturesEXT::pageableDeviceLocalMemory feature enabled. When enabled the Vulkan implementation will allow device-local memory allocations to be paged in and out by the operating system, and may not return VK_ERROR_OUT_OF_DEVICE_MEMORY even if device-local memory appears to be full, but will instead page this, or other allocations, out to make room.

Which sounds promising, but then there is…

The Vulkan implementation will also ensure that host-local memory allocations will never be promoted to device-local memory by the operating system, or consume device-local memory.

So considering the allocation promotion hack… 🤷

@SveSop
Copy link
Owner Author

SveSop commented Jan 7, 2025

Made some changes here bd24982 (testing branch)

Atleast it works the same, and attempt to close the handles, although not closing the fd as i am not completely sure how to do that in a proper manner atm. Theoretically it might be fine to close it after the relays, as i suppose the extSem_out and extMem_out should then be "created".. maybe?

Opening this again, as it still crashes for me with a bit of time.. 15-30 minutes ingame, and it either claims it is out of memory even if i have 1-2GB free vram according to nvtop, and the game sais it uses 12-13GB, or just stops.

Have done some testing with fsync/esync and without any, but it does not really seem to matter either. Maybe it is a gamebug that pops up with the usage of CUDA like this, considering this shady __GL_13ebad usage?

One "round" of nvOFExecute does this:

0244:trace:nvofapi:nvOFExecute (0x7be4611520b0, 0x8136dce0, 0x8136dcb0)
0244:trace:nvcuda:wine_cuCtxGetCurrent (0x8136dd60)
0244:trace:nvcuda:wine_cuMemcpy2DAsync_v2 (0x8136dbd0, 0x7be46109be10)
0244:trace:nvcuda:wine_cuCtxPopCurrent_v2 (0x9ca76e18)
0244:trace:nvcuda:wine_cuCtxPushCurrent_v2 (0x7be4606e8b80)
0244:trace:nvcuda:wine_cuSignalExternalSemaphoresAsync (0x8136d850, 0x8136d890, 1, 0x7be46109be10)
0244:trace:nvcuda:wine_cuCtxPopCurrent_v2 (0x9ca76e18)
0244:trace:nvcuda:wine_cuCtxPushCurrent_v2 (0x7be4606e8b80)
0244:trace:nvcuda:wine_cuImportExternalSemaphore (0x138c94910, 0x8136df40, 10)
0244:trace:nvcuda:wine_cuCtxPopCurrent_v2 (0x8136df10)
0244:trace:nvcuda:wine_cuCtxPushCurrent_v2 (0x7be4606e8b80)
0244:trace:nvcuda:wine_cuImportExternalSemaphore (0x138c94960, 0x8136df40, 10)
0244:trace:nvcuda:wine_cuCtxPopCurrent_v2 (0x8136df10)
0244:trace:nvcuda:wine_cuCtxPushCurrent_v2 (0x7be4606e8b80)
0244:trace:nvcuda:wine_cuWaitExternalSemaphoresAsync (0x8136d850, 0x8136d890, 1, 0x7be46109bdf0)
0244:trace:nvcuda:wine_cuCtxPopCurrent_v2 (0x9ca76e18)
0244:trace:nvcuda:wine_cuCtxPushCurrent_v2 (0x7be4606e8b80)
0244:trace:nvcuda:ContextStorage_Get (0x8136d6b0, (nil), 0xdd5a940)
0244:trace:nvcuda:wine_cuLaunchKernel (0x7be4614f3670, 10, 45, 1, 32, 16, 1, 0, 0x7be46109bdf0, 0x8136dac0, (nil)),
0244:trace:nvcuda:ContextStorage_Get (0x8136d6b0, (nil), 0xdd5a940)
0244:trace:nvcuda:wine_cuLaunchKernel (0x7be4614f3670, 10, 45, 1, 32, 16, 1, 0, 0x7be46109bdf0, 0x8136dac0, (nil)),

That ContextStorage_Get bit could very well be iffy also, as this is one of those hidden api's, but it does not seem to cause issues other places, but i suppose no other game using CUDA is THIS memory hungry. I mean, a game using PhysX that uses like 4-6GB vram would probably spend a lot longer going oom.

@SveSop SveSop reopened this Jan 7, 2025
@shelterx
Copy link

shelterx commented Jan 8, 2025

I'm a bit further into the game and it crashes even without nvcuda after a while. While it might be possible to optimize nvcuda I think it's basically a game engine/driver workaround issue like @Saancreed said.

@SveSop
Copy link
Owner Author

SveSop commented Jan 9, 2025

The "memory leakage" where the game seems to .. well.. somewhat allocate or map memory from host-mem -> videomem, possibly due to this workaround seems present without FG. However, it does seem to be a lot WORSE when using framegen/nvcuda implementation. I am not sure if it is a nvcuda bug, in that the game is "better" at freeing/moving memory without it, or if using FG just increases the underlying problem.

@Saancreed I have not found any worsening, or improving using the HANDLE get_shared_resource_kmt_handle(HANDLE shared_resource) function from Vulkan, so i think i might just aswell stick with that, as it does seem a bit better to "get" it than "open" one (create one). Logical, or same sh**? 🤣

@shelterx
Copy link

shelterx commented Jan 10, 2025

We can only hope Nvidia comes up with a more stable fix for this game, at least we got the workaround for now...

FWIW I checked the game profile with Nvidia Profile Inspector in windows, there seem to be a few "quirks" added for this game in Windows, I can't tell what all of them do.

@SveSop
Copy link
Owner Author

SveSop commented Jan 13, 2025

So, i have moved stuff around a bit, and now the active development of nvcuda is on the master branch, and i am pulling nvenc and nvcuda as submodules in the nvidia-libs package. Keeps it a bit more tidy, as i do not have to re-commit twice over..

Anyway, the last two commits to nvcuda did seem to help a bit, and i think i am getting half a grasp on why things are not exactly working out of the box. I have done a lot of testing, and i believe it is not easily solved until NVIDIA gets a couple of things different. One of them i THINK would be to have the same CUDA "sysmem fallback policy" as you can choose in windows. There you have the option to choose that CUDA allocations will fallback to sysmem so you will not run oom that easily.
That is one part, but imo this could probably just lead to even more stuttering and eventually crashing anyway (when you run out of sysmem i suppose).

Another part is the way we "translate" win32 handles -> fd handles. This ended up not being too hard after @Saancreed came up with that idea, and i have done a bit of testing, however i do believe this is somewhat only half the story.

Ref to cuda documentation those different "types" of handles are treated completely oppositely when it comes to CUDA. Yay!

  • If ::CUDA_EXTERNAL_SEMAPHORE_HANDLE_DESC::type is
  • ::CU_EXTERNAL_SEMAPHORE_HANDLE_TYPE_TIMELINE_SEMAPHORE_FD, then
  • ::CUDA_EXTERNAL_SEMAPHORE_HANDLE_DESC::handle::fd must be a valid
  • file descriptor referencing a synchronization object. Ownership of
  • the file descriptor is transferred to the CUDA driver when the
  • handle is imported successfully. Performing any operations on the
  • file descriptor after it is imported results in undefined behavior.

So, this i would think means what it sais - CUDA claims "ownership" of the FD handle

  • If ::CUDA_EXTERNAL_SEMAPHORE_HANDLE_DESC::type is
  • ::CU_EXTERNAL_SEMAPHORE_HANDLE_TYPE_TIMELINE_SEMAPHORE_WIN32, then exactly one
  • of ::CUDA_EXTERNAL_SEMAPHORE_HANDLE_DESC::handle::win32::handle and
  • ::CUDA_EXTERNAL_SEMAPHORE_HANDLE_DESC::handle::win32::name must not be
  • NULL. If
  • ::CUDA_EXTERNAL_SEMAPHORE_HANDLE_DESC::handle::win32::handle
  • is not NULL, then it must represent a valid shared NT handle that
  • references a synchronization object. Ownership of this handle is
  • not transferred to CUDA after the import operation, so the
  • application must release the handle using the appropriate system
  • call. If ::CUDA_EXTERNAL_SEMAPHORE_HANDLE_DESC::handle::win32::name
  • is not NULL, then it must name a valid synchronization object.

but in the case of a win32 type of handle, the ownership is NOT transfered to CUDA! So, completely opposite. This probably also means that the game/app is supposed to handle its lifetime, and CUDA would just used it for what it needs it for internally. No problems there, as long as the app does cleanups and whatnot...

The issue here i THINK could be in the way CUDA is supposed to "free" the resource.. and if or how wineserver figures that out, i do not know.. wineserver/winevulkan may do it correctly, and completely out of sync with CUDA and/or CUDA could internally attempt to free this and not being able to? I do not know...

I would also think that one of the reasons win32 timeline semaphores work with winevulkan could be due to the vulkan extension in use as well, but we do not have any sort of such extension for CUDA. There is a note about KMT handle in the winevulkan source also:

case VK_EXTERNAL_MEMORY_HANDLE_TYPE_D3D11_TEXTURE_KMT_BIT:
/* FIXME: the spec says that device memory imported from a KMT handle doesn't keep a reference to the underyling payload.
This means that in cases where on windows an application leaks VkDeviceMemory objects, we leak the full payload. To
fix this, we would need wine_dev_mem objects to store no reference to the payload, that means no host VkDeviceMemory
object (as objects imported from FDs hold a reference to the payload), and no win32 handle to the object. We would then
extend make_vulkan to have the thunks converting wine_dev_mem to native handles open the VkDeviceMemory from the KMT
handle, use it in the host function, then close it again. */

Something that kind of indicate to me that it is not trouble-free there either.

I will keep stabbing in the dark with this, but my knowledge is frightfully limited, so for now i suppose the latest hack is the best i got 🤣

@Saancreed
Copy link
Contributor

The issue here i THINK could be in the way CUDA is supposed to "free" the resource.. and if or how wineserver figures that out, i do not know.. wineserver/winevulkan may do it correctly, and completely out of sync with CUDA and/or CUDA could internally attempt to free this and not being able to? I do not know...

Yeah, this was one thing I was also not sure about. With the difference being that importing Win32 handles does not consume them, but importing FDs does, I expected seeing some issues in a scenario where (but keep in mind, I don't exactly remember what Winevulkan's shared resources patchset does exactly so some of it is a complete guesswork):

  • Application asks winevulkan for an export handle to created semaphore
    • winevulkan asks Vulkan driver for an export FD and gets FD (let's call it fd_1)
    • winevulkan asks wineserver / sharedgpures to wrap that FD into a Win32 handle and gets a handle (let's call it handle_1)
    • Winevulkan gives handle_1 to the application
  • Application asks nvcuda to import semaphore from handle_1
    • nvcuda asks wineserver / sharedgpures to unwrap handle_1 and retrieve underlying FD and gets fd_2 which may or may not be exactly the same FD as fd_1 (depending on whether wineserver duplicates the FD or not)
    • nvcuda asks libcuda to import semaphore from fd_2, and libcuda consumes that FD
  • Application expects handle_1 to remain usable because nvcuda is not supposed to consume it, but we just allowed libcuda to consume some FD related to that handle.

What if fd_2 was indeed fd_1 and we are supposed to pass dup(fd_2) to libcuda instead of fd_2 directly to prevent fd_1 from being consumed? But if that's the case I'd expect to see crashes related to usage of invalid FD with current revision, not increasing memory usage.

😩

@SveSop
Copy link
Owner Author

SveSop commented Jan 14, 2025

Yeah, its a pickle...
I duplicated the win32 handle:
if(NtDuplicateObject(NtCurrentProcess(), win32_handle, NtCurrentProcess(), &new_handle, 0, 0, DUPLICATE_SAME_ACCESS)) and used that to retrieve a FD handle, but maybe that was not the correct way to do it.

I was hoping wineserver would be oki with duplicating like this. It did not fix stuttering or memoryleak doing this, so it ended up being the same anyway. Thats one of the reasons i believe it is not just 1 issue we are dealing with here either, although I am sure it is one of them. I have not tested extensively setting this CUDA "fallback to sysmem" to disabled in windows, so i am going to do some more testing there, and see if i end up oom there aswell. Still do not think the problem would be as big there due to us needing this __GL_13ebad workaround thingy to even run the game with wine.

The cuda sysmem fallback thing is something ppl have requested for a long time, as this is not only for gaming, but other cuda apps that easily run oom on linux but works just fine in windows. Hopefully things could improve a bit if that was to come around 👍

Tracking memory allocations with nvidia-smi or other tools is impossible it seems, as it will NOT show things like shared memory or managed memory (cuMemAllocManaged), so you could end up oom with only a few GB videomemory "used". I do suspect this __GL_13ebad quirk is using some hackery here that will hide some allocations perhaps. Not to mention the ever present culprit of fragmentation due to frequent cuda allocations (imports?) that may have a MUCH slower rate of freeing than windows perhaps.

As i understand this Cuda Frame Generation thing (until DLSS 4.0 comes i guess), is that lets say vulkan render/generate 2 images. Image_1 and Image_2, cuda then uses "AI" to generate Image_2.5 out of these two images so you gain 30%+ more fps (dont get hung up on math here). This means that if you stand perfectly still with nothing moving or otherwise happening, it will possibly re-use the same two image over and over. This i think i can see as in scenes where things are slow or you stand still looking at a wall, the calls to nvOFExecute slows down. However, with movement or more things happening, generating this will speed up (more calls).

How many "saved frames for later use" is supposed to be in vmem? Dunno.. Is it "always" generating a new image? Dunno... Aggressively calling cuDestroyExternalSemaphore to destroy those previous extSem_out semaphores did atleast seem to stagger the excessive memory usage, but i will not claim it is solved, since doing so in a multithreaded manner meant i had to protect this with a mutex 😱 And locking with a mutex possibly every 3 frames (or whatever it may be) rendered is not exactly a fps or latency boost 😏

@Saancreed
Copy link
Contributor

Saancreed commented Jan 27, 2025

Fun fact: with DLSS v310 snippets borrowed from Cyberpunk 2077's latest update, Frame Gen appears to be working in Indy without anything CUDA-related having to be installed.

Image

(Something something Optical Flow Accelerators are no longer used, which likely means the same for CUDA interop if not the entire NVOFAPI library. At least from a quick look at Proton logs, nvcuda.dll is loaded but never called into while nvofapi64.dll is just never loaded at all.)

(Using R570 driver currently requires Proton Experimental bleeding-edge, but I expect this to be working even with R565, although I didn't test that particular scenario.)

@SveSop
Copy link
Owner Author

SveSop commented Jan 28, 2025

https://www.techpowerup.com/331322/nvidias-frame-generation-technology-could-come-to-geforce-rtx-30-series

So.. This is supposedly the new DLSS4.0 then.. and as it sais in the article - NVIDIA has moved away from OpticalFlow usage.

I see that on the great circle "update 3" notes, it sais that they will add support for Blackwell GPU's (50xx), which will hopefully also update the games DLSS/sl.xx dll's to this new version.
I don't mind that at all, as that means we can happily drop the libnvidia-opticalflow.so workaraound 👍

Although i admit it is still "typical NVIDIA" version scheme to use v310.xx for the DLSS4 binaries.. or whatever... 🤣

Does work, although i do get quite a few spikes. Gonna do some comparison with windows.

PS. In case ppl do not own Cyberpunk - the files can be downloaded as listed here in this reddit post:
https://www.reddit.com/r/nvidia/comments/1i82rp6/dlss_4_dlls_from_cyberpunk_patch_221/

Also note that the streamline files that is needed is listed in that post under "Edit 5"

@Saancreed
Copy link
Contributor

I see that on the great circle "update 3" notes, it sais that they will add support for Blackwell GPU's (50xx), which will hopefully also update the games DLSS/sl.xx dll's to this new version.

At the very least we should get a showcase of upcoming Blackwell-exclusive VK_NV_ray_tracing_linear_swept_spheres since Indy was advertised to add RTX Hair thingy soon… but yeah, would be weird if DLSS snippets weren't updated at the same time. But even if they aren't, we could probably still figure something out with some liberal usage of PROTON_ENABLE_NGX_UPDATER.

Although i admit it is still "typical NVIDIA" version scheme to use v310.xx for the DLSS4 binaries.. or whatever... 🤣

Maybe all this time we were supposed to be adding all the digits together 🙃

@shelterx
Copy link

shelterx commented Jan 29, 2025

Hm, game started fine and played fine with those files added to the streamline directory, but as soon as i enabled FG, game went to a black screen and crashed.

@Saancreed
Copy link
Contributor

Why is your setup so cursed…

Proton logs please?

@jp7677
Copy link

jp7677 commented Jan 29, 2025

From jp7677/dxvk-nvapi#245 (comment) , from what I understood (but you should know better ;) ) the black screen issue also needs your newer CUDA endpoints proof of concept (jp7677/dxvk-nvapi#245 (comment)).

Edit: ah, forget what I said, those newer endpoints are only relevant for D3D12.

@shelterx
Copy link

Heh, OK, I'll wait till that get sorted out. Feels like it's too much of a WIP right now.

@SveSop
Copy link
Owner Author

SveSop commented Jan 30, 2025

Heh, OK, I'll wait till that get sorted out. Feels like it's too much of a WIP right now.

I did not immediately get this to work either, so i dunno if it is "just" to copy all the .dll's from cyberpunk or not.. but even on windows i ended up hard-rebooting after doing that! However, i downloaded the DLSS_Swapper tool for windows, and used that.. then copied the "streamline" folder over to my linux steam version of Great Circle.. and that did work 👍

I attach a .zip of the streamline folder with the working files here:
https://drive.google.com/file/d/1gOfg4-zrMtQi89ZNC0A_w-UULeFLYR7v/view?usp=sharing

Then make sure to delete $/HOME/.steam/steam/steamapps/common/Proton - Experimental and then do a "verify integrity" on the proton "tool" using steam so you download a fresh version making sure it is no old crud leftovers..
Ofc delete the game prefix $HOME/.steam/steam/steamapps/compatdata/2677660

Then replace the "streamline" folder in the game directory with the one you download above.

If that still does not work AT ALL.. i have no other ideas than either some special "cachy" tweaks, or some hardware issues perhaps.. I dunno.

You still have to run the game with __GL_13ebad=0x1 DXVK_NVAPI_GPU_ARCH=AD100

@shelterx
Copy link

shelterx commented Jan 30, 2025

@SveSop Yeah, thanks for the explanation. The main issue was that I wasn't using proton-experimental bleeding edge. When i switched to that FG works fine :) Also thanks @jp7677 for mentioning the DLSS4 FG issue.
Btw, with the 570.86.15 driver __GL_13ebad=0x1 is no longer needed.

@SveSop
Copy link
Owner Author

SveSop commented Jan 30, 2025

Jeez.. what the bloody f is a "datacenter driver"? So.. not only is the CUDA driver 570.85.10 "too old" - but its still not released as a standalone driver.. now one need to get some "datacenter driver" to get up-to-date?

lulz.. I mean.. sheesh...

@Saancreed
Copy link
Contributor

I wouldn't worry about it, the actual GeForce R570 is supposed to be released later today.

@shelterx
Copy link

shelterx commented Jan 30, 2025

I heavily suspect the R570 release will be the the same driver anyway (570.86.15) but who knows.. .:)
EDIT:
Close enough 570.86.16 ;)

@SveSop
Copy link
Owner Author

SveSop commented Feb 2, 2025

Right.. so as this seems to work even better than the nvofapi workaround, i dropped this and we made some upstream dxvk-nvapi changes just in case.

This means that IF you want to use Frame Generation with this game - until they hopefully update it - you need to copy the DLSS4 binaries from either downloading or other game. The DLSS-Swapper software that "everybody" is using seems to indicate that this is very much used for a lot of games in windows.

So, i feel confident that this issue can be worked around using this method, and hopeful that this will be completely solved in the next game update that their patchnotes indicate should add support for RTX50xx cards. (Eg. Update to DLSS).

Still the occational stutter, and possible over-use of vram, but the stuttering seems a bit less with DLSS4, and crashing happened here and there anyway.. all in all a win.

Thanks for helping out with this long issue. Glad it started, because i have actually enjoyed playing that game 👍

@SveSop SveSop closed this as completed Feb 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants