Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vkQueueSubmit/vkQueueSubmit2 returns VK_TIMEOUT when non-primary GPU is asked to wait on non-signaled timeline semaphore. #174

Closed
RogueLogix opened this issue Dec 7, 2023 · 4 comments
Labels

Comments

@RogueLogix
Copy link

When using a timeline semaphore that will be signaled from the same thread after submit this line in vk_queue.cpp (and potentially others, but thats where i found it to be happening when debugging) can have a PAL timeout returned, causing the function to early out and return the invalid for vkQueueSubmit/vkQueueSubmit2 return code of VK_TIMEOUT. The timeout does also cause other spec compliance issues, but they appear to all be the result of the underlying timeout.

Does require a signal operation in the same submit, but only presents itself when the wait semaphore is not yet signaled. Presence or lack of command buffer submit does not appear to affect the result

Produced using a device group of two W5700 GPUs on Ubuntu 22.04 through vkQueueSubmit2, with a timeline semaphore wait executed on device index 1 for a semaphore that will be signaled after submit. The same operation with device index 0 completes as expected after signaling on the host. When used in a single device device group, both physical devices behave correctly.

If needed i can provide code that reproduces the issue on my machine.

@RogueLogix
Copy link
Author

Found root cause, Semaphore::PopulateInDeviceGroup doesn't carry the timeline flag to the other device semaphores, causing the second device to not stall its queue correctly.

@lukelmy
Copy link

lukelmy commented Dec 14, 2023

Hi RogueLogix, could you pls provide the source code? Thanks!

@RogueLogix
Copy link
Author

heres a gist with a minimal repro cpp file, as well as the patch of what i did locally to fix it
https://gist.github.com/RogueLogix/6583dddb1a5aa1e7384d0390d87ad290

@jinjianrong
Copy link
Member

This is fixed in recent releases

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants