Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crash when we get CUDA_ERROR_ILLEGAL_ADDRESS #1921

Closed
iameli opened this issue Jun 16, 2021 · 4 comments · Fixed by #2057
Closed

Crash when we get CUDA_ERROR_ILLEGAL_ADDRESS #1921

iameli opened this issue Jun 16, 2021 · 4 comments · Fixed by #2057
Assignees

Comments

@iameli
Copy link
Contributor

iameli commented Jun 16, 2021

Related: livepeer/lpms#239

My understanding of the latest CUDA_ERROR_ILLEGAL_ADDRESS errors we're seeing: CUDA itself has gotten into some sort of bad state and any further CUDA calls will fail. From Nvidia docs (thanks @jailuthra):

CUDA_ERROR_ILLEGAL_ADDRESS = 700

While executing a kernel, the device encountered a load or store instruction on an invalid memory address. This leaves the process in an inconsistent state and any further CUDA work will return the same error. To continue using CUDA, the process must be terminated and relaunched.

So in the event we hit CUDA_ERROR_ILLEGAL_ADDRESS, we should just panic() and let the local environment restart the node. No point in continuing to run if we can't transcode anymore.

@jailuthra
Copy link
Contributor

Saw this again, prioritizing the fix

@jailuthra jailuthra self-assigned this Oct 12, 2021
@yondonfu
Copy link
Member

As already mentioned in Discord, the plan is to also classify this error (which shows up in go-livepeer as an "Unknown error" right now) as non-retryable so that a B does not continuously retry the segment triggering this error with different Os.

@iameli
Copy link
Contributor Author

iameli commented Oct 12, 2021

Does that mean classifying every "unknown error" as non-retryable? Are we confident those errors are only generated in cases where segments can't be transcoded? Seems like something that could happen randomly for other reasons from time to time.

@jailuthra
Copy link
Contributor

Are we confident those errors are only generated in cases where segments can't be transcoded?

Hmm good point. I just checked and we do get an "Unknown error" for things like GPU OOM, where it makes sense to retry the segment on other transcoders.

I'll try to propogate the CUDA error retvals as something other than "Unknown error" to better distinguish between them, currently they all reach go-livepeer as Unknown.

jailuthra added a commit that referenced this issue Oct 15, 2021
We now `panic` if CUDA_ERROR_ILLEGAL_ADDRESS is encountered, as it is an unrecoverable error.
See #1921 for details.
yondonfu pushed a commit that referenced this issue Oct 25, 2021
We now `panic` if CUDA_ERROR_ILLEGAL_ADDRESS is encountered, as it is an unrecoverable error.
See #1921 for details.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants