-
Notifications
You must be signed in to change notification settings - Fork 179
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crash when we get CUDA_ERROR_ILLEGAL_ADDRESS #1921
Comments
Saw this again, prioritizing the fix |
As already mentioned in Discord, the plan is to also classify this error (which shows up in go-livepeer as an "Unknown error" right now) as non-retryable so that a B does not continuously retry the segment triggering this error with different Os. |
Does that mean classifying every "unknown error" as non-retryable? Are we confident those errors are only generated in cases where segments can't be transcoded? Seems like something that could happen randomly for other reasons from time to time. |
Hmm good point. I just checked and we do get an "Unknown error" for things like GPU OOM, where it makes sense to retry the segment on other transcoders. I'll try to propogate the CUDA error retvals as something other than "Unknown error" to better distinguish between them, currently they all reach go-livepeer as Unknown. |
We now `panic` if CUDA_ERROR_ILLEGAL_ADDRESS is encountered, as it is an unrecoverable error. See #1921 for details.
We now `panic` if CUDA_ERROR_ILLEGAL_ADDRESS is encountered, as it is an unrecoverable error. See #1921 for details.
Related: livepeer/lpms#239
My understanding of the latest
CUDA_ERROR_ILLEGAL_ADDRESS
errors we're seeing: CUDA itself has gotten into some sort of bad state and any further CUDA calls will fail. From Nvidia docs (thanks @jailuthra):So in the event we hit
CUDA_ERROR_ILLEGAL_ADDRESS
, we should justpanic()
and let the local environment restart the node. No point in continuing to run if we can't transcode anymore.The text was updated successfully, but these errors were encountered: