Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Orchestrator "stuck" in transcode loop; many many goroutines open #1624

Closed
iameli opened this issue Sep 30, 2020 · 11 comments
Closed

Orchestrator "stuck" in transcode loop; many many goroutines open #1624

iameli opened this issue Sep 30, 2020 · 11 comments
Assignees
Labels
type: bug Something isn't working

Comments

@iameli
Copy link
Contributor

iameli commented Sep 30, 2020

Describe the bug
Orchestrator stopped transcoding, reporting OrchestratorCapped.

Stack trace is here. Looks like lots and lots of goroutines waiting on transcoding to complete.

To Reproduce
¯_(ツ)_/¯

CLI flags for this one were /usr/bin/livepeer -v=6 -network=offchain -orchestrator=true -transcoder=true -monitor=true -cliAddr=0.0.0.0:7935 -httpAddr=0.0.0.0:443 -serviceAddr=ber-prod-livepeer-orchestrator-0.livepeer.com:443 -maxSessions=200 -nvidia=1,2,3,4,5,6,7 -ethUrl= -ethPassword=/pw.txt -ethOrchAddr= -pricePerUnit=10000 -redeemer=false -redeemerAddr= -orchSecret=

@iameli
Copy link
Contributor Author

iameli commented Oct 2, 2020

Happened again with a whole cluster of Os after some heavy load. @darkdarkdragon thinks it could be the same as livepeer/lpms#158

@yondonfu
Copy link
Member

yondonfu commented Oct 2, 2020

I've observed this in the past as well. IIRC the transcoding would hang in LPMS somewhere and as a result the transcode session tracked in go-livepeer would never be cleaned up even though no additional segments are coming in. The end result was that the O would have a session count of X when its session count really should have been 0.

I've also seen another case where the transcoding would hang in LPMS and then after a looooong time a bunch of log messages come out simultaneously.

@iameli
Copy link
Contributor Author

iameli commented Oct 2, 2020

I wonder if we could do a failsafe that shuts down the session after a sufficiently long among of time or some such.

@yondonfu
Copy link
Member

yondonfu commented Oct 8, 2020

A few notes about control flow that might help with debugging and determining what area of the codebase a solution should target:

  • An orchestrator creates a transcode loop in a goroutine
  • The transcode loop receives new segments to transcode via a channel and then invokes transcodeSeg()
  • If the transcode loop is idle for 1 minute then a cleanup is triggered
  • Under the hood, transcodeSeg() invokes the Transcode() method on the Transcoder interface
  • When we are running an orchestrator that does standalone transcoding, the Transcoder interface is implemented by LoadBalancingTranscoder which will route segments to different Nvidia devices
  • The LoadBalancingTranscoder also uses a transcode loop run within a goroutine to receive segments via channel which are then passed to the actual transcoding method
  • Similar to the orchestrator transcode loop, the transcode loop in LoadBalancingTranscoder also will cleanup if no segment is received via a channel in 1 minute
  • Under the hood, the LoadBalancingTranscoder uses the Transcode() method on the NvidiaTranscoder which makes the call to the LPMS Transcode() method

So, there actually is a timeout mechanism for both the orchestrator transcode loop and the load balancer transcode loop. But, what might be happening here is that we never reach the next iteration of the transcoder loop in both cases such that a cleanup can be triggered. This might be happening because the transcoding function in LPMS hangs and never returns allowing the transcode loop to trigger the cleanup since the transcode loop in both cases needs to move on to the next run of the loop in order to hit the select case that triggers a cleanup due to a context timeout.

This might mean that there needs to be some timeout within the LPMS transcode method so that it can return control in the transcode loop in both the orchestrator and the load balancer. Or if we can identify what is causing the LPMS transcode method to never return in the first place then we can address that.

@darkdarkdragon
Copy link
Contributor

This might mean that there needs to be some timeout within the LPMS transcode method so that it can return control

It will not help if control is not being returned from inside Nvidia's code, like it was in my case (from cuvidDestroyDecoder function). To really cleanup in such case we would need to kill whole process (maybe killing thread would be enough, but I don't know how to do in Go).
First thing we should find out - if it is our code stuck or some drivers/Nvidia code. If ours - we will fix it, but if not ours - first we should check on not-patched drivers.
If it is not our code, probably easiest way to get around this would be to run one O and standalone transcoders per GPU - that way we can periodically kill whole process to cleanup resources without much interruption (we could integrate this in protocol between O and T - if O detects that T leaked too much GPU sessions it can signal T to exit - and then k8s will spin up new T. That way O will coordinate restarts so at every moment only one T will be out).

@jailuthra
Copy link
Contributor

jailuthra commented Oct 9, 2020

Quick update: Able to reproduce the memcheck failure in cuvidDestroyDecoder livepeer/lpms#158 (comment) in both lpms and ffmpeg with nvidia driver version 440.82 on both patched and non-patched drivers.

Will also try to update to the latest arch linux nvidia driver (450.57) and see how it works (:crossed_fingers: im always scared of updating to bleeding edge drivers)

that way we can periodically kill whole process to cleanup resources without much interruption

Yep this is the fallback approach in my mind as well, Eli also suggested something similar yesterday. But still trying to see if there's any possible way to recover within golang (by killing a goroutine after some timer expires in a different thread, if that's easy to do).

Thanks everyone for helping out here! This is a fun bug indeed :)

@jailuthra jailuthra added the type: bug Something isn't working label Oct 9, 2020
@yondonfu
Copy link
Member

yondonfu commented Oct 9, 2020

we could integrate this in protocol between O and T - if O detects that T leaked too much GPU sessions it can signal T to exit - and then k8s will spin up new T

Just want to note that killing the T might be an ok hack to temporarily address this problem to fallback on if necessary, but I don't think O should be able signal to T to exit via the O/T protocol because the entity running the O and T could be different and we wouldn't want the O operator to be able to arbitrarily kill the T that someone else is running whenever it wants.

@darkdarkdragon
Copy link
Contributor

@jailuthra
I think libnvcuvid and libcuda distributed separately from drivers, so it would be good to check if there is new versions of them.

@yondonfu

O and T could be different and we wouldn't want the O operator to be able to arbitrarily kill the T that someone else is running whenever it wants.

That could be solved with permissions - just think that O has best view on that T should be restarted and has best chance to do it with minimal service interruption.

@iameli
Copy link
Contributor Author

iameli commented Oct 9, 2020

first we should check on not-patched drivers.

@darkdarkdragon Confirmed repro on not-patched drivers in LHR.

@iameli
Copy link
Contributor Author

iameli commented Oct 11, 2020

If anything, anecdotally, it happens more on the GCP cards. Our LHR region gets stuck more than any other.

@jailuthra
Copy link
Contributor

jailuthra commented Oct 19, 2020

Summary

We've seen this issue with varying symptoms in different chipsets, drivers and environments. The commonality in all the cases seen is that Orchestrator Capacity reduces with time which is strongly suggestive that some resource like memory/CPU/VRAM is not being freed after a stream is complete, and thus killing the node usually resets it to normal.

Root cause

While some nodes fail with a segfault when overloaded with streams, others remain alive showing high CPU usage as their go-routines are spinning. Sometimes even if an O never gets overloaded, it still slowly leaks GPU VRAM or has some stuck go-routines wasting CPU cycles but only on particular chipsets.

In both these cases the fault seems to lie in a cuCtxSynchronize call deep in the CUDA libraries, which gets triggered when ffmpeg tries freeing the NVDec decoder using the cuvidDestroyDecoder api as seen before too in livepeer/lpms#158 (comment)

Because the go-routines get stuck deep in the call stack from Go -> Cgo -> FFmpeg -> CUDA, there is no good place to cleanly recover from - as killing stuck go-routines is not feasible unless it reaches a particular golang select block where it can listen on a channel bringing the kill-signal from outside, which is impossible for us as the control-flow is deep in CUDA libs.

Short-term fix

A simple short-term solution is to periodically kill the OT node, which will free up all the background go-routines spawned by the node and free up any GPU VRAM. This is the approach we're trying out on the k8s level, and if it works well we can consider making it a part of the node itself until a long-term fix is figured out.

Long-term fix

Ivan found a codebase which mentioned a similar deadlock at cuCtxSynchronize and used pthread locks to prevent it here. This codebase uses a very threaded producer-consumer approach to NVDec decoding, and re-uses cuda contexts among threads.

On the other hand, FFmpeg uses cuCtxPush/PopCurrent to set the currently used cuda context whenever making an API call, and doesn't have a threaded model for CUVID - so it's unclear how this fix can help us directly.

But there are multiple posts (1, 2) on the NVIDIA forums mentioning similar deadlocks around their decoder API.

NVDec does offer its own locking functions bundled within the API which is "a safer alternative to cuCtxPushCurrent and cuCtxPopCurrent" [ref]. We can explore using those locks for the cuda calls in the FFmpeg code, instead of relying on setting context but it is unclear why that might be needed for non-threaded use.

Testing Done

cuda-memcheck reports an issue with the cuCtxSynchronize line while destroying the decoder for all devices I've tested on - even when not overloading an Orchestrator node. This is also reproducible with ffmpeg alone when transcoding using NVDEC+NVENC+CudaScalingFilter.

The actual stuck go-routine issue is harder to track as it only causes GPU VRAM to not be freed even after the stream is over and rest of the objects are cleaned up. But we've still seen it happen reliably on these devices/drivers -

Device Driver Version(s) Tested with go-livepeer node Tested with FFmpeg
GTX 1070 / 1070 Ti ? YES NO
GTX 1060 440.82/445.28 YES NO
Tesla T4 ? YES NO

We still haven't seen it on the GTX 1080 with (?) driver version.

TODO

  • Try to overload GPU using FFmpeg for the above chipsets and see if issue is still reproducible easily.
  • Find the drivers for the above chipsets to isolate a driver version giving least trouble.

Conclusion

The bug is definitely either an Nvidia chipset/driver issue or undefined use of their decoder API by ffmpeg causing a race condition somehow. So, for now we'll try to mitigate it by just killing the OTs periodically, and icebox the long-term fix if this gets particularly pesky later on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
None yet
Development

No branches or pull requests

5 participants