-
Notifications
You must be signed in to change notification settings - Fork 178
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Orchestrator "stuck" in transcode loop; many many goroutines open #1624
Comments
Happened again with a whole cluster of Os after some heavy load. @darkdarkdragon thinks it could be the same as livepeer/lpms#158 |
I've observed this in the past as well. IIRC the transcoding would hang in LPMS somewhere and as a result the transcode session tracked in go-livepeer would never be cleaned up even though no additional segments are coming in. The end result was that the O would have a session count of X when its session count really should have been 0. I've also seen another case where the transcoding would hang in LPMS and then after a looooong time a bunch of log messages come out simultaneously. |
I wonder if we could do a failsafe that shuts down the session after a sufficiently long among of time or some such. |
A few notes about control flow that might help with debugging and determining what area of the codebase a solution should target:
So, there actually is a timeout mechanism for both the orchestrator transcode loop and the load balancer transcode loop. But, what might be happening here is that we never reach the next iteration of the transcoder loop in both cases such that a cleanup can be triggered. This might be happening because the transcoding function in LPMS hangs and never returns allowing the transcode loop to trigger the cleanup since the transcode loop in both cases needs to move on to the next run of the loop in order to hit the select case that triggers a cleanup due to a context timeout. This might mean that there needs to be some timeout within the LPMS transcode method so that it can return control in the transcode loop in both the orchestrator and the load balancer. Or if we can identify what is causing the LPMS transcode method to never return in the first place then we can address that. |
It will not help if control is not being returned from inside Nvidia's code, like it was in my case (from |
Quick update: Able to reproduce the memcheck failure in Will also try to update to the latest arch linux nvidia driver (
Yep this is the fallback approach in my mind as well, Eli also suggested something similar yesterday. But still trying to see if there's any possible way to recover within golang (by killing a goroutine after some timer expires in a different thread, if that's easy to do). Thanks everyone for helping out here! This is a fun bug indeed :) |
Just want to note that killing the T might be an ok hack to temporarily address this problem to fallback on if necessary, but I don't think O should be able signal to T to exit via the O/T protocol because the entity running the O and T could be different and we wouldn't want the O operator to be able to arbitrarily kill the T that someone else is running whenever it wants. |
@jailuthra
That could be solved with permissions - just think that O has best view on that T should be restarted and has best chance to do it with minimal service interruption. |
@darkdarkdragon Confirmed repro on not-patched drivers in LHR. |
If anything, anecdotally, it happens more on the GCP cards. Our LHR region gets stuck more than any other. |
SummaryWe've seen this issue with varying symptoms in different chipsets, drivers and environments. The commonality in all the cases seen is that Orchestrator Capacity reduces with time which is strongly suggestive that some resource like memory/CPU/VRAM is not being freed after a stream is complete, and thus killing the node usually resets it to normal. Root causeWhile some nodes fail with a segfault when overloaded with streams, others remain alive showing high CPU usage as their go-routines are spinning. Sometimes even if an O never gets overloaded, it still slowly leaks GPU VRAM or has some stuck go-routines wasting CPU cycles but only on particular chipsets. In both these cases the fault seems to lie in a Because the go-routines get stuck deep in the call stack from Go -> Cgo -> FFmpeg -> CUDA, there is no good place to cleanly recover from - as killing stuck go-routines is not feasible unless it reaches a particular golang Short-term fixA simple short-term solution is to periodically kill the OT node, which will free up all the background go-routines spawned by the node and free up any GPU VRAM. This is the approach we're trying out on the k8s level, and if it works well we can consider making it a part of the node itself until a long-term fix is figured out. Long-term fixIvan found a codebase which mentioned a similar deadlock at On the other hand, FFmpeg uses But there are multiple posts (1, 2) on the NVIDIA forums mentioning similar deadlocks around their decoder API. NVDec does offer its own locking functions bundled within the API which is "a safer alternative to cuCtxPushCurrent and cuCtxPopCurrent" [ref]. We can explore using those locks for the cuda calls in the FFmpeg code, instead of relying on setting context but it is unclear why that might be needed for non-threaded use. Testing Done
The actual stuck go-routine issue is harder to track as it only causes GPU VRAM to not be freed even after the stream is over and rest of the objects are cleaned up. But we've still seen it happen reliably on these devices/drivers -
We still haven't seen it on the GTX 1080 with (?) driver version. TODO
ConclusionThe bug is definitely either an Nvidia chipset/driver issue or undefined use of their decoder API by ffmpeg causing a race condition somehow. So, for now we'll try to mitigate it by just killing the OTs periodically, and icebox the long-term fix if this gets particularly pesky later on. |
Describe the bug
Orchestrator stopped transcoding, reporting
OrchestratorCapped
.Stack trace is here. Looks like lots and lots of goroutines waiting on transcoding to complete.
To Reproduce
¯_(ツ)_/¯
CLI flags for this one were
/usr/bin/livepeer -v=6 -network=offchain -orchestrator=true -transcoder=true -monitor=true -cliAddr=0.0.0.0:7935 -httpAddr=0.0.0.0:443 -serviceAddr=ber-prod-livepeer-orchestrator-0.livepeer.com:443 -maxSessions=200 -nvidia=1,2,3,4,5,6,7 -ethUrl= -ethPassword=/pw.txt -ethOrchAddr= -pricePerUnit=10000 -redeemer=false -redeemerAddr= -orchSecret=
The text was updated successfully, but these errors were encountered: