Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segment violation in libnvcuvid.so.1 #171

Open
darkdarkdragon opened this issue Jan 29, 2020 · 4 comments
Open

Segment violation in libnvcuvid.so.1 #171

darkdarkdragon opened this issue Jan 29, 2020 · 4 comments

Comments

@darkdarkdragon
Copy link
Contributor

When go-livepeer is under heavy load and there is constantly not enough video memory, node often panic's with Segmentation fault.

Stack trace:

#0  0x00007fff8415d7e0 in ?? () from /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
#1  0x00007fff8415d952 in ?? () from /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
#2  0x00007fff8415d9ea in ?? () from /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
#3  0x00007fff841147c6 in ?? () from /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
#4  0x00007fff8412974b in ?? () from /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
#5  0x00007fff8410d665 in ?? () from /usr/lib/x86_64-linux-gnu/libnvcuvid.so.1
#6  0x00007fff381e70f3 in ?? () from /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.1
#7  0x00007fff381e26da in ?? () from /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.1
#8  0x00007fff381f1499 in ?? () from /usr/lib/x86_64-linux-gnu/libnvidia-encode.so.1
#9  0x0000000000496c75 in nvenc_setup_encoder (avctx=avctx@entry=0x7fffcc43cc40) at libavcodec/nvenc.c:1259
#10 0x0000000000498758 in ff_nvenc_encode_init (avctx=0x7fffcc43cc40) at libavcodec/nvenc.c:1553
#11 0x00000000013cd05c in avcodec_open2 (avctx=avctx@entry=0x7fffcc43cc40, codec=codec@entry=0x2c06740 <ff_h264_nvenc_encoder>, options=0xc008cde1d0) at libavcodec/utils.c:951
#12 0x0000000000fd4200 in open_output (ictx=0x7fffe8001598, octx=0x7fffe80015f8) at lpms_ffmpeg.c:693
#13 transcode (h=h@entry=0x7fffe8001590, inp=inp@entry=0xc00000ff60, params=params@entry=0xc008cde180, results=results@entry=0xc0000c2410, decoded_results=decoded_results@entry=0xc0000c2440) at lpms_ffmpeg.c:1145
#14 0x0000000000fd4d68 in lpms_transcode (inp=0xc00000ff60, params=0xc008cde180, results=0xc0000c2410, nb_outputs=1, decoded_results=0xc0000c2440) at lpms_ffmpeg.c:1308
#15 0x0000000000fd1376 in _cgo_f32e5de116c8_Cfunc_lpms_transcode (v=0xc000079938) at cgo-gcc-prolog:140
#16 0x00000000004fedd0 in runtime.asmcgocall () at /usr/lib/go-1.13/src/runtime/asm_amd64.s:655
#17 0x0000000000000040 in ?? ()
#18 0x0000000001c32e80 in type.* ()
#19 0x00000000004fb401 in runtime.(*mheap).setSpan (h=<optimized out>, base=0, s=0xc000079938) at /usr/lib/go-1.13/src/runtime/mheap.go:1143
#20 runtime.(*mheap).scavengeSplit.func1 (s=0x4d3600 <runtime.mstart>) at /usr/lib/go-1.13/src/runtime/mheap.go:1459
#21 0x000000c0002f1980 in ?? ()
#22 0x00000000004d3600 in ?? () at /usr/lib/go-1.13/src/runtime/proc.go:1080
#23 0x0000000000000000 in ?? ()

nvenc.c:1259 is:

    nv_status = p_nvenc->nvEncInitializeEncoder(ctx->nvencoder, &ctx->init_encode_params);

stack_with_variable.txt

I think it is either of:

  • We're not processing some errors correctly and as a result passing some invalid data down to Nvidia drivers and that leads to segmental fault
  • Ffmpeg's code not processing errors correctly and passes invalid data to drivers
  • Just bug in Nvidia's code
@darkdarkdragon
Copy link
Contributor Author

@j0sh What do you think about this one? It was hard to reproduce in GCP, but on the rig with 1660s I was hitting this often during my testing on the rig with 1660.

@j0sh
Copy link
Collaborator

j0sh commented Jan 29, 2020

Probably the moral of the story is, "don't overwhelm the system" 😄

Combined with #158 , it sounds like it may be a good idea to put reasonable limits in somewhere until we can spend the time to narrow this down further.

How many streams / what configuration until you started seeing segfaults? I've also seen hangs on the 1660 a couple times (probably the same problem as #158, but not certain yet). Unfortunately the hangs have been under relatively light load such as 4 input x 4 output renditions x 8 cards (128 encodes total for the system, but 16 encodes per card and 4 decodes).

I have a suspicion that something is weird with the 1660 rig anyway because it's about 2x slower transcoding compared to the 1070s, despite having better all-around hardware specs.

@darkdarkdragon
Copy link
Contributor Author

Probably the moral of the story is, "don't overwhelm the system" 😄

Yep, but problem here is that these issues manifests itself if there is not enough video memory, and we don't have a way to constraint system's load by video memory.

How many streams / what configuration until you started seeing segfaults?

I don't remember (

I have a suspicion that something is weird with the 1660 rig anyway because it's about 2x slower transcoding compared to the 1070s

Strange, for me speed of 1660 and 1070 was the same.

@darkdarkdragon
Copy link
Contributor Author

We just hit segment violation in AC (in mainnet orchestrator).

Jan 31 06:49:55 orchestrator-1080-55b4b4794-gnrtk chi fatal error: unexpected signal during runtime execution
Jan 31 06:49:55 orchestrator-1080-55b4b4794-gnrtk chi [signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x7f499c8e49b7]
Jan 31 06:49:55 orchestrator-1080-55b4b4794-gnrtk chi runtime stack:
Jan 31 06:49:55 orchestrator-1080-55b4b4794-gnrtk chi runtime.throw(0x1c973d6, 0x2a)
Jan 31 06:49:55 orchestrator-1080-55b4b4794-gnrtk chi 	/usr/lib/go-1.13/src/runtime/panic.go:774 +0x72
Jan 31 06:49:55 orchestrator-1080-55b4b4794-gnrtk chi runtime.sigpanic()
Jan 31 06:49:55 orchestrator-1080-55b4b4794-gnrtk chi 	/usr/lib/go-1.13/src/runtime/signal_unix.go:378 +0x47c
Jan 31 06:49:55 orchestrator-1080-55b4b4794-gnrtk chi goroutine 623 [syscall]:
Jan 31 06:49:55 orchestrator-1080-55b4b4794-gnrtk chi runtime.cgocall(0xf76d00, 0xc0002bc8a0, 0xc0002bc8b0)
Jan 31 06:49:55 orchestrator-1080-55b4b4794-gnrtk chi 	/usr/lib/go-1.13/src/runtime/cgocall.go:128 +0x5b fp=0xc0002bc870 sp=0xc0002bc838 pc=0x49c44b
Jan 31 06:49:55 orchestrator-1080-55b4b4794-gnrtk chi github.com/livepeer/lpms/ffmpeg._Cfunc_lpms_transcode(0xc0007bd000, 0xc000b6a000, 0xc0000b1fb0, 0x3, 0xc0007fcad0, 0xc000000000)
Jan 31 06:49:55 orchestrator-1080-55b4b4794-gnrtk chi 	_cgo_gotypes.go:270 +0x4d fp=0xc0002bc8a0 sp=0xc0002bc870 pc=0xc2301d
Jan 31 06:49:55 orchestrator-1080-55b4b4794-gnrtk chi github.com/livepeer/lpms/ffmpeg.(*Transcoder).Transcode.func9(0xc0007bd000, 0xc000b6a000, 0xc0000b1fb0, 0xc000b6a000, 0x3, 0x3, 0xc0007fcad0, 0xc000a2a3b0)
Jan 31 06:49:55 orchestrator-1080-55b4b4794-gnrtk chi 	/go/pkg/mod/github.com/livepeer/[email protected]/ffmpeg/ffmpeg.go:290 +0xac fp=0xc0002bc8e0 sp=0xc0002bc8a0 pc=0xc2624c
Jan 31 06:49:55 orchestrator-1080-55b4b4794-gnrtk chi github.com/livepeer/lpms/ffmpeg.(*Transcoder).Transcode(0xc0007bcf80, 0xc0002bce08, 0xc0000d4240, 0x3, 0x3, 0x0, 0x0, 0x0)
Jan 31 06:49:55 orchestrator-1080-55b4b4794-gnrtk chi 	/go/pkg/mod/github.com/livepeer/[email protected]/ffmpeg/ffmpeg.go:290 +0xabc fp=0xc0002bcda8 sp=0xc0002bc8e0 pc=0xc2432c
Jan 31 06:49:55 orchestrator-1080-55b4b4794-gnrtk chi github.com/livepeer/go-livepeer/core.(*NvidiaTranscoder).Transcode(0xc000b6d890, 0xc000b2e450, 0x24, 0xc00067cb40, 0x4a, 0xc0003397a0, 0x3, 0x3, 0xc000b220a0, 0xc0002daf78, ...)
Jan 31 06:49:55 orchestrator-1080-55b4b4794-gnrtk chi 	/build/core/transcoder.go:74 +0x116 fp=0xc0002bce40 sp=0xc0002bcda8 pc=0xf0de16
Jan 31 06:49:55 orchestrator-1080-55b4b4794-gnrtk chi github.com/livepeer/go-livepeer/core.(*transcoderSession).loop(0xc000b6d8c0)
Jan 31 06:49:55 orchestrator-1080-55b4b4794-gnrtk chi 	/build/core/lb.go:183 +0x1d4 fp=0xc0002bcfb8 sp=0xc0002bce40 pc=0xf04734
Jan 31 06:49:55 orchestrator-1080-55b4b4794-gnrtk chi github.com/livepeer/go-livepeer/core.(*LoadBalancingTranscoder).createSession.func2(0xc000b6d8c0, 0xc0003bbf00)
Jan 31 06:49:55 orchestrator-1080-55b4b4794-gnrtk chi 	/build/core/lb.go:107 +0x2b fp=0xc0002bcfd0 sp=0xc0002bcfb8 pc=0xf0ebeb
Jan 31 06:49:55 orchestrator-1080-55b4b4794-gnrtk chi runtime.goexit()

Looks like it is the same problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants