Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reset sx_fbtbc to 0 once we are sure it's haproxy v2 header #322

Open
wants to merge 1 commit into
base: next
Choose a base branch
from

Conversation

taodd
Copy link

@taodd taodd commented Sep 29, 2024

When receiving a haproxy v2 message, after processing the first fragment which only contain the proxy header, the ::recv might return EGAIN if the next fragment is not ready to read yet, then the function svc_vc_recv would return, but left xd->sx_fbtbx == 0x0d0a0d0a (PP2_SIG_UINT32), which was set when parsing the haproxy v2 header. When the next fragment is ready, the epoll event for the same xprt will be trigged, then a non-zero xd->sx_fbtbc would deceit sv_vc_recv into derefrencing a NULL uv pointer.

Stack trace:
Thread 76 "ganesha.nfsd" received signal SIGSEGV, Segmentation fault. [Switching to Thread 0x7fc9937ee640 (LWP 2931)]
0x00007fc9f3e30fa6 in svc_vc_recv (xprt=0x7fc9dc00e3b0) at /git/nfs-ganesha/src/libntirpc/src/svc_vc.c:1070 1070 flags = uv->u.uio_flags;
(gdb) bt
#0 0x00007fc9f3e30fa6 in svc_vc_recv (xprt=0x7fc9dc00e3b0) at /git/nfs-ganesha/src/libntirpc/src/svc_vc.c:1070
#1 0x00007fc9f3e2baab in svc_rqst_xprt_task_recv (wpe=0x7fc9dc00e680) at /git/nfs-ganesha/src/libntirpc/src/svc_rqst.c:1210
#2 0x00007fc9f3e2c740 in svc_rqst_epoll_loop (wpe=0x3003cb8) at /git/nfs-ganesha/src/libntirpc/src/svc_rqst.c:1585
#3 0x00007fc9f3e39be8 in work_pool_thread (arg=0x7fc9e007e030) at /git/nfs-ganesha/src/libntirpc/src/work_pool.c:187
#4 0x00007fc9f3aa0c52 in start_thread () from /lib64/libc.so.6
#5 0x00007fc9f3b24f74 in clone () from /lib64/libc.so.6
(gdb) p uv
$1 = (struct xdr_ioq_uv *) 0x0
(gdb) p xd->sx_fbtbc
$2 = 218762506

@taodd
Copy link
Author

taodd commented Oct 1, 2024

Reporter with the same issue here:
nfs-ganesha/nfs-ganesha#1158

It's pretty easy to reproduce:

  1. Deploy single haproxy + single nfs-ganesha (v5.5)
  2. Add "default-server send-proxy-v2" option to haproxy config
  3. Repeated do the mount and umount the nfs client in a for loop, it's highly likely to reproduce the crash within 1 minute.

Copy link
Member

@ffilz ffilz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to understand the flow path of the failure better.

Also, it seems like it might be better to reset sx_fbtbc down below in the switch after the call to handle_haproxy_header().

I'm also curious down there is we should do a go to again for local command also? Not quite sure how that's supposed to be handled...

@taodd
Copy link
Author

taodd commented Oct 9, 2024

I'd like to understand the flow path of the failure better.
The flow path is like below:

  1. svc_vc_recv started to process a new haproxy v2 message and the xd->sx_fbtbc will be set to 0x0d0a0d0a (PP2_SIG_UINT32)
  2. Once the header is processed, the flow jumped to the again block @ https://github.com/nfs-ganesha/ntirpc/blob/next/src/svc_vc.c#L1033, which will try to receive the remaining body message.
  3. In the case of the remaining message not ready, ::recv will return EAGAIN and the function svc_vc_recv will just return at https://github.com/nfs-ganesha/ntirpc/blob/next/src/svc_vc.c#L993, which will left the xd->sx_fbtbc to a non-zero value 0x0d0a0d0a.
  4. After the remaining body message is ready and the same xprt handler with xd->fbtbc == 0x0d0a0d0a will be passed to svc_vc_recv function and the if condition at https://github.com/nfs-ganesha/ntirpc/blob/next/src/svc_vc.c#L967 will fail and go directly to https://github.com/nfs-ganesha/ntirpc/blob/next/src/svc_vc.c#L1069, which will crash due to an empty ioq.
    I think svc_vc_recv only expect a non-zero xd->fbtbc which represent the remaining bytes of the message to be received. But this is apparently not the case when xd->fbtbc represents a proxy v2 protocal, so we xd->fbtbc==0x0d0a0d0a should be a temporary thing and reset it once it's no longer needed.

Also, it seems like it might be better to reset sx_fbtbc down below in the switch after the call to handle_haproxy_header().

I did try to reset it after the ::recv returned EAGAIN, right before https://github.com/nfs-ganesha/ntirpc/blob/next/src/svc_vc.c#L993, but there is a race condition: Before the sx_fbtbc can be reset in this thread, a new thread which is trigged by the epoll event on the same socket fd, trying to recv the body message, will call svc_vc_recv with the same xprt and crash at the same place.
So I think we should reset it before the haproxy v2 header was completely received ? However I can test to reset it after the the call to handle_haproxy_header.

When receiving a haproxy v2 message, after processing the first fragment
which only contain the proxy header, the ::recv might return EGAIN if the
next fragment is not ready to read yet, then the function svc_vc_recv
would return, but left xd->sx_fbtbx == 0x0d0a0d0a (PP2_SIG_UINT32), which
was set when parsing the haproxy v2 header. When the next fragment is ready,
the epoll event for the same xprt will be trigged, then a non-zero
xd->sx_fbtbc would deceit sv_vc_recv into derefrencing a NULL uv pointer.

Stack trace:
Thread 76 "ganesha.nfsd" received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fc9937ee640 (LWP 2931)]
0x00007fc9f3e30fa6 in svc_vc_recv (xprt=0x7fc9dc00e3b0) at /git/nfs-ganesha/src/libntirpc/src/svc_vc.c:1070
1070			flags = uv->u.uio_flags;
(gdb) bt
(gdb) p uv
$1 = (struct xdr_ioq_uv *) 0x0
(gdb) p xd->sx_fbtbc
$2 = 218762506
@taodd
Copy link
Author

taodd commented Oct 13, 2024

@ffilz I revised the patch by resetting the sx_fbtbc after the call to handle_haproxy_header, I can't reproduce the crash with it, works perfect! please take a look

Copy link
Member

@ffilz ffilz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Glad this works. That looks great and easily understood (and the code is tight enough here to not need a comment explaining why sx_fbtbc is being reset).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants