-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RINA kernel module crash #1352
Comments
This is the offending line in the first oops. |
This crash is an accidental side-effect of the Go program, which is intended to search for a specific problem that we've observed: a fraction of the time (indicating a timing race), when a flow is closed the opposite end of the flow does not close and terminate a waiting read() until the waiting program running in gdb is paused and resumed, at which point the read() correctly terminates with an eof. This indicates that the read() syscall is in an interruptible wait state, and when it is interrupted (by the gdb signaling) and later resumed, the re-executed syscall correctly detects that the close has occurred and terminates normally. Other times, the read() immediately returns eof when notification of the other-end close arrives, as desired. This is causing a lot of trouble in a program that opens and closes flows in a goroutine -- the goroutines get hung in a read() after the other end closes, and after a while, the hung goroutines holding onto file descriptors causes the process to run out of available file descriptors and crashes. In order to attempt to identify the race condition that exercises the hanging read(), this Go program (at a high level) creates a flow between a client and server, and then with random timing reads from one end and closes at the other to see if the read() completes with eof vs. hangs. As an unintended side-effect of running it, it crashes Linux. The fail-to-close bug will be filed separately, but pursuing it is hampered by the crashing problem. |
This could be related to earlier closed issue #1133, which was closed as not repeatable. If it is, then the timing issue in that case was exercised when the commands were being executed manually. Opening the flow at the server ( <xx ) end would be immediately followed by an eof and close() of xx, and then a close on the flow. So the last CDAP message(s) from the server to start and to close the flow might be delivered to the IPCP while the process running the read() is still blocked in the rina_flow_alloc() syscall, and could both be processed before the read() is executed, or after, depending on timing. Just a thought -- that might explain the unrepeatability (depending on the timing of execution on the hardware, number of CPUs, etc.). |
The first version of the program was a bit of a mess. This program is somewhat simplified but still exposes the problem. |
Steve has suggested me I add a more complete you of the whole kernel log. The warning messages
|
I searched in my embedded development Bag-O-Trick and remembered about ftrace. I've modified the Go program so that it triggers kernel function tracing while the program is running and let it run until it triggered a crash. This is the whole kernel function trace for the system while the program is running. I have not tried to analyze it yet. |
2 more trace files, in different format. Those are limited to the CPU where the process is run. |
I have more information. I enabled a few debug options in the kernel and got this.
|
Thanks for all this work debugging the issue, and sorry for not having been able to contribute so far! Hopefully next week I can allocate some time. |
Yes #1358 seems to be fixing this bug. IRATI kernel modules look lot more stable for me right now. I believe this can be closed. |
This is onto me since apparently my patch was not 100% correct. Since it improves stability I believe I'm not that far from the correct patch but I just go this crash, which is obviously related to the same problem.
|
The attached Go program reproduces fairly reliably a crash that happens within the IRATI kernel modules. 2 stack trace appear in quick succession.
main.go.txt
The text was updated successfully, but these errors were encountered: