-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JNIFuse crash when DetachCurrentThread #15015
Comments
First, because code exists in |
with this new fix, core crash still happens, we search on the network and find This ticket mentions: The reason TinyB hit the bug is that it would attach the thread without checking if it was already attached. As a result, for calls that were executing on a thread that started within Java, TinyB would attempt to detach the thread even though it was not a native thread. For me, this was happening when using notifications. When the C++ code gets a global reference to the callback in Java code, it is re-attaching the thread that originated in Java. This is allowed by JNI and is a no-op. The problem is that it could not detach the thread correctly because the JVM had already cleaned up the thread on shutdown, leading to the null pointer from Thread::current(). which means that the thread may be a JAVA thread and gets cleaned by JVM GC already. If we tried to detach a cleaned thread, will throw NullPointerException and error out. This is a bug in JDK8 and is fixed in JDK 11. We use a JDK 11 alluxio-dev docker image to see if this problem exists. Luckily the same core crash issue didn't happen again. Using JDK11 can solve this issue |
The workaround mentioned by intel-iot-devkit/tinyb#135 We search online and find other jni related Softwares have the same issue but most of them didn't get resolve including We couldn't find a valid workaround in JDK 8. |
Special thanks for @Nizifan for reporting the issue, helping debugging, and validating different changes |
### What changes are proposed in this pull request? User is able to build dev image running Alluxio with Java11 by using build-arg to specify java version. ### Why are the changes needed? Java11 solves a jvm bug in Java8 (See #15015). If users want to use java11, they have to modify the Dockerfile code then build the image. With this change they can input the java version (for example, 11) to build the image without modifying code. pr-link: #15227 change-id: cid-2279c81042a01602aa8ed81e59081ca73bc0c4eb
Alluxio setting:
Workload:
torch.utils.data.Dataloader
.file batch size = 8, worker (training thread number in each node) number = 8.
Error
In each training, some random number of the training nodes failed in the middle or beginning of the read process. The remaining training nodes succeed. The Fuse pod crashed because of libjnifuse throws SIGSEGV.
Core crash SIGSEGV,
hs_err_pid151.log
Reported by @Nizifan
The text was updated successfully, but these errors were encountered: