JNIFuse crash when DetachCurrentThread #15015

LuQQiu · 2022-02-16T23:27:34Z

Alluxio setting:

Version 2.7.2
40 worker nodes with 40 standalone fuse pod for serving read requests
Kubernetes

Workload:

2TB total data (30K files * on average 50MB each)
Each training node gets a subset of the files to read from, each node get the file names of 1/40 of the dataset.
Each node reads those files from Alluxio Fuse using torch.utils.data.Dataloader.

    train_set = ImageList('./headerPartial.txt', rootdir)
    train_data = DataLoader(train_set, batch_size=int(args.batch_size), shuffle=False, num_workers=args.w, drop_last=True)

    costs = []
    batch_index = 0
    for epoch in range(args.e):
        e_st = time.time()
        g_time = time.time()
        for batch_index, (batch_imgs, batch_labels) in enumerate(train_data): # 在每一次迭代的过程中，遍历训练集
            if batch_index >= args.b:
                print("exceed wanted batch num break")
                break
            cost = time.time() - e_st
            if args.t != 0:
                time.sleep(args.t  * 0.001)
            costs.append(cost)
            print('[%s] pid: %s, batch %s, cost %.4f, cur sum %s' % (datetime.datetime.now(), os.getpid(), batch_index, cost, len(batch_imgs)))
            e_st = time.time()
        print("[%s] pid: %s, cost %.4f, qps %.4f" % (datetime.datetime.now(), os.getpid(), time.time() - g_time, batch_index * args.batch_size / (time.time() - g_time)))
        if args.mode == benchmarkmode.DOWNLOADRAW:
            print("Mode:[DownloadingRaw] -- downloading time: %.4f" % extra_time)

file batch size = 8, worker (training thread number in each node) number = 8.

Error

In each training, some random number of the training nodes failed in the middle or beginning of the read process. The remaining training nodes succeed. The Fuse pod crashed because of libjnifuse throws SIGSEGV.

Core crash SIGSEGV,

#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007f747f90b106, pid=15, tid=0x00007f7480ef1700
#
# JRE version: OpenJDK Runtime Environment (8.0_312-b07) (build 1.8.0_312-b07)
# Java VM: OpenJDK 64-Bit Server VM (25.312-b07 mixed mode linux-amd64 )
# Problematic frame:
# V [libjvm.so+0x8f4106] Monitor::ILock(Thread*)+0xf6
#
# Core dump written. Default location: /opt/alluxio-2.7.2/core or core.15
#
# An error report file with more information is saved as:
# /opt/alluxio-2.7.2/hs_err_pid15.log
#
# If you would like to submit a bug report, please visit:

Stack: [0x00007faa003af000,0x00007faa004af000],  sp=0x00007faa004ade10,  free space=1019k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x8f4106]  Monitor::ILock(Thread*)+0xf6
V  [libjvm.so+0x8f4ba6]  Monitor::lock_without_safepoint_check()+0x26
V  [libjvm.so+0xb6f551]  VM_Exit::wait_if_vm_exited()+0x31
V  [libjvm.so+0x6fefe5]  jni_DetachCurrentThread+0x85
C  [libjnifuse4472936298026469718.so+0x8290]  JavaVM_::DetachCurrentThread()+0x20
C  [libjnifuse4472936298026469718.so+0x76f7]

hs_err_pid151.log

Reported by @Nizifan

The text was updated successfully, but these errors were encountered:

LuQQiu · 2022-02-16T23:32:00Z

First, because code exists in JavaVM_::DetachCurrentThread(), we relooked into libjnifuse codes, see whether we are following the JNI best practice about how to get java env, how to attach and detach threads.
The fixes are in #15000

LuQQiu · 2022-02-16T23:33:27Z

with this new fix, core crash still happens, we search on the network and find
intel-iot-devkit/tinyb#135 (comment)
which contains the same core error dump message.

This ticket mentions:

The reason TinyB hit the bug is that it would attach the thread without checking if it was already attached. As a result, for calls that were executing on a thread that started within Java, TinyB would attempt to detach the thread even though it was not a native thread.

For me, this was happening when using notifications. When the C++ code gets a global reference to the callback in Java code, it is re-attaching the thread that originated in Java. This is allowed by JNI and is a no-op. The problem is that it could not detach the thread correctly because the JVM had already cleaned up the thread on shutdown, leading to the null pointer from Thread::current().

which means that the thread may be a JAVA thread and gets cleaned by JVM GC already. If we tried to detach a cleaned thread, will throw NullPointerException and error out. This is a bug in JDK8 and is fixed in JDK 11.
Related information can be found in
https://bugs.openjdk.java.net/browse/JDK-8199012?focusedCommentId=14161254&page=com.atl[…]ssian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel
http://hg.openjdk.java.net/jdk/jdk/rev/2085742233ed

We use a JDK 11 alluxio-dev docker image to see if this problem exists. Luckily the same core crash issue didn't happen again.

Using JDK11 can solve this issue

LuQQiu · 2022-02-16T23:39:24Z

The workaround mentioned by intel-iot-devkit/tinyb#135
with PR intel-iot-devkit/tinyb#153 doesn't work in our experiments.
Its core logic has already been included in PR #15000 and validated to have the same core crash issue.

We search online and find other jni related Softwares have the same issue but most of them didn't get resolve including libhdfs.

We couldn't find a valid workaround in JDK 8.
If you encounter the same issue, we recommend you upgrade to JDK 11

LuQQiu · 2022-02-16T23:41:18Z

Special thanks for @Nizifan for reporting the issue, helping debugging, and validating different changes

### What changes are proposed in this pull request? User is able to build dev image running Alluxio with Java11 by using build-arg to specify java version. ### Why are the changes needed? Java11 solves a jvm bug in Java8 (See #15015). If users want to use java11, they have to modify the Dockerfile code then build the image. With this change they can input the java version (for example, 11) to build the image without modifying code. pr-link: #15227 change-id: cid-2279c81042a01602aa8ed81e59081ca73bc0c4eb

LuQQiu added the type-bug This issue is about a bug label Feb 16, 2022

LuQQiu added priority-high area-fuse Alluxio fuse integration labels Feb 16, 2022

LuQQiu self-assigned this Feb 16, 2022

LuQQiu mentioned this issue Feb 17, 2022

Umounting alluxio fuse can't work correctly #15007

Closed

LuQQiu closed this as completed Mar 17, 2022

ssz1997 mentioned this issue Mar 29, 2022

Make java version configuable in dev image #15227

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

JNIFuse crash when DetachCurrentThread #15015

JNIFuse crash when DetachCurrentThread #15015

LuQQiu commented Feb 16, 2022 •

edited

Loading

LuQQiu commented Feb 16, 2022

LuQQiu commented Feb 16, 2022 •

edited

Loading

LuQQiu commented Feb 16, 2022

LuQQiu commented Feb 16, 2022

JNIFuse crash when DetachCurrentThread #15015

JNIFuse crash when DetachCurrentThread #15015

Comments

LuQQiu commented Feb 16, 2022 • edited Loading

LuQQiu commented Feb 16, 2022

LuQQiu commented Feb 16, 2022 • edited Loading

LuQQiu commented Feb 16, 2022

LuQQiu commented Feb 16, 2022

LuQQiu commented Feb 16, 2022 •

edited

Loading

LuQQiu commented Feb 16, 2022 •

edited

Loading