Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JNIFuse crash when DetachCurrentThread #15015

Closed
LuQQiu opened this issue Feb 16, 2022 · 4 comments
Closed

JNIFuse crash when DetachCurrentThread #15015

LuQQiu opened this issue Feb 16, 2022 · 4 comments
Assignees
Labels
area-fuse Alluxio fuse integration priority-high type-bug This issue is about a bug

Comments

@LuQQiu
Copy link
Contributor

LuQQiu commented Feb 16, 2022

Alluxio setting:

  • Version 2.7.2
  • 40 worker nodes with 40 standalone fuse pod for serving read requests
  • Kubernetes

Workload:

  • 2TB total data (30K files * on average 50MB each)
  • Each training node gets a subset of the files to read from, each node get the file names of 1/40 of the dataset.
  • Each node reads those files from Alluxio Fuse using torch.utils.data.Dataloader.
    train_set = ImageList('./headerPartial.txt', rootdir)
    train_data = DataLoader(train_set, batch_size=int(args.batch_size), shuffle=False, num_workers=args.w, drop_last=True)

    costs = []
    batch_index = 0
    for epoch in range(args.e):
        e_st = time.time()
        g_time = time.time()
        for batch_index, (batch_imgs, batch_labels) in enumerate(train_data): # 在每一次迭代的过程中,遍历训练集
            if batch_index >= args.b:
                print("exceed wanted batch num break")
                break
            cost = time.time() - e_st
            if args.t != 0:
                time.sleep(args.t  * 0.001)
            costs.append(cost)
            print('[%s] pid: %s, batch %s, cost %.4f, cur sum %s' % (datetime.datetime.now(), os.getpid(), batch_index, cost, len(batch_imgs)))
            e_st = time.time()
        print("[%s] pid: %s, cost %.4f, qps %.4f" % (datetime.datetime.now(), os.getpid(), time.time() - g_time, batch_index * args.batch_size / (time.time() - g_time)))
        if args.mode == benchmarkmode.DOWNLOADRAW:
            print("Mode:[DownloadingRaw] -- downloading time: %.4f" % extra_time)

file batch size = 8, worker (training thread number in each node) number = 8.

Error

In each training, some random number of the training nodes failed in the middle or beginning of the read process. The remaining training nodes succeed. The Fuse pod crashed because of libjnifuse throws SIGSEGV.

Core crash SIGSEGV,

#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007f747f90b106, pid=15, tid=0x00007f7480ef1700
#
# JRE version: OpenJDK Runtime Environment (8.0_312-b07) (build 1.8.0_312-b07)
# Java VM: OpenJDK 64-Bit Server VM (25.312-b07 mixed mode linux-amd64 )
# Problematic frame:
# V [libjvm.so+0x8f4106] Monitor::ILock(Thread*)+0xf6
#
# Core dump written. Default location: /opt/alluxio-2.7.2/core or core.15
#
# An error report file with more information is saved as:
# /opt/alluxio-2.7.2/hs_err_pid15.log
#
# If you would like to submit a bug report, please visit:
Stack: [0x00007faa003af000,0x00007faa004af000],  sp=0x00007faa004ade10,  free space=1019k
Native frames: (J=compiled Java code, j=interpreted, Vv=VM code, C=native code)
V  [libjvm.so+0x8f4106]  Monitor::ILock(Thread*)+0xf6
V  [libjvm.so+0x8f4ba6]  Monitor::lock_without_safepoint_check()+0x26
V  [libjvm.so+0xb6f551]  VM_Exit::wait_if_vm_exited()+0x31
V  [libjvm.so+0x6fefe5]  jni_DetachCurrentThread+0x85
C  [libjnifuse4472936298026469718.so+0x8290]  JavaVM_::DetachCurrentThread()+0x20
C  [libjnifuse4472936298026469718.so+0x76f7]

hs_err_pid151.log

Reported by @Nizifan

@LuQQiu LuQQiu added the type-bug This issue is about a bug label Feb 16, 2022
@LuQQiu
Copy link
Contributor Author

LuQQiu commented Feb 16, 2022

First, because code exists in JavaVM_::DetachCurrentThread(), we relooked into libjnifuse codes, see whether we are following the JNI best practice about how to get java env, how to attach and detach threads.
The fixes are in #15000

@LuQQiu
Copy link
Contributor Author

LuQQiu commented Feb 16, 2022

with this new fix, core crash still happens, we search on the network and find
intel-iot-devkit/tinyb#135 (comment)
which contains the same core error dump message.

This ticket mentions:

The reason TinyB hit the bug is that it would attach the thread without checking if it was already attached. As a result, for calls that were executing on a thread that started within Java, TinyB would attempt to detach the thread even though it was not a native thread.

For me, this was happening when using notifications. When the C++ code gets a global reference to the callback in Java code, it is re-attaching the thread that originated in Java. This is allowed by JNI and is a no-op. The problem is that it could not detach the thread correctly because the JVM had already cleaned up the thread on shutdown, leading to the null pointer from Thread::current().

which means that the thread may be a JAVA thread and gets cleaned by JVM GC already. If we tried to detach a cleaned thread, will throw NullPointerException and error out. This is a bug in JDK8 and is fixed in JDK 11.
Related information can be found in
https://bugs.openjdk.java.net/browse/JDK-8199012?focusedCommentId=14161254&page=com.atl[…]ssian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel
http://hg.openjdk.java.net/jdk/jdk/rev/2085742233ed

We use a JDK 11 alluxio-dev docker image to see if this problem exists. Luckily the same core crash issue didn't happen again.

Using JDK11 can solve this issue

@LuQQiu
Copy link
Contributor Author

LuQQiu commented Feb 16, 2022

The workaround mentioned by intel-iot-devkit/tinyb#135
with PR intel-iot-devkit/tinyb#153 doesn't work in our experiments.
Its core logic has already been included in PR #15000 and validated to have the same core crash issue.

We search online and find other jni related Softwares have the same issue but most of them didn't get resolve including libhdfs.

We couldn't find a valid workaround in JDK 8.
If you encounter the same issue, we recommend you upgrade to JDK 11

@LuQQiu
Copy link
Contributor Author

LuQQiu commented Feb 16, 2022

Special thanks for @Nizifan for reporting the issue, helping debugging, and validating different changes

@LuQQiu LuQQiu added priority-high area-fuse Alluxio fuse integration labels Feb 16, 2022
@LuQQiu LuQQiu self-assigned this Feb 16, 2022
@LuQQiu LuQQiu closed this as completed Mar 17, 2022
alluxio-bot pushed a commit that referenced this issue Mar 30, 2022
### What changes are proposed in this pull request?
User is able to build dev image running Alluxio with Java11 by using
build-arg to specify java version.

### Why are the changes needed?
Java11 solves a jvm bug in Java8 (See
#15015). If users want to use
java11, they have to modify the Dockerfile code then build the image.
With this change they can input the java version (for example, 11) to
build the image without modifying code.

pr-link: #15227
change-id: cid-2279c81042a01602aa8ed81e59081ca73bc0c4eb
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-fuse Alluxio fuse integration priority-high type-bug This issue is about a bug
Projects
None yet
Development

No branches or pull requests

1 participant