Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation violation during checkpoint restore (cpu mode) #49

Open
alexfrolov opened this issue May 23, 2024 · 6 comments
Open

Segmentation violation during checkpoint restore (cpu mode) #49

alexfrolov opened this issue May 23, 2024 · 6 comments

Comments

@alexfrolov
Copy link

Hi!

I want to try cricket for C/R in cpu mode (no in-kernel checkpointing). However, when I run restore it fails with segfault.

(gdb) r
Starting program: /home/alexndrfrolov/cricket/cpu/cricket-rpc-server 
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
welcome to cricket!
+03:00:00.000003 INFO:  restoring previous state was enabled by setting CRICKET_RESTORE
+03:00:00.000146 DEBUG: restoring rpc_id from ckp/rpc_id
+03:00:00.000189 DEBUG: using prog=99, vers=1   in cpu-server.c:220
+03:00:00.000200 INFO:  using TCP...
+03:00:00.000766 INFO:  listening on port 49338
+03:00:00.001007 DEBUG: sched_none_init
[New Thread 0x7fffb47ff000 (LWP 2666702)]
+03:00:00.673881 DEBUG: restoring api records from ckp/api_records
+03:00:00.673948 DEBUG: function: 50 

Thread 1 "cricket-rpc-ser" received signal SIGSEGV, Segmentation fault.
0x00007fffb8381b1d in ?? () from /lib/x86_64-linux-gnu/libc.so.6
(gdb) bt
#0  0x00007fffb8381b1d in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#1  0x00007fffb824dd31 in ?? () from /lib/x86_64-linux-gnu/libc.so.6
#2  0x000055555557c922 in loggf (level=3 '\003', formatstr=0x5555555973d8 "rpc_register_function(fatCubinHandle: %p, hostFun: %p, deviceFun: %s, deviceName: %s, thread_limit: %d)") at log.c:98
#3  0x000055555557a38f in rpc_register_function_1_svc (fatCubinHandle=94419555140752, hostFun=94419554144212, deviceFun=0x56340424fd90 <error: Cannot access memory at address 0x56340424fd90>, 
    deviceName=0x5634044f2e90 <error: Cannot access memory at address 0x5634044f2e90>, thread_limit=-1, result=0x555555975c50, rqstp=0x7fffffffdfc0) at cpu-server-driver.c:111
#4  0x0000555555564913 in _rpc_register_function_1 (argp=0x555555972740, result=0x555555975c50, rqstp=0x7fffffffdfc0) at cpu_rpc_prot_svc_mod.c:46
#5  0x0000555555583534 in cr_call_record (record=0x555555974830) at cr.c:714
#6  0x0000555555583889 in cr_restore_resources (path=0x5555555963fb "ckp", record=0x555555974830, rm_memory=0x5555555a5d60 <rm_memory>, rm_streams=0x5555555a5ac0 <rm_streams>, rm_events=0x5555555a5c40 <rm_events>, 
    rm_arrays=0x5555555a61a0 <rm_arrays>, rm_cusolver=0x5555555a5be0 <rm_cusolver>, rm_cublas=0x5555555a5e80 <rm_cublas>) at cr.c:772
#7  0x0000555555583d55 in cr_restore (path=0x5555555963fb "ckp", rm_memory=0x5555555a5d60 <rm_memory>, rm_streams=0x5555555a5ac0 <rm_streams>, rm_events=0x5555555a5c40 <rm_events>, rm_arrays=0x5555555a61a0 <rm_arrays>, 
    rm_cusolver=0x5555555a5be0 <rm_cusolver>, rm_cublas=0x5555555a5e80 <rm_cublas>) at cr.c:870
#8  0x00005555555710c1 in server_runtime_restore (path=0x5555555963fb "ckp") at cpu-server-runtime.c:141
#9  0x0000555555570e3b in server_runtime_init (restore=1) at cpu-server-runtime.c:87
#10 0x000055555556ed54 in cricket_main (prog_num=99, vers_num=1) at cpu-server.c:284
#11 0x0000555555592752 in main (argc=1, argv=0x7fffffffe3d8) at server-exe.c:11

After a little debugging, I have found out that the problem comes from using rpc_register_function_1_svc in restore process (see gdb trace). In the comments it is said that it does not support checkpoint/restore. But I have not found how to avoid it, because it is called from the __cudaRegisterFunction at the client side.

Does it mean that C/R does not work in Cricket for cpu at the moment? Thank you!

@n-eiling
Copy link
Member

Cricket currently only supports C/R when you only use the runtime API. It looks like your checkpoint contains a call to a driver API function for which there is currently no C/R support.
Are you able to share the code? How have you launched the application and how have you created the checkpoint?

@alexfrolov
Copy link
Author

Hi!

By runtime API do you mean the ./gpu part of the Cricket? Yes, I was able to generate a checkpoint for one of your samples (probably, it was test_apps/matmul.cu).

Do you have any plans to add a support for C/R for a "cpu" mode ?

Best,
Alex

@n-eiling
Copy link
Member

n-eiling commented Jun 5, 2024

Hey,

I mean the CUDA Runtime API (see https://docs.nvidia.com/cuda/cuda-runtime-api/index.html).
Not supported is any function from the CUDA Driver API (see https://docs.nvidia.com/cuda/cuda-driver-api/index.html)
You are getting the segfault in a Driver API call, because this function is supported for remote execution but not for checkpointing. It tries to restore something that was not saved to the checkpoint file.

@alexfrolov
Copy link
Author

Hi!

AFAIU, invoking of __cudaRegisterFunction comes with NVCC generating the binary code. Is it possible to avoid it by using some options for nvcc ?

@n-eiling
Copy link
Member

n-eiling commented Jun 7, 2024

Have you linked to the CUDA libaries dynamically, i.e., using -cudart shared as a nvcc option? If I remember correctly your error might happen if you link statically.

@ya0guang
Copy link

I also encounter this bug when compiling my simple CUDA application with shared cudart library. Seems like __cudaRegisterFunctio is called via the rt library:

$ nm matrixMult.bin | grep cuda
0000000000001be4 t _Z16cudaLaunchKernelIcE9cudaErrorPKT_4dim3S4_PPvmP11CUstream_st
0000000000005048 b _ZL20__cudaFatCubinHandle
0000000000005070 b _ZL20__cudaFatCubinHandle
0000000000005050 b _ZL22__cudaPrelinkedFatbins
0000000000001b86 t _ZL24__sti____cudaRegisterAllv
000000000000191d t _ZL26__cudaUnregisterBinaryUtilv
0000000000001b20 t _ZL31__nv_cudaEntityRegisterCallbackPPv
0000000000005080 b _ZZL31__nv_cudaEntityRegisterCallbackPPvE5__ref
                 U [email protected]
                 U [email protected]
                 U [email protected]
                 U [email protected]
                 U [email protected]
                 U [email protected]
0000000000001329 t __cudaUnregisterBinaryUtil
                 U [email protected]
                 U [email protected]
                 U [email protected]
                 U [email protected]
                 U [email protected]

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants