Segfault when running `evolution.eigensystem` inside `cProfile` #7

jakelishman · 2018-10-01T14:10:40Z

Some routine encountered in evolution.eigensystem triggers a segfault when run inside the profiler. This behaviour seems to be dependent on numba - removing all the numba.njit decorators causes all the code to work just fine, with no Python exceptions and generally no indication of any fault. The segfault occurs with both versions 0.39 and 0.40 of numba using Python 3.6.5, numpy 1.15 and scipy 1.1.0.

The behaviour appears on both macOS (High Sierra) and Linux (the cx1 cluster at Imperial).

Example to reproduce: rabi.py.txt

When attaching a debugger to python and running the example file (or any similar reproducer), I get a message
stop reason = EXC_BAD_ACCESS (code=1, address=0x10)
(the address is always the same, it seems), and the offending disassembled instruction is
0x1000540f1: mov rax, qword ptr [rdi + 0x10]

This makes me think that the rdi register is either being zeroed, or not written to before its use. It does not appear to be an out-of-bounds array access, but more a type of null pointer dereference (in this case with a 16-byte offset). The same behaviour occurs whether I'm using lldb on my Mac, or gdb on the cluster. On the cluster, the stack size is unlimited (output of ulimit -s), so I don't think stack size is the problem (and anyway, all allocations should be on the heap).

I don't have the debugging symbols for Python/numpy/numba installed (and they're not possible to get through conda), so I haven't yet hunted down the issue well enough to determine whose fault it is.

I also tried setting the environment variables OMP_NUM_THREADS, NUMBA_NUM_THREADS and MKL_NUM_THREADS all to 1 to remove threading issues. This does not appear to have any effect, and anyway, only one thread appears to be active at the critical time when running in the debugger.

Possibly related to issue numba/numba#3229?

At any rate, this doesn't seem to be a blocking problem because it only manifests itself when using the profiler.

The text was updated successfully, but these errors were encountered:

jakelishman · 2018-10-01T14:15:14Z

The behaviour appears on both macOS (High Sierra) and Linux (the cx1 cluster at Imperial).

This isn't to say that the behaviour doesn't appear on Windows, it's just that I don't have a Windows installation to test it on.

jakelishman · 2018-10-01T14:25:15Z

Using python rabi.py (the reproduction example file) as a test, git bisect says that commit a46d7e6 is the first bad one. There is plenty of opportunity in that commit for an out-of-bounds memory access to have snuck into my code (though I can't spot it), so there's still quite a reasonable chance it's my fault.

jakelishman added the bug Something isn't working label Oct 1, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segfault when running `evolution.eigensystem` inside `cProfile` #7

Segfault when running `evolution.eigensystem` inside `cProfile` #7

jakelishman commented Oct 1, 2018

jakelishman commented Oct 1, 2018

jakelishman commented Oct 1, 2018

Segfault when running evolution.eigensystem inside cProfile #7

Segfault when running evolution.eigensystem inside cProfile #7

Comments

jakelishman commented Oct 1, 2018

jakelishman commented Oct 1, 2018

jakelishman commented Oct 1, 2018

Segfault when running `evolution.eigensystem` inside `cProfile` #7

Segfault when running `evolution.eigensystem` inside `cProfile` #7