Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault when running evolution.eigensystem inside cProfile #7

Open
jakelishman opened this issue Oct 1, 2018 · 2 comments
Open

Segfault when running evolution.eigensystem inside cProfile #7

jakelishman opened this issue Oct 1, 2018 · 2 comments
Labels
bug Something isn't working

Comments

@jakelishman
Copy link
Member

Some routine encountered in evolution.eigensystem triggers a segfault when run inside the profiler. This behaviour seems to be dependent on numba - removing all the numba.njit decorators causes all the code to work just fine, with no Python exceptions and generally no indication of any fault. The segfault occurs with both versions 0.39 and 0.40 of numba using Python 3.6.5, numpy 1.15 and scipy 1.1.0.

The behaviour appears on both macOS (High Sierra) and Linux (the cx1 cluster at Imperial).

Example to reproduce: rabi.py.txt

When attaching a debugger to python and running the example file (or any similar reproducer), I get a message
stop reason = EXC_BAD_ACCESS (code=1, address=0x10)
(the address is always the same, it seems), and the offending disassembled instruction is
0x1000540f1: mov rax, qword ptr [rdi + 0x10]

This makes me think that the rdi register is either being zeroed, or not written to before its use. It does not appear to be an out-of-bounds array access, but more a type of null pointer dereference (in this case with a 16-byte offset). The same behaviour occurs whether I'm using lldb on my Mac, or gdb on the cluster. On the cluster, the stack size is unlimited (output of ulimit -s), so I don't think stack size is the problem (and anyway, all allocations should be on the heap).

I don't have the debugging symbols for Python/numpy/numba installed (and they're not possible to get through conda), so I haven't yet hunted down the issue well enough to determine whose fault it is.

I also tried setting the environment variables OMP_NUM_THREADS, NUMBA_NUM_THREADS and MKL_NUM_THREADS all to 1 to remove threading issues. This does not appear to have any effect, and anyway, only one thread appears to be active at the critical time when running in the debugger.

Possibly related to issue numba/numba#3229?

At any rate, this doesn't seem to be a blocking problem because it only manifests itself when using the profiler.

@jakelishman jakelishman added the bug Something isn't working label Oct 1, 2018
@jakelishman
Copy link
Member Author

The behaviour appears on both macOS (High Sierra) and Linux (the cx1 cluster at Imperial).

This isn't to say that the behaviour doesn't appear on Windows, it's just that I don't have a Windows installation to test it on.

@jakelishman
Copy link
Member Author

Using python rabi.py (the reproduction example file) as a test, git bisect says that commit a46d7e6 is the first bad one. There is plenty of opportunity in that commit for an out-of-bounds memory access to have snuck into my code (though I can't spot it), so there's still quite a reasonable chance it's my fault.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant