-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Non deterministic segfault with Intel CPU #12
Comments
Sorry for the delay. Until then: Where does the parallelism come into play here? I assume that you are really talking about multiple host threads, right? And... is there a Java program that shows the same problem? (If not, I'll try out the Scala version ASAP). |
Nothing scala specific here really, just that my code was already in scala. I am talking about multiple host threads all mapping and writing memory at the same time. |
The main question was: Which parts of this code, exactly, are executed in parallel?
and I'm not sure whether this is the case here. Otherwise, I'll try to have a closer look at this soon, but the mix of guessing, reading Scala-docs and deriving what it is likely doing in the background seems ... non-deterministic ;-) I'd really like to pin this down to a case that I can analyze quickly and reliably. |
This is public class OpenCLSession {
final cl_context context;
final cl_command_queue queue;
final cl_device_id device;
OpenCLSession(cl_context context, cl_command_queue queue, cl_device_id device) {
this.context = context;
this.queue = queue;
this.device = device;
}
cl_mem stream(scala.collection.Iterator<Double> it) {
int groupSize = 1024*1024*256;
cl_mem on_host = null;
try {
on_host = clCreateBuffer(context, CL_MEM_ALLOC_HOST_PTR, groupSize, null, null);
ByteBuffer rawBuffer = clEnqueueMapBuffer(queue, on_host, true, CL_MAP_WRITE, 0, groupSize, 0, null, null, null);
DoubleBuffer buffer = rawBuffer.order(ByteOrder.nativeOrder()).asDoubleBuffer();
int copied = 0;
while(copied < groupSize/Sizeof.cl_double && it.hasNext()) {
buffer.put(copied, it.next());
copied += 1;
}
clEnqueueUnmapMemObject(queue, on_host, rawBuffer, 0, null, null);
clRetainMemObject(on_host);
return on_host;
} finally {
if(on_host != null)
clReleaseMemObject(on_host);
}
}
@Override
protected void finalize() {
clReleaseCommandQueue(queue);
clReleaseContext(context);
}
}
Since the buffers are all freshly allocated by |
So I have tried out the original program that you posted, only adjusted to use my CPU, which is one from AMD. I started the program several times, and did not experience any crashes. If I understood this correctly, then the answer to my question above (namely, where the actual parallelism comes into play) is
Did I understand this correctly: When you remove this, then you do not see any crashes? During the crash, it should write the infamous (If possible, I'd try out to write a plain OpenCL (C++) program that uses 2 host threads, to narrow down the search space, and check whether this problem might be caused by the Intel OpenCL implementation) |
I have tested this on several OpenCL implementations, and the Intel CPU one is the only crashing one. I have only observed it under parallel execution (the more threads, the more likely it crashes). When it crashes, it produces no hs_err file. The coredump file (not sure if that helps) is huge. |
Turns out the compressed coredump is not so huge, here it is (with 3 threads): http://bulsa.faui2k11.de/core.xz |
Sorry, I'm not familiar with Linux and code dump analysis. But when you say that it only happens on Intel CPUs, then it's not unlikely that there reason is actually a bug/limitation/constraint for the Intel OpenCL implementation. (I don't say that it is, only that it's not unlikely). In cases like this, I usually try to write a native OpenCL program that "does the same thing" (as far as reasonably possible), and consider the result as "ground truth": When it also crashes without the JOCL layer, then the reason is somewhere else. Again: This could really be caused by JOCL or the "unfortunate interference" between JVM and OpenCL that you mentioned. But finding a definite answer here may be tricky. I'll try to allocate some time for creating a native implementation, with multiple threads, each mapping buffers, based on the given example. It will likely work for me, and I can't test it on an Intel CPU, but maybe I can provide a minimal test case for you to try out on Intel. However, I can't give an exact time frame for this. |
I've tested this some more and indeed, it looks like a Hotspot/Intel problem: The same code (modified to use DoubleStream instead of scala Iterator) never segfaults when using the IBM Java implementation. With the OpenJDK one it segfaults most of the time, but more interestingly, the segfaults are still there on successful runs (as can be seen with So my conclusion is that the Hotspot JVM uses segfaults at some point, and the Intel SDK probably masks/deregisters the handler. The only remaining question now is, where to report this? 😞 |
Thanks for the further investigation! I'd have to first do some research about the options for debugging a segfault in such a complex setup (OpenCL accessed by JOCL accessed by a JVM...), particularly on Linux. But you already mentioned And maybe it's possible to derive a hint about where to report this. Most likely, the OpenCL/JVM implementors will skeptically look at such a report and say "Are you sure it's not caused by the JVM/OpenCL implementation?" (respectively). Having a stack trace might help to pin down the culprit. (And I'm still crossing fingers that it's not actually JOCL in the end...) (BTW: Apologies that I can't be as supportive as I'd like to be right now. The task to increase test coverage is already on my todo list, together with e.g. issue 7, and this should probably include more extensive testing on other VMs and OSes - I'm currently very focussed on Windows/Oracle...) |
I posted this issue to the Intel forums: |
Thanks for keeping this up. I think that only the intel folks can really give an answer here, if they are willing to investigate it (the setup and configuration is very special, and it's hard to reproduce the issue). The first response at least sounds encouraging that they will have a look at this. |
It shouldn't be that hard to reproduce with the jar I uploaded. Although my mentioning that I basically don't care anymore seems to have caused a loss of interest on Intel's side. |
Maybe a "bump" there could also be helpful, but it's not unlikely that they consider the problematic case as "too narrow" (or "too specific"). As for the JOCL side, I'm not sure what I could do now. (I've recently been busy with other stuff, and the JOCL work was only a few updates to JOCLBlast, but the open issues here are sill nagging me). In any case, I'll leave this one open as well, at least until it's clear whether Intel will still respond or not. |
I have some scala code that puts the contents of an iterator into OpenCL device memory. If I do this in parallel on the Intel CPU OpenCL implementation it segfaults most of the time. I have reduced it down to the following code:
I suspect it is some unfortunate interference between the jvm and the Intel OpenCL implementation. I would be glad for some expert judgment on this.
The text was updated successfully, but these errors were encountered: