OpenCL race condition with TailStrategy::ShiftInwards? #5430
-
When generating tiled OpenCL code, Halide generates code which I believe has a race condition (multiple concurrent threads may write to the same output element). I do not see how this is legal according to the OpenCL specification and believe atomic stores should be used for correctness, or am I mistaken? The program does seem to consistently produce the correct output on my machine. Small example: output(x) = input(x) + 2 * input(x+1) + input(x+2);
output.gpu_tile(x, xi, 32); // defaults to TailStrategy::ShiftInwards Intermediate statements extract:
OpenCL kernel extract: __kernel void kernel_output_s0_x_x___block_id_x(
__address_space__input const float *restrict _input,
__address_space__output float *restrict _output,
// [...]
__local int16* __shared)
{
int _output_s0_x_x___block_id_x = get_group_id(0);
int ___thread_id_x = get_local_id(0);
bool _0 = _output_s0_x_x___block_id_x < _t16;
if (_0)
{
int _1 = _output_s0_x_x___block_id_x * 32;
int _13 = _1 + ___thread_id_x;
// [...]
_output[_13] = _12;
} // if _0
else
{
// [...]
int _25 = ___thread_id_x + _output_extent_0;
int _26 = _25 + -32;
_output[_26] = _24;
} // if _0 else
} // kernel kernel_output_s0_x_x___block_id_x |
Beta Was this translation helpful? Give feedback.
Replies: 1 comment
-
Yes, Halide can generate race conditions of this specific type: Two threads race to store the same value to the same memory location, and then there's a full memory barrier before any thread tries to read that location. It's the most benign possible race condition I can think of, but it's still technically UB in many contexts. If it's a problem I would say just use GuardWithIf instead. We have seen it cause non-determinism in the past when the tail case gets compiled as a separate piece of code, floating point optimizations shake out differently, and then the race is between different values, both presumably "correct" ones according to -ffast-math wild west rules. |
Beta Was this translation helpful? Give feedback.
Yes, Halide can generate race conditions of this specific type: Two threads race to store the same value to the same memory location, and then there's a full memory barrier before any thread tries to read that location.
It's the most benign possible race condition I can think of, but it's still technically UB in many contexts. If it's a problem I would say just use GuardWithIf instead. We have seen it cause non-determinism in the past when the tail case gets compiled as a separate piece of code, floating point optimizations shake out differently, and then the race is between different values, both presumably "correct" ones according to -ffast-math wild west rules.