[BUG]: Racy non-trivial runs test #351

gevtushenko · 2023-08-17T08:34:33Z

Is this a duplicate?

I confirmed there appear to be no duplicate issues for this bug and that I agree to the Code of Conduct

Type of Bug

Runtime Error

Component

CUB

Describe the bug

The cub::DeviceRunLengthEncode::NonTrivialRuns test failed in ctk 12.2, llvm15, c++14 container with the following error message:

Pointer DeviceRunLengthEncode::NonTrivialRuns cub::CUB on 1000000 items, 174932 segments (avg run length 5.717), {6uchar2 key, i offset, i length}, max_segment 16, entropy_reduction 1
Synchronizing...
Synchronizing...
INCORRECT: [173]: 0 != 968	 Offsets FAIL
INCORRECT: [172]: 0 != 2	 Lengths FAIL
Count PASS

Consecutive run was successful. It's unclear if there's a race in the algorithm or a particular random workloads leads to the issue. The only change in 2.2 that's related to non-trivial runs on V100 is #294.

How to Reproduce

Run ./bin/cub.cpp20.test.device_run_length_encode.cdp_0 and hope that the issue take place.

Expected behavior

The test is always passing.

Reproduction link

No response

Operating System

No response

nvidia-smi output

No response

NVCC version

No response

The text was updated successfully, but these errors were encountered:

elstehle · 2023-08-30T16:07:56Z

To share a few findings so far:

I have to run all the RLE test suite on average ~1300 times to observe the failure
So far, it was always the same test case failing
So far, I was only able to reproduce with clang host compiler
Running with compute-sanitizer --tool racecheck reports some hazards, that I'm currently investigating
The failure also occurs before latest fix for non-trivial runs
I see races being reported also for commits before the latest fix for non-trivial runs

gevtushenko added the bug Something isn't working right. label Aug 17, 2023

gevtushenko assigned elstehle Aug 17, 2023

gevtushenko mentioned this issue Aug 17, 2023

Tune reduce by key on A100 #346

Merged

2 tasks

jrhemstad changed the title ~~[BUG]: Racy non-trivail runs test~~ [BUG]: Racy non-trivial runs test Aug 30, 2023

elstehle mentioned this issue Sep 4, 2023

Fixes a race in DeviceRunLengthEncode::NonTrivialRuns #399

Merged

2 tasks

elstehle closed this as completed Sep 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG]: Racy non-trivial runs test #351

[BUG]: Racy non-trivial runs test #351

gevtushenko commented Aug 17, 2023

elstehle commented Aug 30, 2023

[BUG]: Racy non-trivial runs test #351

[BUG]: Racy non-trivial runs test #351

Comments

gevtushenko commented Aug 17, 2023

Is this a duplicate?

Type of Bug

Component

Describe the bug

How to Reproduce

Expected behavior

Reproduction link

Operating System

nvidia-smi output

NVCC version

elstehle commented Aug 30, 2023