Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG]: Racy non-trivial runs test #351

Closed
1 task done
gevtushenko opened this issue Aug 17, 2023 · 1 comment
Closed
1 task done

[BUG]: Racy non-trivial runs test #351

gevtushenko opened this issue Aug 17, 2023 · 1 comment
Assignees
Labels
bug Something isn't working right.

Comments

@gevtushenko
Copy link
Collaborator

Is this a duplicate?

Type of Bug

Runtime Error

Component

CUB

Describe the bug

The cub::DeviceRunLengthEncode::NonTrivialRuns test failed in ctk 12.2, llvm15, c++14 container with the following error message:

Pointer DeviceRunLengthEncode::NonTrivialRuns cub::CUB on 1000000 items, 174932 segments (avg run length 5.717), {6uchar2 key, i offset, i length}, max_segment 16, entropy_reduction 1
Synchronizing...
Synchronizing...
INCORRECT: [173]: 0 != 968	 Offsets FAIL
INCORRECT: [172]: 0 != 2	 Lengths FAIL
Count PASS

Consecutive run was successful. It's unclear if there's a race in the algorithm or a particular random workloads leads to the issue. The only change in 2.2 that's related to non-trivial runs on V100 is #294.

How to Reproduce

Run ./bin/cub.cpp20.test.device_run_length_encode.cdp_0 and hope that the issue take place.

Expected behavior

The test is always passing.

Reproduction link

No response

Operating System

No response

nvidia-smi output

No response

NVCC version

No response

@gevtushenko gevtushenko added the bug Something isn't working right. label Aug 17, 2023
@jrhemstad jrhemstad changed the title [BUG]: Racy non-trivail runs test [BUG]: Racy non-trivial runs test Aug 30, 2023
@elstehle
Copy link
Collaborator

To share a few findings so far:

  • I have to run all the RLE test suite on average ~1300 times to observe the failure
  • So far, it was always the same test case failing
  • So far, I was only able to reproduce with clang host compiler
  • Running with compute-sanitizer --tool racecheck reports some hazards, that I'm currently investigating
  • The failure also occurs before latest fix for non-trivial runs
  • I see races being reported also for commits before the latest fix for non-trivial runs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working right.
Projects
Archived in project
Development

No branches or pull requests

2 participants