Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make thrust::transform use cub::DeviceTransform #2389

Merged
merged 3 commits into from
Nov 6, 2024

Conversation

bernhardmgruber
Copy link
Contributor

@bernhardmgruber bernhardmgruber commented Sep 6, 2024

This PR makes thrust::transform use cub::DeviceTransform for the CUDA backend.

It also:

  • Introduces address stability detection and opt-in in libcu++
  • Mark lambdas in Thrust BabelStream benchmark address oblivious
  • Adds an optimization to the cub::DeviceTransform prefetch algorithm for small problem sizes
Benchmark on H100
# base

## [0] NVIDIA H100 NVL

|  T{ct}  |  Elements  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |      Diff |   %Diff |  Status  |
|---------|------------|------------|-------------|------------|-------------|-----------|---------|----------|
|   U32   |    2^16    |  10.367 us |       3.40% |   9.172 us |       3.05% | -1.195 us | -11.53% |   FAIL   |
|   U32   |    2^20    |  15.435 us |       4.07% |  16.118 us |       2.96% |  0.683 us |   4.43% |   FAIL   |
|   U32   |    2^24    |  96.316 us |       0.49% |  98.619 us |       0.30% |  2.303 us |   2.39% |   FAIL   |
|   U32   |    2^28    |   1.378 ms |       0.12% |   1.392 ms |       0.04% | 13.307 us |   0.97% |   FAIL   |
|   U64   |    2^16    |  12.335 us |      14.19% |   9.878 us |       3.00% | -2.457 us | -19.92% |   FAIL   |
|   U64   |    2^20    |  16.956 us |       5.22% |  17.244 us |       2.56% |  0.288 us |   1.70% |   PASS   |
|   U64   |    2^24    | 116.436 us |       0.32% | 112.155 us |       0.43% | -4.281 us |  -3.68% |   FAIL   |
|   U64   |    2^28    |   1.777 ms |       2.54% |   1.815 ms |       2.98% | 37.387 us |   2.10% |   PASS   |

# mul

## [0] NVIDIA H100 NVL

|  T{ct}  |  Elements  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|---------|------------|------------|-------------|------------|-------------|------------|---------|----------|
|   I8    |    2^25    |  86.886 us |       2.14% |  34.495 us |       2.99% | -52.392 us | -60.30% |   FAIL   |
|   I16   |    2^25    |  94.903 us |       0.72% |  50.397 us |       1.35% | -44.506 us | -46.90% |   FAIL   |
|   F32   |    2^25    | 112.067 us |       0.55% |  87.791 us |       0.80% | -24.275 us | -21.66% |   FAIL   |
|   F64   |    2^25    | 172.552 us |       0.41% | 162.959 us |       0.54% |  -9.592 us |  -5.56% |   FAIL   |
|  I128   |    2^25    | 313.923 us |       0.56% | 316.735 us |       0.46% |   2.812 us |   0.90% |   FAIL   |

# add

## [0] NVIDIA H100 NVL

|  T{ct}  |  Elements  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|---------|------------|------------|-------------|------------|-------------|------------|---------|----------|
|   I8    |    2^25    |  93.893 us |       1.08% |  45.065 us |       1.88% | -48.828 us | -52.00% |   FAIL   |
|   I16   |    2^25    | 105.996 us |       0.28% |  70.617 us |       0.80% | -35.378 us | -33.38% |   FAIL   |
|   F32   |    2^25    | 142.256 us |       0.37% | 124.355 us |       0.51% | -17.901 us | -12.58% |   FAIL   |
|   F64   |    2^25    | 234.508 us |       0.31% | 233.388 us |       0.32% |  -1.121 us |  -0.48% |   FAIL   |
|  I128   |    2^25    | 455.391 us |       0.59% | 453.136 us |       0.58% |  -2.254 us |  -0.50% |   PASS   |

# triad

## [0] NVIDIA H100 NVL

|  T{ct}  |  Elements  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|---------|------------|------------|-------------|------------|-------------|------------|---------|----------|
|   I8    |    2^25    |  95.942 us |       0.98% |  44.766 us |       1.78% | -51.176 us | -53.34% |   FAIL   |
|   I16   |    2^25    | 106.412 us |       0.39% |  70.679 us |       1.00% | -35.733 us | -33.58% |   FAIL   |
|   F32   |    2^25    | 142.654 us |       0.40% | 124.903 us |       0.53% | -17.751 us | -12.44% |   FAIL   |
|   F64   |    2^25    | 234.775 us |       0.31% | 233.577 us |       0.30% |  -1.198 us |  -0.51% |   FAIL   |
|  I128   |    2^25    | 455.983 us |       0.57% | 453.968 us |       0.52% |  -2.015 us |  -0.44% |   PASS   |

# nstream

## [0] NVIDIA H100 NVL

|  T{ct}  |  Elements  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|---------|------------|------------|-------------|------------|-------------|------------|---------|----------|
|   I8    |    2^25    | 102.062 us |       1.36% |  55.360 us |       2.87% | -46.702 us | -45.76% |   FAIL   |
|   I16   |    2^25    | 119.080 us |       0.29% |  89.179 us |       0.57% | -29.901 us | -25.11% |   FAIL   |
|   F32   |    2^25    | 172.130 us |       0.29% | 160.601 us |       0.41% | -11.529 us |  -6.70% |   FAIL   |
|   F64   |    2^25    | 305.508 us |       0.19% | 304.931 us |       0.25% |  -0.578 us |  -0.19% |   PASS   |
|  I128   |    2^25    | 595.518 us |       0.30% | 594.329 us |       0.28% |  -1.189 us |  -0.20% |   PASS   |

# nstream_stable

## [0] NVIDIA H100 NVL

|  T{ct}  |  Elements  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |      Diff |   %Diff |  Status  |
|---------|------------|------------|-------------|------------|-------------|-----------|---------|----------|
|   I8    |    2^25    | 102.602 us |       1.36% |  99.178 us |       1.61% | -3.424 us |  -3.34% |   FAIL   |
|   I16   |    2^25    | 119.544 us |       0.40% | 114.715 us |       0.37% | -4.829 us |  -4.04% |   FAIL   |
|   F32   |    2^25    | 172.225 us |       0.32% | 167.925 us |       0.30% | -4.300 us |  -2.50% |   FAIL   |
|   F64   |    2^25    | 305.132 us |       0.17% | 305.411 us |       0.16% |  0.280 us |   0.09% |   PASS   |
|  I128   |    2^25    | 595.308 us |       0.29% | 595.591 us |       0.33% |  0.283 us |   0.05% |   PASS   |
Benchmark on H200
# base

## [0] NVIDIA H200

|  T{ct}  |  Elements  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|---------|------------|------------|-------------|------------|-------------|------------|---------|----------|
|   U32   |    2^16    |  12.166 us |       4.54% |  11.492 us |       6.80% |  -0.674 us |  -5.54% |   FAIL   |
|   U32   |    2^20    |  16.644 us |       2.42% |  17.449 us |       2.59% |   0.805 us |   4.83% |   FAIL   |
|   U32   |    2^24    |  88.994 us |       0.57% |  90.992 us |       0.59% |   1.998 us |   2.25% |   FAIL   |
|   U32   |    2^28    |   1.242 ms |       3.31% |   1.254 ms |       0.94% |  11.856 us |   0.95% |   FAIL   |
|   U64   |    2^16    |  12.668 us |       5.41% |  11.867 us |       4.50% |  -0.801 us |  -6.32% |   FAIL   |
|   U64   |    2^20    |  17.953 us |       3.47% |  18.353 us |       2.51% |   0.400 us |   2.23% |   PASS   |
|   U64   |    2^24    | 105.728 us |       0.47% | 102.073 us |       0.63% |  -3.655 us |  -3.46% |   FAIL   |
|   U64   |    2^28    |   1.509 ms |       0.13% |   1.445 ms |       0.45% | -64.508 us |  -4.27% |   FAIL   |

# mul

## [0] NVIDIA H200

|  T{ct}  |  Elements  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|---------|------------|------------|-------------|------------|-------------|------------|---------|----------|
|   I8    |    2^25    |  79.274 us |       1.54% |  32.089 us |       2.77% | -47.185 us | -59.52% |   FAIL   |
|   I16   |    2^25    |  87.026 us |       0.78% |  46.222 us |       2.37% | -40.803 us | -46.89% |   FAIL   |
|   F32   |    2^25    | 101.086 us |       0.89% |  76.642 us |       1.22% | -24.444 us | -24.18% |   FAIL   |
|   F64   |    2^25    | 149.348 us |       0.71% | 139.039 us |       0.69% | -10.309 us |  -6.90% |   FAIL   |
|  I128   |    2^25    | 264.865 us |       0.32% | 269.614 us |       0.37% |   4.749 us |   1.79% |   FAIL   |

# add

## [0] NVIDIA H200

|  T{ct}  |  Elements  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|---------|------------|------------|-------------|------------|-------------|------------|---------|----------|
|   I8    |    2^25    |  86.340 us |       0.63% |  41.636 us |       1.54% | -44.704 us | -51.78% |   FAIL   |
|   I16   |    2^25    |  97.559 us |       0.50% |  62.541 us |       1.21% | -35.017 us | -35.89% |   FAIL   |
|   F32   |    2^25    | 124.967 us |       0.44% | 106.734 us |       0.45% | -18.233 us | -14.59% |   FAIL   |
|   F64   |    2^25    | 198.089 us |       0.28% | 197.965 us |       6.21% |  -0.124 us |  -0.06% |   PASS   |
|  I128   |    2^25    | 382.959 us |       2.94% | 379.519 us |       0.24% |  -3.439 us |  -0.90% |   FAIL   |

# triad

## [0] NVIDIA H200

|  T{ct}  |  Elements  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|---------|------------|------------|-------------|------------|-------------|------------|---------|----------|
|   I8    |    2^25    |  88.292 us |       0.76% |  41.519 us |       2.03% | -46.773 us | -52.98% |   FAIL   |
|   I16   |    2^25    |  96.488 us |       0.53% |  62.454 us |       1.28% | -34.034 us | -35.27% |   FAIL   |
|   F32   |    2^25    | 125.204 us |       0.44% | 106.697 us |       0.48% | -18.507 us | -14.78% |   FAIL   |
|   F64   |    2^25    | 198.454 us |       0.25% | 196.966 us |       0.34% |  -1.489 us |  -0.75% |   FAIL   |
|  I128   |    2^25    | 381.363 us |       0.37% | 377.733 us |       0.36% |  -3.631 us |  -0.95% |   FAIL   |

# nstream

## [0] NVIDIA H200

|  T{ct}  |  Elements  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|---------|------------|------------|-------------|------------|-------------|------------|---------|----------|
|   I8    |    2^25    |  92.025 us |       0.47% |  48.296 us |       1.49% | -43.729 us | -47.52% |   FAIL   |
|   I16   |    2^25    | 106.494 us |       0.33% |  75.618 us |       1.02% | -30.876 us | -28.99% |   FAIL   |
|   F32   |    2^25    | 146.526 us |       0.56% | 134.765 us |       0.61% | -11.761 us |  -8.03% |   FAIL   |
|   F64   |    2^25    | 254.879 us |       0.36% | 254.521 us |       0.37% |  -0.358 us |  -0.14% |   PASS   |
|  I128   |    2^25    | 499.487 us |       0.36% | 494.593 us |       0.28% |  -4.894 us |  -0.98% |   FAIL   |

# nstream_stable

## [0] NVIDIA H200

|  T{ct}  |  Elements  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |      Diff |   %Diff |  Status  |
|---------|------------|------------|-------------|------------|-------------|-----------|---------|----------|
|   I8    |    2^25    |  92.382 us |       0.52% |  88.861 us |       0.56% | -3.521 us |  -3.81% |   FAIL   |
|   I16   |    2^25    | 106.623 us |       0.45% | 100.466 us |       0.46% | -6.157 us |  -5.77% |   FAIL   |
|   F32   |    2^25    | 146.150 us |       0.56% | 142.722 us |       0.58% | -3.428 us |  -2.35% |   FAIL   |
|   F64   |    2^25    | 254.607 us |       0.47% | 254.666 us |       0.36% |  0.059 us |   0.02% |   PASS   |
|  I128   |    2^25    | 499.203 us |       0.16% | 496.988 us |       0.17% | -2.215 us |  -0.44% |   FAIL   |

Resolve before:

Comment on lines 127 to 129
thrust::transform(
c.begin(), c.end(), b.begin(), cuda::std::allow_copied_arguments([=] __device__ __host__(const T& ci) {
return ci * scalar;
}));
Copy link
Contributor Author

@bernhardmgruber bernhardmgruber Sep 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what uses are expected to write. They need to wrap their callables in cuda::allow_copied_arguments(...) or specialize cuda::allows_copied_arguments.

Copy link
Contributor

@mfbalin mfbalin Sep 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you elaborate more on under what conditions cuda::std::allow_copied_arguments will be required? Is it for peak performance?

It is exciting that the functions I am using in my own code are going to speedup just like that by simply updating CCCL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sad answer for now is that you need to wrap your callable in allow_copied_arguments every time you call a thrust API that may use thrust::transform underneath and your callable does not rely on the address of your arguments. We have some ideas on how to detect this automatically for a set of known types and functors, e.g. thrust::plus<int>, so this will automatically enable at least some use cases.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it due to the use of shared memory? If you take address, as it won't be same as the original data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exactly. cub::DeviceTranform on Hopper+ will use an asynchronous bulk copy from global to shared memory and then run the transformation function on the data in shared memory. This generally leads to higher saturation of the memory bandwidth. But if your callable has a const T& value parameter that reference will point to shared memory now instead of global memory, so &value may not be what you expect.

Copy link
Contributor

@mfbalin mfbalin Sep 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When the user uses const T value as the callable argument, the address can not be taken anymore. So can the fast past optimization be enabled when the callable argument is passed by value?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If const T value happens to be const std::reference_wrapper<U> value, or const thrust::device_reference<U> value, or any other proxy reference type, and your buffer contains Us, you can still recover the address of the Us in global memory even though your function syntactically takes arguments by copy.

Copy link
Contributor

@mfbalin mfbalin Sep 11, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you want to change it in such a way that even for those types, the fast path can be used to dereference the reference_wrapper object? If not, why is the fast path not enabled by default when const std::reference_wrapper<U> is the callable argument so that std::reference_wrapper<U> is copied via the fast path and the dereferencing can be done inside the users' kernel?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I want to understand if there is any disadvantage if the user has a kernel that takes const T value and the user always wraps it with cuda::std::allow_copied_arguments vs not wrapping with it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So you want to change it in such a way that even for those types, the fast path can be used to dereference the reference_wrapper object?

No, I cannot enable the fast-path for by-value arguments like std::reference_wrapper<U> or thrust::referenc<U>.

I want to understand if there is any disadvantage if the user has a kernel that takes const T value and the user always wraps it with cuda::std::allow_copied_arguments vs not wrapping with it.

If T is cheap to copy, or the compiler is able to inline your function object, then there is no downside of using const T value. Wrapping your callable in cuda::std::allow_copied_arguments also has no downside.

@bernhardmgruber bernhardmgruber force-pushed the transform_thrust branch 2 times, most recently from 43ce459 to c662a2f Compare September 9, 2024 15:12
@bernhardmgruber
Copy link
Contributor Author

We discussed this PR in the code review hour now and concluded that we are fine with putting the address stability traits into libcu++ and the cuda:: namespace. The functionality is sufficient and we are fine with the naming. The tests can remain in thrust, since we will add more tests in conjunction with thrust callables and thrust types in the future. This mostly relates to automatic detection of whether transformation function arguments can be copied or not.

Copy link
Contributor

🟨 CI finished in 7h 41m: Pass: 99%/417 | Total: 7d 07h | Avg: 25m 11s | Max: 2h 03m | Hits: 82%/39381
  • 🟨 thrust: Pass: 99%/118 | Total: 2d 11h | Avg: 30m 22s | Max: 1h 12m | Hits: 78%/17872

    🔍 cpu: amd64 🔍
      🔍 amd64              Pass:  99%/110 | Total:  2d 07h | Avg: 30m 16s | Max:  1h 12m | Hits:  78%/17872 
      🟩 arm64              Pass: 100%/8   | Total:  4h 14m | Avg: 31m 51s | Max: 40m 45s
    🔍 ctk: 11.1 🔍
      🔍 11.1               Pass:  93%/15  | Total:  7h 16m | Avg: 29m 05s | Max:  1h 06m
      🟩 11.8               Pass: 100%/3   | Total:  2h 01m | Avg: 40m 26s | Max: 47m 25s
      🟩 12.5               Pass: 100%/100 | Total:  2d 02h | Avg: 30m 16s | Max:  1h 12m | Hits:  78%/17872 
    🔍 cudacxx: nvcc11.1 🔍
      🟩 ClangCUDA17        Pass: 100%/2   | Total: 57m 22s | Avg: 28m 41s | Max: 29m 18s
      🔍 nvcc11.1           Pass:  93%/15  | Total:  7h 16m | Avg: 29m 05s | Max:  1h 06m
      🟩 nvcc11.8           Pass: 100%/3   | Total:  2h 01m | Avg: 40m 26s | Max: 47m 25s
      🟩 nvcc12.5           Pass: 100%/98  | Total:  2d 01h | Avg: 30m 18s | Max:  1h 12m | Hits:  78%/17872 
    🔍 cudacxx_family: nvcc 🔍
      🟩 ClangCUDA          Pass: 100%/2   | Total: 57m 22s | Avg: 28m 41s | Max: 29m 18s
      🔍 nvcc               Pass:  99%/116 | Total:  2d 10h | Avg: 30m 24s | Max:  1h 12m | Hits:  78%/17872 
    🚨 cxx: MSVC14.16 🚨
      🟩 Clang9             Pass: 100%/6   | Total:  2h 49m | Avg: 28m 10s | Max: 33m 07s
      🟩 Clang10            Pass: 100%/3   | Total:  1h 41m | Avg: 33m 47s | Max: 39m 02s
      🟩 Clang11            Pass: 100%/4   | Total:  2h 03m | Avg: 30m 54s | Max: 32m 47s
      🟩 Clang12            Pass: 100%/4   | Total:  2h 08m | Avg: 32m 05s | Max: 36m 31s
      🟩 Clang13            Pass: 100%/4   | Total:  2h 03m | Avg: 30m 58s | Max: 36m 25s
      🟩 Clang14            Pass: 100%/4   | Total:  2h 06m | Avg: 31m 43s | Max: 35m 23s
      🟩 Clang15            Pass: 100%/4   | Total:  2h 04m | Avg: 31m 09s | Max: 33m 01s
      🟩 Clang16            Pass: 100%/4   | Total:  2h 05m | Avg: 31m 28s | Max: 36m 11s
      🟩 Clang17            Pass: 100%/18  | Total:  6h 42m | Avg: 22m 20s | Max: 32m 47s
      🟩 GCC6               Pass: 100%/2   | Total: 49m 01s | Avg: 24m 30s | Max: 26m 40s
      🟩 GCC7               Pass: 100%/6   | Total:  2h 55m | Avg: 29m 11s | Max: 34m 21s
      🟩 GCC8               Pass: 100%/6   | Total:  2h 56m | Avg: 29m 21s | Max: 32m 55s
      🟩 GCC9               Pass: 100%/6   | Total:  3h 02m | Avg: 30m 23s | Max: 38m 38s
      🟩 GCC10              Pass: 100%/4   | Total:  2h 17m | Avg: 34m 24s | Max: 40m 29s
      🟩 GCC11              Pass: 100%/7   | Total:  4h 15m | Avg: 36m 26s | Max: 47m 25s
      🟩 GCC12              Pass: 100%/4   | Total:  2h 20m | Avg: 35m 12s | Max: 39m 41s
      🟩 GCC13              Pass: 100%/20  | Total:  7h 46m | Avg: 23m 18s | Max: 40m 45s
      🟩 Intel2023.2.0      Pass: 100%/3   | Total:  1h 57m | Avg: 39m 03s | Max: 43m 24s
      🔥 MSVC14.16          Pass:   0%/1   | Total:  1h 06m | Avg:  1h 06m | Max:  1h 06m
      🟩 MSVC14.29          Pass: 100%/2   | Total:  2h 08m | Avg:  1h 04m | Max:  1h 06m | Hits:  66%/4468  
      🟩 MSVC14.39          Pass: 100%/6   | Total:  4h 24m | Avg: 44m 03s | Max:  1h 12m | Hits:  82%/13404 
    🔍 cxx_family: MSVC 🔍
      🟩 Clang              Pass: 100%/51  | Total: 23h 45m | Avg: 27m 57s | Max: 39m 02s
      🟩 GCC                Pass: 100%/55  | Total:  1d 02h | Avg: 28m 45s | Max: 47m 25s
      🟩 Intel              Pass: 100%/3   | Total:  1h 57m | Avg: 39m 03s | Max: 43m 24s
      🔍 MSVC               Pass:  88%/9   | Total:  7h 39m | Avg: 51m 05s | Max:  1h 12m | Hits:  78%/17872 
    🔍 jobs: Build 🔍
      🔍 Build              Pass:  98%/99  | Total:  2d 07h | Avg: 33m 26s | Max:  1h 12m | Hits:  66%/11170 
      🟩 TestCPU            Pass: 100%/11  | Total:  2h 08m | Avg: 11m 40s | Max: 23m 09s | Hits:  99%/6702  
      🟩 TestGPU            Pass: 100%/8   | Total:  2h 26m | Avg: 18m 16s | Max: 35m 45s
    🔍 std: 14 🔍
      🟩 11                 Pass: 100%/30  | Total: 12h 36m | Avg: 25m 12s | Max: 35m 45s
      🔍 14                 Pass:  97%/34  | Total: 18h 11m | Avg: 32m 06s | Max:  1h 06m | Hits:  77%/6702  
      🟩 17                 Pass: 100%/33  | Total: 17h 58m | Avg: 32m 41s | Max:  1h 06m | Hits:  77%/6702  
      🟩 20                 Pass: 100%/21  | Total: 10h 58m | Avg: 31m 21s | Max:  1h 12m | Hits:  82%/4468  
    🟨 gpu
      🟨 v100               Pass:  99%/118 | Total:  2d 11h | Avg: 30m 22s | Max:  1h 12m | Hits:  78%/17872 
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total:  2h 01m | Avg: 40m 26s | Max: 47m 25s
      🟩 90a                Pass: 100%/4   | Total:  1h 18m | Avg: 19m 34s | Max: 24m 44s
    
  • 🟩 cub: Pass: 100%/132 | Total: 4d 00h | Avg: 44m 01s | Max: 2h 03m | Hits: 62%/4362

    🟩 cpu
      🟩 amd64              Pass: 100%/124 | Total:  3d 17h | Avg: 43m 22s | Max:  2h 03m | Hits:  62%/4362  
      🟩 arm64              Pass: 100%/8   | Total:  7h 11m | Avg: 53m 54s | Max: 55m 47s
    🟩 ctk
      🟩 11.1               Pass: 100%/15  | Total: 11h 24m | Avg: 45m 39s | Max: 57m 01s | Hits:  56%/727   
      🟩 11.8               Pass: 100%/3   | Total:  3h 19m | Avg:  1h 06m | Max:  1h 08m
      🟩 12.5               Pass: 100%/114 | Total:  3d 10h | Avg: 43m 12s | Max:  2h 03m | Hits:  63%/3635  
    🟩 cudacxx
      🟩 ClangCUDA17        Pass: 100%/2   | Total: 46m 54s | Avg: 23m 27s | Max: 24m 53s
      🟩 nvcc11.1           Pass: 100%/15  | Total: 11h 24m | Avg: 45m 39s | Max: 57m 01s | Hits:  56%/727   
      🟩 nvcc11.8           Pass: 100%/3   | Total:  3h 19m | Avg:  1h 06m | Max:  1h 08m
      🟩 nvcc12.5           Pass: 100%/112 | Total:  3d 09h | Avg: 43m 33s | Max:  2h 03m | Hits:  63%/3635  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total: 46m 54s | Avg: 23m 27s | Max: 24m 53s
      🟩 nvcc               Pass: 100%/130 | Total:  4d 00h | Avg: 44m 20s | Max:  2h 03m | Hits:  62%/4362  
    🟩 cxx
      🟩 Clang9             Pass: 100%/6   | Total:  4h 55m | Avg: 49m 10s | Max: 58m 56s
      🟩 Clang10            Pass: 100%/3   | Total:  2h 30m | Avg: 50m 10s | Max: 51m 06s
      🟩 Clang11            Pass: 100%/4   | Total:  3h 31m | Avg: 52m 50s | Max: 54m 43s
      🟩 Clang12            Pass: 100%/4   | Total:  3h 31m | Avg: 52m 47s | Max: 55m 44s
      🟩 Clang13            Pass: 100%/4   | Total:  3h 29m | Avg: 52m 22s | Max: 56m 10s
      🟩 Clang14            Pass: 100%/4   | Total:  3h 33m | Avg: 53m 25s | Max: 55m 46s
      🟩 Clang15            Pass: 100%/4   | Total:  3h 27m | Avg: 51m 53s | Max: 55m 29s
      🟩 Clang16            Pass: 100%/4   | Total:  3h 40m | Avg: 55m 13s | Max: 58m 31s
      🟩 Clang17            Pass: 100%/26  | Total: 12h 43m | Avg: 29m 22s | Max: 53m 00s
      🟩 GCC6               Pass: 100%/2   | Total:  1h 31m | Avg: 45m 47s | Max: 48m 30s
      🟩 GCC7               Pass: 100%/6   | Total:  4h 52m | Avg: 48m 44s | Max: 56m 47s
      🟩 GCC8               Pass: 100%/6   | Total:  4h 46m | Avg: 47m 43s | Max: 50m 43s
      🟩 GCC9               Pass: 100%/6   | Total:  4h 49m | Avg: 48m 14s | Max: 54m 59s
      🟩 GCC10              Pass: 100%/4   | Total:  3h 37m | Avg: 54m 28s | Max: 57m 47s
      🟩 GCC11              Pass: 100%/7   | Total:  6h 53m | Avg: 59m 05s | Max:  1h 08m
      🟩 GCC12              Pass: 100%/4   | Total:  3h 33m | Avg: 53m 28s | Max: 57m 11s
      🟩 GCC13              Pass: 100%/29  | Total: 16h 05m | Avg: 33m 16s | Max:  2h 03m
      🟩 Intel2023.2.0      Pass: 100%/3   | Total:  2h 57m | Avg: 59m 02s | Max: 59m 50s
      🟩 MSVC14.16          Pass: 100%/1   | Total: 57m 01s | Avg: 57m 01s | Max: 57m 01s | Hits:  56%/727   
      🟩 MSVC14.29          Pass: 100%/2   | Total:  2h 14m | Avg:  1h 07m | Max:  1h 08m | Hits:  63%/1454  
      🟩 MSVC14.39          Pass: 100%/3   | Total:  3h 08m | Avg:  1h 02m | Max:  1h 06m | Hits:  63%/2181  
    🟩 cxx_family
      🟩 Clang              Pass: 100%/59  | Total:  1d 17h | Avg: 42m 05s | Max: 58m 56s
      🟩 GCC                Pass: 100%/64  | Total:  1d 22h | Avg: 43m 17s | Max:  2h 03m
      🟩 Intel              Pass: 100%/3   | Total:  2h 57m | Avg: 59m 02s | Max: 59m 50s
      🟩 MSVC               Pass: 100%/6   | Total:  6h 19m | Avg:  1h 03m | Max:  1h 08m | Hits:  62%/4362  
    🟩 gpu
      🟩 v100               Pass: 100%/132 | Total:  4d 00h | Avg: 44m 01s | Max:  2h 03m | Hits:  62%/4362  
    🟩 jobs
      🟩 Build              Pass: 100%/99  | Total:  3d 12h | Avg: 51m 13s | Max:  1h 08m | Hits:  62%/4362  
      🟩 DeviceLaunch       Pass: 100%/8   | Total:  2h 41m | Avg: 20m 07s | Max: 28m 59s
      🟩 GraphCapture       Pass: 100%/8   | Total:  2h 01m | Avg: 15m 10s | Max: 16m 50s
      🟩 HostLaunch         Pass: 100%/8   | Total:  2h 18m | Avg: 17m 19s | Max: 20m 19s
      🟩 SmallGMem          Pass: 100%/1   | Total:  2h 03m | Avg:  2h 03m | Max:  2h 03m
      🟩 TestGPU            Pass: 100%/8   | Total:  3h 15m | Avg: 24m 25s | Max: 27m 32s
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total:  3h 19m | Avg:  1h 06m | Max:  1h 08m
      🟩 90a                Pass: 100%/4   | Total:  1h 35m | Avg: 23m 58s | Max: 27m 37s
    🟩 std
      🟩 11                 Pass: 100%/34  | Total: 23h 54m | Avg: 42m 11s | Max:  1h 04m
      🟩 14                 Pass: 100%/37  | Total:  1d 04h | Avg: 45m 24s | Max:  1h 08m | Hits:  60%/2181  
      🟩 17                 Pass: 100%/37  | Total:  1d 04h | Avg: 46m 54s | Max:  2h 03m | Hits:  63%/1454  
      🟩 20                 Pass: 100%/24  | Total: 16h 00m | Avg: 40m 01s | Max:  1h 01m | Hits:  63%/727   
    
  • 🟩 libcudacxx: Pass: 100%/112 | Total: 15h 32m | Avg: 8m 19s | Max: 34m 07s | Hits: 91%/16953

    🟩 cpu
      🟩 amd64              Pass: 100%/104 | Total: 14h 51m | Avg:  8m 34s | Max: 34m 07s | Hits:  91%/16953 
      🟩 arm64              Pass: 100%/8   | Total: 40m 25s | Avg:  5m 03s | Max: 15m 28s
    🟩 ctk
      🟩 11.1               Pass: 100%/15  | Total:  1h 27m | Avg:  5m 49s | Max: 18m 19s | Hits:  99%/2633  
      🟩 11.8               Pass: 100%/3   | Total: 59m 35s | Avg: 19m 51s | Max: 20m 52s
      🟩 12.5               Pass: 100%/94  | Total: 13h 05m | Avg:  8m 21s | Max: 34m 07s | Hits:  89%/14320 
    🟩 cudacxx
      🟩 ClangCUDA17        Pass: 100%/2   | Total: 39m 10s | Avg: 19m 35s | Max: 20m 06s
      🟩 nvcc11.1           Pass: 100%/15  | Total:  1h 27m | Avg:  5m 49s | Max: 18m 19s | Hits:  99%/2633  
      🟩 nvcc11.8           Pass: 100%/3   | Total: 59m 35s | Avg: 19m 51s | Max: 20m 52s
      🟩 nvcc12.5           Pass: 100%/92  | Total: 12h 26m | Avg:  8m 06s | Max: 34m 07s | Hits:  89%/14320 
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total: 39m 10s | Avg: 19m 35s | Max: 20m 06s
      🟩 nvcc               Pass: 100%/110 | Total: 14h 53m | Avg:  8m 07s | Max: 34m 07s | Hits:  91%/16953 
    🟩 cxx
      🟩 Clang9             Pass: 100%/6   | Total: 24m 20s | Avg:  4m 03s | Max:  5m 33s
      🟩 Clang10            Pass: 100%/3   | Total: 14m 08s | Avg:  4m 42s | Max:  5m 08s
      🟩 Clang11            Pass: 100%/4   | Total: 15m 55s | Avg:  3m 58s | Max:  4m 26s
      🟩 Clang12            Pass: 100%/4   | Total: 32m 15s | Avg:  8m 03s | Max: 21m 10s
      🟩 Clang13            Pass: 100%/4   | Total: 14m 59s | Avg:  3m 44s | Max:  3m 59s
      🟩 Clang14            Pass: 100%/4   | Total: 15m 36s | Avg:  3m 54s | Max:  4m 19s
      🟩 Clang15            Pass: 100%/4   | Total: 31m 01s | Avg:  7m 45s | Max: 18m 59s
      🟩 Clang16            Pass: 100%/4   | Total: 15m 59s | Avg:  3m 59s | Max:  4m 34s
      🟩 Clang17            Pass: 100%/14  | Total:  2h 24m | Avg: 10m 21s | Max: 25m 43s
      🟩 GCC6               Pass: 100%/2   | Total:  5m 29s | Avg:  2m 44s | Max:  3m 05s
      🟩 GCC7               Pass: 100%/6   | Total: 49m 59s | Avg:  8m 19s | Max: 19m 55s
      🟩 GCC8               Pass: 100%/6   | Total: 33m 44s | Avg:  5m 37s | Max: 19m 14s
      🟩 GCC9               Pass: 100%/6   | Total: 33m 44s | Avg:  5m 37s | Max: 17m 12s
      🟩 GCC10              Pass: 100%/4   | Total: 14m 07s | Avg:  3m 31s | Max:  3m 52s
      🟩 GCC11              Pass: 100%/7   | Total:  1h 13m | Avg: 10m 30s | Max: 20m 52s
      🟩 GCC12              Pass: 100%/4   | Total: 13m 52s | Avg:  3m 28s | Max:  3m 49s
      🟩 GCC13              Pass: 100%/21  | Total:  4h 16m | Avg: 12m 12s | Max: 34m 07s
      🟩 Intel2023.2.0      Pass: 100%/3   | Total: 42m 34s | Avg: 14m 11s | Max: 19m 33s
      🟩 MSVC14.16          Pass: 100%/1   | Total: 17m 35s | Avg: 17m 35s | Max: 17m 35s | Hits:  99%/2633  
      🟩 MSVC14.29          Pass: 100%/2   | Total: 38m 12s | Avg: 19m 06s | Max: 24m 47s | Hits:  74%/5628  
      🟩 MSVC14.39          Pass: 100%/3   | Total: 43m 51s | Avg: 14m 37s | Max: 15m 30s | Hits:  99%/8692  
    🟩 cxx_family
      🟩 Clang              Pass: 100%/47  | Total:  5h 09m | Avg:  6m 34s | Max: 25m 43s
      🟩 GCC                Pass: 100%/56  | Total:  8h 00m | Avg:  8m 35s | Max: 34m 07s
      🟩 Intel              Pass: 100%/3   | Total: 42m 34s | Avg: 14m 11s | Max: 19m 33s
      🟩 MSVC               Pass: 100%/6   | Total:  1h 39m | Avg: 16m 36s | Max: 24m 47s | Hits:  91%/16953 
    🟩 gpu
      🟩 v100               Pass: 100%/112 | Total: 15h 32m | Avg:  8m 19s | Max: 34m 07s | Hits:  91%/16953 
    🟩 jobs
      🟩 Build              Pass: 100%/99  | Total: 11h 13m | Avg:  6m 47s | Max: 24m 47s | Hits:  91%/16953 
      🟩 NVRTC              Pass: 100%/4   | Total:  1h 43m | Avg: 25m 59s | Max: 34m 07s
      🟩 Test               Pass: 100%/8   | Total:  2h 33m | Avg: 19m 09s | Max: 25m 43s
      🟩 VerifyCodegen      Pass: 100%/1   | Total:  2m 01s | Avg:  2m 01s | Max:  2m 01s
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total: 59m 35s | Avg: 19m 51s | Max: 20m 52s
      🟩 90a                Pass: 100%/4   | Total: 16m 33s | Avg:  4m 08s | Max:  5m 10s
    🟩 std
      🟩 11                 Pass: 100%/29  | Total:  3h 45m | Avg:  7m 45s | Max: 20m 52s
      🟩 14                 Pass: 100%/32  | Total:  3h 47m | Avg:  7m 06s | Max: 24m 47s | Hits:  82%/8101  
      🟩 17                 Pass: 100%/31  | Total:  4h 53m | Avg:  9m 27s | Max: 34m 07s | Hits:  99%/5788  
      🟩 20                 Pass: 100%/19  | Total:  3h 04m | Avg:  9m 41s | Max: 34m 05s | Hits:  99%/3064  
    
  • 🟩 cudax: Pass: 100%/54 | Total: 2h 42m | Avg: 3m 00s | Max: 9m 24s | Hits: 90%/194

    🟩 cpu
      🟩 amd64              Pass: 100%/50  | Total:  2h 32m | Avg:  3m 02s | Max:  9m 24s | Hits:  90%/194   
      🟩 arm64              Pass: 100%/4   | Total: 10m 13s | Avg:  2m 33s | Max:  3m 00s
    🟩 ctk
      🟩 12.0               Pass: 100%/23  | Total:  1h 07m | Avg:  2m 55s | Max:  7m 37s | Hits:  89%/97    
      🟩 12.5               Pass: 100%/31  | Total:  1h 35m | Avg:  3m 04s | Max:  9m 24s | Hits:  90%/97    
    🟩 cudacxx
      🟩 nvcc12.0           Pass: 100%/23  | Total:  1h 07m | Avg:  2m 55s | Max:  7m 37s | Hits:  89%/97    
      🟩 nvcc12.5           Pass: 100%/31  | Total:  1h 35m | Avg:  3m 04s | Max:  9m 24s | Hits:  90%/97    
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/54  | Total:  2h 42m | Avg:  3m 00s | Max:  9m 24s | Hits:  90%/194   
    🟩 cxx
      🟩 Clang9             Pass: 100%/2   | Total:  5m 13s | Avg:  2m 36s | Max:  2m 42s
      🟩 Clang10            Pass: 100%/2   | Total:  4m 58s | Avg:  2m 29s | Max:  2m 33s
      🟩 Clang11            Pass: 100%/4   | Total: 10m 03s | Avg:  2m 30s | Max:  2m 38s
      🟩 Clang12            Pass: 100%/4   | Total: 10m 37s | Avg:  2m 39s | Max:  2m 58s
      🟩 Clang13            Pass: 100%/4   | Total: 10m 28s | Avg:  2m 37s | Max:  2m 57s
      🟩 Clang14            Pass: 100%/6   | Total: 18m 32s | Avg:  3m 05s | Max:  4m 18s
      🟩 Clang15            Pass: 100%/2   | Total:  6m 04s | Avg:  3m 02s | Max:  3m 26s
      🟩 Clang16            Pass: 100%/6   | Total: 19m 50s | Avg:  3m 18s | Max:  5m 09s
      🟩 GCC9               Pass: 100%/2   | Total:  4m 31s | Avg:  2m 15s | Max:  2m 21s
      🟩 GCC10              Pass: 100%/4   | Total:  9m 22s | Avg:  2m 20s | Max:  2m 24s
      🟩 GCC11              Pass: 100%/4   | Total:  9m 36s | Avg:  2m 24s | Max:  2m 27s
      🟩 GCC12              Pass: 100%/12  | Total: 36m 10s | Avg:  3m 00s | Max:  3m 53s
      🟩 MSVC14.36          Pass: 100%/1   | Total:  7m 37s | Avg:  7m 37s | Max:  7m 37s | Hits:  89%/97    
      🟩 MSVC14.39          Pass: 100%/1   | Total:  9m 24s | Avg:  9m 24s | Max:  9m 24s | Hits:  90%/97    
    🟩 cxx_family
      🟩 Clang              Pass: 100%/30  | Total:  1h 25m | Avg:  2m 51s | Max:  5m 09s
      🟩 GCC                Pass: 100%/22  | Total: 59m 39s | Avg:  2m 42s | Max:  3m 53s
      🟩 MSVC               Pass: 100%/2   | Total: 17m 01s | Avg:  8m 30s | Max:  9m 24s | Hits:  90%/194   
    🟩 gpu
      🟩 v100               Pass: 100%/54  | Total:  2h 42m | Avg:  3m 00s | Max:  9m 24s | Hits:  90%/194   
    🟩 jobs
      🟩 Build              Pass: 100%/46  | Total:  2h 09m | Avg:  2m 49s | Max:  9m 24s | Hits:  90%/194   
      🟩 Test               Pass: 100%/8   | Total: 32m 28s | Avg:  4m 03s | Max:  5m 09s
    🟩 sm
      🟩 90                 Pass: 100%/1   | Total:  2m 06s | Avg:  2m 06s | Max:  2m 06s
      🟩 90a                Pass: 100%/1   | Total:  2m 43s | Avg:  2m 43s | Max:  2m 43s
    🟩 std
      🟩 17                 Pass: 100%/30  | Total:  1h 22m | Avg:  2m 44s | Max:  5m 09s
      🟩 20                 Pass: 100%/24  | Total:  1h 20m | Avg:  3m 20s | Max:  9m 24s | Hits:  90%/194   
    
  • 🟩 pycuda: Pass: 100%/1 | Total: 14m 17s | Avg: 14m 17s | Max: 14m 17s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 14m 17s | Avg: 14m 17s | Max: 14m 17s
    🟩 ctk
      🟩 12.5               Pass: 100%/1   | Total: 14m 17s | Avg: 14m 17s | Max: 14m 17s
    🟩 cudacxx
      🟩 nvcc12.5           Pass: 100%/1   | Total: 14m 17s | Avg: 14m 17s | Max: 14m 17s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 14m 17s | Avg: 14m 17s | Max: 14m 17s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 14m 17s | Avg: 14m 17s | Max: 14m 17s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 14m 17s | Avg: 14m 17s | Max: 14m 17s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 14m 17s | Avg: 14m 17s | Max: 14m 17s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 14m 17s | Avg: 14m 17s | Max: 14m 17s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
+/- libcu++
+/- CUB
+/- Thrust
CUDA Experimental
pycuda
CUDA C Core Library

Modifications in project or dependencies?

Project
CCCL Infrastructure
+/- libcu++
+/- CUB
+/- Thrust
+/- CUDA Experimental
+/- pycuda
+/- CUDA C Core Library

🏃‍ Runner counts (total jobs: 417)

# Runner
304 linux-amd64-cpu16
62 linux-amd64-gpu-v100-latest-1
28 linux-arm64-cpu16
23 windows-amd64-cpu16

@bernhardmgruber bernhardmgruber force-pushed the transform_thrust branch 2 times, most recently from 18cd7ea to 7839465 Compare September 10, 2024 10:07
@bernhardmgruber
Copy link
Contributor Author

Benchmark of thrust::transform looks good (see PR description), except this regression:

|   F64   |    2^25    | 580.479 us |       0.11% | 587.621 us |       0.14% |   7.142 us |   1.23% |   FAIL   |

However, 1.23% slowdown seem tolerable given the other improvements.

Copy link
Contributor

🟩 CI finished in 11h 59m: Pass: 100%/433 | Total: 8d 02h | Avg: 26m 54s | Max: 1h 12m | Hits: 76%/41615
  • 🟩 cub: Pass: 100%/136 | Total: 4d 03h | Avg: 43m 46s | Max: 1h 11m | Hits: 65%/4362

    🟩 cpu
      🟩 amd64              Pass: 100%/128 | Total:  3d 19h | Avg: 43m 06s | Max:  1h 11m | Hits:  65%/4362  
      🟩 arm64              Pass: 100%/8   | Total:  7h 15m | Avg: 54m 24s | Max: 57m 31s
    🟩 ctk
      🟩 11.1               Pass: 100%/15  | Total: 11h 10m | Avg: 44m 40s | Max: 54m 18s | Hits:  65%/727   
      🟩 11.8               Pass: 100%/3   | Total:  3h 28m | Avg:  1h 09m | Max:  1h 11m
      🟩 12.6               Pass: 100%/118 | Total:  3d 12h | Avg: 43m 00s | Max:  1h 07m | Hits:  65%/3635  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total:  1h 46m | Avg: 53m 15s | Max: 53m 16s
      🟩 nvcc11.1           Pass: 100%/15  | Total: 11h 10m | Avg: 44m 40s | Max: 54m 18s | Hits:  65%/727   
      🟩 nvcc11.8           Pass: 100%/3   | Total:  3h 28m | Avg:  1h 09m | Max:  1h 11m
      🟩 nvcc12.6           Pass: 100%/116 | Total:  3d 10h | Avg: 42m 49s | Max:  1h 07m | Hits:  65%/3635  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total:  1h 46m | Avg: 53m 15s | Max: 53m 16s
      🟩 nvcc               Pass: 100%/134 | Total:  4d 01h | Avg: 43m 37s | Max:  1h 11m | Hits:  65%/4362  
    🟩 cxx
      🟩 Clang9             Pass: 100%/6   | Total:  4h 43m | Avg: 47m 16s | Max: 52m 46s
      🟩 Clang10            Pass: 100%/3   | Total:  2h 34m | Avg: 51m 29s | Max: 52m 43s
      🟩 Clang11            Pass: 100%/4   | Total:  3h 28m | Avg: 52m 09s | Max: 54m 17s
      🟩 Clang12            Pass: 100%/4   | Total:  3h 35m | Avg: 53m 53s | Max: 56m 38s
      🟩 Clang13            Pass: 100%/4   | Total:  3h 24m | Avg: 51m 00s | Max: 54m 09s
      🟩 Clang14            Pass: 100%/4   | Total:  3h 27m | Avg: 51m 56s | Max: 55m 49s
      🟩 Clang15            Pass: 100%/4   | Total:  3h 20m | Avg: 50m 08s | Max: 53m 11s
      🟩 Clang16            Pass: 100%/4   | Total:  3h 29m | Avg: 52m 17s | Max: 54m 09s
      🟩 Clang17            Pass: 100%/4   | Total:  3h 29m | Avg: 52m 19s | Max: 55m 55s
      🟩 Clang18            Pass: 100%/26  | Total: 14h 11m | Avg: 32m 45s | Max: 56m 48s
      🟩 GCC6               Pass: 100%/2   | Total:  1h 25m | Avg: 42m 32s | Max: 43m 21s
      🟩 GCC7               Pass: 100%/6   | Total:  4h 45m | Avg: 47m 39s | Max: 55m 15s
      🟩 GCC8               Pass: 100%/6   | Total:  4h 51m | Avg: 48m 37s | Max: 54m 41s
      🟩 GCC9               Pass: 100%/6   | Total:  4h 49m | Avg: 48m 11s | Max: 53m 01s
      🟩 GCC10              Pass: 100%/4   | Total:  3h 35m | Avg: 53m 52s | Max: 58m 26s
      🟩 GCC11              Pass: 100%/7   | Total:  6h 59m | Avg: 59m 55s | Max:  1h 11m
      🟩 GCC12              Pass: 100%/4   | Total:  3h 42m | Avg: 55m 35s | Max: 57m 56s
      🟩 GCC13              Pass: 100%/29  | Total: 14h 27m | Avg: 29m 54s | Max: 58m 59s
      🟩 Intel2023.2.0      Pass: 100%/3   | Total:  2h 57m | Avg: 59m 11s | Max:  1h 01m
      🟩 MSVC14.16          Pass: 100%/1   | Total: 54m 18s | Avg: 54m 18s | Max: 54m 18s | Hits:  65%/727   
      🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 57m | Avg: 58m 37s | Max:  1h 00m | Hits:  65%/1454  
      🟩 MSVC14.39          Pass: 100%/3   | Total:  3h 02m | Avg:  1h 00m | Max:  1h 07m | Hits:  65%/2181  
    🟩 cxx_family
      🟩 Clang              Pass: 100%/63  | Total:  1d 21h | Avg: 43m 34s | Max: 56m 48s
      🟩 GCC                Pass: 100%/64  | Total:  1d 20h | Avg: 41m 49s | Max:  1h 11m
      🟩 Intel              Pass: 100%/3   | Total:  2h 57m | Avg: 59m 11s | Max:  1h 01m
      🟩 MSVC               Pass: 100%/6   | Total:  5h 53m | Avg: 58m 57s | Max:  1h 07m | Hits:  65%/4362  
    🟩 gpu
      🟩 v100               Pass: 100%/136 | Total:  4d 03h | Avg: 43m 46s | Max:  1h 11m | Hits:  65%/4362  
    🟩 jobs
      🟩 Build              Pass: 100%/103 | Total:  3d 16h | Avg: 51m 30s | Max:  1h 11m | Hits:  65%/4362  
      🟩 DeviceLaunch       Pass: 100%/8   | Total:  2h 33m | Avg: 19m 09s | Max: 20m 39s
      🟩 GraphCapture       Pass: 100%/8   | Total:  1h 59m | Avg: 14m 58s | Max: 16m 39s
      🟩 HostLaunch         Pass: 100%/8   | Total:  2h 25m | Avg: 18m 11s | Max: 21m 39s
      🟩 SmallGMem          Pass: 100%/1   | Total: 34m 16s | Avg: 34m 16s | Max: 34m 16s
      🟩 TestGPU            Pass: 100%/8   | Total:  3h 15m | Avg: 24m 22s | Max: 30m 41s
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total:  3h 28m | Avg:  1h 09m | Max:  1h 11m
      🟩 90a                Pass: 100%/4   | Total:  1h 33m | Avg: 23m 24s | Max: 24m 40s
    🟩 std
      🟩 11                 Pass: 100%/35  | Total:  1d 01h | Avg: 43m 02s | Max:  1h 07m
      🟩 14                 Pass: 100%/38  | Total:  1d 04h | Avg: 44m 59s | Max:  1h 08m | Hits:  65%/2181  
      🟩 17                 Pass: 100%/38  | Total:  1d 03h | Avg: 44m 09s | Max:  1h 11m | Hits:  65%/1454  
      🟩 20                 Pass: 100%/25  | Total: 17h 38m | Avg: 42m 21s | Max:  1h 07m | Hits:  65%/727   
    
  • 🟩 thrust: Pass: 100%/122 | Total: 2d 13h | Avg: 30m 16s | Max: 1h 12m | Hits: 77%/20106

    🟩 cpu
      🟩 amd64              Pass: 100%/114 | Total:  2d 09h | Avg: 30m 14s | Max:  1h 12m | Hits:  77%/20106 
      🟩 arm64              Pass: 100%/8   | Total:  4h 05m | Avg: 30m 44s | Max: 36m 08s
    🟩 ctk
      🟩 11.1               Pass: 100%/15  | Total:  7h 11m | Avg: 28m 47s | Max: 51m 34s | Hits:  66%/2234  
      🟩 11.8               Pass: 100%/3   | Total:  2h 02m | Avg: 40m 58s | Max: 44m 36s
      🟩 12.6               Pass: 100%/104 | Total:  2d 04h | Avg: 30m 11s | Max:  1h 12m | Hits:  78%/17872 
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total: 58m 02s | Avg: 29m 01s | Max: 30m 15s
      🟩 nvcc11.1           Pass: 100%/15  | Total:  7h 11m | Avg: 28m 47s | Max: 51m 34s | Hits:  66%/2234  
      🟩 nvcc11.8           Pass: 100%/3   | Total:  2h 02m | Avg: 40m 58s | Max: 44m 36s
      🟩 nvcc12.6           Pass: 100%/102 | Total:  2d 03h | Avg: 30m 12s | Max:  1h 12m | Hits:  78%/17872 
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total: 58m 02s | Avg: 29m 01s | Max: 30m 15s
      🟩 nvcc               Pass: 100%/120 | Total:  2d 12h | Avg: 30m 18s | Max:  1h 12m | Hits:  77%/20106 
    🟩 cxx
      🟩 Clang9             Pass: 100%/6   | Total:  2h 59m | Avg: 29m 54s | Max: 38m 02s
      🟩 Clang10            Pass: 100%/3   | Total:  1h 43m | Avg: 34m 23s | Max: 36m 53s
      🟩 Clang11            Pass: 100%/4   | Total:  2h 04m | Avg: 31m 01s | Max: 33m 21s
      🟩 Clang12            Pass: 100%/4   | Total:  2h 12m | Avg: 33m 06s | Max: 36m 07s
      🟩 Clang13            Pass: 100%/4   | Total:  2h 01m | Avg: 30m 24s | Max: 33m 21s
      🟩 Clang14            Pass: 100%/4   | Total:  2h 04m | Avg: 31m 01s | Max: 36m 25s
      🟩 Clang15            Pass: 100%/4   | Total:  2h 08m | Avg: 32m 10s | Max: 34m 37s
      🟩 Clang16            Pass: 100%/4   | Total:  2h 06m | Avg: 31m 33s | Max: 36m 30s
      🟩 Clang17            Pass: 100%/4   | Total:  2h 16m | Avg: 34m 05s | Max: 37m 34s
      🟩 Clang18            Pass: 100%/18  | Total:  6h 40m | Avg: 22m 13s | Max: 34m 09s
      🟩 GCC6               Pass: 100%/2   | Total: 52m 11s | Avg: 26m 05s | Max: 28m 08s
      🟩 GCC7               Pass: 100%/6   | Total:  2h 58m | Avg: 29m 44s | Max: 34m 56s
      🟩 GCC8               Pass: 100%/6   | Total:  2h 54m | Avg: 29m 02s | Max: 33m 15s
      🟩 GCC9               Pass: 100%/6   | Total:  3h 03m | Avg: 30m 37s | Max: 37m 37s
      🟩 GCC10              Pass: 100%/4   | Total:  2h 11m | Avg: 32m 52s | Max: 35m 44s
      🟩 GCC11              Pass: 100%/7   | Total:  4h 13m | Avg: 36m 16s | Max: 44m 36s
      🟩 GCC12              Pass: 100%/4   | Total:  2h 21m | Avg: 35m 15s | Max: 41m 30s
      🟩 GCC13              Pass: 100%/20  | Total:  7h 34m | Avg: 22m 44s | Max: 44m 27s
      🟩 Intel2023.2.0      Pass: 100%/3   | Total:  2h 04m | Avg: 41m 34s | Max: 47m 53s
      🟩 MSVC14.16          Pass: 100%/1   | Total: 51m 34s | Avg: 51m 34s | Max: 51m 34s | Hits:  66%/2234  
      🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 58m | Avg: 59m 03s | Max:  1h 00m | Hits:  66%/4468  
      🟩 MSVC14.39          Pass: 100%/6   | Total:  4h 13m | Avg: 42m 16s | Max:  1h 12m | Hits:  82%/13404 
    🟩 cxx_family
      🟩 Clang              Pass: 100%/55  | Total:  1d 02h | Avg: 28m 39s | Max: 38m 02s
      🟩 GCC                Pass: 100%/55  | Total:  1d 02h | Avg: 28m 32s | Max: 44m 36s
      🟩 Intel              Pass: 100%/3   | Total:  2h 04m | Avg: 41m 34s | Max: 47m 53s
      🟩 MSVC               Pass: 100%/9   | Total:  7h 03m | Avg: 47m 01s | Max:  1h 12m | Hits:  77%/20106 
    🟩 gpu
      🟩 v100               Pass: 100%/122 | Total:  2d 13h | Avg: 30m 16s | Max:  1h 12m | Hits:  77%/20106 
    🟩 jobs
      🟩 Build              Pass: 100%/103 | Total:  2d 09h | Avg: 33m 21s | Max:  1h 12m | Hits:  66%/13404 
      🟩 TestCPU            Pass: 100%/11  | Total:  2h 04m | Avg: 11m 20s | Max: 21m 38s | Hits:  99%/6702  
      🟩 TestGPU            Pass: 100%/8   | Total:  2h 12m | Avg: 16m 35s | Max: 20m 58s
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total:  2h 02m | Avg: 40m 58s | Max: 44m 36s
      🟩 90a                Pass: 100%/4   | Total:  1h 21m | Avg: 20m 24s | Max: 23m 24s
    🟩 std
      🟩 11                 Pass: 100%/31  | Total: 12h 41m | Avg: 24m 33s | Max: 36m 27s
      🟩 14                 Pass: 100%/35  | Total: 18h 43m | Avg: 32m 06s | Max:  1h 00m | Hits:  74%/8936  
      🟩 17                 Pass: 100%/34  | Total: 18h 45m | Avg: 33m 05s | Max: 58m 34s | Hits:  77%/6702  
      🟩 20                 Pass: 100%/22  | Total: 11h 23m | Avg: 31m 04s | Max:  1h 12m | Hits:  82%/4468  
    
  • 🟩 libcudacxx: Pass: 100%/116 | Total: 1d 06h | Avg: 15m 41s | Max: 1h 04m | Hits: 77%/16953

    🟩 cpu
      🟩 amd64              Pass: 100%/108 | Total:  1d 04h | Avg: 15m 57s | Max:  1h 04m | Hits:  77%/16953 
      🟩 arm64              Pass: 100%/8   | Total:  1h 37m | Avg: 12m 07s | Max: 18m 32s
    🟩 ctk
      🟩 11.1               Pass: 100%/15  | Total:  3h 28m | Avg: 13m 55s | Max: 27m 18s | Hits:  46%/2633  
      🟩 11.8               Pass: 100%/3   | Total: 56m 09s | Avg: 18m 43s | Max: 20m 04s
      🟩 12.6               Pass: 100%/98  | Total:  1d 01h | Avg: 15m 51s | Max:  1h 04m | Hits:  83%/14320 
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/2   | Total: 37m 25s | Avg: 18m 42s | Max: 18m 53s
      🟩 nvcc11.1           Pass: 100%/15  | Total:  3h 28m | Avg: 13m 55s | Max: 27m 18s | Hits:  46%/2633  
      🟩 nvcc11.8           Pass: 100%/3   | Total: 56m 09s | Avg: 18m 43s | Max: 20m 04s
      🟩 nvcc12.6           Pass: 100%/96  | Total:  1d 01h | Avg: 15m 48s | Max:  1h 04m | Hits:  83%/14320 
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/2   | Total: 37m 25s | Avg: 18m 42s | Max: 18m 53s
      🟩 nvcc               Pass: 100%/114 | Total:  1d 05h | Avg: 15m 38s | Max:  1h 04m | Hits:  77%/16953 
    🟩 cxx
      🟩 Clang9             Pass: 100%/6   | Total:  1h 38m | Avg: 16m 21s | Max: 20m 50s
      🟩 Clang10            Pass: 100%/3   | Total: 57m 18s | Avg: 19m 06s | Max: 19m 45s
      🟩 Clang11            Pass: 100%/4   | Total:  1h 16m | Avg: 19m 06s | Max: 21m 08s
      🟩 Clang12            Pass: 100%/4   | Total: 46m 29s | Avg: 11m 37s | Max: 21m 35s
      🟩 Clang13            Pass: 100%/4   | Total:  1h 17m | Avg: 19m 15s | Max: 21m 14s
      🟩 Clang14            Pass: 100%/4   | Total: 55m 57s | Avg: 13m 59s | Max: 20m 39s
      🟩 Clang15            Pass: 100%/4   | Total: 47m 56s | Avg: 11m 59s | Max: 20m 45s
      🟩 Clang16            Pass: 100%/4   | Total: 39m 30s | Avg:  9m 52s | Max: 22m 05s
      🟩 Clang17            Pass: 100%/4   | Total: 49m 08s | Avg: 12m 17s | Max: 22m 41s
      🟩 Clang18            Pass: 100%/14  | Total:  2h 53m | Avg: 12m 25s | Max: 22m 03s
      🟩 GCC6               Pass: 100%/2   | Total: 31m 43s | Avg: 15m 51s | Max: 19m 44s
      🟩 GCC7               Pass: 100%/6   | Total:  1h 26m | Avg: 14m 21s | Max: 21m 50s
      🟩 GCC8               Pass: 100%/6   | Total:  1h 03m | Avg: 10m 32s | Max: 18m 42s
      🟩 GCC9               Pass: 100%/6   | Total:  1h 44m | Avg: 17m 22s | Max: 21m 19s
      🟩 GCC10              Pass: 100%/4   | Total:  1h 04m | Avg: 16m 13s | Max: 20m 07s
      🟩 GCC11              Pass: 100%/7   | Total:  2h 14m | Avg: 19m 10s | Max: 20m 37s
      🟩 GCC12              Pass: 100%/4   | Total:  1h 11m | Avg: 17m 45s | Max: 23m 20s
      🟩 GCC13              Pass: 100%/21  | Total:  6h 33m | Avg: 18m 45s | Max:  1h 04m
      🟩 Intel2023.2.0      Pass: 100%/3   | Total: 35m 06s | Avg: 11m 42s | Max: 18m 14s
      🟩 MSVC14.16          Pass: 100%/1   | Total: 27m 18s | Avg: 27m 18s | Max: 27m 18s | Hits:  46%/2633  
      🟩 MSVC14.29          Pass: 100%/2   | Total: 35m 32s | Avg: 17m 46s | Max: 24m 27s | Hits:  73%/5628  
      🟩 MSVC14.39          Pass: 100%/3   | Total: 50m 46s | Avg: 16m 55s | Max: 21m 44s | Hits:  90%/8692  
    🟩 cxx_family
      🟩 Clang              Pass: 100%/51  | Total: 12h 01m | Avg: 14m 09s | Max: 22m 41s
      🟩 GCC                Pass: 100%/56  | Total: 15h 49m | Avg: 16m 57s | Max:  1h 04m
      🟩 Intel              Pass: 100%/3   | Total: 35m 06s | Avg: 11m 42s | Max: 18m 14s
      🟩 MSVC               Pass: 100%/6   | Total:  1h 53m | Avg: 18m 56s | Max: 27m 18s | Hits:  77%/16953 
    🟩 gpu
      🟩 v100               Pass: 100%/116 | Total:  1d 06h | Avg: 15m 41s | Max:  1h 04m | Hits:  77%/16953 
    🟩 jobs
      🟩 Build              Pass: 100%/103 | Total:  1d 00h | Avg: 14m 29s | Max: 27m 18s | Hits:  77%/16953 
      🟩 NVRTC              Pass: 100%/4   | Total:  1h 08m | Avg: 17m 01s | Max: 18m 12s
      🟩 Test               Pass: 100%/8   | Total:  4h 17m | Avg: 32m 10s | Max:  1h 04m
      🟩 VerifyCodegen      Pass: 100%/1   | Total:  2m 01s | Avg:  2m 01s | Max:  2m 01s
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total: 56m 09s | Avg: 18m 43s | Max: 20m 04s
      🟩 90a                Pass: 100%/4   | Total: 26m 36s | Avg:  6m 39s | Max:  7m 32s
    🟩 std
      🟩 11                 Pass: 100%/30  | Total:  7h 02m | Avg: 14m 04s | Max: 24m 37s
      🟩 14                 Pass: 100%/33  | Total:  8h 03m | Avg: 14m 38s | Max: 44m 23s | Hits:  59%/8101  
      🟩 17                 Pass: 100%/32  | Total:  8h 37m | Avg: 16m 09s | Max: 53m 16s | Hits:  99%/5788  
      🟩 20                 Pass: 100%/20  | Total:  6h 34m | Avg: 19m 44s | Max:  1h 04m | Hits:  84%/3064  
    
  • 🟩 cudax: Pass: 100%/58 | Total: 2h 53m | Avg: 2m 59s | Max: 7m 08s | Hits: 89%/194

    🟩 cpu
      🟩 amd64              Pass: 100%/54  | Total:  2h 44m | Avg:  3m 02s | Max:  7m 08s | Hits:  89%/194   
      🟩 arm64              Pass: 100%/4   | Total:  9m 16s | Avg:  2m 19s | Max:  2m 23s
    🟩 ctk
      🟩 12.0               Pass: 100%/23  | Total:  1h 09m | Avg:  3m 01s | Max:  7m 08s | Hits:  89%/97    
      🟩 12.6               Pass: 100%/35  | Total:  1h 44m | Avg:  2m 58s | Max:  6m 57s | Hits:  89%/97    
    🟩 cudacxx
      🟩 nvcc12.0           Pass: 100%/23  | Total:  1h 09m | Avg:  3m 01s | Max:  7m 08s | Hits:  89%/97    
      🟩 nvcc12.6           Pass: 100%/35  | Total:  1h 44m | Avg:  2m 58s | Max:  6m 57s | Hits:  89%/97    
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/58  | Total:  2h 53m | Avg:  2m 59s | Max:  7m 08s | Hits:  89%/194   
    🟩 cxx
      🟩 Clang9             Pass: 100%/2   | Total:  5m 10s | Avg:  2m 35s | Max:  2m 37s
      🟩 Clang10            Pass: 100%/2   | Total:  5m 16s | Avg:  2m 38s | Max:  2m 46s
      🟩 Clang11            Pass: 100%/4   | Total: 10m 03s | Avg:  2m 30s | Max:  2m 40s
      🟩 Clang12            Pass: 100%/4   | Total: 10m 39s | Avg:  2m 39s | Max:  3m 15s
      🟩 Clang13            Pass: 100%/4   | Total: 10m 55s | Avg:  2m 43s | Max:  2m 49s
      🟩 Clang14            Pass: 100%/6   | Total: 19m 01s | Avg:  3m 10s | Max:  5m 02s
      🟩 Clang15            Pass: 100%/2   | Total:  5m 38s | Avg:  2m 49s | Max:  3m 03s
      🟩 Clang16            Pass: 100%/4   | Total: 10m 04s | Avg:  2m 31s | Max:  2m 45s
      🟩 Clang17            Pass: 100%/2   | Total:  5m 31s | Avg:  2m 45s | Max:  2m 46s
      🟩 Clang18            Pass: 100%/4   | Total: 14m 39s | Avg:  3m 39s | Max:  4m 59s
      🟩 GCC9               Pass: 100%/2   | Total:  5m 29s | Avg:  2m 44s | Max:  3m 09s
      🟩 GCC10              Pass: 100%/4   | Total: 10m 02s | Avg:  2m 30s | Max:  2m 42s
      🟩 GCC11              Pass: 100%/4   | Total: 10m 45s | Avg:  2m 41s | Max:  3m 02s
      🟩 GCC12              Pass: 100%/9   | Total: 29m 32s | Avg:  3m 16s | Max:  4m 49s
      🟩 GCC13              Pass: 100%/3   | Total:  6m 52s | Avg:  2m 17s | Max:  2m 19s
      🟩 MSVC14.36          Pass: 100%/1   | Total:  7m 08s | Avg:  7m 08s | Max:  7m 08s | Hits:  89%/97    
      🟩 MSVC14.39          Pass: 100%/1   | Total:  6m 57s | Avg:  6m 57s | Max:  6m 57s | Hits:  89%/97    
    🟩 cxx_family
      🟩 Clang              Pass: 100%/34  | Total:  1h 36m | Avg:  2m 51s | Max:  5m 02s
      🟩 GCC                Pass: 100%/22  | Total:  1h 02m | Avg:  2m 50s | Max:  4m 49s
      🟩 MSVC               Pass: 100%/2   | Total: 14m 05s | Avg:  7m 02s | Max:  7m 08s | Hits:  89%/194   
    🟩 gpu
      🟩 v100               Pass: 100%/58  | Total:  2h 53m | Avg:  2m 59s | Max:  7m 08s | Hits:  89%/194   
    🟩 jobs
      🟩 Build              Pass: 100%/50  | Total:  2h 19m | Avg:  2m 47s | Max:  7m 08s | Hits:  89%/194   
      🟩 Test               Pass: 100%/8   | Total: 34m 21s | Avg:  4m 17s | Max:  5m 02s
    🟩 sm
      🟩 90                 Pass: 100%/1   | Total:  2m 42s | Avg:  2m 42s | Max:  2m 42s
      🟩 90a                Pass: 100%/1   | Total:  2m 16s | Avg:  2m 16s | Max:  2m 16s
    🟩 std
      🟩 17                 Pass: 100%/32  | Total:  1h 28m | Avg:  2m 46s | Max:  4m 27s
      🟩 20                 Pass: 100%/26  | Total:  1h 24m | Avg:  3m 15s | Max:  7m 08s | Hits:  89%/194   
    
  • 🟩 pycuda: Pass: 100%/1 | Total: 13m 17s | Avg: 13m 17s | Max: 13m 17s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 13m 17s | Avg: 13m 17s | Max: 13m 17s
    🟩 ctk
      🟩 12.5               Pass: 100%/1   | Total: 13m 17s | Avg: 13m 17s | Max: 13m 17s
    🟩 cudacxx
      🟩 nvcc12.5           Pass: 100%/1   | Total: 13m 17s | Avg: 13m 17s | Max: 13m 17s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 13m 17s | Avg: 13m 17s | Max: 13m 17s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 13m 17s | Avg: 13m 17s | Max: 13m 17s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 13m 17s | Avg: 13m 17s | Max: 13m 17s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 13m 17s | Avg: 13m 17s | Max: 13m 17s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 13m 17s | Avg: 13m 17s | Max: 13m 17s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
+/- libcu++
+/- CUB
+/- Thrust
CUDA Experimental
pycuda
CUDA C Core Library

Modifications in project or dependencies?

Project
CCCL Infrastructure
+/- libcu++
+/- CUB
+/- Thrust
+/- CUDA Experimental
+/- pycuda
+/- CUDA C Core Library

🏃‍ Runner counts (total jobs: 433)

# Runner
320 linux-amd64-cpu16
62 linux-amd64-gpu-v100-latest-1
28 linux-arm64-cpu16
23 windows-amd64-cpu16

@bernhardmgruber
Copy link
Contributor Author

It seems we are blocked on: #2402

@bernhardmgruber bernhardmgruber changed the title Make thrust::transform use cub::DeviceTransform Make thrust::transform use cub::DeviceTransform Sep 11, 2024
Copy link
Contributor

🟩 CI finished in 4h 14m: Pass: 100%/394 | Total: 8d 19h | Avg: 32m 10s | Max: 1h 37m | Hits: 55%/25800
  • 🟩 libcudacxx: Pass: 100%/118 | Total: 1d 12h | Avg: 18m 43s | Max: 41m 11s | Hits: 30%/9472

    🟩 cpu
      🟩 amd64              Pass: 100%/110 | Total:  1d 10h | Avg: 18m 48s | Max: 41m 11s | Hits:  30%/9472  
      🟩 arm64              Pass: 100%/8   | Total:  2h 19m | Avg: 17m 28s | Max: 27m 16s
    🟩 ctk
      🟩 11.1               Pass: 100%/15  | Total:  4h 54m | Avg: 19m 36s | Max: 39m 07s | Hits:  30%/2175  
      🟩 11.8               Pass: 100%/3   | Total:  1h 17m | Avg: 25m 46s | Max: 29m 29s
      🟩 12.5               Pass: 100%/4   | Total:  2h 19m | Avg: 34m 59s | Max: 40m 09s
      🟩 12.6               Pass: 100%/96  | Total:  1d 04h | Avg: 17m 40s | Max: 41m 11s | Hits:  30%/7297  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/12  | Total:  2h 32m | Avg: 12m 40s | Max: 20m 36s
      🟩 nvcc11.1           Pass: 100%/15  | Total:  4h 54m | Avg: 19m 36s | Max: 39m 07s | Hits:  30%/2175  
      🟩 nvcc11.8           Pass: 100%/3   | Total:  1h 17m | Avg: 25m 46s | Max: 29m 29s
      🟩 nvcc12.5           Pass: 100%/4   | Total:  2h 19m | Avg: 34m 59s | Max: 40m 09s
      🟩 nvcc12.6           Pass: 100%/84  | Total:  1d 01h | Avg: 18m 23s | Max: 41m 11s | Hits:  30%/7297  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/12  | Total:  2h 32m | Avg: 12m 40s | Max: 20m 36s
      🟩 nvcc               Pass: 100%/106 | Total:  1d 10h | Avg: 19m 24s | Max: 41m 11s | Hits:  30%/9472  
    🟩 cxx
      🟩 Clang9             Pass: 100%/6   | Total:  1h 56m | Avg: 19m 24s | Max: 26m 20s
      🟩 Clang10            Pass: 100%/3   | Total: 50m 43s | Avg: 16m 54s | Max: 24m 40s
      🟩 Clang11            Pass: 100%/4   | Total: 35m 51s | Avg:  8m 57s | Max: 21m 18s
      🟩 Clang12            Pass: 100%/4   | Total:  1h 01m | Avg: 15m 15s | Max: 30m 29s
      🟩 Clang13            Pass: 100%/4   | Total:  1h 14m | Avg: 18m 41s | Max: 29m 32s
      🟩 Clang14            Pass: 100%/4   | Total:  1h 38m | Avg: 24m 43s | Max: 30m 07s
      🟩 Clang15            Pass: 100%/4   | Total:  1h 21m | Avg: 20m 26s | Max: 31m 18s
      🟩 Clang16            Pass: 100%/4   | Total: 33m 50s | Avg:  8m 27s | Max: 19m 11s
      🟩 Clang17            Pass: 100%/4   | Total:  1h 18m | Avg: 19m 39s | Max: 27m 26s
      🟩 Clang18            Pass: 100%/18  | Total:  4h 09m | Avg: 13m 50s | Max: 24m 32s
      🟩 GCC6               Pass: 100%/2   | Total: 39m 26s | Avg: 19m 43s | Max: 22m 31s
      🟩 GCC7               Pass: 100%/6   | Total:  1h 52m | Avg: 18m 44s | Max: 27m 00s
      🟩 GCC8               Pass: 100%/6   | Total:  1h 34m | Avg: 15m 46s | Max: 28m 20s
      🟩 GCC9               Pass: 100%/6   | Total:  1h 33m | Avg: 15m 36s | Max: 24m 00s
      🟩 GCC10              Pass: 100%/4   | Total:  1h 21m | Avg: 20m 24s | Max: 27m 25s
      🟩 GCC11              Pass: 100%/7   | Total:  2h 52m | Avg: 24m 37s | Max: 30m 13s
      🟩 GCC12              Pass: 100%/4   | Total:  1h 41m | Avg: 25m 22s | Max: 29m 36s
      🟩 GCC13              Pass: 100%/17  | Total:  4h 15m | Avg: 15m 03s | Max: 32m 21s
      🟩 Intel2023.2.0      Pass: 100%/3   | Total:  1h 24m | Avg: 28m 13s | Max: 35m 53s
      🟩 MSVC14.16          Pass: 100%/1   | Total: 39m 07s | Avg: 39m 07s | Max: 39m 07s | Hits:  30%/2175  
      🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 11m | Avg: 35m 38s | Max: 40m 15s | Hits:  31%/4711  
      🟩 MSVC14.39          Pass: 100%/1   | Total: 41m 11s | Avg: 41m 11s | Max: 41m 11s | Hits:  29%/2586  
      🟩 NVHPC24.7          Pass: 100%/4   | Total:  2h 19m | Avg: 34m 59s | Max: 40m 09s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/55  | Total: 14h 41m | Avg: 16m 01s | Max: 31m 18s
      🟩 GCC                Pass: 100%/52  | Total: 15h 51m | Avg: 18m 17s | Max: 32m 21s
      🟩 Intel              Pass: 100%/3   | Total:  1h 24m | Avg: 28m 13s | Max: 35m 53s
      🟩 MSVC               Pass: 100%/4   | Total:  2h 31m | Avg: 37m 53s | Max: 41m 11s | Hits:  30%/9472  
      🟩 NVHPC              Pass: 100%/4   | Total:  2h 19m | Avg: 34m 59s | Max: 40m 09s
    🟩 gpu
      🟩 v100               Pass: 100%/118 | Total:  1d 12h | Avg: 18m 43s | Max: 41m 11s | Hits:  30%/9472  
    🟩 jobs
      🟩 Build              Pass: 100%/110 | Total:  1d 10h | Avg: 18m 47s | Max: 41m 11s | Hits:  30%/9472  
      🟩 NVRTC              Pass: 100%/4   | Total:  1h 33m | Avg: 23m 28s | Max: 28m 10s
      🟩 Test               Pass: 100%/3   | Total: 46m 12s | Avg: 15m 24s | Max: 16m 47s
      🟩 VerifyCodegen      Pass: 100%/1   | Total:  2m 11s | Avg:  2m 11s | Max:  2m 11s
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total:  1h 17m | Avg: 25m 46s | Max: 29m 29s
      🟩 90                 Pass: 100%/4   | Total: 42m 24s | Avg: 10m 36s | Max: 12m 17s
      🟩 90a                Pass: 100%/8   | Total: 59m 30s | Avg:  7m 26s | Max: 12m 20s
    🟩 std
      🟩 11                 Pass: 100%/32  | Total:  8h 15m | Avg: 15m 28s | Max: 26m 54s
      🟩 14                 Pass: 100%/32  | Total: 10h 21m | Avg: 19m 24s | Max: 39m 07s | Hits:  31%/4452  
      🟩 17                 Pass: 100%/30  | Total: 10h 00m | Avg: 20m 00s | Max: 40m 15s | Hits:  30%/2434  
      🟩 20                 Pass: 100%/23  | Total:  8h 09m | Avg: 21m 17s | Max: 41m 11s | Hits:  29%/2586  
    
  • 🟩 cub: Pass: 100%/110 | Total: 4d 05h | Avg: 55m 37s | Max: 1h 37m | Hits: 60%/2924

    🟩 cpu
      🟩 amd64              Pass: 100%/102 | Total:  3d 22h | Avg: 55m 45s | Max:  1h 37m | Hits:  60%/2924  
      🟩 arm64              Pass: 100%/8   | Total:  7h 12m | Avg: 54m 01s | Max: 54m 58s
    🟩 ctk
      🟩 11.1               Pass: 100%/15  | Total: 12h 02m | Avg: 48m 11s | Max: 51m 08s | Hits:  63%/731   
      🟩 11.8               Pass: 100%/3   | Total:  3h 49m | Avg:  1h 16m | Max:  1h 20m
      🟩 12.5               Pass: 100%/4   | Total:  4h 08m | Avg:  1h 02m | Max:  1h 07m
      🟩 12.6               Pass: 100%/88  | Total:  3d 09h | Avg: 55m 53s | Max:  1h 37m | Hits:  59%/2193  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/4   | Total:  3h 56m | Avg: 59m 02s | Max:  1h 01m
      🟩 nvcc11.1           Pass: 100%/15  | Total: 12h 02m | Avg: 48m 11s | Max: 51m 08s | Hits:  63%/731   
      🟩 nvcc11.8           Pass: 100%/3   | Total:  3h 49m | Avg:  1h 16m | Max:  1h 20m
      🟩 nvcc12.5           Pass: 100%/4   | Total:  4h 08m | Avg:  1h 02m | Max:  1h 07m
      🟩 nvcc12.6           Pass: 100%/84  | Total:  3d 06h | Avg: 55m 44s | Max:  1h 37m | Hits:  59%/2193  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/4   | Total:  3h 56m | Avg: 59m 02s | Max:  1h 01m
      🟩 nvcc               Pass: 100%/106 | Total:  4d 02h | Avg: 55m 29s | Max:  1h 37m | Hits:  60%/2924  
    🟩 cxx
      🟩 Clang9             Pass: 100%/6   | Total:  5h 13m | Avg: 52m 16s | Max: 57m 48s
      🟩 Clang10            Pass: 100%/3   | Total:  2h 47m | Avg: 55m 55s | Max: 59m 47s
      🟩 Clang11            Pass: 100%/4   | Total:  3h 45m | Avg: 56m 20s | Max: 57m 49s
      🟩 Clang12            Pass: 100%/4   | Total:  3h 46m | Avg: 56m 41s | Max: 59m 02s
      🟩 Clang13            Pass: 100%/4   | Total:  3h 35m | Avg: 53m 49s | Max: 57m 12s
      🟩 Clang14            Pass: 100%/4   | Total:  3h 38m | Avg: 54m 38s | Max: 56m 55s
      🟩 Clang15            Pass: 100%/4   | Total:  3h 41m | Avg: 55m 16s | Max: 59m 31s
      🟩 Clang16            Pass: 100%/4   | Total:  3h 31m | Avg: 52m 53s | Max: 54m 43s
      🟩 Clang17            Pass: 100%/4   | Total:  3h 45m | Avg: 56m 22s | Max:  1h 02m
      🟩 Clang18            Pass: 100%/11  | Total:  9h 45m | Avg: 53m 13s | Max:  1h 01m
      🟩 GCC6               Pass: 100%/2   | Total:  1h 33m | Avg: 46m 49s | Max: 49m 54s
      🟩 GCC7               Pass: 100%/6   | Total:  5h 11m | Avg: 51m 59s | Max:  1h 02m
      🟩 GCC8               Pass: 100%/6   | Total:  5h 00m | Avg: 50m 01s | Max: 52m 53s
      🟩 GCC9               Pass: 100%/6   | Total:  5h 23m | Avg: 53m 57s | Max:  1h 00m
      🟩 GCC10              Pass: 100%/4   | Total:  3h 46m | Avg: 56m 44s | Max:  1h 00m
      🟩 GCC11              Pass: 100%/7   | Total:  7h 31m | Avg:  1h 04m | Max:  1h 20m
      🟩 GCC12              Pass: 100%/4   | Total:  3h 43m | Avg: 55m 52s | Max: 59m 02s
      🟩 GCC13              Pass: 100%/16  | Total: 15h 11m | Avg: 56m 57s | Max:  1h 37m
      🟩 Intel2023.2.0      Pass: 100%/3   | Total:  3h 00m | Avg:  1h 00m | Max:  1h 04m
      🟩 MSVC14.16          Pass: 100%/1   | Total: 51m 08s | Avg: 51m 08s | Max: 51m 08s | Hits:  63%/731   
      🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 59m | Avg: 59m 40s | Max:  1h 02m | Hits:  57%/1462  
      🟩 MSVC14.39          Pass: 100%/1   | Total:  1h 05m | Avg:  1h 05m | Max:  1h 05m | Hits:  63%/731   
      🟩 NVHPC24.7          Pass: 100%/4   | Total:  4h 08m | Avg:  1h 02m | Max:  1h 07m
    🟩 cxx_family
      🟩 Clang              Pass: 100%/48  | Total:  1d 19h | Avg: 54m 23s | Max:  1h 02m
      🟩 GCC                Pass: 100%/51  | Total:  1d 23h | Avg: 55m 44s | Max:  1h 37m
      🟩 Intel              Pass: 100%/3   | Total:  3h 00m | Avg:  1h 00m | Max:  1h 04m
      🟩 MSVC               Pass: 100%/4   | Total:  3h 55m | Avg: 58m 57s | Max:  1h 05m | Hits:  60%/2924  
      🟩 NVHPC              Pass: 100%/4   | Total:  4h 08m | Avg:  1h 02m | Max:  1h 07m
    🟩 gpu
      🟩 v100               Pass: 100%/110 | Total:  4d 05h | Avg: 55m 37s | Max:  1h 37m | Hits:  60%/2924  
    🟩 jobs
      🟩 Build              Pass: 100%/102 | Total:  3d 20h | Avg: 54m 27s | Max:  1h 20m | Hits:  60%/2924  
      🟩 DeviceLaunch       Pass: 100%/1   | Total:  1h 37m | Avg:  1h 37m | Max:  1h 37m
      🟩 GraphCapture       Pass: 100%/1   | Total:  1h 37m | Avg:  1h 37m | Max:  1h 37m
      🟩 HostLaunch         Pass: 100%/3   | Total:  2h 44m | Avg: 54m 59s | Max:  1h 24m
      🟩 TestGPU            Pass: 100%/3   | Total:  3h 24m | Avg:  1h 08m | Max:  1h 30m
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total:  3h 49m | Avg:  1h 16m | Max:  1h 20m
      🟩 90a                Pass: 100%/4   | Total:  1h 36m | Avg: 24m 10s | Max: 25m 53s
    🟩 std
      🟩 11                 Pass: 100%/30  | Total:  1d 03h | Avg: 55m 00s | Max:  1h 17m
      🟩 14                 Pass: 100%/29  | Total:  1d 02h | Avg: 54m 32s | Max:  1h 20m | Hits:  58%/1462  
      🟩 17                 Pass: 100%/27  | Total:  1d 00h | Avg: 53m 50s | Max:  1h 12m | Hits:  60%/731   
      🟩 20                 Pass: 100%/24  | Total: 23h 53m | Avg: 59m 43s | Max:  1h 37m | Hits:  63%/731   
    
  • 🟩 thrust: Pass: 100%/109 | Total: 2d 19h | Avg: 36m 59s | Max: 1h 26m | Hits: 72%/13180

    🟩 cpu
      🟩 amd64              Pass: 100%/101 | Total:  2d 14h | Avg: 37m 21s | Max:  1h 26m | Hits:  72%/13180 
      🟩 arm64              Pass: 100%/8   | Total:  4h 18m | Avg: 32m 17s | Max: 39m 00s
    🟩 ctk
      🟩 11.1               Pass: 100%/15  | Total:  8h 51m | Avg: 35m 25s | Max:  1h 06m | Hits:  67%/2636  
      🟩 11.8               Pass: 100%/3   | Total:  2h 11m | Avg: 43m 42s | Max: 48m 05s
      🟩 12.5               Pass: 100%/4   | Total:  5h 17m | Avg:  1h 19m | Max:  1h 26m
      🟩 12.6               Pass: 100%/87  | Total:  2d 02h | Avg: 35m 04s | Max:  1h 18m | Hits:  73%/10544 
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/4   | Total:  1h 47m | Avg: 26m 58s | Max: 29m 40s
      🟩 nvcc11.1           Pass: 100%/15  | Total:  8h 51m | Avg: 35m 25s | Max:  1h 06m | Hits:  67%/2636  
      🟩 nvcc11.8           Pass: 100%/3   | Total:  2h 11m | Avg: 43m 42s | Max: 48m 05s
      🟩 nvcc12.5           Pass: 100%/4   | Total:  5h 17m | Avg:  1h 19m | Max:  1h 26m
      🟩 nvcc12.6           Pass: 100%/83  | Total:  2d 01h | Avg: 35m 27s | Max:  1h 18m | Hits:  73%/10544 
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/4   | Total:  1h 47m | Avg: 26m 58s | Max: 29m 40s
      🟩 nvcc               Pass: 100%/105 | Total:  2d 17h | Avg: 37m 22s | Max:  1h 26m | Hits:  72%/13180 
    🟩 cxx
      🟩 Clang9             Pass: 100%/6   | Total:  3h 34m | Avg: 35m 45s | Max: 41m 50s
      🟩 Clang10            Pass: 100%/3   | Total:  1h 58m | Avg: 39m 23s | Max: 44m 31s
      🟩 Clang11            Pass: 100%/4   | Total:  2h 24m | Avg: 36m 05s | Max: 40m 06s
      🟩 Clang12            Pass: 100%/4   | Total:  2h 32m | Avg: 38m 09s | Max: 46m 15s
      🟩 Clang13            Pass: 100%/4   | Total:  2h 15m | Avg: 33m 59s | Max: 37m 35s
      🟩 Clang14            Pass: 100%/4   | Total:  2h 29m | Avg: 37m 16s | Max: 40m 58s
      🟩 Clang15            Pass: 100%/4   | Total:  2h 26m | Avg: 36m 41s | Max: 40m 27s
      🟩 Clang16            Pass: 100%/4   | Total:  2h 24m | Avg: 36m 00s | Max: 41m 35s
      🟩 Clang17            Pass: 100%/4   | Total:  2h 22m | Avg: 35m 36s | Max: 42m 05s
      🟩 Clang18            Pass: 100%/11  | Total:  5h 07m | Avg: 27m 59s | Max: 42m 13s
      🟩 GCC6               Pass: 100%/2   | Total:  1h 01m | Avg: 30m 32s | Max: 34m 36s
      🟩 GCC7               Pass: 100%/6   | Total:  3h 28m | Avg: 34m 45s | Max: 42m 05s
      🟩 GCC8               Pass: 100%/6   | Total:  3h 26m | Avg: 34m 27s | Max: 41m 11s
      🟩 GCC9               Pass: 100%/6   | Total:  3h 31m | Avg: 35m 10s | Max: 39m 30s
      🟩 GCC10              Pass: 100%/4   | Total:  2h 28m | Avg: 37m 12s | Max: 43m 13s
      🟩 GCC11              Pass: 100%/7   | Total:  4h 45m | Avg: 40m 49s | Max: 48m 05s
      🟩 GCC12              Pass: 100%/4   | Total:  2h 44m | Avg: 41m 10s | Max: 47m 09s
      🟩 GCC13              Pass: 100%/14  | Total:  5h 41m | Avg: 24m 23s | Max: 40m 55s
      🟩 Intel2023.2.0      Pass: 100%/3   | Total:  2h 12m | Avg: 44m 11s | Max: 47m 56s
      🟩 MSVC14.16          Pass: 100%/1   | Total:  1h 06m | Avg:  1h 06m | Max:  1h 06m | Hits:  67%/2636  
      🟩 MSVC14.29          Pass: 100%/2   | Total:  2h 09m | Avg:  1h 04m | Max:  1h 06m | Hits:  63%/5272  
      🟩 MSVC14.39          Pass: 100%/2   | Total:  1h 41m | Avg: 50m 35s | Max:  1h 18m | Hits:  83%/5272  
      🟩 NVHPC24.7          Pass: 100%/4   | Total:  5h 17m | Avg:  1h 19m | Max:  1h 26m
    🟩 cxx_family
      🟩 Clang              Pass: 100%/48  | Total:  1d 03h | Avg: 34m 29s | Max: 46m 15s
      🟩 GCC                Pass: 100%/49  | Total:  1d 03h | Avg: 33m 13s | Max: 48m 05s
      🟩 Intel              Pass: 100%/3   | Total:  2h 12m | Avg: 44m 11s | Max: 47m 56s
      🟩 MSVC               Pass: 100%/5   | Total:  4h 57m | Avg: 59m 27s | Max:  1h 18m | Hits:  72%/13180 
      🟩 NVHPC              Pass: 100%/4   | Total:  5h 17m | Avg:  1h 19m | Max:  1h 26m
    🟩 gpu
      🟩 v100               Pass: 100%/109 | Total:  2d 19h | Avg: 36m 59s | Max:  1h 26m | Hits:  72%/13180 
    🟩 jobs
      🟩 Build              Pass: 100%/102 | Total:  2d 17h | Avg: 38m 30s | Max:  1h 26m | Hits:  65%/10544 
      🟩 TestCPU            Pass: 100%/4   | Total: 47m 00s | Avg: 11m 45s | Max: 22m 50s | Hits:  99%/2636  
      🟩 TestGPU            Pass: 100%/3   | Total: 56m 02s | Avg: 18m 40s | Max: 27m 46s
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total:  2h 11m | Avg: 43m 42s | Max: 48m 05s
      🟩 90a                Pass: 100%/4   | Total:  1h 26m | Avg: 21m 37s | Max: 25m 34s
    🟩 std
      🟩 11                 Pass: 100%/30  | Total: 14h 35m | Avg: 29m 10s | Max:  1h 05m
      🟩 14                 Pass: 100%/29  | Total: 19h 40m | Avg: 40m 42s | Max:  1h 22m | Hits:  67%/5272  
      🟩 17                 Pass: 100%/27  | Total: 18h 35m | Avg: 41m 19s | Max:  1h 23m | Hits:  58%/2636  
      🟩 20                 Pass: 100%/23  | Total: 14h 20m | Avg: 37m 23s | Max:  1h 26m | Hits:  83%/5272  
    
  • 🟩 cudax: Pass: 100%/54 | Total: 4h 48m | Avg: 5m 20s | Max: 19m 27s | Hits: 87%/224

    🟩 cpu
      🟩 amd64              Pass: 100%/50  | Total:  4h 35m | Avg:  5m 30s | Max: 19m 27s | Hits:  87%/224   
      🟩 arm64              Pass: 100%/4   | Total: 13m 34s | Avg:  3m 23s | Max:  3m 30s
    🟩 ctk
      🟩 12.0               Pass: 100%/19  | Total:  1h 43m | Avg:  5m 25s | Max: 19m 04s | Hits:  87%/112   
      🟩 12.5               Pass: 100%/2   | Total: 11m 50s | Avg:  5m 55s | Max:  6m 09s
      🟩 12.6               Pass: 100%/33  | Total:  2h 53m | Avg:  5m 15s | Max: 19m 27s | Hits:  87%/112   
    🟩 cudacxx
      🟩 nvcc12.0           Pass: 100%/19  | Total:  1h 43m | Avg:  5m 25s | Max: 19m 04s | Hits:  87%/112   
      🟩 nvcc12.5           Pass: 100%/2   | Total: 11m 50s | Avg:  5m 55s | Max:  6m 09s
      🟩 nvcc12.6           Pass: 100%/33  | Total:  2h 53m | Avg:  5m 15s | Max: 19m 27s | Hits:  87%/112   
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/54  | Total:  4h 48m | Avg:  5m 20s | Max: 19m 27s | Hits:  87%/224   
    🟩 cxx
      🟩 Clang9             Pass: 100%/2   | Total:  8m 30s | Avg:  4m 15s | Max:  4m 24s
      🟩 Clang10            Pass: 100%/2   | Total:  8m 04s | Avg:  4m 02s | Max:  4m 18s
      🟩 Clang11            Pass: 100%/4   | Total: 14m 23s | Avg:  3m 35s | Max:  3m 45s
      🟩 Clang12            Pass: 100%/4   | Total: 14m 56s | Avg:  3m 44s | Max:  3m 53s
      🟩 Clang13            Pass: 100%/4   | Total: 14m 52s | Avg:  3m 43s | Max:  4m 01s
      🟩 Clang14            Pass: 100%/4   | Total: 28m 33s | Avg:  7m 08s | Max: 17m 11s
      🟩 Clang15            Pass: 100%/2   | Total:  7m 31s | Avg:  3m 45s | Max:  3m 48s
      🟩 Clang16            Pass: 100%/4   | Total: 14m 35s | Avg:  3m 38s | Max:  4m 06s
      🟩 Clang17            Pass: 100%/2   | Total:  8m 06s | Avg:  4m 03s | Max:  4m 16s
      🟩 Clang18            Pass: 100%/2   | Total: 21m 54s | Avg: 10m 57s | Max: 17m 58s
      🟩 GCC9               Pass: 100%/2   | Total:  7m 30s | Avg:  3m 45s | Max:  3m 48s
      🟩 GCC10              Pass: 100%/4   | Total: 15m 08s | Avg:  3m 47s | Max:  3m 56s
      🟩 GCC11              Pass: 100%/4   | Total: 14m 59s | Avg:  3m 44s | Max:  3m 56s
      🟩 GCC12              Pass: 100%/7   | Total:  1h 10m | Avg: 10m 00s | Max: 19m 27s
      🟩 GCC13              Pass: 100%/3   | Total: 10m 09s | Avg:  3m 23s | Max:  3m 30s
      🟩 MSVC14.36          Pass: 100%/1   | Total:  8m 45s | Avg:  8m 45s | Max:  8m 45s | Hits:  87%/112   
      🟩 MSVC14.39          Pass: 100%/1   | Total:  9m 01s | Avg:  9m 01s | Max:  9m 01s | Hits:  87%/112   
      🟩 NVHPC24.7          Pass: 100%/2   | Total: 11m 50s | Avg:  5m 55s | Max:  6m 09s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/30  | Total:  2h 21m | Avg:  4m 42s | Max: 17m 58s
      🟩 GCC                Pass: 100%/20  | Total:  1h 57m | Avg:  5m 53s | Max: 19m 27s
      🟩 MSVC               Pass: 100%/2   | Total: 17m 46s | Avg:  8m 53s | Max:  9m 01s | Hits:  87%/224   
      🟩 NVHPC              Pass: 100%/2   | Total: 11m 50s | Avg:  5m 55s | Max:  6m 09s
    🟩 gpu
      🟩 v100               Pass: 100%/54  | Total:  4h 48m | Avg:  5m 20s | Max: 19m 27s | Hits:  87%/224   
    🟩 jobs
      🟩 Build              Pass: 100%/49  | Total:  3h 17m | Avg:  4m 02s | Max:  9m 01s | Hits:  87%/224   
      🟩 Test               Pass: 100%/5   | Total:  1h 30m | Avg: 18m 11s | Max: 19m 27s
    🟩 sm
      🟩 90                 Pass: 100%/1   | Total:  3m 01s | Avg:  3m 01s | Max:  3m 01s
      🟩 90a                Pass: 100%/1   | Total:  3m 15s | Avg:  3m 15s | Max:  3m 15s
    🟩 std
      🟩 17                 Pass: 100%/29  | Total:  2h 21m | Avg:  4m 52s | Max: 19m 27s
      🟩 20                 Pass: 100%/25  | Total:  2h 27m | Avg:  5m 54s | Max: 17m 58s | Hits:  87%/224   
    
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 11m 33s | Avg: 5m 46s | Max: 9m 17s

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total: 11m 33s | Avg:  5m 46s | Max:  9m 17s
    🟩 ctk
      🟩 12.6               Pass: 100%/2   | Total: 11m 33s | Avg:  5m 46s | Max:  9m 17s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/2   | Total: 11m 33s | Avg:  5m 46s | Max:  9m 17s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total: 11m 33s | Avg:  5m 46s | Max:  9m 17s
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total: 11m 33s | Avg:  5m 46s | Max:  9m 17s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total: 11m 33s | Avg:  5m 46s | Max:  9m 17s
    🟩 gpu
      🟩 v100               Pass: 100%/2   | Total: 11m 33s | Avg:  5m 46s | Max:  9m 17s
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 16s | Avg:  2m 16s | Max:  2m 16s
      🟩 Test               Pass: 100%/1   | Total:  9m 17s | Avg:  9m 17s | Max:  9m 17s
    
  • 🟩 python: Pass: 100%/1 | Total: 16m 38s | Avg: 16m 38s | Max: 16m 38s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 16m 38s | Avg: 16m 38s | Max: 16m 38s
    🟩 ctk
      🟩 12.6               Pass: 100%/1   | Total: 16m 38s | Avg: 16m 38s | Max: 16m 38s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/1   | Total: 16m 38s | Avg: 16m 38s | Max: 16m 38s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 16m 38s | Avg: 16m 38s | Max: 16m 38s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 16m 38s | Avg: 16m 38s | Max: 16m 38s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 16m 38s | Avg: 16m 38s | Max: 16m 38s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 16m 38s | Avg: 16m 38s | Max: 16m 38s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 16m 38s | Avg: 16m 38s | Max: 16m 38s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
+/- libcu++
+/- CUB
+/- Thrust
CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
+/- libcu++
+/- CUB
+/- Thrust
+/- CUDA Experimental
+/- python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 394)

# Runner
326 linux-amd64-cpu16
28 linux-arm64-cpu16
25 linux-amd64-gpu-v100-latest-1
15 windows-amd64-cpu16

@bernhardmgruber
Copy link
Contributor Author

Added benchmark on H200 (PR description). Looking good except our fibonaccy transformation:

|  T{ct}  |  Elements  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |       Diff |   %Diff |  Status  |
|---------|------------|------------|-------------|------------|-------------|------------|---------|----------|
|   U32   |    2^16    |   9.858 us |       2.82% |   9.025 us |       3.08% |  -0.833 us |  -8.45% |   FAIL   |
|   U32   |    2^20    |  14.385 us |       2.76% |  16.259 us |       1.84% |   1.875 us |  13.03% |   FAIL   |
|   U32   |    2^24    |  86.839 us |       0.49% |  88.620 us |       0.37% |   1.781 us |   2.05% |   FAIL   |
|   U32   |    2^28    |   1.238 ms |       0.08% |   1.251 ms |       0.09% |  12.555 us |   1.01% |   FAIL   |

Copy link
Contributor

github-actions bot commented Nov 4, 2024

🟩 CI finished in 1h 35m: Pass: 100%/394 | Total: 8d 00h | Avg: 29m 15s | Max: 1h 14m | Hits: 70%/25848
  • 🟩 libcudacxx: Pass: 100%/118 | Total: 1d 02h | Avg: 13m 26s | Max: 39m 16s | Hits: 68%/9484

    🟩 cpu
      🟩 amd64              Pass: 100%/110 | Total:  1d 01h | Avg: 13m 39s | Max: 39m 16s | Hits:  68%/9484  
      🟩 arm64              Pass: 100%/8   | Total:  1h 24m | Avg: 10m 30s | Max: 23m 35s
    🟩 ctk
      🟩 11.1               Pass: 100%/15  | Total:  3h 09m | Avg: 12m 36s | Max: 32m 14s | Hits:  35%/2177  
      🟩 11.8               Pass: 100%/3   | Total:  1h 16m | Avg: 25m 31s | Max: 30m 49s
      🟩 12.5               Pass: 100%/4   | Total:  2h 09m | Avg: 32m 17s | Max: 39m 16s
      🟩 12.6               Pass: 100%/96  | Total: 19h 52m | Avg: 12m 25s | Max: 33m 11s | Hits:  78%/7307  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/12  | Total:  2h 31m | Avg: 12m 36s | Max: 20m 02s
      🟩 nvcc11.1           Pass: 100%/15  | Total:  3h 09m | Avg: 12m 36s | Max: 32m 14s | Hits:  35%/2177  
      🟩 nvcc11.8           Pass: 100%/3   | Total:  1h 16m | Avg: 25m 31s | Max: 30m 49s
      🟩 nvcc12.5           Pass: 100%/4   | Total:  2h 09m | Avg: 32m 17s | Max: 39m 16s
      🟩 nvcc12.6           Pass: 100%/84  | Total: 17h 20m | Avg: 12m 23s | Max: 33m 11s | Hits:  78%/7307  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/12  | Total:  2h 31m | Avg: 12m 36s | Max: 20m 02s
      🟩 nvcc               Pass: 100%/106 | Total: 23h 55m | Avg: 13m 32s | Max: 39m 16s | Hits:  68%/9484  
    🟩 cxx
      🟩 Clang9             Pass: 100%/6   | Total:  2h 03m | Avg: 20m 32s | Max: 29m 21s
      🟩 Clang10            Pass: 100%/3   | Total: 43m 47s | Avg: 14m 35s | Max: 23m 10s
      🟩 Clang11            Pass: 100%/4   | Total: 33m 35s | Avg:  8m 23s | Max: 20m 07s
      🟩 Clang12            Pass: 100%/4   | Total: 58m 50s | Avg: 14m 42s | Max: 25m 23s
      🟩 Clang13            Pass: 100%/4   | Total: 43m 26s | Avg: 10m 51s | Max: 29m 58s
      🟩 Clang14            Pass: 100%/4   | Total: 19m 14s | Avg:  4m 48s | Max:  5m 09s
      🟩 Clang15            Pass: 100%/4   | Total: 51m 30s | Avg: 12m 52s | Max: 23m 52s
      🟩 Clang16            Pass: 100%/4   | Total: 31m 21s | Avg:  7m 50s | Max: 17m 33s
      🟩 Clang17            Pass: 100%/4   | Total:  1h 01m | Avg: 15m 25s | Max: 26m 47s
      🟩 Clang18            Pass: 100%/18  | Total:  3h 52m | Avg: 12m 55s | Max: 27m 27s
      🟩 GCC6               Pass: 100%/2   | Total: 26m 12s | Avg: 13m 06s | Max: 22m 57s
      🟩 GCC7               Pass: 100%/6   | Total: 38m 30s | Avg:  6m 25s | Max: 21m 31s
      🟩 GCC8               Pass: 100%/6   | Total:  1h 20m | Avg: 13m 24s | Max: 26m 54s
      🟩 GCC9               Pass: 100%/6   | Total: 54m 39s | Avg:  9m 06s | Max: 22m 44s
      🟩 GCC10              Pass: 100%/4   | Total: 52m 06s | Avg: 13m 01s | Max: 26m 25s
      🟩 GCC11              Pass: 100%/7   | Total:  1h 59m | Avg: 17m 06s | Max: 30m 49s
      🟩 GCC12              Pass: 100%/4   | Total: 34m 33s | Avg:  8m 38s | Max: 20m 28s
      🟩 GCC13              Pass: 100%/17  | Total:  3h 30m | Avg: 12m 23s | Max: 27m 44s
      🟩 Intel2023.2.0      Pass: 100%/3   | Total: 53m 31s | Avg: 17m 50s | Max: 26m 40s
      🟩 MSVC14.16          Pass: 100%/1   | Total: 32m 14s | Avg: 32m 14s | Max: 32m 14s | Hits:  35%/2177  
      🟩 MSVC14.29          Pass: 100%/2   | Total: 43m 54s | Avg: 21m 57s | Max: 33m 11s | Hits:  67%/4717  
      🟩 MSVC14.39          Pass: 100%/1   | Total: 11m 52s | Avg: 11m 52s | Max: 11m 52s | Hits:  98%/2590  
      🟩 NVHPC24.7          Pass: 100%/4   | Total:  2h 09m | Avg: 32m 17s | Max: 39m 16s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/55  | Total: 11h 39m | Avg: 12m 42s | Max: 29m 58s
      🟩 GCC                Pass: 100%/52  | Total: 10h 16m | Avg: 11m 51s | Max: 30m 49s
      🟩 Intel              Pass: 100%/3   | Total: 53m 31s | Avg: 17m 50s | Max: 26m 40s
      🟩 MSVC               Pass: 100%/4   | Total:  1h 28m | Avg: 22m 00s | Max: 33m 11s | Hits:  68%/9484  
      🟩 NVHPC              Pass: 100%/4   | Total:  2h 09m | Avg: 32m 17s | Max: 39m 16s
    🟩 gpu
      🟩 v100               Pass: 100%/118 | Total:  1d 02h | Avg: 13m 26s | Max: 39m 16s | Hits:  68%/9484  
    🟩 jobs
      🟩 Build              Pass: 100%/110 | Total: 23h 50m | Avg: 13m 00s | Max: 39m 16s | Hits:  68%/9484  
      🟩 NVRTC              Pass: 100%/4   | Total:  1h 26m | Avg: 21m 39s | Max: 27m 43s
      🟩 Test               Pass: 100%/3   | Total:  1h 07m | Avg: 22m 34s | Max: 27m 27s
      🟩 VerifyCodegen      Pass: 100%/1   | Total:  2m 19s | Avg:  2m 19s | Max:  2m 19s
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total:  1h 16m | Avg: 25m 31s | Max: 30m 49s
      🟩 90                 Pass: 100%/4   | Total: 42m 27s | Avg: 10m 36s | Max: 13m 19s
      🟩 90a                Pass: 100%/8   | Total: 59m 12s | Avg:  7m 24s | Max: 13m 23s
    🟩 std
      🟩 11                 Pass: 100%/32  | Total:  6h 09m | Avg: 11m 32s | Max: 29m 01s
      🟩 14                 Pass: 100%/32  | Total:  7h 43m | Avg: 14m 28s | Max: 36m 09s | Hits:  34%/4457  
      🟩 17                 Pass: 100%/30  | Total:  7h 43m | Avg: 15m 27s | Max: 39m 16s | Hits:  99%/2437  
      🟩 20                 Pass: 100%/23  | Total:  4h 48m | Avg: 12m 32s | Max: 27m 44s | Hits:  98%/2590  
    
  • 🟩 cub: Pass: 100%/110 | Total: 3d 19h | Avg: 50m 10s | Max: 1h 14m | Hits: 65%/2948

    🟩 cpu
      🟩 amd64              Pass: 100%/102 | Total:  3d 12h | Avg: 49m 46s | Max:  1h 14m | Hits:  65%/2948  
      🟩 arm64              Pass: 100%/8   | Total:  7h 21m | Avg: 55m 13s | Max:  1h 02m
    🟩 ctk
      🟩 11.1               Pass: 100%/15  | Total: 11h 46m | Avg: 47m 05s | Max: 56m 33s | Hits:  65%/737   
      🟩 11.8               Pass: 100%/3   | Total:  3h 29m | Avg:  1h 09m | Max:  1h 14m
      🟩 12.5               Pass: 100%/4   | Total:  4h 13m | Avg:  1h 03m | Max:  1h 05m
      🟩 12.6               Pass: 100%/88  | Total:  3d 00h | Avg: 49m 25s | Max:  1h 02m | Hits:  65%/2211  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/4   | Total:  3h 47m | Avg: 56m 53s | Max: 57m 49s
      🟩 nvcc11.1           Pass: 100%/15  | Total: 11h 46m | Avg: 47m 05s | Max: 56m 33s | Hits:  65%/737   
      🟩 nvcc11.8           Pass: 100%/3   | Total:  3h 29m | Avg:  1h 09m | Max:  1h 14m
      🟩 nvcc12.5           Pass: 100%/4   | Total:  4h 13m | Avg:  1h 03m | Max:  1h 05m
      🟩 nvcc12.6           Pass: 100%/84  | Total:  2d 20h | Avg: 49m 04s | Max:  1h 02m | Hits:  65%/2211  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/4   | Total:  3h 47m | Avg: 56m 53s | Max: 57m 49s
      🟩 nvcc               Pass: 100%/106 | Total:  3d 16h | Avg: 49m 55s | Max:  1h 14m | Hits:  65%/2948  
    🟩 cxx
      🟩 Clang9             Pass: 100%/6   | Total:  4h 51m | Avg: 48m 36s | Max: 54m 22s
      🟩 Clang10            Pass: 100%/3   | Total:  2h 38m | Avg: 52m 49s | Max: 57m 15s
      🟩 Clang11            Pass: 100%/4   | Total:  3h 24m | Avg: 51m 11s | Max: 53m 45s
      🟩 Clang12            Pass: 100%/4   | Total:  3h 36m | Avg: 54m 08s | Max: 55m 28s
      🟩 Clang13            Pass: 100%/4   | Total:  3h 36m | Avg: 54m 07s | Max: 57m 06s
      🟩 Clang14            Pass: 100%/4   | Total:  3h 32m | Avg: 53m 02s | Max: 54m 45s
      🟩 Clang15            Pass: 100%/4   | Total:  3h 38m | Avg: 54m 40s | Max: 57m 52s
      🟩 Clang16            Pass: 100%/4   | Total:  3h 37m | Avg: 54m 21s | Max: 58m 54s
      🟩 Clang17            Pass: 100%/4   | Total:  3h 24m | Avg: 51m 14s | Max: 52m 46s
      🟩 Clang18            Pass: 100%/11  | Total:  8h 59m | Avg: 49m 02s | Max: 57m 49s
      🟩 GCC6               Pass: 100%/2   | Total:  1h 30m | Avg: 45m 19s | Max: 45m 59s
      🟩 GCC7               Pass: 100%/6   | Total:  4h 55m | Avg: 49m 13s | Max: 56m 19s
      🟩 GCC8               Pass: 100%/6   | Total:  4h 58m | Avg: 49m 45s | Max: 56m 03s
      🟩 GCC9               Pass: 100%/6   | Total:  5h 05m | Avg: 50m 50s | Max: 57m 06s
      🟩 GCC10              Pass: 100%/4   | Total:  3h 37m | Avg: 54m 22s | Max: 58m 52s
      🟩 GCC11              Pass: 100%/7   | Total:  7h 04m | Avg:  1h 00m | Max:  1h 14m
      🟩 GCC12              Pass: 100%/4   | Total:  3h 26m | Avg: 51m 35s | Max: 52m 16s
      🟩 GCC13              Pass: 100%/16  | Total:  9h 03m | Avg: 33m 57s | Max:  1h 02m
      🟩 Intel2023.2.0      Pass: 100%/3   | Total:  2h 53m | Avg: 57m 41s | Max:  1h 01m
      🟩 MSVC14.16          Pass: 100%/1   | Total: 56m 33s | Avg: 56m 33s | Max: 56m 33s | Hits:  65%/737   
      🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 51m | Avg: 55m 33s | Max: 55m 42s | Hits:  65%/1474  
      🟩 MSVC14.39          Pass: 100%/1   | Total:  1h 02m | Avg:  1h 02m | Max:  1h 02m | Hits:  65%/737   
      🟩 NVHPC24.7          Pass: 100%/4   | Total:  4h 13m | Avg:  1h 03m | Max:  1h 05m
    🟩 cxx_family
      🟩 Clang              Pass: 100%/48  | Total:  1d 17h | Avg: 51m 40s | Max: 58m 54s
      🟩 GCC                Pass: 100%/51  | Total:  1d 15h | Avg: 46m 41s | Max:  1h 14m
      🟩 Intel              Pass: 100%/3   | Total:  2h 53m | Avg: 57m 41s | Max:  1h 01m
      🟩 MSVC               Pass: 100%/4   | Total:  3h 50m | Avg: 57m 39s | Max:  1h 02m | Hits:  65%/2948  
      🟩 NVHPC              Pass: 100%/4   | Total:  4h 13m | Avg:  1h 03m | Max:  1h 05m
    🟩 gpu
      🟩 v100               Pass: 100%/110 | Total:  3d 19h | Avg: 50m 10s | Max:  1h 14m | Hits:  65%/2948  
    🟩 jobs
      🟩 Build              Pass: 100%/102 | Total:  3d 17h | Avg: 52m 32s | Max:  1h 14m | Hits:  65%/2948  
      🟩 DeviceLaunch       Pass: 100%/1   | Total: 20m 13s | Avg: 20m 13s | Max: 20m 13s
      🟩 GraphCapture       Pass: 100%/1   | Total: 17m 11s | Avg: 17m 11s | Max: 17m 11s
      🟩 HostLaunch         Pass: 100%/3   | Total: 53m 17s | Avg: 17m 45s | Max: 19m 44s
      🟩 TestGPU            Pass: 100%/3   | Total:  1h 08m | Avg: 22m 53s | Max: 24m 04s
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total:  3h 29m | Avg:  1h 09m | Max:  1h 14m
      🟩 90a                Pass: 100%/4   | Total:  1h 31m | Avg: 22m 49s | Max: 23m 46s
    🟩 std
      🟩 11                 Pass: 100%/30  | Total:  1d 01h | Avg: 50m 02s | Max:  1h 07m
      🟩 14                 Pass: 100%/29  | Total:  1d 01h | Avg: 52m 25s | Max:  1h 07m | Hits:  65%/1474  
      🟩 17                 Pass: 100%/27  | Total: 23h 32m | Avg: 52m 18s | Max:  1h 14m | Hits:  65%/737   
      🟩 20                 Pass: 100%/24  | Total: 18h 05m | Avg: 45m 13s | Max:  1h 02m | Hits:  65%/737   
    
  • 🟩 thrust: Pass: 100%/109 | Total: 2d 20h | Avg: 37m 49s | Max: 1h 08m | Hits: 73%/13180

    🟩 cpu
      🟩 amd64              Pass: 100%/101 | Total:  2d 16h | Avg: 38m 01s | Max:  1h 08m | Hits:  73%/13180 
      🟩 arm64              Pass: 100%/8   | Total:  4h 42m | Avg: 35m 20s | Max: 41m 45s
    🟩 ctk
      🟩 11.1               Pass: 100%/15  | Total:  9h 14m | Avg: 36m 57s | Max: 55m 42s | Hits:  68%/2636  
      🟩 11.8               Pass: 100%/3   | Total:  2h 08m | Avg: 42m 49s | Max: 47m 42s
      🟩 12.5               Pass: 100%/4   | Total:  3h 56m | Avg: 59m 06s | Max:  1h 08m
      🟩 12.6               Pass: 100%/87  | Total:  2d 05h | Avg: 36m 49s | Max:  1h 08m | Hits:  74%/10544 
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/4   | Total:  1h 58m | Avg: 29m 41s | Max: 32m 44s
      🟩 nvcc11.1           Pass: 100%/15  | Total:  9h 14m | Avg: 36m 57s | Max: 55m 42s | Hits:  68%/2636  
      🟩 nvcc11.8           Pass: 100%/3   | Total:  2h 08m | Avg: 42m 49s | Max: 47m 42s
      🟩 nvcc12.5           Pass: 100%/4   | Total:  3h 56m | Avg: 59m 06s | Max:  1h 08m
      🟩 nvcc12.6           Pass: 100%/83  | Total:  2d 03h | Avg: 37m 10s | Max:  1h 08m | Hits:  74%/10544 
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/4   | Total:  1h 58m | Avg: 29m 41s | Max: 32m 44s
      🟩 nvcc               Pass: 100%/105 | Total:  2d 18h | Avg: 38m 08s | Max:  1h 08m | Hits:  73%/13180 
    🟩 cxx
      🟩 Clang9             Pass: 100%/6   | Total:  3h 44m | Avg: 37m 25s | Max: 44m 58s
      🟩 Clang10            Pass: 100%/3   | Total:  2h 01m | Avg: 40m 28s | Max: 45m 42s
      🟩 Clang11            Pass: 100%/4   | Total:  2h 34m | Avg: 38m 38s | Max: 45m 55s
      🟩 Clang12            Pass: 100%/4   | Total:  2h 41m | Avg: 40m 18s | Max: 43m 02s
      🟩 Clang13            Pass: 100%/4   | Total:  2h 33m | Avg: 38m 27s | Max: 43m 21s
      🟩 Clang14            Pass: 100%/4   | Total:  2h 36m | Avg: 39m 13s | Max: 43m 01s
      🟩 Clang15            Pass: 100%/4   | Total:  2h 37m | Avg: 39m 21s | Max: 41m 33s
      🟩 Clang16            Pass: 100%/4   | Total:  2h 31m | Avg: 37m 55s | Max: 43m 37s
      🟩 Clang17            Pass: 100%/4   | Total:  2h 25m | Avg: 36m 28s | Max: 41m 43s
      🟩 Clang18            Pass: 100%/11  | Total:  5h 17m | Avg: 28m 52s | Max: 44m 23s
      🟩 GCC6               Pass: 100%/2   | Total:  1h 07m | Avg: 33m 35s | Max: 38m 56s
      🟩 GCC7               Pass: 100%/6   | Total:  3h 37m | Avg: 36m 10s | Max: 44m 09s
      🟩 GCC8               Pass: 100%/6   | Total:  3h 33m | Avg: 35m 31s | Max: 39m 35s
      🟩 GCC9               Pass: 100%/6   | Total:  3h 57m | Avg: 39m 35s | Max: 45m 45s
      🟩 GCC10              Pass: 100%/4   | Total:  2h 37m | Avg: 39m 20s | Max: 43m 08s
      🟩 GCC11              Pass: 100%/7   | Total:  4h 47m | Avg: 41m 08s | Max: 47m 42s
      🟩 GCC12              Pass: 100%/4   | Total:  2h 45m | Avg: 41m 21s | Max: 45m 41s
      🟩 GCC13              Pass: 100%/14  | Total:  6h 29m | Avg: 27m 49s | Max: 41m 45s
      🟩 Intel2023.2.0      Pass: 100%/3   | Total:  2h 18m | Avg: 46m 16s | Max: 50m 32s
      🟩 MSVC14.16          Pass: 100%/1   | Total: 55m 42s | Avg: 55m 42s | Max: 55m 42s | Hits:  68%/2636  
      🟩 MSVC14.29          Pass: 100%/2   | Total:  2h 01m | Avg:  1h 00m | Max:  1h 01m | Hits:  66%/5272  
      🟩 MSVC14.39          Pass: 100%/2   | Total:  1h 30m | Avg: 45m 03s | Max:  1h 08m | Hits:  82%/5272  
      🟩 NVHPC24.7          Pass: 100%/4   | Total:  3h 56m | Avg: 59m 06s | Max:  1h 08m
    🟩 cxx_family
      🟩 Clang              Pass: 100%/48  | Total:  1d 05h | Avg: 36m 21s | Max: 45m 55s
      🟩 GCC                Pass: 100%/49  | Total:  1d 04h | Avg: 35m 24s | Max: 47m 42s
      🟩 Intel              Pass: 100%/3   | Total:  2h 18m | Avg: 46m 16s | Max: 50m 32s
      🟩 MSVC               Pass: 100%/5   | Total:  4h 27m | Avg: 53m 32s | Max:  1h 08m | Hits:  73%/13180 
      🟩 NVHPC              Pass: 100%/4   | Total:  3h 56m | Avg: 59m 06s | Max:  1h 08m
    🟩 gpu
      🟩 v100               Pass: 100%/109 | Total:  2d 20h | Avg: 37m 49s | Max:  1h 08m | Hits:  73%/13180 
    🟩 jobs
      🟩 Build              Pass: 100%/102 | Total:  2d 18h | Avg: 39m 22s | Max:  1h 08m | Hits:  66%/10544 
      🟩 TestCPU            Pass: 100%/4   | Total: 46m 41s | Avg: 11m 40s | Max: 21m 26s | Hits:  99%/2636  
      🟩 TestGPU            Pass: 100%/3   | Total:  1h 00m | Avg: 20m 16s | Max: 33m 30s
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total:  2h 08m | Avg: 42m 49s | Max: 47m 42s
      🟩 90a                Pass: 100%/4   | Total:  1h 40m | Avg: 25m 01s | Max: 29m 10s
    🟩 std
      🟩 11                 Pass: 100%/30  | Total: 15h 45m | Avg: 31m 30s | Max: 50m 32s
      🟩 14                 Pass: 100%/29  | Total: 19h 55m | Avg: 41m 12s | Max:  1h 00m | Hits:  66%/5272  
      🟩 17                 Pass: 100%/27  | Total: 19h 01m | Avg: 42m 15s | Max:  1h 01m | Hits:  68%/2636  
      🟩 20                 Pass: 100%/23  | Total: 14h 01m | Avg: 36m 36s | Max:  1h 08m | Hits:  82%/5272  
    
  • 🟩 cudax: Pass: 100%/54 | Total: 4h 31m | Avg: 5m 01s | Max: 19m 11s | Hits: 88%/236

    🟩 cpu
      🟩 amd64              Pass: 100%/50  | Total:  4h 17m | Avg:  5m 08s | Max: 19m 11s | Hits:  88%/236   
      🟩 arm64              Pass: 100%/4   | Total: 14m 47s | Avg:  3m 41s | Max:  4m 21s
    🟩 ctk
      🟩 12.0               Pass: 100%/19  | Total:  1h 32m | Avg:  4m 50s | Max: 15m 02s | Hits:  88%/118   
      🟩 12.5               Pass: 100%/2   | Total: 11m 19s | Avg:  5m 39s | Max:  5m 57s
      🟩 12.6               Pass: 100%/33  | Total:  2h 48m | Avg:  5m 06s | Max: 19m 11s | Hits:  88%/118   
    🟩 cudacxx
      🟩 nvcc12.0           Pass: 100%/19  | Total:  1h 32m | Avg:  4m 50s | Max: 15m 02s | Hits:  88%/118   
      🟩 nvcc12.5           Pass: 100%/2   | Total: 11m 19s | Avg:  5m 39s | Max:  5m 57s
      🟩 nvcc12.6           Pass: 100%/33  | Total:  2h 48m | Avg:  5m 06s | Max: 19m 11s | Hits:  88%/118   
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/54  | Total:  4h 31m | Avg:  5m 01s | Max: 19m 11s | Hits:  88%/236   
    🟩 cxx
      🟩 Clang9             Pass: 100%/2   | Total:  7m 50s | Avg:  3m 55s | Max:  4m 11s
      🟩 Clang10            Pass: 100%/2   | Total:  8m 18s | Avg:  4m 09s | Max:  4m 47s
      🟩 Clang11            Pass: 100%/4   | Total: 14m 20s | Avg:  3m 35s | Max:  3m 51s
      🟩 Clang12            Pass: 100%/4   | Total: 14m 29s | Avg:  3m 37s | Max:  3m 50s
      🟩 Clang13            Pass: 100%/4   | Total: 14m 52s | Avg:  3m 43s | Max:  3m 58s
      🟩 Clang14            Pass: 100%/4   | Total: 26m 12s | Avg:  6m 33s | Max: 15m 02s
      🟩 Clang15            Pass: 100%/2   | Total:  7m 53s | Avg:  3m 56s | Max:  4m 17s
      🟩 Clang16            Pass: 100%/4   | Total: 15m 39s | Avg:  3m 54s | Max:  4m 21s
      🟩 Clang17            Pass: 100%/2   | Total:  7m 33s | Avg:  3m 46s | Max:  3m 50s
      🟩 Clang18            Pass: 100%/2   | Total: 22m 47s | Avg: 11m 23s | Max: 19m 11s
      🟩 GCC9               Pass: 100%/2   | Total:  6m 42s | Avg:  3m 21s | Max:  3m 23s
      🟩 GCC10              Pass: 100%/4   | Total: 13m 45s | Avg:  3m 26s | Max:  3m 45s
      🟩 GCC11              Pass: 100%/4   | Total: 14m 01s | Avg:  3m 30s | Max:  3m 37s
      🟩 GCC12              Pass: 100%/7   | Total:  1h 01m | Avg:  8m 48s | Max: 16m 28s
      🟩 GCC13              Pass: 100%/3   | Total: 10m 06s | Avg:  3m 22s | Max:  3m 35s
      🟩 MSVC14.36          Pass: 100%/1   | Total:  7m 04s | Avg:  7m 04s | Max:  7m 04s | Hits:  88%/118   
      🟩 MSVC14.39          Pass: 100%/1   | Total:  7m 15s | Avg:  7m 15s | Max:  7m 15s | Hits:  88%/118   
      🟩 NVHPC24.7          Pass: 100%/2   | Total: 11m 19s | Avg:  5m 39s | Max:  5m 57s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/30  | Total:  2h 19m | Avg:  4m 39s | Max: 19m 11s
      🟩 GCC                Pass: 100%/20  | Total:  1h 46m | Avg:  5m 18s | Max: 16m 28s
      🟩 MSVC               Pass: 100%/2   | Total: 14m 19s | Avg:  7m 09s | Max:  7m 15s | Hits:  88%/236   
      🟩 NVHPC              Pass: 100%/2   | Total: 11m 19s | Avg:  5m 39s | Max:  5m 57s
    🟩 gpu
      🟩 v100               Pass: 100%/54  | Total:  4h 31m | Avg:  5m 01s | Max: 19m 11s | Hits:  88%/236   
    🟩 jobs
      🟩 Build              Pass: 100%/49  | Total:  3h 10m | Avg:  3m 53s | Max:  7m 15s | Hits:  88%/236   
      🟩 Test               Pass: 100%/5   | Total:  1h 21m | Avg: 16m 17s | Max: 19m 11s
    🟩 sm
      🟩 90                 Pass: 100%/1   | Total:  2m 48s | Avg:  2m 48s | Max:  2m 48s
      🟩 90a                Pass: 100%/1   | Total:  3m 08s | Avg:  3m 08s | Max:  3m 08s
    🟩 std
      🟩 17                 Pass: 100%/29  | Total:  2h 11m | Avg:  4m 31s | Max: 15m 49s
      🟩 20                 Pass: 100%/25  | Total:  2h 20m | Avg:  5m 37s | Max: 19m 11s | Hits:  88%/236   
    
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 10m 38s | Avg: 5m 19s | Max: 8m 31s

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total: 10m 38s | Avg:  5m 19s | Max:  8m 31s
    🟩 ctk
      🟩 12.6               Pass: 100%/2   | Total: 10m 38s | Avg:  5m 19s | Max:  8m 31s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/2   | Total: 10m 38s | Avg:  5m 19s | Max:  8m 31s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total: 10m 38s | Avg:  5m 19s | Max:  8m 31s
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total: 10m 38s | Avg:  5m 19s | Max:  8m 31s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total: 10m 38s | Avg:  5m 19s | Max:  8m 31s
    🟩 gpu
      🟩 v100               Pass: 100%/2   | Total: 10m 38s | Avg:  5m 19s | Max:  8m 31s
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 07s | Avg:  2m 07s | Max:  2m 07s
      🟩 Test               Pass: 100%/1   | Total:  8m 31s | Avg:  8m 31s | Max:  8m 31s
    
  • 🟩 python: Pass: 100%/1 | Total: 14m 44s | Avg: 14m 44s | Max: 14m 44s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 14m 44s | Avg: 14m 44s | Max: 14m 44s
    🟩 ctk
      🟩 12.6               Pass: 100%/1   | Total: 14m 44s | Avg: 14m 44s | Max: 14m 44s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/1   | Total: 14m 44s | Avg: 14m 44s | Max: 14m 44s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 14m 44s | Avg: 14m 44s | Max: 14m 44s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 14m 44s | Avg: 14m 44s | Max: 14m 44s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 14m 44s | Avg: 14m 44s | Max: 14m 44s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 14m 44s | Avg: 14m 44s | Max: 14m 44s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 14m 44s | Avg: 14m 44s | Max: 14m 44s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
+/- libcu++
+/- CUB
+/- Thrust
CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
+/- libcu++
+/- CUB
+/- Thrust
+/- CUDA Experimental
+/- python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 394)

# Runner
326 linux-amd64-cpu16
28 linux-arm64-cpu16
25 linux-amd64-gpu-v100-latest-1
15 windows-amd64-cpu16

cub/cub/device/dispatch/dispatch_transform.cuh Outdated Show resolved Hide resolved
cub/cub/device/dispatch/dispatch_transform.cuh Outdated Show resolved Hide resolved
libcudacxx/include/cuda/__functional/address_stability.h Outdated Show resolved Hide resolved
libcudacxx/include/cuda/__functional/address_stability.h Outdated Show resolved Hide resolved
libcudacxx/include/cuda/__functional/address_stability.h Outdated Show resolved Hide resolved
libcudacxx/include/cuda/__functional/address_stability.h Outdated Show resolved Hide resolved
static constexpr bool allows_copied_arguments = true;
};

//! Creates a new function object from an existing one, allowing its arguments to be copies of whatever source they come
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this use doxygen keywords?

Suggested change
//! Creates a new function object from an existing one, allowing its arguments to be copies of whatever source they come
//! @brief Creates a new function object from an existing one, allowing its arguments to be copies of whatever source they come

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, should it? I intended a simple block of documentation for this function, so I figured I would not need any Doxygen commands.

libcudacxx/include/cuda/functional Show resolved Hide resolved
Copy link
Contributor

github-actions bot commented Nov 5, 2024

🟩 CI finished in 2h 26m: Pass: 100%/394 | Total: 8d 05h | Avg: 30m 01s | Max: 1h 26m | Hits: 59%/25850
  • 🟩 libcudacxx: Pass: 100%/118 | Total: 1d 04h | Avg: 14m 20s | Max: 41m 21s | Hits: 49%/9484

    🟩 cpu
      🟩 amd64              Pass: 100%/110 | Total:  1d 03h | Avg: 15m 03s | Max: 41m 21s | Hits:  49%/9484  
      🟩 arm64              Pass: 100%/8   | Total: 36m 01s | Avg:  4m 30s | Max: 11m 34s
    🟩 ctk
      🟩 11.1               Pass: 100%/15  | Total:  3h 06m | Avg: 12m 27s | Max: 30m 43s | Hits:  35%/2177  
      🟩 11.8               Pass: 100%/3   | Total: 47m 23s | Avg: 15m 47s | Max: 22m 21s
      🟩 12.5               Pass: 100%/4   | Total:  1h 30m | Avg: 22m 38s | Max: 37m 34s
      🟩 12.6               Pass: 100%/96  | Total: 22h 46m | Avg: 14m 14s | Max: 41m 21s | Hits:  53%/7307  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/12  | Total:  2h 28m | Avg: 12m 24s | Max: 21m 47s
      🟩 nvcc11.1           Pass: 100%/15  | Total:  3h 06m | Avg: 12m 27s | Max: 30m 43s | Hits:  35%/2177  
      🟩 nvcc11.8           Pass: 100%/3   | Total: 47m 23s | Avg: 15m 47s | Max: 22m 21s
      🟩 nvcc12.5           Pass: 100%/4   | Total:  1h 30m | Avg: 22m 38s | Max: 37m 34s
      🟩 nvcc12.6           Pass: 100%/84  | Total: 20h 17m | Avg: 14m 29s | Max: 41m 21s | Hits:  53%/7307  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/12  | Total:  2h 28m | Avg: 12m 24s | Max: 21m 47s
      🟩 nvcc               Pass: 100%/106 | Total:  1d 01h | Avg: 14m 33s | Max: 41m 21s | Hits:  49%/9484  
    🟩 cxx
      🟩 Clang9             Pass: 100%/6   | Total:  1h 01m | Avg: 10m 18s | Max: 24m 22s
      🟩 Clang10            Pass: 100%/3   | Total: 47m 18s | Avg: 15m 46s | Max: 27m 48s
      🟩 Clang11            Pass: 100%/4   | Total:  1h 19m | Avg: 19m 59s | Max: 25m 51s
      🟩 Clang12            Pass: 100%/4   | Total: 39m 00s | Avg:  9m 45s | Max: 26m 14s
      🟩 Clang13            Pass: 100%/4   | Total:  1h 13m | Avg: 18m 26s | Max: 25m 27s
      🟩 Clang14            Pass: 100%/4   | Total:  1h 17m | Avg: 19m 17s | Max: 25m 55s
      🟩 Clang15            Pass: 100%/4   | Total: 31m 41s | Avg:  7m 55s | Max: 17m 53s
      🟩 Clang16            Pass: 100%/4   | Total:  1h 16m | Avg: 19m 05s | Max: 29m 05s
      🟩 Clang17            Pass: 100%/4   | Total: 42m 10s | Avg: 10m 32s | Max: 17m 43s
      🟩 Clang18            Pass: 100%/18  | Total:  3h 05m | Avg: 10m 17s | Max: 21m 47s
      🟩 GCC6               Pass: 100%/2   | Total: 27m 41s | Avg: 13m 50s | Max: 24m 46s
      🟩 GCC7               Pass: 100%/6   | Total:  1h 04m | Avg: 10m 40s | Max: 22m 28s
      🟩 GCC8               Pass: 100%/6   | Total:  1h 32m | Avg: 15m 25s | Max: 24m 12s
      🟩 GCC9               Pass: 100%/6   | Total:  1h 16m | Avg: 12m 44s | Max: 27m 33s
      🟩 GCC10              Pass: 100%/4   | Total:  1h 08m | Avg: 17m 02s | Max: 26m 49s
      🟩 GCC11              Pass: 100%/7   | Total:  1h 40m | Avg: 14m 21s | Max: 24m 10s
      🟩 GCC12              Pass: 100%/4   | Total:  1h 19m | Avg: 19m 53s | Max: 31m 13s
      🟩 GCC13              Pass: 100%/17  | Total:  3h 17m | Avg: 11m 35s | Max: 29m 26s
      🟩 Intel2023.2.0      Pass: 100%/3   | Total:  1h 02m | Avg: 20m 40s | Max: 30m 45s
      🟩 MSVC14.16          Pass: 100%/1   | Total: 30m 43s | Avg: 30m 43s | Max: 30m 43s | Hits:  35%/2177  
      🟩 MSVC14.29          Pass: 100%/2   | Total: 46m 37s | Avg: 23m 18s | Max: 35m 50s | Hits:  64%/4717  
      🟩 MSVC14.39          Pass: 100%/1   | Total: 41m 21s | Avg: 41m 21s | Max: 41m 21s | Hits:  34%/2590  
      🟩 NVHPC24.7          Pass: 100%/4   | Total:  1h 30m | Avg: 22m 38s | Max: 37m 34s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/55  | Total: 11h 54m | Avg: 12m 59s | Max: 29m 05s
      🟩 GCC                Pass: 100%/52  | Total: 11h 45m | Avg: 13m 34s | Max: 31m 13s
      🟩 Intel              Pass: 100%/3   | Total:  1h 02m | Avg: 20m 40s | Max: 30m 45s
      🟩 MSVC               Pass: 100%/4   | Total:  1h 58m | Avg: 29m 40s | Max: 41m 21s | Hits:  49%/9484  
      🟩 NVHPC              Pass: 100%/4   | Total:  1h 30m | Avg: 22m 38s | Max: 37m 34s
    🟩 gpu
      🟩 v100               Pass: 100%/118 | Total:  1d 04h | Avg: 14m 20s | Max: 41m 21s | Hits:  49%/9484  
    🟩 jobs
      🟩 Build              Pass: 100%/110 | Total:  1d 01h | Avg: 13m 51s | Max: 41m 21s | Hits:  49%/9484  
      🟩 NVRTC              Pass: 100%/4   | Total:  1h 46m | Avg: 26m 44s | Max: 29m 26s
      🟩 Test               Pass: 100%/3   | Total: 57m 59s | Avg: 19m 19s | Max: 23m 32s
      🟩 VerifyCodegen      Pass: 100%/1   | Total:  1m 53s | Avg:  1m 53s | Max:  1m 53s
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total: 47m 23s | Avg: 15m 47s | Max: 22m 21s
      🟩 90                 Pass: 100%/4   | Total: 40m 10s | Avg: 10m 02s | Max: 12m 45s
      🟩 90a                Pass: 100%/8   | Total: 57m 10s | Avg:  7m 08s | Max: 12m 09s
    🟩 std
      🟩 11                 Pass: 100%/32  | Total:  6h 09m | Avg: 11m 32s | Max: 28m 29s
      🟩 14                 Pass: 100%/32  | Total:  7h 18m | Avg: 13m 41s | Max: 30m 43s | Hits:  68%/4457  
      🟩 17                 Pass: 100%/30  | Total:  8h 20m | Avg: 16m 41s | Max: 37m 34s | Hits:  32%/2437  
      🟩 20                 Pass: 100%/23  | Total:  6h 21m | Avg: 16m 35s | Max: 41m 21s | Hits:  34%/2590  
    
  • 🟩 cub: Pass: 100%/110 | Total: 3d 19h | Avg: 49m 59s | Max: 1h 13m | Hits: 66%/2948

    🟩 cpu
      🟩 amd64              Pass: 100%/102 | Total:  3d 12h | Avg: 49m 38s | Max:  1h 13m | Hits:  66%/2948  
      🟩 arm64              Pass: 100%/8   | Total:  7h 15m | Avg: 54m 28s | Max: 56m 47s
    🟩 ctk
      🟩 11.1               Pass: 100%/15  | Total: 11h 24m | Avg: 45m 36s | Max: 50m 08s | Hits:  66%/737   
      🟩 11.8               Pass: 100%/3   | Total:  3h 16m | Avg:  1h 05m | Max:  1h 05m
      🟩 12.5               Pass: 100%/4   | Total:  4h 09m | Avg:  1h 02m | Max:  1h 05m
      🟩 12.6               Pass: 100%/88  | Total:  3d 00h | Avg: 49m 38s | Max:  1h 13m | Hits:  66%/2211  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/4   | Total:  3h 44m | Avg: 56m 05s | Max: 58m 29s
      🟩 nvcc11.1           Pass: 100%/15  | Total: 11h 24m | Avg: 45m 36s | Max: 50m 08s | Hits:  66%/737   
      🟩 nvcc11.8           Pass: 100%/3   | Total:  3h 16m | Avg:  1h 05m | Max:  1h 05m
      🟩 nvcc12.5           Pass: 100%/4   | Total:  4h 09m | Avg:  1h 02m | Max:  1h 05m
      🟩 nvcc12.6           Pass: 100%/84  | Total:  2d 21h | Avg: 49m 20s | Max:  1h 13m | Hits:  66%/2211  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/4   | Total:  3h 44m | Avg: 56m 05s | Max: 58m 29s
      🟩 nvcc               Pass: 100%/106 | Total:  3d 15h | Avg: 49m 45s | Max:  1h 13m | Hits:  66%/2948  
    🟩 cxx
      🟩 Clang9             Pass: 100%/6   | Total:  5h 02m | Avg: 50m 20s | Max: 55m 50s
      🟩 Clang10            Pass: 100%/3   | Total:  2h 40m | Avg: 53m 32s | Max: 55m 16s
      🟩 Clang11            Pass: 100%/4   | Total:  3h 21m | Avg: 50m 25s | Max: 51m 40s
      🟩 Clang12            Pass: 100%/4   | Total:  3h 21m | Avg: 50m 29s | Max: 52m 06s
      🟩 Clang13            Pass: 100%/4   | Total:  3h 27m | Avg: 51m 48s | Max: 53m 38s
      🟩 Clang14            Pass: 100%/4   | Total:  3h 28m | Avg: 52m 07s | Max: 54m 33s
      🟩 Clang15            Pass: 100%/4   | Total:  3h 32m | Avg: 53m 10s | Max: 57m 45s
      🟩 Clang16            Pass: 100%/4   | Total:  3h 38m | Avg: 54m 30s | Max: 57m 28s
      🟩 Clang17            Pass: 100%/4   | Total:  3h 28m | Avg: 52m 12s | Max: 57m 26s
      🟩 Clang18            Pass: 100%/11  | Total:  8h 51m | Avg: 48m 20s | Max: 58m 29s
      🟩 GCC6               Pass: 100%/2   | Total:  1h 28m | Avg: 44m 06s | Max: 45m 26s
      🟩 GCC7               Pass: 100%/6   | Total:  4h 52m | Avg: 48m 47s | Max: 56m 37s
      🟩 GCC8               Pass: 100%/6   | Total:  4h 52m | Avg: 48m 43s | Max: 53m 43s
      🟩 GCC9               Pass: 100%/6   | Total:  4h 54m | Avg: 49m 07s | Max: 56m 37s
      🟩 GCC10              Pass: 100%/4   | Total:  3h 37m | Avg: 54m 28s | Max: 57m 12s
      🟩 GCC11              Pass: 100%/7   | Total:  7h 02m | Avg:  1h 00m | Max:  1h 05m
      🟩 GCC12              Pass: 100%/4   | Total:  3h 25m | Avg: 51m 17s | Max: 52m 37s
      🟩 GCC13              Pass: 100%/16  | Total:  9h 54m | Avg: 37m 10s | Max:  1h 13m
      🟩 Intel2023.2.0      Pass: 100%/3   | Total:  2h 47m | Avg: 55m 48s | Max: 56m 55s
      🟩 MSVC14.16          Pass: 100%/1   | Total: 50m 08s | Avg: 50m 08s | Max: 50m 08s | Hits:  66%/737   
      🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 50m | Avg: 55m 23s | Max: 55m 27s | Hits:  66%/1474  
      🟩 MSVC14.39          Pass: 100%/1   | Total: 59m 06s | Avg: 59m 06s | Max: 59m 06s | Hits:  66%/737   
      🟩 NVHPC24.7          Pass: 100%/4   | Total:  4h 09m | Avg:  1h 02m | Max:  1h 05m
    🟩 cxx_family
      🟩 Clang              Pass: 100%/48  | Total:  1d 16h | Avg: 51m 06s | Max: 58m 29s
      🟩 GCC                Pass: 100%/51  | Total:  1d 16h | Avg: 47m 13s | Max:  1h 13m
      🟩 Intel              Pass: 100%/3   | Total:  2h 47m | Avg: 55m 48s | Max: 56m 55s
      🟩 MSVC               Pass: 100%/4   | Total:  3h 40m | Avg: 55m 00s | Max: 59m 06s | Hits:  66%/2948  
      🟩 NVHPC              Pass: 100%/4   | Total:  4h 09m | Avg:  1h 02m | Max:  1h 05m
    🟩 gpu
      🟩 v100               Pass: 100%/110 | Total:  3d 19h | Avg: 49m 59s | Max:  1h 13m | Hits:  66%/2948  
    🟩 jobs
      🟩 Build              Pass: 100%/102 | Total:  3d 16h | Avg: 51m 48s | Max:  1h 05m | Hits:  66%/2948  
      🟩 DeviceLaunch       Pass: 100%/1   | Total: 19m 34s | Avg: 19m 34s | Max: 19m 34s
      🟩 GraphCapture       Pass: 100%/1   | Total: 13m 58s | Avg: 13m 58s | Max: 13m 58s
      🟩 HostLaunch         Pass: 100%/3   | Total:  1h 51m | Avg: 37m 06s | Max:  1h 13m
      🟩 TestGPU            Pass: 100%/3   | Total:  1h 08m | Avg: 22m 49s | Max: 25m 07s
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total:  3h 16m | Avg:  1h 05m | Max:  1h 05m
      🟩 90a                Pass: 100%/4   | Total:  1h 31m | Avg: 22m 56s | Max: 24m 32s
    🟩 std
      🟩 11                 Pass: 100%/30  | Total:  1d 01h | Avg: 51m 16s | Max:  1h 13m
      🟩 14                 Pass: 100%/29  | Total:  1d 00h | Avg: 51m 29s | Max:  1h 05m | Hits:  66%/1474  
      🟩 17                 Pass: 100%/27  | Total: 23h 14m | Avg: 51m 39s | Max:  1h 05m | Hits:  66%/737   
      🟩 20                 Pass: 100%/24  | Total: 17h 51m | Avg: 44m 38s | Max:  1h 01m | Hits:  66%/737   
    
  • 🟩 thrust: Pass: 100%/109 | Total: 3d 00h | Avg: 39m 44s | Max: 1h 26m | Hits: 64%/13180

    🟩 cpu
      🟩 amd64              Pass: 100%/101 | Total:  2d 19h | Avg: 39m 54s | Max:  1h 26m | Hits:  64%/13180 
      🟩 arm64              Pass: 100%/8   | Total:  5h 00m | Avg: 37m 37s | Max: 43m 52s
    🟩 ctk
      🟩 11.1               Pass: 100%/15  | Total:  9h 21m | Avg: 37m 24s | Max:  1h 06m | Hits:  56%/2636  
      🟩 11.8               Pass: 100%/3   | Total:  2h 15m | Avg: 45m 05s | Max: 50m 03s
      🟩 12.5               Pass: 100%/4   | Total:  5h 13m | Avg:  1h 18m | Max:  1h 26m
      🟩 12.6               Pass: 100%/87  | Total:  2d 07h | Avg: 38m 10s | Max:  1h 24m | Hits:  66%/10544 
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/4   | Total:  2h 12m | Avg: 33m 01s | Max: 36m 23s
      🟩 nvcc11.1           Pass: 100%/15  | Total:  9h 21m | Avg: 37m 24s | Max:  1h 06m | Hits:  56%/2636  
      🟩 nvcc11.8           Pass: 100%/3   | Total:  2h 15m | Avg: 45m 05s | Max: 50m 03s
      🟩 nvcc12.5           Pass: 100%/4   | Total:  5h 13m | Avg:  1h 18m | Max:  1h 26m
      🟩 nvcc12.6           Pass: 100%/83  | Total:  2d 05h | Avg: 38m 25s | Max:  1h 24m | Hits:  66%/10544 
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/4   | Total:  2h 12m | Avg: 33m 01s | Max: 36m 23s
      🟩 nvcc               Pass: 100%/105 | Total:  2d 21h | Avg: 39m 59s | Max:  1h 26m | Hits:  64%/13180 
    🟩 cxx
      🟩 Clang9             Pass: 100%/6   | Total:  3h 36m | Avg: 36m 07s | Max: 42m 48s
      🟩 Clang10            Pass: 100%/3   | Total:  1h 59m | Avg: 39m 41s | Max: 43m 26s
      🟩 Clang11            Pass: 100%/4   | Total:  2h 38m | Avg: 39m 44s | Max: 43m 38s
      🟩 Clang12            Pass: 100%/4   | Total:  2h 30m | Avg: 37m 41s | Max: 40m 09s
      🟩 Clang13            Pass: 100%/4   | Total:  2h 34m | Avg: 38m 33s | Max: 43m 03s
      🟩 Clang14            Pass: 100%/4   | Total:  2h 45m | Avg: 41m 26s | Max: 45m 52s
      🟩 Clang15            Pass: 100%/4   | Total:  2h 40m | Avg: 40m 04s | Max: 45m 15s
      🟩 Clang16            Pass: 100%/4   | Total:  2h 44m | Avg: 41m 11s | Max: 47m 56s
      🟩 Clang17            Pass: 100%/4   | Total:  2h 45m | Avg: 41m 28s | Max: 45m 35s
      🟩 Clang18            Pass: 100%/11  | Total:  5h 42m | Avg: 31m 10s | Max: 45m 15s
      🟩 GCC6               Pass: 100%/2   | Total:  1h 09m | Avg: 34m 39s | Max: 39m 36s
      🟩 GCC7               Pass: 100%/6   | Total:  3h 36m | Avg: 36m 01s | Max: 40m 09s
      🟩 GCC8               Pass: 100%/6   | Total:  3h 42m | Avg: 37m 05s | Max: 45m 42s
      🟩 GCC9               Pass: 100%/6   | Total:  3h 48m | Avg: 38m 06s | Max: 41m 57s
      🟩 GCC10              Pass: 100%/4   | Total:  2h 44m | Avg: 41m 13s | Max: 47m 22s
      🟩 GCC11              Pass: 100%/7   | Total:  4h 55m | Avg: 42m 14s | Max: 50m 03s
      🟩 GCC12              Pass: 100%/4   | Total:  2h 55m | Avg: 43m 47s | Max: 51m 30s
      🟩 GCC13              Pass: 100%/14  | Total:  6h 19m | Avg: 27m 05s | Max: 43m 52s
      🟩 Intel2023.2.0      Pass: 100%/3   | Total:  2h 39m | Avg: 53m 02s | Max: 59m 07s
      🟩 MSVC14.16          Pass: 100%/1   | Total:  1h 06m | Avg:  1h 06m | Max:  1h 06m | Hits:  56%/2636  
      🟩 MSVC14.29          Pass: 100%/2   | Total:  2h 16m | Avg:  1h 08m | Max:  1h 08m | Hits:  56%/5272  
      🟩 MSVC14.39          Pass: 100%/2   | Total:  1h 45m | Avg: 52m 34s | Max:  1h 24m | Hits:  76%/5272  
      🟩 NVHPC24.7          Pass: 100%/4   | Total:  5h 13m | Avg:  1h 18m | Max:  1h 26m
    🟩 cxx_family
      🟩 Clang              Pass: 100%/48  | Total:  1d 05h | Avg: 37m 29s | Max: 47m 56s
      🟩 GCC                Pass: 100%/49  | Total:  1d 05h | Avg: 35m 44s | Max: 51m 30s
      🟩 Intel              Pass: 100%/3   | Total:  2h 39m | Avg: 53m 02s | Max: 59m 07s
      🟩 MSVC               Pass: 100%/5   | Total:  5h 08m | Avg:  1h 01m | Max:  1h 24m | Hits:  64%/13180 
      🟩 NVHPC              Pass: 100%/4   | Total:  5h 13m | Avg:  1h 18m | Max:  1h 26m
    🟩 gpu
      🟩 v100               Pass: 100%/109 | Total:  3d 00h | Avg: 39m 44s | Max:  1h 26m | Hits:  64%/13180 
    🟩 jobs
      🟩 Build              Pass: 100%/102 | Total:  2d 22h | Avg: 41m 37s | Max:  1h 26m | Hits:  55%/10544 
      🟩 TestCPU            Pass: 100%/4   | Total: 46m 55s | Avg: 11m 43s | Max: 21m 03s | Hits:  99%/2636  
      🟩 TestGPU            Pass: 100%/3   | Total: 39m 07s | Avg: 13m 02s | Max: 15m 11s
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total:  2h 15m | Avg: 45m 05s | Max: 50m 03s
      🟩 90a                Pass: 100%/4   | Total:  1h 46m | Avg: 26m 34s | Max: 32m 01s
    🟩 std
      🟩 11                 Pass: 100%/30  | Total: 16h 18m | Avg: 32m 36s | Max:  1h 05m
      🟩 14                 Pass: 100%/29  | Total: 20h 34m | Avg: 42m 33s | Max:  1h 20m | Hits:  57%/5272  
      🟩 17                 Pass: 100%/27  | Total: 20h 02m | Avg: 44m 31s | Max:  1h 20m | Hits:  53%/2636  
      🟩 20                 Pass: 100%/23  | Total: 15h 17m | Avg: 39m 52s | Max:  1h 26m | Hits:  76%/5272  
    
  • 🟩 cudax: Pass: 100%/54 | Total: 4h 31m | Avg: 5m 01s | Max: 18m 25s | Hits: 87%/238

    🟩 cpu
      🟩 amd64              Pass: 100%/50  | Total:  4h 17m | Avg:  5m 08s | Max: 18m 25s | Hits:  87%/238   
      🟩 arm64              Pass: 100%/4   | Total: 13m 54s | Avg:  3m 28s | Max:  3m 42s
    🟩 ctk
      🟩 12.0               Pass: 100%/19  | Total:  1h 33m | Avg:  4m 53s | Max: 16m 55s | Hits:  87%/119   
      🟩 12.5               Pass: 100%/2   | Total: 10m 39s | Avg:  5m 19s | Max:  5m 22s
      🟩 12.6               Pass: 100%/33  | Total:  2h 47m | Avg:  5m 04s | Max: 18m 25s | Hits:  87%/119   
    🟩 cudacxx
      🟩 nvcc12.0           Pass: 100%/19  | Total:  1h 33m | Avg:  4m 53s | Max: 16m 55s | Hits:  87%/119   
      🟩 nvcc12.5           Pass: 100%/2   | Total: 10m 39s | Avg:  5m 19s | Max:  5m 22s
      🟩 nvcc12.6           Pass: 100%/33  | Total:  2h 47m | Avg:  5m 04s | Max: 18m 25s | Hits:  87%/119   
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/54  | Total:  4h 31m | Avg:  5m 01s | Max: 18m 25s | Hits:  87%/238   
    🟩 cxx
      🟩 Clang9             Pass: 100%/2   | Total:  8m 04s | Avg:  4m 02s | Max:  4m 19s
      🟩 Clang10            Pass: 100%/2   | Total:  7m 35s | Avg:  3m 47s | Max:  4m 03s
      🟩 Clang11            Pass: 100%/4   | Total: 14m 17s | Avg:  3m 34s | Max:  3m 54s
      🟩 Clang12            Pass: 100%/4   | Total: 13m 46s | Avg:  3m 26s | Max:  3m 46s
      🟩 Clang13            Pass: 100%/4   | Total: 13m 47s | Avg:  3m 26s | Max:  3m 40s
      🟩 Clang14            Pass: 100%/4   | Total: 26m 22s | Avg:  6m 35s | Max: 15m 18s
      🟩 Clang15            Pass: 100%/2   | Total:  7m 17s | Avg:  3m 38s | Max:  3m 40s
      🟩 Clang16            Pass: 100%/4   | Total: 14m 52s | Avg:  3m 43s | Max:  3m 59s
      🟩 Clang17            Pass: 100%/2   | Total:  7m 19s | Avg:  3m 39s | Max:  3m 41s
      🟩 Clang18            Pass: 100%/2   | Total: 21m 29s | Avg: 10m 44s | Max: 17m 38s
      🟩 GCC9               Pass: 100%/2   | Total:  6m 46s | Avg:  3m 23s | Max:  3m 33s
      🟩 GCC10              Pass: 100%/4   | Total: 13m 46s | Avg:  3m 26s | Max:  3m 28s
      🟩 GCC11              Pass: 100%/4   | Total: 13m 55s | Avg:  3m 28s | Max:  3m 48s
      🟩 GCC12              Pass: 100%/7   | Total:  1h 06m | Avg:  9m 31s | Max: 18m 25s
      🟩 GCC13              Pass: 100%/3   | Total:  9m 52s | Avg:  3m 17s | Max:  3m 21s
      🟩 MSVC14.36          Pass: 100%/1   | Total:  6m 58s | Avg:  6m 58s | Max:  6m 58s | Hits:  87%/119   
      🟩 MSVC14.39          Pass: 100%/1   | Total:  7m 56s | Avg:  7m 56s | Max:  7m 56s | Hits:  87%/119   
      🟩 NVHPC24.7          Pass: 100%/2   | Total: 10m 39s | Avg:  5m 19s | Max:  5m 22s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/30  | Total:  2h 14m | Avg:  4m 29s | Max: 17m 38s
      🟩 GCC                Pass: 100%/20  | Total:  1h 50m | Avg:  5m 32s | Max: 18m 25s
      🟩 MSVC               Pass: 100%/2   | Total: 14m 54s | Avg:  7m 27s | Max:  7m 56s | Hits:  87%/238   
      🟩 NVHPC              Pass: 100%/2   | Total: 10m 39s | Avg:  5m 19s | Max:  5m 22s
    🟩 gpu
      🟩 v100               Pass: 100%/54  | Total:  4h 31m | Avg:  5m 01s | Max: 18m 25s | Hits:  87%/238   
    🟩 jobs
      🟩 Build              Pass: 100%/49  | Total:  3h 05m | Avg:  3m 47s | Max:  7m 56s | Hits:  87%/238   
      🟩 Test               Pass: 100%/5   | Total:  1h 25m | Avg: 17m 03s | Max: 18m 25s
    🟩 sm
      🟩 90                 Pass: 100%/1   | Total:  2m 58s | Avg:  2m 58s | Max:  2m 58s
      🟩 90a                Pass: 100%/1   | Total:  3m 18s | Avg:  3m 18s | Max:  3m 18s
    🟩 std
      🟩 17                 Pass: 100%/29  | Total:  2h 11m | Avg:  4m 32s | Max: 17m 02s
      🟩 20                 Pass: 100%/25  | Total:  2h 19m | Avg:  5m 35s | Max: 18m 25s | Hits:  87%/238   
    
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 10m 08s | Avg: 5m 04s | Max: 8m 01s

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total: 10m 08s | Avg:  5m 04s | Max:  8m 01s
    🟩 ctk
      🟩 12.6               Pass: 100%/2   | Total: 10m 08s | Avg:  5m 04s | Max:  8m 01s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/2   | Total: 10m 08s | Avg:  5m 04s | Max:  8m 01s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total: 10m 08s | Avg:  5m 04s | Max:  8m 01s
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total: 10m 08s | Avg:  5m 04s | Max:  8m 01s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total: 10m 08s | Avg:  5m 04s | Max:  8m 01s
    🟩 gpu
      🟩 v100               Pass: 100%/2   | Total: 10m 08s | Avg:  5m 04s | Max:  8m 01s
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 07s | Avg:  2m 07s | Max:  2m 07s
      🟩 Test               Pass: 100%/1   | Total:  8m 01s | Avg:  8m 01s | Max:  8m 01s
    
  • 🟩 python: Pass: 100%/1 | Total: 25m 59s | Avg: 25m 59s | Max: 25m 59s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 25m 59s | Avg: 25m 59s | Max: 25m 59s
    🟩 ctk
      🟩 12.6               Pass: 100%/1   | Total: 25m 59s | Avg: 25m 59s | Max: 25m 59s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/1   | Total: 25m 59s | Avg: 25m 59s | Max: 25m 59s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 25m 59s | Avg: 25m 59s | Max: 25m 59s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 25m 59s | Avg: 25m 59s | Max: 25m 59s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 25m 59s | Avg: 25m 59s | Max: 25m 59s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 25m 59s | Avg: 25m 59s | Max: 25m 59s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 25m 59s | Avg: 25m 59s | Max: 25m 59s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
+/- libcu++
+/- CUB
+/- Thrust
CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
+/- libcu++
+/- CUB
+/- Thrust
+/- CUDA Experimental
+/- python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 394)

# Runner
326 linux-amd64-cpu16
28 linux-arm64-cpu16
25 linux-amd64-gpu-v100-latest-1
15 windows-amd64-cpu16

@@ -18,6 +18,10 @@ Function wrapper
- Creates a forwarding call wrapper that proclaims return type
- libcu++ 1.9.0 / CCCL 2.0.0 / CUDA 11.8

* - ``cuda::proclaim_copyable_arguments``
- Creates a forwarding call wrapper that proclaims that arguments can be freely copied before invocation the wrapped callable
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
- Creates a forwarding call wrapper that proclaims that arguments can be freely copied before invocation the wrapped callable
- Creates a forwarding call wrapper that proclaims that arguments can be freely copied before invocation of the wrapped callable

@bernhardmgruber
Copy link
Contributor Author

Added benchmark on H200 (PR description). Looking good except our fibonaccy transformation:

For the U32 2^20 fibonacci benchmark, we are launching 2048 blocks before and 683 blocks after. Achieved occupancy decreased by 36% (although theoretical occupancy did not). H200 has 144 SMs, which results in 113.7 (before) and 37.9 (after) warps per SM. The latter is smaller than the maximum number of resident warps per SM (48).

Playing around a bit, I new assume the slowdown could be caused by the work division, especially items per thread and distribution of blocks to SMs.

I made a few attempts at tuning how we compute items per thread. Diff of proposed implementation with new items per thread logic on H200 with 132 SMs:

* at least  2 blocks /  16 warps per SM
|   U32   |    2^16    |  11.732 us |       7.70% |  11.660 us |       7.61% | -0.073 us |  -0.62% |   PASS   |
|   U32   |    2^20    |  18.633 us |       2.29% |  18.408 us |       2.25% | -0.224 us |  -1.20% |   PASS   |
|   U32   |    2^24    |  90.927 us |       0.47% |  90.719 us |       0.45% | -0.208 us |  -0.23% |   PASS   |
|   U32   |    2^28    |   1.258 ms |       7.80% |   1.258 ms |       7.85% | -0.314 us |  -0.02% |   PASS   |
* at least  4 blocks /  32 warps per SM
|   U32   |    2^16    |  11.732 us |       7.70% |  12.027 us |       7.77% |  0.295 us |   2.51% |   PASS   |
|   U32   |    2^20    |  18.633 us |       2.29% |  18.579 us |       2.20% | -0.054 us |  -0.29% |   PASS   |
|   U32   |    2^24    |  90.927 us |       0.47% |  90.959 us |       0.44% |  0.032 us |   0.03% |   PASS   |
|   U32   |    2^28    |   1.258 ms |       7.80% |   1.258 ms |       7.88% | -0.215 us |  -0.02% |   PASS   |
* at least  8 blocks /  64 warps per SM
|   U32   |    2^16    |  11.732 us |       7.70% |  11.627 us |       6.16% | -0.105 us |  -0.90% |   PASS   |
|   U32   |    2^20    |  18.633 us |       2.29% |  17.322 us |       2.23% | -1.311 us |  -7.03% |   FAIL   |
|   U32   |    2^24    |  90.927 us |       0.47% |  90.750 us |       0.41% | -0.177 us |  -0.20% |   PASS   |
|   U32   |    2^28    |   1.258 ms |       7.80% |   1.258 ms |       8.89% |  0.600 us |   0.05% |   PASS   |
* at least 16 blocks / 128 warps per SM:
|   U32   |    2^16    |  11.732 us |       7.70% |  11.620 us |       5.30% | -0.112 us |  -0.96% |   PASS   |
|   U32   |    2^20    |  18.633 us |       2.29% |  17.461 us |       3.22% | -1.172 us |  -6.29% |   FAIL   |
|   U32   |    2^24    |  90.927 us |       0.47% |  91.106 us |       0.46% |  0.179 us |   0.20% |   PASS   |
|   U32   |    2^28    |   1.258 ms |       7.80% |   1.256 ms |       4.94% | -2.113 us |  -0.17% |   PASS   |
Almost no impact in the U64 runs

* Setting minimum elements per thread to 2 via tuning policy:
|   U32   |    2^16    |  11.732 us |       7.70% |  11.769 us |       6.23% |  0.037 us |   0.31% |   PASS   |
|   U32   |    2^20    |  18.633 us |       2.29% |  18.305 us |       2.29% | -0.328 us |  -1.76% |   PASS   |
|   U32   |    2^24    |  90.927 us |       0.47% |  90.953 us |       0.45% |  0.025 us |   0.03% |   PASS   |
|   U32   |    2^28    |   1.258 ms |       7.80% |   1.253 ms |       0.13% | -4.840 us |  -0.38% |   FAIL   |
|   U64   |    2^16    |  11.795 us |       4.14% |  12.554 us |       3.76% |  0.759 us |   6.43% |   FAIL   |
|   U64   |    2^20    |  18.442 us |       2.36% |  18.380 us |       2.22% | -0.062 us |  -0.34% |   PASS   |
|   U64   |    2^24    | 102.268 us |       0.55% | 102.100 us |       0.48% | -0.168 us |  -0.16% |   PASS   |
|   U64   |    2^28    |   1.445 ms |       0.05% |   1.450 ms |       6.77% |  5.086 us |   0.35% |   FAIL   |

* Hardcoding elements per thread to 2:
|   U32   |    2^16    |  12.215 us |       9.17% |  11.009 us |      10.98% |  -1.206 us |  -9.87% |   FAIL   |
|   U32   |    2^20    |  16.459 us |       3.12% |  14.886 us |       2.83% |  -1.574 us |  -9.56% |   FAIL   |
|   U32   |    2^24    |  88.903 us |       0.56% |  92.387 us |       0.58% |   3.484 us |   3.92% |   FAIL   |
|   U32   |    2^28    |   1.240 ms |       0.05% |   1.333 ms |       4.95% |  92.328 us |   7.44% |   FAIL   |
|   U64   |    2^16    |  12.660 us |       4.74% |  10.177 us |       3.65% |  -2.483 us | -19.62% |   FAIL   |
|   U64   |    2^20    |  17.940 us |       3.40% |  15.717 us |       2.39% |  -2.223 us | -12.39% |   FAIL   |
|   U64   |    2^24    | 105.743 us |       0.46% | 101.769 us |       0.44% |  -3.974 us |  -3.76% |   FAIL   |
|   U64   |    2^28    |   1.512 ms |       3.87% |   1.481 ms |       0.04% | -31.696 us |  -2.10% |   FAIL   |

The last is also what the previous implementation (based on `cub::DeviceFor`) did, which works good for small problem sizes but is worse for larger ones.

@bernhardmgruber
Copy link
Contributor Author

Alright, the fix best fitting into the existing work divison model is:

     const int items_per_thread_evenly_spread =
-      static_cast<int>(::cuda::std::min(Offset{items_per_thread}, num_items / (config->sm_count * block_dim)));
+      static_cast<int>(::cuda::std::min(Offset{items_per_thread}, num_items / (config->sm_count * block_dim * config->max_occupancy)));

which maximizes occupancy for small problem sizes. That is, if the computed items per thread to sustain peak bytes in flight would lead to insufficient blocks to fill, or evenly fill, all SMs, we rather process less data per block and generate more blocks to max out the device.

ncu is also indicating something like that in the PM Samling:

Before:
image

After:
image

We can see that before we launched a lot more blocks, but the SMs were maxed out.

@miscco
Copy link
Collaborator

miscco commented Nov 5, 2024

Should this use cuda::ceil_div for that calculation rather than /

@bernhardmgruber
Copy link
Contributor Author

Should this use cuda::ceil_div for that calculation rather than /

I was actually more leaning towards round(). The division computes the number of items per thread to produce enough blocks to max out the device. If that number is e.g. 3.4 items, running with 4 (ceil_div) would produce too little blocks, running with 3 (integer division) will oversubscribe the device.

I am going to benchmark the new version soon, so I can also include ceil_div. It's a heuristic after all, so I take what works best :)

@bernhardmgruber
Copy link
Contributor Author

I am going to benchmark the new version soon, so I can also include ceil_div. It's a heuristic after all, so I take what works best :)

Well, it's rather inconclusive. Normal division is about 1% faster on H100, and about 1% slower on H200. I will leave the code as is then.

@bernhardmgruber
Copy link
Contributor Author

I think I am happy with the result. I only violate the "no regressions of more than 2% compared to previous implementation on 2^24+ problem sizes" rule for the fibonacci benchmark that @gevtushenko gave me a while back. We need 2.5%:

## [0] NVIDIA H100 NVL
|   U32   |    2^24    |  96.316 us |       0.49% |  98.619 us |       0.30% |  2.303 us |   2.39% |   FAIL   |
## [0] NVIDIA H200
|   U32   |    2^24    |  88.994 us |       0.57% |  90.992 us |       0.59% |   1.998 us |   2.25% |   FAIL   |

Copy link
Contributor

github-actions bot commented Nov 5, 2024

🟩 CI finished in 2h 22m: Pass: 100%/394 | Total: 8d 06h | Avg: 30m 14s | Max: 1h 26m | Hits: 61%/25850
  • 🟩 libcudacxx: Pass: 100%/118 | Total: 1d 02h | Avg: 13m 29s | Max: 49m 23s | Hits: 72%/9484

    🟩 cpu
      🟩 amd64              Pass: 100%/110 | Total:  1d 01h | Avg: 13m 49s | Max: 49m 23s | Hits:  72%/9484  
      🟩 arm64              Pass: 100%/8   | Total:  1h 11m | Avg:  8m 55s | Max: 22m 28s
    🟩 ctk
      🟩 11.1               Pass: 100%/15  | Total:  2h 54m | Avg: 11m 39s | Max: 24m 40s | Hits:  98%/2177  
      🟩 11.8               Pass: 100%/3   | Total:  1h 12m | Avg: 24m 08s | Max: 30m 42s
      🟩 12.5               Pass: 100%/4   | Total:  1h 41m | Avg: 25m 29s | Max: 39m 37s
      🟩 12.6               Pass: 100%/96  | Total: 20h 42m | Avg: 12m 56s | Max: 49m 23s | Hits:  65%/7307  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/12  | Total:  2h 30m | Avg: 12m 31s | Max: 22m 37s
      🟩 nvcc11.1           Pass: 100%/15  | Total:  2h 54m | Avg: 11m 39s | Max: 24m 40s | Hits:  98%/2177  
      🟩 nvcc11.8           Pass: 100%/3   | Total:  1h 12m | Avg: 24m 08s | Max: 30m 42s
      🟩 nvcc12.5           Pass: 100%/4   | Total:  1h 41m | Avg: 25m 29s | Max: 39m 37s
      🟩 nvcc12.6           Pass: 100%/84  | Total: 18h 12m | Avg: 13m 00s | Max: 49m 23s | Hits:  65%/7307  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/12  | Total:  2h 30m | Avg: 12m 31s | Max: 22m 37s
      🟩 nvcc               Pass: 100%/106 | Total:  1d 00h | Avg: 13m 36s | Max: 49m 23s | Hits:  72%/9484  
    🟩 cxx
      🟩 Clang9             Pass: 100%/6   | Total: 49m 33s | Avg:  8m 15s | Max: 13m 59s
      🟩 Clang10            Pass: 100%/3   | Total: 48m 49s | Avg: 16m 16s | Max: 32m 14s
      🟩 Clang11            Pass: 100%/4   | Total: 58m 45s | Avg: 14m 41s | Max: 27m 47s
      🟩 Clang12            Pass: 100%/4   | Total: 29m 22s | Avg:  7m 20s | Max: 16m 32s
      🟩 Clang13            Pass: 100%/4   | Total:  1h 00m | Avg: 15m 04s | Max: 29m 14s
      🟩 Clang14            Pass: 100%/4   | Total:  1h 06m | Avg: 16m 40s | Max: 26m 32s
      🟩 Clang15            Pass: 100%/4   | Total:  1h 18m | Avg: 19m 37s | Max: 26m 51s
      🟩 Clang16            Pass: 100%/4   | Total: 41m 45s | Avg: 10m 26s | Max: 15m 27s
      🟩 Clang17            Pass: 100%/4   | Total: 44m 15s | Avg: 11m 03s | Max: 16m 18s
      🟩 Clang18            Pass: 100%/18  | Total:  4h 08m | Avg: 13m 49s | Max: 49m 23s
      🟩 GCC6               Pass: 100%/2   | Total: 39m 09s | Avg: 19m 34s | Max: 20m 13s
      🟩 GCC7               Pass: 100%/6   | Total: 40m 25s | Avg:  6m 44s | Max: 24m 44s
      🟩 GCC8               Pass: 100%/6   | Total: 48m 41s | Avg:  8m 06s | Max: 24m 40s
      🟩 GCC9               Pass: 100%/6   | Total:  1h 07m | Avg: 11m 10s | Max: 23m 37s
      🟩 GCC10              Pass: 100%/4   | Total: 54m 59s | Avg: 13m 44s | Max: 27m 21s
      🟩 GCC11              Pass: 100%/7   | Total:  2h 18m | Avg: 19m 51s | Max: 30m 42s
      🟩 GCC12              Pass: 100%/4   | Total: 55m 00s | Avg: 13m 45s | Max: 30m 48s
      🟩 GCC13              Pass: 100%/17  | Total:  3h 15m | Avg: 11m 30s | Max: 29m 26s
      🟩 Intel2023.2.0      Pass: 100%/3   | Total: 29m 31s | Avg:  9m 50s | Max: 18m 23s
      🟩 MSVC14.16          Pass: 100%/1   | Total: 17m 40s | Avg: 17m 40s | Max: 17m 40s | Hits:  98%/2177  
      🟩 MSVC14.29          Pass: 100%/2   | Total: 47m 03s | Avg: 23m 31s | Max: 35m 21s | Hits:  63%/4717  
      🟩 MSVC14.39          Pass: 100%/1   | Total: 29m 07s | Avg: 29m 07s | Max: 29m 07s | Hits:  67%/2590  
      🟩 NVHPC24.7          Pass: 100%/4   | Total:  1h 41m | Avg: 25m 29s | Max: 39m 37s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/55  | Total: 12h 06m | Avg: 13m 12s | Max: 49m 23s
      🟩 GCC                Pass: 100%/52  | Total: 10h 39m | Avg: 12m 18s | Max: 30m 48s
      🟩 Intel              Pass: 100%/3   | Total: 29m 31s | Avg:  9m 50s | Max: 18m 23s
      🟩 MSVC               Pass: 100%/4   | Total:  1h 33m | Avg: 23m 27s | Max: 35m 21s | Hits:  72%/9484  
      🟩 NVHPC              Pass: 100%/4   | Total:  1h 41m | Avg: 25m 29s | Max: 39m 37s
    🟩 gpu
      🟩 v100               Pass: 100%/118 | Total:  1d 02h | Avg: 13m 29s | Max: 49m 23s | Hits:  72%/9484  
    🟩 jobs
      🟩 Build              Pass: 100%/110 | Total: 23h 27m | Avg: 12m 47s | Max: 39m 37s | Hits:  72%/9484  
      🟩 NVRTC              Pass: 100%/4   | Total:  1h 26m | Avg: 21m 35s | Max: 23m 08s
      🟩 Test               Pass: 100%/3   | Total:  1h 35m | Avg: 31m 57s | Max: 49m 23s
      🟩 VerifyCodegen      Pass: 100%/1   | Total:  2m 02s | Avg:  2m 02s | Max:  2m 02s
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total:  1h 12m | Avg: 24m 08s | Max: 30m 42s
      🟩 90                 Pass: 100%/4   | Total: 40m 38s | Avg: 10m 09s | Max: 12m 10s
      🟩 90a                Pass: 100%/8   | Total: 56m 38s | Avg:  7m 04s | Max: 11m 39s
    🟩 std
      🟩 11                 Pass: 100%/32  | Total:  5h 21m | Avg: 10m 02s | Max: 29m 26s
      🟩 14                 Pass: 100%/32  | Total:  5h 57m | Avg: 11m 11s | Max: 31m 45s | Hits:  98%/4457  
      🟩 17                 Pass: 100%/30  | Total:  7h 55m | Avg: 15m 51s | Max: 35m 21s | Hits:  30%/2437  
      🟩 20                 Pass: 100%/23  | Total:  7h 15m | Avg: 18m 56s | Max: 49m 23s | Hits:  67%/2590  
    
  • 🟩 cub: Pass: 100%/110 | Total: 3d 21h | Avg: 51m 02s | Max: 1h 14m | Hits: 64%/2948

    🟩 cpu
      🟩 amd64              Pass: 100%/102 | Total:  3d 14h | Avg: 50m 41s | Max:  1h 14m | Hits:  64%/2948  
      🟩 arm64              Pass: 100%/8   | Total:  7h 24m | Avg: 55m 33s | Max:  1h 01m
    🟩 ctk
      🟩 11.1               Pass: 100%/15  | Total: 11h 42m | Avg: 46m 48s | Max: 50m 21s | Hits:  64%/737   
      🟩 11.8               Pass: 100%/3   | Total:  3h 30m | Avg:  1h 10m | Max:  1h 14m
      🟩 12.5               Pass: 100%/4   | Total:  4h 06m | Avg:  1h 01m | Max:  1h 08m
      🟩 12.6               Pass: 100%/88  | Total:  3d 02h | Avg: 50m 37s | Max:  1h 03m | Hits:  64%/2211  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/4   | Total:  3h 58m | Avg: 59m 44s | Max:  1h 01m
      🟩 nvcc11.1           Pass: 100%/15  | Total: 11h 42m | Avg: 46m 48s | Max: 50m 21s | Hits:  64%/737   
      🟩 nvcc11.8           Pass: 100%/3   | Total:  3h 30m | Avg:  1h 10m | Max:  1h 14m
      🟩 nvcc12.5           Pass: 100%/4   | Total:  4h 06m | Avg:  1h 01m | Max:  1h 08m
      🟩 nvcc12.6           Pass: 100%/84  | Total:  2d 22h | Avg: 50m 11s | Max:  1h 03m | Hits:  64%/2211  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/4   | Total:  3h 58m | Avg: 59m 44s | Max:  1h 01m
      🟩 nvcc               Pass: 100%/106 | Total:  3d 17h | Avg: 50m 42s | Max:  1h 14m | Hits:  64%/2948  
    🟩 cxx
      🟩 Clang9             Pass: 100%/6   | Total:  4h 47m | Avg: 47m 53s | Max: 53m 10s
      🟩 Clang10            Pass: 100%/3   | Total:  2h 43m | Avg: 54m 38s | Max: 56m 55s
      🟩 Clang11            Pass: 100%/4   | Total:  3h 32m | Avg: 53m 03s | Max: 57m 33s
      🟩 Clang12            Pass: 100%/4   | Total:  3h 34m | Avg: 53m 39s | Max: 55m 55s
      🟩 Clang13            Pass: 100%/4   | Total:  3h 25m | Avg: 51m 22s | Max: 53m 41s
      🟩 Clang14            Pass: 100%/4   | Total:  3h 32m | Avg: 53m 13s | Max: 55m 08s
      🟩 Clang15            Pass: 100%/4   | Total:  3h 42m | Avg: 55m 38s | Max: 57m 47s
      🟩 Clang16            Pass: 100%/4   | Total:  3h 29m | Avg: 52m 23s | Max: 54m 42s
      🟩 Clang17            Pass: 100%/4   | Total:  3h 31m | Avg: 52m 52s | Max: 57m 08s
      🟩 Clang18            Pass: 100%/11  | Total:  9h 43m | Avg: 53m 05s | Max:  1h 01m
      🟩 GCC6               Pass: 100%/2   | Total:  1h 37m | Avg: 48m 57s | Max: 49m 15s
      🟩 GCC7               Pass: 100%/6   | Total:  4h 51m | Avg: 48m 37s | Max: 56m 06s
      🟩 GCC8               Pass: 100%/6   | Total:  4h 59m | Avg: 49m 59s | Max: 53m 25s
      🟩 GCC9               Pass: 100%/6   | Total:  5h 06m | Avg: 51m 05s | Max: 53m 46s
      🟩 GCC10              Pass: 100%/4   | Total:  3h 46m | Avg: 56m 40s | Max: 58m 10s
      🟩 GCC11              Pass: 100%/7   | Total:  7h 08m | Avg:  1h 01m | Max:  1h 14m
      🟩 GCC12              Pass: 100%/4   | Total:  3h 31m | Avg: 52m 47s | Max: 55m 33s
      🟩 GCC13              Pass: 100%/16  | Total:  9h 34m | Avg: 35m 54s | Max: 57m 17s
      🟩 Intel2023.2.0      Pass: 100%/3   | Total:  2h 59m | Avg: 59m 46s | Max:  1h 02m
      🟩 MSVC14.16          Pass: 100%/1   | Total: 49m 56s | Avg: 49m 56s | Max: 49m 56s | Hits:  64%/737   
      🟩 MSVC14.29          Pass: 100%/2   | Total:  1h 53m | Avg: 56m 48s | Max: 58m 11s | Hits:  64%/1474  
      🟩 MSVC14.39          Pass: 100%/1   | Total:  1h 03m | Avg:  1h 03m | Max:  1h 03m | Hits:  64%/737   
      🟩 NVHPC24.7          Pass: 100%/4   | Total:  4h 06m | Avg:  1h 01m | Max:  1h 08m
    🟩 cxx_family
      🟩 Clang              Pass: 100%/48  | Total:  1d 18h | Avg: 52m 35s | Max:  1h 01m
      🟩 GCC                Pass: 100%/51  | Total:  1d 16h | Avg: 47m 47s | Max:  1h 14m
      🟩 Intel              Pass: 100%/3   | Total:  2h 59m | Avg: 59m 46s | Max:  1h 02m
      🟩 MSVC               Pass: 100%/4   | Total:  3h 46m | Avg: 56m 44s | Max:  1h 03m | Hits:  64%/2948  
      🟩 NVHPC              Pass: 100%/4   | Total:  4h 06m | Avg:  1h 01m | Max:  1h 08m
    🟩 gpu
      🟩 v100               Pass: 100%/110 | Total:  3d 21h | Avg: 51m 02s | Max:  1h 14m | Hits:  64%/2948  
    🟩 jobs
      🟩 Build              Pass: 100%/102 | Total:  3d 17h | Avg: 52m 56s | Max:  1h 14m | Hits:  64%/2948  
      🟩 DeviceLaunch       Pass: 100%/1   | Total: 22m 58s | Avg: 22m 58s | Max: 22m 58s
      🟩 GraphCapture       Pass: 100%/1   | Total: 21m 35s | Avg: 21m 35s | Max: 21m 35s
      🟩 HostLaunch         Pass: 100%/3   | Total:  1h 33m | Avg: 31m 15s | Max: 36m 13s
      🟩 TestGPU            Pass: 100%/3   | Total:  1h 16m | Avg: 25m 37s | Max: 30m 09s
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total:  3h 30m | Avg:  1h 10m | Max:  1h 14m
      🟩 90a                Pass: 100%/4   | Total:  1h 35m | Avg: 23m 57s | Max: 24m 43s
    🟩 std
      🟩 11                 Pass: 100%/30  | Total:  1d 01h | Avg: 50m 51s | Max:  1h 08m
      🟩 14                 Pass: 100%/29  | Total:  1d 01h | Avg: 52m 40s | Max:  1h 07m | Hits:  64%/1474  
      🟩 17                 Pass: 100%/27  | Total: 23h 55m | Avg: 53m 09s | Max:  1h 14m | Hits:  64%/737   
      🟩 20                 Pass: 100%/24  | Total: 18h 46m | Avg: 46m 55s | Max:  1h 08m | Hits:  64%/737   
    
  • 🟩 thrust: Pass: 100%/109 | Total: 3d 01h | Avg: 40m 24s | Max: 1h 26m | Hits: 51%/13180

    🟩 cpu
      🟩 amd64              Pass: 100%/101 | Total:  2d 20h | Avg: 40m 36s | Max:  1h 26m | Hits:  51%/13180 
      🟩 arm64              Pass: 100%/8   | Total:  5h 03m | Avg: 37m 56s | Max: 43m 58s
    🟩 ctk
      🟩 11.1               Pass: 100%/15  | Total:  9h 25m | Avg: 37m 40s | Max:  1h 15m | Hits:  39%/2636  
      🟩 11.8               Pass: 100%/3   | Total:  2h 18m | Avg: 46m 08s | Max: 52m 53s
      🟩 12.5               Pass: 100%/4   | Total:  5h 02m | Avg:  1h 15m | Max:  1h 26m
      🟩 12.6               Pass: 100%/87  | Total:  2d 08h | Avg: 39m 04s | Max:  1h 19m | Hits:  54%/10544 
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/4   | Total:  2h 09m | Avg: 32m 24s | Max: 37m 05s
      🟩 nvcc11.1           Pass: 100%/15  | Total:  9h 25m | Avg: 37m 40s | Max:  1h 15m | Hits:  39%/2636  
      🟩 nvcc11.8           Pass: 100%/3   | Total:  2h 18m | Avg: 46m 08s | Max: 52m 53s
      🟩 nvcc12.5           Pass: 100%/4   | Total:  5h 02m | Avg:  1h 15m | Max:  1h 26m
      🟩 nvcc12.6           Pass: 100%/83  | Total:  2d 06h | Avg: 39m 23s | Max:  1h 19m | Hits:  54%/10544 
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/4   | Total:  2h 09m | Avg: 32m 24s | Max: 37m 05s
      🟩 nvcc               Pass: 100%/105 | Total:  2d 23h | Avg: 40m 43s | Max:  1h 26m | Hits:  51%/13180 
    🟩 cxx
      🟩 Clang9             Pass: 100%/6   | Total:  3h 39m | Avg: 36m 30s | Max: 40m 20s
      🟩 Clang10            Pass: 100%/3   | Total:  2h 02m | Avg: 40m 57s | Max: 43m 52s
      🟩 Clang11            Pass: 100%/4   | Total:  2h 38m | Avg: 39m 35s | Max: 43m 05s
      🟩 Clang12            Pass: 100%/4   | Total:  2h 36m | Avg: 39m 02s | Max: 43m 53s
      🟩 Clang13            Pass: 100%/4   | Total:  2h 39m | Avg: 39m 55s | Max: 44m 34s
      🟩 Clang14            Pass: 100%/4   | Total:  2h 40m | Avg: 40m 07s | Max: 45m 51s
      🟩 Clang15            Pass: 100%/4   | Total:  2h 49m | Avg: 42m 16s | Max: 46m 51s
      🟩 Clang16            Pass: 100%/4   | Total:  2h 48m | Avg: 42m 11s | Max: 48m 31s
      🟩 Clang17            Pass: 100%/4   | Total:  2h 48m | Avg: 42m 07s | Max: 47m 34s
      🟩 Clang18            Pass: 100%/11  | Total:  5h 50m | Avg: 31m 52s | Max: 39m 53s
      🟩 GCC6               Pass: 100%/2   | Total:  1h 07m | Avg: 33m 38s | Max: 38m 53s
      🟩 GCC7               Pass: 100%/6   | Total:  3h 43m | Avg: 37m 14s | Max: 44m 36s
      🟩 GCC8               Pass: 100%/6   | Total:  3h 45m | Avg: 37m 31s | Max: 42m 59s
      🟩 GCC9               Pass: 100%/6   | Total:  3h 59m | Avg: 39m 53s | Max: 49m 15s
      🟩 GCC10              Pass: 100%/4   | Total:  2h 45m | Avg: 41m 17s | Max: 47m 46s
      🟩 GCC11              Pass: 100%/7   | Total:  5h 10m | Avg: 44m 19s | Max: 52m 53s
      🟩 GCC12              Pass: 100%/4   | Total:  2h 48m | Avg: 42m 05s | Max: 45m 30s
      🟩 GCC13              Pass: 100%/14  | Total:  6h 39m | Avg: 28m 31s | Max: 43m 58s
      🟩 Intel2023.2.0      Pass: 100%/3   | Total:  2h 34m | Avg: 51m 22s | Max: 59m 24s
      🟩 MSVC14.16          Pass: 100%/1   | Total:  1h 15m | Avg:  1h 15m | Max:  1h 15m | Hits:  39%/2636  
      🟩 MSVC14.29          Pass: 100%/2   | Total:  2h 19m | Avg:  1h 09m | Max:  1h 13m | Hits:  40%/5272  
      🟩 MSVC14.39          Pass: 100%/2   | Total:  1h 41m | Avg: 50m 48s | Max:  1h 19m | Hits:  69%/5272  
      🟩 NVHPC24.7          Pass: 100%/4   | Total:  5h 02m | Avg:  1h 15m | Max:  1h 26m
    🟩 cxx_family
      🟩 Clang              Pass: 100%/48  | Total:  1d 06h | Avg: 38m 11s | Max: 48m 31s
      🟩 GCC                Pass: 100%/49  | Total:  1d 05h | Avg: 36m 41s | Max: 52m 53s
      🟩 Intel              Pass: 100%/3   | Total:  2h 34m | Avg: 51m 22s | Max: 59m 24s
      🟩 MSVC               Pass: 100%/5   | Total:  5h 16m | Avg:  1h 03m | Max:  1h 19m | Hits:  51%/13180 
      🟩 NVHPC              Pass: 100%/4   | Total:  5h 02m | Avg:  1h 15m | Max:  1h 26m
    🟩 gpu
      🟩 v100               Pass: 100%/109 | Total:  3d 01h | Avg: 40m 24s | Max:  1h 26m | Hits:  51%/13180 
    🟩 jobs
      🟩 Build              Pass: 100%/102 | Total:  2d 23h | Avg: 42m 02s | Max:  1h 26m | Hits:  39%/10544 
      🟩 TestCPU            Pass: 100%/4   | Total: 46m 26s | Avg: 11m 36s | Max: 22m 00s | Hits:  99%/2636  
      🟩 TestGPU            Pass: 100%/3   | Total:  1h 09m | Avg: 23m 13s | Max: 29m 07s
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total:  2h 18m | Avg: 46m 08s | Max: 52m 53s
      🟩 90a                Pass: 100%/4   | Total:  1h 48m | Avg: 27m 13s | Max: 31m 54s
    🟩 std
      🟩 11                 Pass: 100%/30  | Total: 16h 20m | Avg: 32m 40s | Max:  1h 04m
      🟩 14                 Pass: 100%/29  | Total: 21h 03m | Avg: 43m 33s | Max:  1h 15m | Hits:  40%/5272  
      🟩 17                 Pass: 100%/27  | Total: 20h 31m | Avg: 45m 36s | Max:  1h 26m | Hits:  39%/2636  
      🟩 20                 Pass: 100%/23  | Total: 15h 30m | Avg: 40m 27s | Max:  1h 19m | Hits:  69%/5272  
    
  • 🟩 cudax: Pass: 100%/54 | Total: 4h 41m | Avg: 5m 13s | Max: 19m 29s | Hits: 87%/238

    🟩 cpu
      🟩 amd64              Pass: 100%/50  | Total:  4h 27m | Avg:  5m 20s | Max: 19m 29s | Hits:  87%/238   
      🟩 arm64              Pass: 100%/4   | Total: 14m 15s | Avg:  3m 33s | Max:  3m 55s
    🟩 ctk
      🟩 12.0               Pass: 100%/19  | Total:  1h 42m | Avg:  5m 22s | Max: 19m 29s | Hits:  87%/119   
      🟩 12.5               Pass: 100%/2   | Total: 10m 55s | Avg:  5m 27s | Max:  5m 29s
      🟩 12.6               Pass: 100%/33  | Total:  2h 48m | Avg:  5m 06s | Max: 18m 46s | Hits:  87%/119   
    🟩 cudacxx
      🟩 nvcc12.0           Pass: 100%/19  | Total:  1h 42m | Avg:  5m 22s | Max: 19m 29s | Hits:  87%/119   
      🟩 nvcc12.5           Pass: 100%/2   | Total: 10m 55s | Avg:  5m 27s | Max:  5m 29s
      🟩 nvcc12.6           Pass: 100%/33  | Total:  2h 48m | Avg:  5m 06s | Max: 18m 46s | Hits:  87%/119   
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/54  | Total:  4h 41m | Avg:  5m 13s | Max: 19m 29s | Hits:  87%/238   
    🟩 cxx
      🟩 Clang9             Pass: 100%/2   | Total:  7m 50s | Avg:  3m 55s | Max:  3m 58s
      🟩 Clang10            Pass: 100%/2   | Total:  7m 47s | Avg:  3m 53s | Max:  4m 13s
      🟩 Clang11            Pass: 100%/4   | Total: 14m 25s | Avg:  3m 36s | Max:  3m 55s
      🟩 Clang12            Pass: 100%/4   | Total: 13m 57s | Avg:  3m 29s | Max:  3m 33s
      🟩 Clang13            Pass: 100%/4   | Total: 13m 49s | Avg:  3m 27s | Max:  3m 38s
      🟩 Clang14            Pass: 100%/4   | Total: 28m 47s | Avg:  7m 11s | Max: 17m 45s
      🟩 Clang15            Pass: 100%/2   | Total:  7m 50s | Avg:  3m 55s | Max:  3m 59s
      🟩 Clang16            Pass: 100%/4   | Total: 14m 53s | Avg:  3m 43s | Max:  3m 55s
      🟩 Clang17            Pass: 100%/2   | Total:  8m 01s | Avg:  4m 00s | Max:  4m 07s
      🟩 Clang18            Pass: 100%/2   | Total: 20m 35s | Avg: 10m 17s | Max: 16m 30s
      🟩 GCC9               Pass: 100%/2   | Total:  7m 31s | Avg:  3m 45s | Max:  3m 55s
      🟩 GCC10              Pass: 100%/4   | Total: 14m 20s | Avg:  3m 35s | Max:  3m 47s
      🟩 GCC11              Pass: 100%/4   | Total: 14m 55s | Avg:  3m 43s | Max:  3m 59s
      🟩 GCC12              Pass: 100%/7   | Total:  1h 09m | Avg:  9m 57s | Max: 19m 29s
      🟩 GCC13              Pass: 100%/3   | Total: 10m 05s | Avg:  3m 21s | Max:  3m 29s
      🟩 MSVC14.36          Pass: 100%/1   | Total:  8m 50s | Avg:  8m 50s | Max:  8m 50s | Hits:  87%/119   
      🟩 MSVC14.39          Pass: 100%/1   | Total:  7m 29s | Avg:  7m 29s | Max:  7m 29s | Hits:  87%/119   
      🟩 NVHPC24.7          Pass: 100%/2   | Total: 10m 55s | Avg:  5m 27s | Max:  5m 29s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/30  | Total:  2h 17m | Avg:  4m 35s | Max: 17m 45s
      🟩 GCC                Pass: 100%/20  | Total:  1h 56m | Avg:  5m 49s | Max: 19m 29s
      🟩 MSVC               Pass: 100%/2   | Total: 16m 19s | Avg:  8m 09s | Max:  8m 50s | Hits:  87%/238   
      🟩 NVHPC              Pass: 100%/2   | Total: 10m 55s | Avg:  5m 27s | Max:  5m 29s
    🟩 gpu
      🟩 v100               Pass: 100%/54  | Total:  4h 41m | Avg:  5m 13s | Max: 19m 29s | Hits:  87%/238   
    🟩 jobs
      🟩 Build              Pass: 100%/49  | Total:  3h 12m | Avg:  3m 55s | Max:  8m 50s | Hits:  87%/238   
      🟩 Test               Pass: 100%/5   | Total:  1h 29m | Avg: 17m 53s | Max: 19m 29s
    🟩 sm
      🟩 90                 Pass: 100%/1   | Total:  3m 02s | Avg:  3m 02s | Max:  3m 02s
      🟩 90a                Pass: 100%/1   | Total:  3m 14s | Avg:  3m 14s | Max:  3m 14s
    🟩 std
      🟩 17                 Pass: 100%/29  | Total:  2h 19m | Avg:  4m 48s | Max: 19m 29s
      🟩 20                 Pass: 100%/25  | Total:  2h 22m | Avg:  5m 40s | Max: 17m 45s | Hits:  87%/238   
    
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 9m 35s | Avg: 4m 47s | Max: 7m 31s

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total:  9m 35s | Avg:  4m 47s | Max:  7m 31s
    🟩 ctk
      🟩 12.6               Pass: 100%/2   | Total:  9m 35s | Avg:  4m 47s | Max:  7m 31s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/2   | Total:  9m 35s | Avg:  4m 47s | Max:  7m 31s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total:  9m 35s | Avg:  4m 47s | Max:  7m 31s
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total:  9m 35s | Avg:  4m 47s | Max:  7m 31s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total:  9m 35s | Avg:  4m 47s | Max:  7m 31s
    🟩 gpu
      🟩 v100               Pass: 100%/2   | Total:  9m 35s | Avg:  4m 47s | Max:  7m 31s
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  2m 04s | Avg:  2m 04s | Max:  2m 04s
      🟩 Test               Pass: 100%/1   | Total:  7m 31s | Avg:  7m 31s | Max:  7m 31s
    
  • 🟩 python: Pass: 100%/1 | Total: 15m 11s | Avg: 15m 11s | Max: 15m 11s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 15m 11s | Avg: 15m 11s | Max: 15m 11s
    🟩 ctk
      🟩 12.6               Pass: 100%/1   | Total: 15m 11s | Avg: 15m 11s | Max: 15m 11s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/1   | Total: 15m 11s | Avg: 15m 11s | Max: 15m 11s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 15m 11s | Avg: 15m 11s | Max: 15m 11s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 15m 11s | Avg: 15m 11s | Max: 15m 11s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 15m 11s | Avg: 15m 11s | Max: 15m 11s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 15m 11s | Avg: 15m 11s | Max: 15m 11s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 15m 11s | Avg: 15m 11s | Max: 15m 11s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
+/- libcu++
+/- CUB
+/- Thrust
CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
+/- libcu++
+/- CUB
+/- Thrust
+/- CUDA Experimental
+/- python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 394)

# Runner
326 linux-amd64-cpu16
28 linux-arm64-cpu16
25 linux-amd64-gpu-v100-latest-1
15 windows-amd64-cpu16

* Introduces address stability detection and opt-in in libcu++
* Mark lambdas in Thrust BabelStream benchmark address oblivious

Fixes: NVIDIA#2263
@bernhardmgruber
Copy link
Contributor Author

bernhardmgruber commented Nov 6, 2024

I only violate the "no regressions of more than 2% compared to previous implementation on 2^24+ problem sizes" rule for the fibonacci benchmark that @gevtushenko gave me a while back. We need 2.5%:

Talked offline with @gevtushenko who is fine with the results.

@bernhardmgruber bernhardmgruber marked this pull request as ready for review November 6, 2024 08:33
Copy link
Contributor

github-actions bot commented Nov 6, 2024

🟩 CI finished in 1h 55m: Pass: 100%/394 | Total: 8d 00h | Avg: 29m 17s | Max: 1h 23m | Hits: 61%/25866
  • 🟩 libcudacxx: Pass: 100%/118 | Total: 18h 16m | Avg: 9m 17s | Max: 37m 47s | Hits: 82%/9500

    🟩 cpu
      🟩 amd64              Pass: 100%/110 | Total: 16h 56m | Avg:  9m 14s | Max: 37m 47s | Hits:  82%/9500  
      🟩 arm64              Pass: 100%/8   | Total:  1h 20m | Avg: 10m 02s | Max: 18m 29s
    🟩 ctk
      🟩 11.1               Pass: 100%/15  | Total:  1h 34m | Avg:  6m 16s | Max: 24m 04s | Hits:  98%/2181  
      🟩 11.8               Pass: 100%/3   | Total: 59m 31s | Avg: 19m 50s | Max: 31m 33s
      🟩 12.5               Pass: 100%/4   | Total: 35m 35s | Avg:  8m 53s | Max:  9m 18s
      🟩 12.6               Pass: 100%/96  | Total: 15h 07m | Avg:  9m 27s | Max: 37m 47s | Hits:  78%/7319  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/12  | Total:  2h 27m | Avg: 12m 17s | Max: 21m 54s
      🟩 nvcc11.1           Pass: 100%/15  | Total:  1h 34m | Avg:  6m 16s | Max: 24m 04s | Hits:  98%/2181  
      🟩 nvcc11.8           Pass: 100%/3   | Total: 59m 31s | Avg: 19m 50s | Max: 31m 33s
      🟩 nvcc12.5           Pass: 100%/4   | Total: 35m 35s | Avg:  8m 53s | Max:  9m 18s
      🟩 nvcc12.6           Pass: 100%/84  | Total: 12h 39m | Avg:  9m 02s | Max: 37m 47s | Hits:  78%/7319  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/12  | Total:  2h 27m | Avg: 12m 17s | Max: 21m 54s
      🟩 nvcc               Pass: 100%/106 | Total: 15h 49m | Avg:  8m 57s | Max: 37m 47s | Hits:  82%/9500  
    🟩 cxx
      🟩 Clang9             Pass: 100%/6   | Total:  1h 13m | Avg: 12m 11s | Max: 24m 04s
      🟩 Clang10            Pass: 100%/3   | Total: 15m 44s | Avg:  5m 14s | Max:  5m 52s
      🟩 Clang11            Pass: 100%/4   | Total: 48m 34s | Avg: 12m 08s | Max: 18m 22s
      🟩 Clang12            Pass: 100%/4   | Total: 17m 53s | Avg:  4m 28s | Max:  4m 49s
      🟩 Clang13            Pass: 100%/4   | Total: 38m 26s | Avg:  9m 36s | Max: 25m 27s
      🟩 Clang14            Pass: 100%/4   | Total: 18m 11s | Avg:  4m 32s | Max:  5m 02s
      🟩 Clang15            Pass: 100%/4   | Total: 18m 13s | Avg:  4m 33s | Max:  4m 57s
      🟩 Clang16            Pass: 100%/4   | Total: 26m 15s | Avg:  6m 33s | Max: 12m 54s
      🟩 Clang17            Pass: 100%/4   | Total: 24m 26s | Avg:  6m 06s | Max: 11m 50s
      🟩 Clang18            Pass: 100%/18  | Total:  3h 30m | Avg: 11m 42s | Max: 21m 54s
      🟩 GCC6               Pass: 100%/2   | Total:  6m 39s | Avg:  3m 19s | Max:  3m 29s
      🟩 GCC7               Pass: 100%/6   | Total: 20m 11s | Avg:  3m 21s | Max:  4m 12s
      🟩 GCC8               Pass: 100%/6   | Total: 44m 15s | Avg:  7m 22s | Max: 17m 44s
      🟩 GCC9               Pass: 100%/6   | Total: 39m 49s | Avg:  6m 38s | Max: 22m 15s
      🟩 GCC10              Pass: 100%/4   | Total: 41m 30s | Avg: 10m 22s | Max: 30m 34s
      🟩 GCC11              Pass: 100%/7   | Total:  1h 15m | Avg: 10m 45s | Max: 31m 33s
      🟩 GCC12              Pass: 100%/4   | Total: 37m 54s | Avg:  9m 28s | Max: 26m 05s
      🟩 GCC13              Pass: 100%/17  | Total:  3h 32m | Avg: 12m 28s | Max: 37m 47s
      🟩 Intel2023.2.0      Pass: 100%/3   | Total: 16m 57s | Avg:  5m 39s | Max:  6m 17s
      🟩 MSVC14.16          Pass: 100%/1   | Total: 18m 32s | Avg: 18m 32s | Max: 18m 32s | Hits:  98%/2181  
      🟩 MSVC14.29          Pass: 100%/2   | Total: 41m 43s | Avg: 20m 51s | Max: 29m 24s | Hits:  67%/4725  
      🟩 MSVC14.39          Pass: 100%/1   | Total: 14m 38s | Avg: 14m 38s | Max: 14m 38s | Hits:  97%/2594  
      🟩 NVHPC24.7          Pass: 100%/4   | Total: 35m 35s | Avg:  8m 53s | Max:  9m 18s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/55  | Total:  8h 11m | Avg:  8m 56s | Max: 25m 27s
      🟩 GCC                Pass: 100%/52  | Total:  7h 57m | Avg:  9m 11s | Max: 37m 47s
      🟩 Intel              Pass: 100%/3   | Total: 16m 57s | Avg:  5m 39s | Max:  6m 17s
      🟩 MSVC               Pass: 100%/4   | Total:  1h 14m | Avg: 18m 43s | Max: 29m 24s | Hits:  82%/9500  
      🟩 NVHPC              Pass: 100%/4   | Total: 35m 35s | Avg:  8m 53s | Max:  9m 18s
    🟩 gpu
      🟩 v100               Pass: 100%/118 | Total: 18h 16m | Avg:  9m 17s | Max: 37m 47s | Hits:  82%/9500  
    🟩 jobs
      🟩 Build              Pass: 100%/110 | Total: 15h 28m | Avg:  8m 26s | Max: 31m 33s | Hits:  82%/9500  
      🟩 NVRTC              Pass: 100%/4   | Total:  1h 55m | Avg: 28m 57s | Max: 37m 47s
      🟩 Test               Pass: 100%/3   | Total: 49m 55s | Avg: 16m 38s | Max: 17m 55s
      🟩 VerifyCodegen      Pass: 100%/1   | Total:  1m 48s | Avg:  1m 48s | Max:  1m 48s
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total: 59m 31s | Avg: 19m 50s | Max: 31m 33s
      🟩 90                 Pass: 100%/4   | Total: 42m 16s | Avg: 10m 34s | Max: 12m 52s
      🟩 90a                Pass: 100%/8   | Total: 55m 52s | Avg:  6m 59s | Max: 11m 43s
    🟩 std
      🟩 11                 Pass: 100%/32  | Total:  3h 52m | Avg:  7m 16s | Max: 21m 32s
      🟩 14                 Pass: 100%/32  | Total:  4h 57m | Avg:  9m 17s | Max: 35m 01s | Hits:  66%/4465  
      🟩 17                 Pass: 100%/30  | Total:  5h 09m | Avg: 10m 18s | Max: 31m 33s | Hits:  97%/2441  
      🟩 20                 Pass: 100%/23  | Total:  4h 15m | Avg: 11m 06s | Max: 37m 47s | Hits:  97%/2594  
    
  • 🟩 cub: Pass: 100%/110 | Total: 3d 22h | Avg: 51m 37s | Max: 1h 14m | Hits: 45%/2948

    🟩 cpu
      🟩 amd64              Pass: 100%/102 | Total:  3d 15h | Avg: 51m 23s | Max:  1h 14m | Hits:  45%/2948  
      🟩 arm64              Pass: 100%/8   | Total:  7h 17m | Avg: 54m 39s | Max: 57m 55s
    🟩 ctk
      🟩 11.1               Pass: 100%/15  | Total: 11h 56m | Avg: 47m 47s | Max: 52m 56s | Hits:  41%/737   
      🟩 11.8               Pass: 100%/3   | Total:  3h 36m | Avg:  1h 12m | Max:  1h 14m
      🟩 12.5               Pass: 100%/4   | Total:  4h 24m | Avg:  1h 06m | Max:  1h 10m
      🟩 12.6               Pass: 100%/88  | Total:  3d 02h | Avg: 50m 55s | Max:  1h 02m | Hits:  46%/2211  
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/4   | Total:  3h 51m | Avg: 57m 47s | Max:  1h 01m
      🟩 nvcc11.1           Pass: 100%/15  | Total: 11h 56m | Avg: 47m 47s | Max: 52m 56s | Hits:  41%/737   
      🟩 nvcc11.8           Pass: 100%/3   | Total:  3h 36m | Avg:  1h 12m | Max:  1h 14m
      🟩 nvcc12.5           Pass: 100%/4   | Total:  4h 24m | Avg:  1h 06m | Max:  1h 10m
      🟩 nvcc12.6           Pass: 100%/84  | Total:  2d 22h | Avg: 50m 35s | Max:  1h 02m | Hits:  46%/2211  
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/4   | Total:  3h 51m | Avg: 57m 47s | Max:  1h 01m
      🟩 nvcc               Pass: 100%/106 | Total:  3d 18h | Avg: 51m 23s | Max:  1h 14m | Hits:  45%/2948  
    🟩 cxx
      🟩 Clang9             Pass: 100%/6   | Total:  5h 01m | Avg: 50m 12s | Max:  1h 00m
      🟩 Clang10            Pass: 100%/3   | Total:  2h 51m | Avg: 57m 09s | Max:  1h 01m
      🟩 Clang11            Pass: 100%/4   | Total:  3h 30m | Avg: 52m 44s | Max: 53m 34s
      🟩 Clang12            Pass: 100%/4   | Total:  3h 31m | Avg: 52m 59s | Max: 56m 49s
      🟩 Clang13            Pass: 100%/4   | Total:  3h 36m | Avg: 54m 10s | Max: 57m 13s
      🟩 Clang14            Pass: 100%/4   | Total:  3h 40m | Avg: 55m 11s | Max: 57m 12s
      🟩 Clang15            Pass: 100%/4   | Total:  3h 37m | Avg: 54m 23s | Max: 58m 49s
      🟩 Clang16            Pass: 100%/4   | Total:  3h 34m | Avg: 53m 44s | Max: 57m 59s
      🟩 Clang17            Pass: 100%/4   | Total:  3h 43m | Avg: 55m 46s | Max: 58m 18s
      🟩 Clang18            Pass: 100%/11  | Total:  9h 09m | Avg: 49m 55s | Max:  1h 01m
      🟩 GCC6               Pass: 100%/2   | Total:  1h 35m | Avg: 47m 50s | Max: 48m 38s
      🟩 GCC7               Pass: 100%/6   | Total:  5h 14m | Avg: 52m 29s | Max: 58m 11s
      🟩 GCC8               Pass: 100%/6   | Total:  5h 06m | Avg: 51m 04s | Max: 58m 13s
      🟩 GCC9               Pass: 100%/6   | Total:  5h 13m | Avg: 52m 16s | Max: 58m 09s
      🟩 GCC10              Pass: 100%/4   | Total:  3h 37m | Avg: 54m 16s | Max: 58m 22s
      🟩 GCC11              Pass: 100%/7   | Total:  7h 26m | Avg:  1h 03m | Max:  1h 14m
      🟩 GCC12              Pass: 100%/4   | Total:  3h 43m | Avg: 55m 49s | Max: 58m 39s
      🟩 GCC13              Pass: 100%/16  | Total:  9h 06m | Avg: 34m 10s | Max:  1h 00m
      🟩 Intel2023.2.0      Pass: 100%/3   | Total:  2h 58m | Avg: 59m 36s | Max:  1h 02m
      🟩 MSVC14.16          Pass: 100%/1   | Total: 52m 56s | Avg: 52m 56s | Max: 52m 56s | Hits:  41%/737   
      🟩 MSVC14.29          Pass: 100%/2   | Total:  2h 00m | Avg:  1h 00m | Max:  1h 00m | Hits:  41%/1474  
      🟩 MSVC14.39          Pass: 100%/1   | Total:  1h 00m | Avg:  1h 00m | Max:  1h 00m | Hits:  58%/737   
      🟩 NVHPC24.7          Pass: 100%/4   | Total:  4h 24m | Avg:  1h 06m | Max:  1h 10m
    🟩 cxx_family
      🟩 Clang              Pass: 100%/48  | Total:  1d 18h | Avg: 52m 52s | Max:  1h 01m
      🟩 GCC                Pass: 100%/51  | Total:  1d 17h | Avg: 48m 18s | Max:  1h 14m
      🟩 Intel              Pass: 100%/3   | Total:  2h 58m | Avg: 59m 36s | Max:  1h 02m
      🟩 MSVC               Pass: 100%/4   | Total:  3h 53m | Avg: 58m 28s | Max:  1h 00m | Hits:  45%/2948  
      🟩 NVHPC              Pass: 100%/4   | Total:  4h 24m | Avg:  1h 06m | Max:  1h 10m
    🟩 gpu
      🟩 v100               Pass: 100%/110 | Total:  3d 22h | Avg: 51m 37s | Max:  1h 14m | Hits:  45%/2948  
    🟩 jobs
      🟩 Build              Pass: 100%/102 | Total:  3d 19h | Avg: 54m 02s | Max:  1h 14m | Hits:  45%/2948  
      🟩 DeviceLaunch       Pass: 100%/1   | Total: 18m 27s | Avg: 18m 27s | Max: 18m 27s
      🟩 GraphCapture       Pass: 100%/1   | Total: 17m 15s | Avg: 17m 15s | Max: 17m 15s
      🟩 HostLaunch         Pass: 100%/3   | Total:  1h 01m | Avg: 20m 31s | Max: 21m 03s
      🟩 TestGPU            Pass: 100%/3   | Total:  1h 09m | Avg: 23m 19s | Max: 28m 34s
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total:  3h 36m | Avg:  1h 12m | Max:  1h 14m
      🟩 90a                Pass: 100%/4   | Total:  1h 33m | Avg: 23m 28s | Max: 24m 19s
    🟩 std
      🟩 11                 Pass: 100%/30  | Total:  1d 01h | Avg: 50m 34s | Max:  1h 08m
      🟩 14                 Pass: 100%/29  | Total:  1d 02h | Avg: 55m 14s | Max:  1h 14m | Hits:  41%/1474  
      🟩 17                 Pass: 100%/27  | Total:  1d 00h | Avg: 54m 07s | Max:  1h 14m | Hits:  41%/737   
      🟩 20                 Pass: 100%/24  | Total: 18h 18m | Avg: 45m 45s | Max:  1h 07m | Hits:  58%/737   
    
  • 🟩 thrust: Pass: 100%/109 | Total: 3d 02h | Avg: 40m 49s | Max: 1h 23m | Hits: 48%/13180

    🟩 cpu
      🟩 amd64              Pass: 100%/101 | Total:  2d 21h | Avg: 41m 04s | Max:  1h 23m | Hits:  48%/13180 
      🟩 arm64              Pass: 100%/8   | Total:  5h 02m | Avg: 37m 45s | Max: 45m 17s
    🟩 ctk
      🟩 11.1               Pass: 100%/15  | Total:  9h 37m | Avg: 38m 30s | Max:  1h 15m | Hits:  36%/2636  
      🟩 11.8               Pass: 100%/3   | Total:  2h 24m | Avg: 48m 07s | Max: 53m 49s
      🟩 12.5               Pass: 100%/4   | Total:  5h 13m | Avg:  1h 18m | Max:  1h 22m
      🟩 12.6               Pass: 100%/87  | Total:  2d 08h | Avg: 39m 15s | Max:  1h 23m | Hits:  51%/10544 
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/4   | Total:  2h 15m | Avg: 33m 54s | Max: 41m 08s
      🟩 nvcc11.1           Pass: 100%/15  | Total:  9h 37m | Avg: 38m 30s | Max:  1h 15m | Hits:  36%/2636  
      🟩 nvcc11.8           Pass: 100%/3   | Total:  2h 24m | Avg: 48m 07s | Max: 53m 49s
      🟩 nvcc12.5           Pass: 100%/4   | Total:  5h 13m | Avg:  1h 18m | Max:  1h 22m
      🟩 nvcc12.6           Pass: 100%/83  | Total:  2d 06h | Avg: 39m 30s | Max:  1h 23m | Hits:  51%/10544 
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/4   | Total:  2h 15m | Avg: 33m 54s | Max: 41m 08s
      🟩 nvcc               Pass: 100%/105 | Total:  2d 23h | Avg: 41m 05s | Max:  1h 23m | Hits:  48%/13180 
    🟩 cxx
      🟩 Clang9             Pass: 100%/6   | Total:  3h 37m | Avg: 36m 19s | Max: 44m 50s
      🟩 Clang10            Pass: 100%/3   | Total:  2h 07m | Avg: 42m 35s | Max: 47m 32s
      🟩 Clang11            Pass: 100%/4   | Total:  2h 45m | Avg: 41m 26s | Max: 44m 54s
      🟩 Clang12            Pass: 100%/4   | Total:  2h 39m | Avg: 39m 55s | Max: 45m 17s
      🟩 Clang13            Pass: 100%/4   | Total:  2h 45m | Avg: 41m 27s | Max: 47m 06s
      🟩 Clang14            Pass: 100%/4   | Total:  2h 44m | Avg: 41m 09s | Max: 45m 46s
      🟩 Clang15            Pass: 100%/4   | Total:  2h 37m | Avg: 39m 17s | Max: 41m 17s
      🟩 Clang16            Pass: 100%/4   | Total:  2h 46m | Avg: 41m 42s | Max: 47m 37s
      🟩 Clang17            Pass: 100%/4   | Total:  2h 41m | Avg: 40m 16s | Max: 43m 40s
      🟩 Clang18            Pass: 100%/11  | Total:  5h 57m | Avg: 32m 28s | Max: 44m 42s
      🟩 GCC6               Pass: 100%/2   | Total:  1h 09m | Avg: 34m 37s | Max: 36m 32s
      🟩 GCC7               Pass: 100%/6   | Total:  3h 51m | Avg: 38m 36s | Max: 46m 49s
      🟩 GCC8               Pass: 100%/6   | Total:  3h 46m | Avg: 37m 46s | Max: 43m 31s
      🟩 GCC9               Pass: 100%/6   | Total:  3h 59m | Avg: 39m 58s | Max: 45m 53s
      🟩 GCC10              Pass: 100%/4   | Total:  2h 50m | Avg: 42m 40s | Max: 47m 32s
      🟩 GCC11              Pass: 100%/7   | Total:  5h 08m | Avg: 44m 07s | Max: 53m 49s
      🟩 GCC12              Pass: 100%/4   | Total:  2h 52m | Avg: 43m 11s | Max: 50m 51s
      🟩 GCC13              Pass: 100%/14  | Total:  6h 28m | Avg: 27m 44s | Max: 45m 17s
      🟩 Intel2023.2.0      Pass: 100%/3   | Total:  2h 40m | Avg: 53m 36s | Max:  1h 03m
      🟩 MSVC14.16          Pass: 100%/1   | Total:  1h 15m | Avg:  1h 15m | Max:  1h 15m | Hits:  36%/2636  
      🟩 MSVC14.29          Pass: 100%/2   | Total:  2h 20m | Avg:  1h 10m | Max:  1h 12m | Hits:  36%/5272  
      🟩 MSVC14.39          Pass: 100%/2   | Total:  1h 47m | Avg: 53m 49s | Max:  1h 23m | Hits:  67%/5272  
      🟩 NVHPC24.7          Pass: 100%/4   | Total:  5h 13m | Avg:  1h 18m | Max:  1h 22m
    🟩 cxx_family
      🟩 Clang              Pass: 100%/48  | Total:  1d 06h | Avg: 38m 24s | Max: 47m 37s
      🟩 GCC                Pass: 100%/49  | Total:  1d 06h | Avg: 36m 54s | Max: 53m 49s
      🟩 Intel              Pass: 100%/3   | Total:  2h 40m | Avg: 53m 36s | Max:  1h 03m
      🟩 MSVC               Pass: 100%/5   | Total:  5h 23m | Avg:  1h 04m | Max:  1h 23m | Hits:  48%/13180 
      🟩 NVHPC              Pass: 100%/4   | Total:  5h 13m | Avg:  1h 18m | Max:  1h 22m
    🟩 gpu
      🟩 v100               Pass: 100%/109 | Total:  3d 02h | Avg: 40m 49s | Max:  1h 23m | Hits:  48%/13180 
    🟩 jobs
      🟩 Build              Pass: 100%/102 | Total:  3d 00h | Avg: 42m 37s | Max:  1h 23m | Hits:  36%/10544 
      🟩 TestCPU            Pass: 100%/4   | Total: 47m 52s | Avg: 11m 58s | Max: 24m 08s | Hits:  99%/2636  
      🟩 TestGPU            Pass: 100%/3   | Total: 54m 08s | Avg: 18m 02s | Max: 25m 55s
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total:  2h 24m | Avg: 48m 07s | Max: 53m 49s
      🟩 90a                Pass: 100%/4   | Total:  1h 52m | Avg: 28m 02s | Max: 32m 28s
    🟩 std
      🟩 11                 Pass: 100%/30  | Total: 16h 39m | Avg: 33m 18s | Max:  1h 11m
      🟩 14                 Pass: 100%/29  | Total: 21h 30m | Avg: 44m 29s | Max:  1h 21m | Hits:  36%/5272  
      🟩 17                 Pass: 100%/27  | Total: 20h 42m | Avg: 46m 00s | Max:  1h 22m | Hits:  36%/2636  
      🟩 20                 Pass: 100%/23  | Total: 15h 18m | Avg: 39m 55s | Max:  1h 23m | Hits:  67%/5272  
    
  • 🟩 cudax: Pass: 100%/54 | Total: 4h 50m | Avg: 5m 23s | Max: 19m 26s | Hits: 78%/238

    🟩 cpu
      🟩 amd64              Pass: 100%/50  | Total:  4h 36m | Avg:  5m 31s | Max: 19m 26s | Hits:  78%/238   
      🟩 arm64              Pass: 100%/4   | Total: 14m 35s | Avg:  3m 38s | Max:  4m 01s
    🟩 ctk
      🟩 12.0               Pass: 100%/19  | Total:  1h 41m | Avg:  5m 19s | Max: 18m 50s | Hits:  78%/119   
      🟩 12.5               Pass: 100%/2   | Total: 13m 03s | Avg:  6m 31s | Max:  6m 45s
      🟩 12.6               Pass: 100%/33  | Total:  2h 56m | Avg:  5m 21s | Max: 19m 26s | Hits:  78%/119   
    🟩 cudacxx
      🟩 nvcc12.0           Pass: 100%/19  | Total:  1h 41m | Avg:  5m 19s | Max: 18m 50s | Hits:  78%/119   
      🟩 nvcc12.5           Pass: 100%/2   | Total: 13m 03s | Avg:  6m 31s | Max:  6m 45s
      🟩 nvcc12.6           Pass: 100%/33  | Total:  2h 56m | Avg:  5m 21s | Max: 19m 26s | Hits:  78%/119   
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/54  | Total:  4h 50m | Avg:  5m 23s | Max: 19m 26s | Hits:  78%/238   
    🟩 cxx
      🟩 Clang9             Pass: 100%/2   | Total:  9m 03s | Avg:  4m 31s | Max:  4m 34s
      🟩 Clang10            Pass: 100%/2   | Total:  8m 20s | Avg:  4m 10s | Max:  4m 27s
      🟩 Clang11            Pass: 100%/4   | Total: 14m 45s | Avg:  3m 41s | Max:  3m 51s
      🟩 Clang12            Pass: 100%/4   | Total: 15m 12s | Avg:  3m 48s | Max:  4m 00s
      🟩 Clang13            Pass: 100%/4   | Total: 14m 01s | Avg:  3m 30s | Max:  3m 45s
      🟩 Clang14            Pass: 100%/4   | Total: 28m 48s | Avg:  7m 12s | Max: 17m 48s
      🟩 Clang15            Pass: 100%/2   | Total:  7m 18s | Avg:  3m 39s | Max:  3m 41s
      🟩 Clang16            Pass: 100%/4   | Total: 14m 36s | Avg:  3m 39s | Max:  3m 59s
      🟩 Clang17            Pass: 100%/2   | Total:  7m 44s | Avg:  3m 52s | Max:  4m 07s
      🟩 Clang18            Pass: 100%/2   | Total: 23m 13s | Avg: 11m 36s | Max: 19m 13s
      🟩 GCC9               Pass: 100%/2   | Total:  7m 31s | Avg:  3m 45s | Max:  3m 54s
      🟩 GCC10              Pass: 100%/4   | Total: 14m 01s | Avg:  3m 30s | Max:  3m 52s
      🟩 GCC11              Pass: 100%/4   | Total: 14m 54s | Avg:  3m 43s | Max:  4m 16s
      🟩 GCC12              Pass: 100%/7   | Total:  1h 11m | Avg: 10m 14s | Max: 19m 26s
      🟩 GCC13              Pass: 100%/3   | Total: 10m 54s | Avg:  3m 38s | Max:  4m 01s
      🟩 MSVC14.36          Pass: 100%/1   | Total:  7m 52s | Avg:  7m 52s | Max:  7m 52s | Hits:  78%/119   
      🟩 MSVC14.39          Pass: 100%/1   | Total:  7m 59s | Avg:  7m 59s | Max:  7m 59s | Hits:  78%/119   
      🟩 NVHPC24.7          Pass: 100%/2   | Total: 13m 03s | Avg:  6m 31s | Max:  6m 45s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/30  | Total:  2h 23m | Avg:  4m 46s | Max: 19m 13s
      🟩 GCC                Pass: 100%/20  | Total:  1h 58m | Avg:  5m 56s | Max: 19m 26s
      🟩 MSVC               Pass: 100%/2   | Total: 15m 51s | Avg:  7m 55s | Max:  7m 59s | Hits:  78%/238   
      🟩 NVHPC              Pass: 100%/2   | Total: 13m 03s | Avg:  6m 31s | Max:  6m 45s
    🟩 gpu
      🟩 v100               Pass: 100%/54  | Total:  4h 50m | Avg:  5m 23s | Max: 19m 26s | Hits:  78%/238   
    🟩 jobs
      🟩 Build              Pass: 100%/49  | Total:  3h 16m | Avg:  4m 00s | Max:  7m 59s | Hits:  78%/238   
      🟩 Test               Pass: 100%/5   | Total:  1h 34m | Avg: 18m 55s | Max: 19m 26s
    🟩 sm
      🟩 90                 Pass: 100%/1   | Total:  2m 45s | Avg:  2m 45s | Max:  2m 45s
      🟩 90a                Pass: 100%/1   | Total:  3m 06s | Avg:  3m 06s | Max:  3m 06s
    🟩 std
      🟩 17                 Pass: 100%/29  | Total:  2h 21m | Avg:  4m 51s | Max: 19m 21s
      🟩 20                 Pass: 100%/25  | Total:  2h 29m | Avg:  5m 59s | Max: 19m 26s | Hits:  78%/238   
    
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 10m 51s | Avg: 5m 25s | Max: 8m 56s

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total: 10m 51s | Avg:  5m 25s | Max:  8m 56s
    🟩 ctk
      🟩 12.6               Pass: 100%/2   | Total: 10m 51s | Avg:  5m 25s | Max:  8m 56s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/2   | Total: 10m 51s | Avg:  5m 25s | Max:  8m 56s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total: 10m 51s | Avg:  5m 25s | Max:  8m 56s
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total: 10m 51s | Avg:  5m 25s | Max:  8m 56s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total: 10m 51s | Avg:  5m 25s | Max:  8m 56s
    🟩 gpu
      🟩 v100               Pass: 100%/2   | Total: 10m 51s | Avg:  5m 25s | Max:  8m 56s
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  1m 55s | Avg:  1m 55s | Max:  1m 55s
      🟩 Test               Pass: 100%/1   | Total:  8m 56s | Avg:  8m 56s | Max:  8m 56s
    
  • 🟩 python: Pass: 100%/1 | Total: 15m 05s | Avg: 15m 05s | Max: 15m 05s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 15m 05s | Avg: 15m 05s | Max: 15m 05s
    🟩 ctk
      🟩 12.6               Pass: 100%/1   | Total: 15m 05s | Avg: 15m 05s | Max: 15m 05s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/1   | Total: 15m 05s | Avg: 15m 05s | Max: 15m 05s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 15m 05s | Avg: 15m 05s | Max: 15m 05s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 15m 05s | Avg: 15m 05s | Max: 15m 05s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 15m 05s | Avg: 15m 05s | Max: 15m 05s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 15m 05s | Avg: 15m 05s | Max: 15m 05s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 15m 05s | Avg: 15m 05s | Max: 15m 05s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
+/- libcu++
+/- CUB
+/- Thrust
CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
+/- libcu++
+/- CUB
+/- Thrust
+/- CUDA Experimental
+/- python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 394)

# Runner
326 linux-amd64-cpu16
28 linux-arm64-cpu16
25 linux-amd64-gpu-v100-latest-1
15 windows-amd64-cpu16

@bernhardmgruber bernhardmgruber merged commit c97f2e3 into NVIDIA:main Nov 6, 2024
408 checks passed
@bernhardmgruber bernhardmgruber deleted the transform_thrust branch November 6, 2024 12:47
pciolkosz pushed a commit to pciolkosz/cccl that referenced this pull request Nov 6, 2024
* Add transform benchmark requiring a stable address
* Make thrust::transform use cub::DeviceTransform
* Introduces address stability detection and opt-in in libcu++
* Mark lambdas in Thrust BabelStream benchmark address oblivious
* Optimize prefetch cub::DeviceTransform for small problems

Fixes: NVIDIA#2263
fbusato pushed a commit to fbusato/cccl that referenced this pull request Nov 9, 2024
* Add transform benchmark requiring a stable address
* Make thrust::transform use cub::DeviceTransform
* Introduces address stability detection and opt-in in libcu++
* Mark lambdas in Thrust BabelStream benchmark address oblivious
* Optimize prefetch cub::DeviceTransform for small problems

Fixes: NVIDIA#2263
pciolkosz added a commit that referenced this pull request Nov 11, 2024
* copy pasted sample

* First draft

* Kernel functor and some other things

* Clean up and break up long main function

* Needs launch fix

* Switch to copy_bytes and cleanups

* Missing include

* Add exception print and waive value

* Adjust copy count

* Add license and switch benchmark streams

* Remove a function left as a mistake

* Update copyright date

Co-authored-by: Eric Niebler <[email protected]>

* Setup cudax examples. (#2697)

* Move the sample to new location and fix warning

* build fixes and 0 return code on waive

* Some new MSVC errors

* explicit cast

* Rename enable/disable peer access and separate the sample loop

* Add `cuda::minimum` and `cuda::maximum` (#2681)

* Add cuda::minimum and cuda::maximum

* Various fixes to cub::DeviceTransform (#2709)

* Workaround non-copyable iterators
* Use a named constant for SMEM
* Cast to raw reference 2
* Fix passing non-copy-assignable iterators to transform_kernel via kernel_arg

* Make `thrust::transform` use `cub::DeviceTransform` (#2389)

* Add transform benchmark requiring a stable address
* Make thrust::transform use cub::DeviceTransform
* Introduces address stability detection and opt-in in libcu++
* Mark lambdas in Thrust BabelStream benchmark address oblivious
* Optimize prefetch cub::DeviceTransform for small problems

Fixes: #2263

* Ensure that we only use the inline variable trait when it is actually available (#2712)

* Ensure that we only use the inline variable trait when it is actually available

* Use the right define for internal traits

* [CUDAX] Rename memory resource and memory pool from async to device (#2710)

* Rename the type

* Update tests

* Rename async memory pool

* Rename the tests

* Change name in the docs

* Generalise the memory_pool_properties name

* Fix docs

---------

Co-authored-by: Michael Schellenberger Costa <[email protected]>

* Update memory resource name

---------

Co-authored-by: Eric Niebler <[email protected]>
Co-authored-by: Allison Piper <[email protected]>
Co-authored-by: Jacob Faibussowitsch <[email protected]>
Co-authored-by: Bernhard Manfred Gruber <[email protected]>
Co-authored-by: Michael Schellenberger Costa <[email protected]>
fbusato pushed a commit to fbusato/cccl that referenced this pull request Nov 12, 2024
* copy pasted sample

* First draft

* Kernel functor and some other things

* Clean up and break up long main function

* Needs launch fix

* Switch to copy_bytes and cleanups

* Missing include

* Add exception print and waive value

* Adjust copy count

* Add license and switch benchmark streams

* Remove a function left as a mistake

* Update copyright date

Co-authored-by: Eric Niebler <[email protected]>

* Setup cudax examples. (NVIDIA#2697)

* Move the sample to new location and fix warning

* build fixes and 0 return code on waive

* Some new MSVC errors

* explicit cast

* Rename enable/disable peer access and separate the sample loop

* Add `cuda::minimum` and `cuda::maximum` (NVIDIA#2681)

* Add cuda::minimum and cuda::maximum

* Various fixes to cub::DeviceTransform (NVIDIA#2709)

* Workaround non-copyable iterators
* Use a named constant for SMEM
* Cast to raw reference 2
* Fix passing non-copy-assignable iterators to transform_kernel via kernel_arg

* Make `thrust::transform` use `cub::DeviceTransform` (NVIDIA#2389)

* Add transform benchmark requiring a stable address
* Make thrust::transform use cub::DeviceTransform
* Introduces address stability detection and opt-in in libcu++
* Mark lambdas in Thrust BabelStream benchmark address oblivious
* Optimize prefetch cub::DeviceTransform for small problems

Fixes: NVIDIA#2263

* Ensure that we only use the inline variable trait when it is actually available (NVIDIA#2712)

* Ensure that we only use the inline variable trait when it is actually available

* Use the right define for internal traits

* [CUDAX] Rename memory resource and memory pool from async to device (NVIDIA#2710)

* Rename the type

* Update tests

* Rename async memory pool

* Rename the tests

* Change name in the docs

* Generalise the memory_pool_properties name

* Fix docs

---------

Co-authored-by: Michael Schellenberger Costa <[email protected]>

* Update memory resource name

---------

Co-authored-by: Eric Niebler <[email protected]>
Co-authored-by: Allison Piper <[email protected]>
Co-authored-by: Jacob Faibussowitsch <[email protected]>
Co-authored-by: Bernhard Manfred Gruber <[email protected]>
Co-authored-by: Michael Schellenberger Costa <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

Port thrust::transform to use cub::DeviceTransform
4 participants