Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add memcpy_async transform kernel for A100 #2394

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

bernhardmgruber
Copy link
Contributor

@bernhardmgruber bernhardmgruber commented Sep 9, 2024

This PR adds a transform kernel using cg::memcpy_async targetting Ampere GPUs to cub::DeviceTransform.

Fixes: #2361

Benchmark on A100 80GB PCIe
# mul

## [0] NVIDIA A100 80GB PCIe

|  T{ct}  |  OffsetT{ct}  |  Elements{io}  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |        Diff |   %Diff |  Status  |
|---------|---------------|----------------|------------|-------------|------------|-------------|-------------|---------|----------|
|   I8    |      I32      |      2^16      |   6.802 us |       8.39% |   6.540 us |       7.95% |   -0.262 us |  -3.85% |   PASS   |
|   I8    |      I32      |      2^20      |  10.249 us |       5.14% |   8.238 us |       6.46% |   -2.010 us | -19.62% |   FAIL   |
|   I8    |      I32      |      2^24      |  62.543 us |       0.86% |  30.496 us |       1.98% |  -32.047 us | -51.24% |   FAIL   |
|   I8    |      I32      |      2^28      | 879.983 us |       0.07% | 362.311 us |       0.65% | -517.672 us | -58.83% |   FAIL   |
|   I8    |      I64      |      2^16      |   7.207 us |       8.42% |   6.895 us |       9.12% |   -0.312 us |  -4.33% |   PASS   |
|   I8    |      I64      |      2^20      |  10.330 us |       4.92% |   8.145 us |       7.29% |   -2.185 us | -21.15% |   FAIL   |
|   I8    |      I64      |      2^24      |  63.021 us |       0.91% |  29.563 us |       1.89% |  -33.458 us | -53.09% |   FAIL   |
|   I8    |      I64      |      2^28      | 886.439 us |       0.07% | 359.991 us |       0.76% | -526.449 us | -59.39% |   FAIL   |
|   I16   |      I32      |      2^16      |   7.127 us |       9.01% |   6.779 us |       8.85% |   -0.349 us |  -4.89% |   PASS   |
|   I16   |      I32      |      2^20      |  10.693 us |       5.07% |   9.685 us |       5.61% |   -1.009 us |  -9.43% |   FAIL   |
|   I16   |      I32      |      2^24      |  65.555 us |       0.86% |  51.138 us |       2.71% |  -14.416 us | -21.99% |   FAIL   |
|   I16   |      I32      |      2^28      | 924.303 us |       0.07% | 678.104 us |       0.27% | -246.199 us | -26.64% |   FAIL   |
|   I16   |      I64      |      2^16      |   7.409 us |       6.86% |   6.929 us |       8.84% |   -0.480 us |  -6.48% |   PASS   |
|   I16   |      I64      |      2^20      |  10.955 us |       5.32% |   9.303 us |       6.29% |   -1.652 us | -15.08% |   FAIL   |
|   I16   |      I64      |      2^24      |  65.881 us |       0.78% |  50.810 us |       2.55% |  -15.070 us | -22.88% |   FAIL   |
|   I16   |      I64      |      2^28      | 929.253 us |       0.06% | 676.679 us |       0.25% | -252.574 us | -27.18% |   FAIL   |
|   F32   |      I32      |      2^16      |   7.495 us |       6.45% |   7.001 us |       8.78% |   -0.494 us |  -6.60% |   FAIL   |
|   F32   |      I32      |      2^20      |  12.384 us |       4.09% |  11.786 us |       4.98% |   -0.599 us |  -4.83% |   FAIL   |
|   F32   |      I32      |      2^24      |  94.864 us |       1.63% |  91.200 us |       1.47% |   -3.663 us |  -3.86% |   FAIL   |
|   F32   |      I32      |      2^28      |   1.386 ms |       0.15% |   1.319 ms |       0.13% |  -67.725 us |  -4.89% |   FAIL   |
|   F32   |      I64      |      2^16      |   7.571 us |       6.68% |   7.105 us |       7.97% |   -0.466 us |  -6.15% |   PASS   |
|   F32   |      I64      |      2^20      |  13.100 us |       4.79% |  12.275 us |       4.44% |   -0.825 us |  -6.30% |   FAIL   |
|   F32   |      I64      |      2^24      |  94.921 us |       1.70% |  91.614 us |       1.68% |   -3.307 us |  -3.48% |   FAIL   |
|   F32   |      I64      |      2^28      |   1.387 ms |       0.16% |   1.321 ms |       0.13% |  -65.915 us |  -4.75% |   FAIL   |
|   F64   |      I32      |      2^16      |   7.799 us |       7.62% |   7.457 us |       7.03% |   -0.342 us |  -4.38% |   PASS   |
|   F64   |      I32      |      2^20      |  17.545 us |       3.13% |  17.270 us |       3.40% |   -0.275 us |  -1.57% |   PASS   |
|   F64   |      I32      |      2^24      | 171.405 us |       0.83% | 172.638 us |       0.84% |    1.234 us |   0.72% |   PASS   |
|   F64   |      I32      |      2^28      |   2.596 ms |       0.06% |   2.618 ms |       0.07% |   21.484 us |   0.83% |   FAIL   |
|   F64   |      I64      |      2^16      |   7.848 us |       6.82% |   7.470 us |       6.77% |   -0.378 us |  -4.81% |   PASS   |
|   F64   |      I64      |      2^20      |  17.527 us |       3.33% |  17.306 us |       3.40% |   -0.222 us |  -1.27% |   PASS   |
|   F64   |      I64      |      2^24      | 173.554 us |       1.43% | 174.633 us |       1.40% |    1.079 us |   0.62% |   PASS   |
|   F64   |      I64      |      2^28      |   2.596 ms |       0.06% |   2.623 ms |       0.07% |   27.480 us |   1.06% |   FAIL   |
|  I128   |      I32      |      2^16      |   8.598 us |       6.24% |   8.175 us |       7.19% |   -0.423 us |  -4.92% |   PASS   |
|  I128   |      I32      |      2^20      |  29.438 us |       4.56% |  28.753 us |       5.42% |   -0.685 us |  -2.33% |   PASS   |
|  I128   |      I32      |      2^24      | 335.566 us |       1.18% | 326.713 us |       1.46% |   -8.853 us |  -2.64% |   FAIL   |
|  I128   |      I32      |      2^28      |   5.213 ms |       0.08% |   5.077 ms |       0.10% | -136.224 us |  -2.61% |   FAIL   |
|  I128   |      I64      |      2^16      |   8.539 us |       6.24% |   8.149 us |       7.30% |   -0.390 us |  -4.57% |   PASS   |
|  I128   |      I64      |      2^20      |  29.386 us |       4.86% |  28.697 us |       5.45% |   -0.689 us |  -2.34% |   PASS   |
|  I128   |      I64      |      2^24      | 343.512 us |       0.36% | 335.308 us |       0.42% |   -8.204 us |  -2.39% |   FAIL   |
|  I128   |      I64      |      2^28      |   5.234 ms |       0.34% |   5.087 ms |       0.49% | -147.484 us |  -2.82% |   FAIL   |

# add

## [0] NVIDIA A100 80GB PCIe

|  T{ct}  |  OffsetT{ct}  |  Elements{io}  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |        Diff |   %Diff |  Status  |
|---------|---------------|----------------|------------|-------------|------------|-------------|-------------|---------|----------|
|   I8    |      I32      |      2^16      |   7.025 us |       8.26% |   6.832 us |       8.53% |   -0.193 us |  -2.75% |   PASS   |
|   I8    |      I32      |      2^20      |  10.956 us |       5.25% |   9.520 us |       5.58% |   -1.436 us | -13.11% |   FAIL   |
|   I8    |      I32      |      2^24      |  68.523 us |       0.83% |  44.984 us |       1.30% |  -23.538 us | -34.35% |   FAIL   |
|   I8    |      I32      |      2^28      | 956.059 us |       0.07% | 572.666 us |       0.42% | -383.393 us | -40.10% |   FAIL   |
|   I8    |      I64      |      2^16      |   7.482 us |       6.89% |   7.168 us |       8.58% |   -0.314 us |  -4.20% |   PASS   |
|   I8    |      I64      |      2^20      |  10.945 us |       5.25% |   9.544 us |       5.09% |   -1.402 us | -12.81% |   FAIL   |
|   I8    |      I64      |      2^24      |  68.454 us |       0.87% |  44.965 us |       1.25% |  -23.489 us | -34.31% |   FAIL   |
|   I8    |      I64      |      2^28      | 959.278 us |       0.07% | 573.693 us |       0.41% | -385.584 us | -40.20% |   FAIL   |
|   I16   |      I32      |      2^16      |   7.420 us |       7.68% |   7.232 us |       7.90% |   -0.187 us |  -2.53% |   PASS   |
|   I16   |      I32      |      2^20      |  11.950 us |       5.01% |  11.472 us |       4.58% |   -0.478 us |  -4.00% |   PASS   |
|   I16   |      I32      |      2^24      |  79.335 us |       0.70% |  77.056 us |       2.11% |   -2.279 us |  -2.87% |   FAIL   |
|   I16   |      I32      |      2^28      |   1.112 ms |       0.36% |   1.061 ms |       0.30% |  -51.083 us |  -4.59% |   FAIL   |
|   I16   |      I64      |      2^16      |   7.506 us |       6.98% |   7.308 us |       7.53% |   -0.199 us |  -2.65% |   PASS   |
|   I16   |      I64      |      2^20      |  12.329 us |       4.87% |  11.701 us |       5.93% |   -0.628 us |  -5.09% |   FAIL   |
|   I16   |      I64      |      2^24      |  79.190 us |       0.68% |  76.779 us |       2.20% |   -2.412 us |  -3.05% |   FAIL   |
|   I16   |      I64      |      2^28      |   1.111 ms |       0.34% |   1.061 ms |       0.28% |  -49.930 us |  -4.50% |   FAIL   |
|   F32   |      I32      |      2^16      |   7.763 us |       7.60% |   7.538 us |       6.95% |   -0.225 us |  -2.90% |   PASS   |
|   F32   |      I32      |      2^20      |  15.889 us |       4.06% |  15.818 us |       4.09% |   -0.071 us |  -0.44% |   PASS   |
|   F32   |      I32      |      2^24      | 137.955 us |       1.17% | 140.439 us |       1.61% |    2.484 us |   1.80% |   FAIL   |
|   F32   |      I32      |      2^28      |   2.021 ms |       0.10% |   2.074 ms |       0.16% |   53.565 us |   2.65% |   FAIL   |
|   F32   |      I64      |      2^16      |   7.842 us |       7.36% |   7.468 us |       6.44% |   -0.374 us |  -4.77% |   PASS   |
|   F32   |      I64      |      2^20      |  16.042 us |       3.71% |  16.032 us |       3.77% |   -0.010 us |  -0.06% |   PASS   |
|   F32   |      I64      |      2^24      | 138.841 us |       1.54% | 141.710 us |       1.86% |    2.869 us |   2.07% |   FAIL   |
|   F32   |      I64      |      2^28      |   2.020 ms |       0.10% |   2.077 ms |       0.17% |   56.991 us |   2.82% |   FAIL   |
|   F64   |      I32      |      2^16      |   8.325 us |       6.82% |   8.080 us |       7.75% |   -0.245 us |  -2.94% |   PASS   |
|   F64   |      I32      |      2^20      |  25.148 us |       5.37% |  25.565 us |       5.38% |    0.417 us |   1.66% |   PASS   |
|   F64   |      I32      |      2^24      | 251.377 us |       0.92% | 253.412 us |       0.92% |    2.035 us |   0.81% |   PASS   |
|   F64   |      I32      |      2^28      |   3.803 ms |       0.04% |   3.848 ms |       0.04% |   44.711 us |   1.18% |   FAIL   |
|   F64   |      I64      |      2^16      |   8.254 us |       6.67% |   8.031 us |       7.20% |   -0.223 us |  -2.70% |   PASS   |
|   F64   |      I64      |      2^20      |  25.333 us |       5.29% |  25.626 us |       5.11% |    0.293 us |   1.15% |   PASS   |
|   F64   |      I64      |      2^24      | 255.488 us |       0.35% | 257.570 us |       0.34% |    2.082 us |   0.81% |   FAIL   |
|   F64   |      I64      |      2^28      |   3.807 ms |       0.17% |   3.855 ms |       0.19% |   48.325 us |   1.27% |   FAIL   |
|  I128   |      I32      |      2^16      |   9.438 us |       5.59% |   9.236 us |       6.33% |   -0.202 us |  -2.14% |   PASS   |
|  I128   |      I32      |      2^20      |  42.510 us |       2.20% |  41.849 us |       2.57% |   -0.660 us |  -1.55% |   PASS   |
|  I128   |      I32      |      2^24      | 489.389 us |       1.20% | 488.309 us |       1.27% |   -1.080 us |  -0.22% |   PASS   |
|  I128   |      I32      |      2^28      |   7.614 ms |       0.07% |   7.611 ms |       0.11% |   -2.617 us |  -0.03% |   PASS   |
|  I128   |      I64      |      2^16      |   9.421 us |       5.33% |   9.216 us |       6.19% |   -0.205 us |  -2.17% |   PASS   |
|  I128   |      I64      |      2^20      |  42.499 us |       2.40% |  41.828 us |       2.43% |   -0.671 us |  -1.58% |   PASS   |
|  I128   |      I64      |      2^24      | 497.489 us |       0.21% | 496.386 us |       0.25% |   -1.104 us |  -0.22% |   FAIL   |
|  I128   |      I64      |      2^28      |   7.620 ms |       0.43% |   7.617 ms |       0.45% |   -2.309 us |  -0.03% |   PASS   |

# triad

## [0] NVIDIA A100 80GB PCIe

|  T{ct}  |  OffsetT{ct}  |  Elements{io}  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |        Diff |   %Diff |  Status  |
|---------|---------------|----------------|------------|-------------|------------|-------------|-------------|---------|----------|
|   I8    |      I32      |      2^16      |   7.427 us |       7.16% |   7.157 us |       7.64% |   -0.271 us |  -3.64% |   PASS   |
|   I8    |      I32      |      2^20      |  11.497 us |       4.50% |  10.137 us |       5.68% |   -1.360 us | -11.83% |   FAIL   |
|   I8    |      I32      |      2^24      |  69.199 us |       0.87% |  45.871 us |       1.44% |  -23.328 us | -33.71% |   FAIL   |
|   I8    |      I32      |      2^28      | 956.037 us |       0.45% | 580.667 us |       0.48% | -375.370 us | -39.26% |   FAIL   |
|   I8    |      I64      |      2^16      |   7.372 us |       7.52% |   7.130 us |       7.55% |   -0.242 us |  -3.28% |   PASS   |
|   I8    |      I64      |      2^20      |  10.990 us |       5.24% |   9.446 us |       5.26% |   -1.545 us | -14.05% |   FAIL   |
|   I8    |      I64      |      2^24      |  68.336 us |       0.85% |  45.054 us |       1.14% |  -23.282 us | -34.07% |   FAIL   |
|   I8    |      I64      |      2^28      | 957.756 us |       0.07% | 573.758 us |       0.42% | -383.999 us | -40.09% |   FAIL   |
|   I16   |      I32      |      2^16      |   7.397 us |       7.86% |   7.110 us |       8.39% |   -0.287 us |  -3.88% |   PASS   |
|   I16   |      I32      |      2^20      |  11.858 us |       4.74% |  11.360 us |       4.53% |   -0.498 us |  -4.20% |   PASS   |
|   I16   |      I32      |      2^24      |  79.295 us |       0.67% |  76.680 us |       2.17% |   -2.615 us |  -3.30% |   FAIL   |
|   I16   |      I32      |      2^28      |   1.111 ms |       0.35% |   1.062 ms |       0.27% |  -49.659 us |  -4.47% |   FAIL   |
|   I16   |      I64      |      2^16      |   7.496 us |       6.67% |   7.264 us |       7.26% |   -0.232 us |  -3.10% |   PASS   |
|   I16   |      I64      |      2^20      |  12.373 us |       4.52% |  11.911 us |       5.08% |   -0.461 us |  -3.73% |   PASS   |
|   I16   |      I64      |      2^24      |  79.202 us |       0.65% |  76.525 us |       2.30% |   -2.676 us |  -3.38% |   FAIL   |
|   I16   |      I64      |      2^28      |   1.110 ms |       0.34% |   1.061 ms |       0.27% |  -48.636 us |  -4.38% |   FAIL   |
|   F32   |      I32      |      2^16      |   7.666 us |       7.03% |   7.397 us |       6.69% |   -0.269 us |  -3.51% |   PASS   |
|   F32   |      I32      |      2^20      |  15.684 us |       4.15% |  15.811 us |       4.07% |    0.127 us |   0.81% |   PASS   |
|   F32   |      I32      |      2^24      | 137.836 us |       1.09% | 140.813 us |       1.52% |    2.978 us |   2.16% |   FAIL   |
|   F32   |      I32      |      2^28      |   2.017 ms |       0.10% |   2.073 ms |       0.17% |   56.326 us |   2.79% |   FAIL   |
|   F32   |      I64      |      2^16      |   7.664 us |       7.01% |   7.356 us |       6.97% |   -0.308 us |  -4.02% |   PASS   |
|   F32   |      I64      |      2^20      |  15.855 us |       3.71% |  15.953 us |       3.58% |    0.098 us |   0.62% |   PASS   |
|   F32   |      I64      |      2^24      | 138.245 us |       1.56% | 141.501 us |       1.93% |    3.256 us |   2.36% |   FAIL   |
|   F32   |      I64      |      2^28      |   2.016 ms |       0.10% |   2.075 ms |       0.17% |   59.070 us |   2.93% |   FAIL   |
|   F64   |      I32      |      2^16      |   8.176 us |       7.03% |   7.930 us |       7.54% |   -0.246 us |  -3.01% |   PASS   |
|   F64   |      I32      |      2^20      |  25.152 us |       5.32% |  25.558 us |       4.99% |    0.406 us |   1.62% |   PASS   |
|   F64   |      I32      |      2^24      | 251.230 us |       0.82% | 253.691 us |       0.87% |    2.460 us |   0.98% |   FAIL   |
|   F64   |      I32      |      2^28      |   3.806 ms |       0.04% |   3.851 ms |       0.04% |   45.347 us |   1.19% |   FAIL   |
|   F64   |      I64      |      2^16      |   8.160 us |       7.13% |   7.906 us |       7.71% |   -0.254 us |  -3.12% |   PASS   |
|   F64   |      I64      |      2^20      |  25.140 us |       5.17% |  25.556 us |       5.42% |    0.416 us |   1.65% |   PASS   |
|   F64   |      I64      |      2^24      | 255.311 us |       0.33% | 257.648 us |       0.33% |    2.337 us |   0.92% |   FAIL   |
|   F64   |      I64      |      2^28      |   3.804 ms |       0.14% |   3.856 ms |       0.20% |   51.458 us |   1.35% |   FAIL   |
|  I128   |      I32      |      2^16      |   9.377 us |       5.88% |   9.233 us |       6.44% |   -0.143 us |  -1.53% |   PASS   |
|  I128   |      I32      |      2^20      |  42.543 us |       2.49% |  41.836 us |       2.37% |   -0.708 us |  -1.66% |   PASS   |
|  I128   |      I32      |      2^24      | 489.137 us |       1.19% | 488.434 us |       1.29% |   -0.703 us |  -0.14% |   PASS   |
|  I128   |      I32      |      2^28      |   7.616 ms |       0.07% |   7.610 ms |       0.11% |   -5.771 us |  -0.08% |   FAIL   |
|  I128   |      I64      |      2^16      |   9.412 us |       5.44% |   9.203 us |       6.47% |   -0.209 us |  -2.22% |   PASS   |
|  I128   |      I64      |      2^20      |  42.537 us |       2.45% |  41.828 us |       2.41% |   -0.709 us |  -1.67% |   PASS   |
|  I128   |      I64      |      2^24      | 497.430 us |       0.19% | 496.110 us |       0.25% |   -1.320 us |  -0.27% |   FAIL   |
|  I128   |      I64      |      2^28      |   7.622 ms |       0.48% |   7.613 ms |       0.29% |   -9.246 us |  -0.12% |   PASS   |

# nstream

## [0] NVIDIA A100 80GB PCIe

|  T{ct}  |  OffsetT{ct}  |  Elements{io}  |  OverwriteInput  |   Ref Time |   Ref Noise |   Cmp Time |   Cmp Noise |        Diff |   %Diff |  Status  |
|---------|---------------|----------------|------------------|------------|-------------|------------|-------------|-------------|---------|----------|
|   I8    |      I32      |      2^16      |        1         |   7.028 us |       8.28% |   7.047 us |       8.43% |    0.019 us |   0.27% |   PASS   |
|   I8    |      I32      |      2^20      |        1         |  11.456 us |       4.15% |  10.758 us |       5.05% |   -0.698 us |  -6.09% |   FAIL   |
|   I8    |      I32      |      2^24      |        1         |  73.978 us |       0.68% |  59.129 us |       1.05% |  -14.849 us | -20.07% |   FAIL   |
|   I8    |      I32      |      2^28      |        1         |   1.037 ms |       0.07% | 799.949 us |       0.25% | -236.739 us | -22.84% |   FAIL   |
|   I8    |      I64      |      2^16      |        1         |   7.608 us |       7.03% |   7.449 us |       7.33% |   -0.160 us |  -2.10% |   PASS   |
|   I8    |      I64      |      2^20      |        1         |  11.547 us |       4.17% |  10.800 us |       5.30% |   -0.748 us |  -6.47% |   FAIL   |
|   I8    |      I64      |      2^24      |        1         |  74.411 us |       0.76% |  60.230 us |       1.09% |  -14.181 us | -19.06% |   FAIL   |
|   I8    |      I64      |      2^28      |        1         |   1.041 ms |       0.07% | 819.130 us |       0.10% | -222.072 us | -21.33% |   FAIL   |
|   I16   |      I32      |      2^16      |        1         |   7.646 us |       7.29% |   7.515 us |       7.08% |   -0.131 us |  -1.72% |   PASS   |
|   I16   |      I32      |      2^20      |        1         |  13.146 us |       4.21% |  12.872 us |       4.25% |   -0.275 us |  -2.09% |   PASS   |
|   I16   |      I32      |      2^24      |        1         |  97.820 us |       1.94% |  97.658 us |       1.89% |   -0.162 us |  -0.17% |   PASS   |
|   I16   |      I32      |      2^28      |        1         |   1.386 ms |       0.24% |   1.375 ms |       0.24% |  -11.569 us |  -0.83% |   FAIL   |
|   I16   |      I64      |      2^16      |        1         |   7.757 us |       7.52% |   7.552 us |       6.78% |   -0.205 us |  -2.65% |   PASS   |
|   I16   |      I64      |      2^20      |        1         |  13.218 us |       4.59% |  12.830 us |       4.49% |   -0.388 us |  -2.93% |   PASS   |
|   I16   |      I64      |      2^24      |        1         |  97.591 us |       1.83% |  97.691 us |       2.09% |    0.100 us |   0.10% |   PASS   |
|   I16   |      I64      |      2^28      |        1         |   1.383 ms |       0.25% |   1.376 ms |       0.23% |   -6.658 us |  -0.48% |   FAIL   |
|   F32   |      I32      |      2^16      |        1         |   7.926 us |       7.78% |   7.763 us |       7.52% |   -0.163 us |  -2.06% |   PASS   |
|   F32   |      I32      |      2^20      |        1         |  17.806 us |       3.25% |  18.004 us |       3.16% |    0.198 us |   1.11% |   PASS   |
|   F32   |      I32      |      2^24      |        1         | 176.647 us |       0.82% | 181.281 us |       1.09% |    4.635 us |   2.62% |   FAIL   |
|   F32   |      I32      |      2^28      |        1         |   2.631 ms |       0.08% |   2.719 ms |       0.12% |   87.018 us |   3.31% |   FAIL   |
|   F32   |      I64      |      2^16      |        1         |   7.967 us |       7.85% |   7.785 us |       7.46% |   -0.182 us |  -2.28% |   PASS   |
|   F32   |      I64      |      2^20      |        1         |  17.821 us |       3.16% |  18.016 us |       3.38% |    0.195 us |   1.09% |   PASS   |
|   F32   |      I64      |      2^24      |        1         | 176.654 us |       0.81% | 181.368 us |       1.19% |    4.715 us |   2.67% |   FAIL   |
|   F32   |      I64      |      2^28      |        1         |   2.630 ms |       0.07% |   2.717 ms |       0.11% |   86.612 us |   3.29% |   FAIL   |
|   F64   |      I32      |      2^16      |        1         |   8.698 us |       6.32% |   8.637 us |       6.17% |   -0.061 us |  -0.70% |   PASS   |
|   F64   |      I32      |      2^20      |        1         |  30.434 us |       3.65% |  31.367 us |       3.61% |    0.932 us |   3.06% |   PASS   |
|   F64   |      I32      |      2^24      |        1         | 326.779 us |       0.27% | 328.449 us |       0.26% |    1.670 us |   0.51% |   FAIL   |
|   F64   |      I32      |      2^28      |        1         |   5.024 ms |       0.06% |   5.039 ms |       0.06% |   14.863 us |   0.30% |   FAIL   |
|   F64   |      I64      |      2^16      |        1         |   8.734 us |       6.41% |   8.718 us |       6.58% |   -0.016 us |  -0.19% |   PASS   |
|   F64   |      I64      |      2^20      |        1         |  30.473 us |       3.66% |  31.345 us |       3.23% |    0.872 us |   2.86% |   PASS   |
|   F64   |      I64      |      2^24      |        1         | 326.923 us |       0.27% | 328.153 us |       0.26% |    1.230 us |   0.38% |   FAIL   |
|   F64   |      I64      |      2^28      |        1         |   5.024 ms |       0.06% |   5.035 ms |       0.06% |   10.581 us |   0.21% |   FAIL   |
|  I128   |      I32      |      2^16      |        1         |  10.349 us |       5.39% |  10.186 us |       5.89% |   -0.163 us |  -1.58% |   PASS   |
|  I128   |      I32      |      2^20      |        1         |  52.132 us |       1.70% |  52.088 us |       1.83% |   -0.044 us |  -0.08% |   PASS   |
|  I128   |      I32      |      2^24      |        1         | 635.185 us |       0.14% | 634.649 us |       0.14% |   -0.536 us |  -0.08% |   PASS   |
|  I128   |      I32      |      2^28      |        1         |   9.945 ms |       0.02% |   9.932 ms |       0.02% |  -13.547 us |  -0.14% |   FAIL   |
|  I128   |      I64      |      2^16      |        1         |  10.349 us |       5.27% |  10.166 us |       5.95% |   -0.182 us |  -1.76% |   PASS   |
|  I128   |      I64      |      2^20      |        1         |  52.173 us |       1.89% |  51.938 us |       1.72% |   -0.235 us |  -0.45% |   PASS   |
|  I128   |      I64      |      2^24      |        1         | 635.157 us |       0.14% | 634.610 us |       0.13% |   -0.547 us |  -0.09% |   PASS   |
|  I128   |      I64      |      2^28      |        1         |   9.944 ms |       0.02% |   9.932 ms |       0.02% |  -12.607 us |  -0.13% |   FAIL   |

@bernhardmgruber bernhardmgruber added the cub For all items related to CUB label Sep 9, 2024
@bernhardmgruber bernhardmgruber force-pushed the transform_A100 branch 2 times, most recently from 87cb4fc to 1d601b1 Compare September 9, 2024 16:42
@bernhardmgruber
Copy link
Contributor Author

bernhardmgruber commented Sep 11, 2024

Copy link
Contributor

github-actions bot commented Nov 4, 2024

🟨 CI finished in 1h 32m: Pass: 81%/222 | Total: 23h 21m | Avg: 6m 18s | Max: 26m 03s | Hits: 99%/14639
  • 🟨 cub: Pass: 63%/110 | Total: 11h 12m | Avg: 6m 06s | Max: 26m 03s | Hits: 97%/1474

    🔍 jobs: Build 🔍
      🔍 Build              Pass:  60%/102 | Total:  8h 12m | Avg:  4m 49s | Max: 13m 55s | Hits:  97%/1474  
      🟩 DeviceLaunch       Pass: 100%/1   | Total: 24m 10s | Avg: 24m 10s | Max: 24m 10s
      🟩 GraphCapture       Pass: 100%/1   | Total: 17m 13s | Avg: 17m 13s | Max: 17m 13s
      🟩 HostLaunch         Pass: 100%/3   | Total:  1h 05m | Avg: 21m 43s | Max: 25m 05s
      🟩 TestGPU            Pass: 100%/3   | Total:  1h 12m | Avg: 24m 15s | Max: 26m 03s
    🟨 cxx
      🟨 Clang9             Pass:  33%/6   | Total: 25m 59s | Avg:  4m 19s | Max:  6m 22s
      🟨 Clang10            Pass:  66%/3   | Total: 14m 41s | Avg:  4m 53s | Max:  6m 07s
      🟨 Clang11            Pass:  75%/4   | Total: 18m 50s | Avg:  4m 42s | Max:  5m 35s
      🟨 Clang12            Pass:  75%/4   | Total: 19m 04s | Avg:  4m 46s | Max:  5m 37s
      🟨 Clang13            Pass:  75%/4   | Total: 19m 05s | Avg:  4m 46s | Max:  5m 35s
      🟨 Clang14            Pass:  75%/4   | Total: 20m 27s | Avg:  5m 06s | Max:  6m 03s
      🟨 Clang15            Pass:  75%/4   | Total: 19m 24s | Avg:  4m 51s | Max:  5m 41s
      🟨 Clang16            Pass:  75%/4   | Total: 20m 03s | Avg:  5m 00s | Max:  6m 04s
      🟨 Clang17            Pass:  75%/4   | Total: 19m 19s | Avg:  4m 49s | Max:  5m 49s
      🟨 Clang18            Pass:  81%/11  | Total:  1h 29m | Avg:  8m 10s | Max: 25m 05s
      🟥 GCC6               Pass:   0%/2   | Total:  4m 43s | Avg:  2m 21s | Max:  2m 23s
      🟨 GCC7               Pass:  33%/6   | Total: 18m 54s | Avg:  3m 09s | Max:  5m 10s
      🟨 GCC8               Pass:  33%/6   | Total: 19m 41s | Avg:  3m 16s | Max:  5m 14s
      🟨 GCC9               Pass:  33%/6   | Total: 19m 46s | Avg:  3m 17s | Max:  5m 18s
      🟨 GCC10              Pass:  75%/4   | Total: 19m 54s | Avg:  4m 58s | Max:  6m 06s
      🟨 GCC11              Pass:  42%/7   | Total: 27m 38s | Avg:  3m 56s | Max:  6m 05s
      🟨 GCC12              Pass:  75%/4   | Total: 19m 50s | Avg:  4m 57s | Max:  5m 50s
      🟨 GCC13              Pass:  87%/16  | Total:  2h 56m | Avg: 11m 02s | Max: 26m 03s
      🟨 Intel2023.2.0      Pass:  66%/3   | Total: 16m 01s | Avg:  5m 20s | Max:  6m 35s
      🟥 MSVC14.16          Pass:   0%/1   | Total: 13m 55s | Avg: 13m 55s | Max: 13m 55s
      🟨 MSVC14.29          Pass:  50%/2   | Total: 22m 47s | Avg: 11m 23s | Max: 12m 12s | Hits:  97%/737   
      🟩 MSVC14.39          Pass: 100%/1   | Total: 12m 02s | Avg: 12m 02s | Max: 12m 02s | Hits:  97%/737   
      🟨 NVHPC24.7          Pass:  75%/4   | Total: 33m 26s | Avg:  8m 21s | Max: 10m 15s
    🟨 std
      🟨 11                 Pass:  80%/30  | Total:  3h 00m | Avg:  6m 00s | Max: 21m 46s
      🟥 14                 Pass:   0%/29  | Total:  1h 41m | Avg:  3m 30s | Max: 13m 55s
      🟨 17                 Pass:  81%/27  | Total:  2h 23m | Avg:  5m 18s | Max: 12m 12s | Hits:  97%/737   
      🟩 20                 Pass: 100%/24  | Total:  4h 06m | Avg: 10m 16s | Max: 26m 03s | Hits:  97%/737   
    🟨 gpu
      🟨 v100               Pass:  63%/110 | Total: 11h 12m | Avg:  6m 06s | Max: 26m 03s | Hits:  97%/1474  
    🟨 cpu
      🟨 amd64              Pass:  62%/102 | Total: 10h 33m | Avg:  6m 12s | Max: 26m 03s | Hits:  97%/1474  
      🟨 arm64              Pass:  75%/8   | Total: 38m 04s | Avg:  4m 45s | Max:  6m 09s
    🟨 ctk
      🟥 11.1               Pass:   0%/15  | Total: 49m 17s | Avg:  3m 17s | Max: 13m 55s
      🟥 11.8               Pass:   0%/3   | Total:  7m 42s | Avg:  2m 34s | Max:  2m 43s
      🟨 12.5               Pass:  75%/4   | Total: 33m 26s | Avg:  8m 21s | Max: 10m 15s
      🟨 12.6               Pass:  76%/88  | Total:  9h 41m | Avg:  6m 36s | Max: 26m 03s | Hits:  97%/1474  
    🟨 cudacxx
      🟨 ClangCUDA18        Pass:  75%/4   | Total: 16m 05s | Avg:  4m 01s | Max:  4m 21s
      🟥 nvcc11.1           Pass:   0%/15  | Total: 49m 17s | Avg:  3m 17s | Max: 13m 55s
      🟥 nvcc11.8           Pass:   0%/3   | Total:  7m 42s | Avg:  2m 34s | Max:  2m 43s
      🟨 nvcc12.5           Pass:  75%/4   | Total: 33m 26s | Avg:  8m 21s | Max: 10m 15s
      🟨 nvcc12.6           Pass:  76%/84  | Total:  9h 25m | Avg:  6m 43s | Max: 26m 03s | Hits:  97%/1474  
    🟨 cudacxx_family
      🟨 ClangCUDA          Pass:  75%/4   | Total: 16m 05s | Avg:  4m 01s | Max:  4m 21s
      🟨 nvcc               Pass:  63%/106 | Total: 10h 55m | Avg:  6m 11s | Max: 26m 03s | Hits:  97%/1474  
    🟨 cxx_family
      🟨 Clang              Pass:  70%/48  | Total:  4h 26m | Avg:  5m 33s | Max: 25m 05s
      🟨 GCC                Pass:  56%/51  | Total:  5h 07m | Avg:  6m 01s | Max: 26m 03s
      🟨 Intel              Pass:  66%/3   | Total: 16m 01s | Avg:  5m 20s | Max:  6m 35s
      🟨 MSVC               Pass:  50%/4   | Total: 48m 44s | Avg: 12m 11s | Max: 13m 55s | Hits:  97%/1474  
      🟨 NVHPC              Pass:  75%/4   | Total: 33m 26s | Avg:  8m 21s | Max: 10m 15s
    🟨 sm
      🟥 60;70;80;90        Pass:   0%/3   | Total:  7m 42s | Avg:  2m 34s | Max:  2m 43s
      🟨 90a                Pass:  75%/4   | Total: 15m 12s | Avg:  3m 48s | Max:  4m 30s
    
  • 🟩 thrust: Pass: 100%/109 | Total: 11h 43m | Avg: 6m 27s | Max: 20m 53s | Hits: 99%/13165

    🟩 cpu
      🟩 amd64              Pass: 100%/101 | Total: 11h 05m | Avg:  6m 35s | Max: 20m 53s | Hits:  99%/13165 
      🟩 arm64              Pass: 100%/8   | Total: 38m 28s | Avg:  4m 48s | Max:  5m 11s
    🟩 ctk
      🟩 11.1               Pass: 100%/15  | Total:  1h 19m | Avg:  5m 16s | Max: 17m 10s | Hits:  99%/2633  
      🟩 11.8               Pass: 100%/3   | Total: 15m 47s | Avg:  5m 15s | Max:  5m 21s
      🟩 12.5               Pass: 100%/4   | Total:  1h 04m | Avg: 16m 03s | Max: 16m 24s
      🟩 12.6               Pass: 100%/87  | Total:  9h 04m | Avg:  6m 15s | Max: 20m 53s | Hits:  99%/10532 
    🟩 cudacxx
      🟩 ClangCUDA18        Pass: 100%/4   | Total: 19m 28s | Avg:  4m 52s | Max:  5m 30s
      🟩 nvcc11.1           Pass: 100%/15  | Total:  1h 19m | Avg:  5m 16s | Max: 17m 10s | Hits:  99%/2633  
      🟩 nvcc11.8           Pass: 100%/3   | Total: 15m 47s | Avg:  5m 15s | Max:  5m 21s
      🟩 nvcc12.5           Pass: 100%/4   | Total:  1h 04m | Avg: 16m 03s | Max: 16m 24s
      🟩 nvcc12.6           Pass: 100%/83  | Total:  8h 45m | Avg:  6m 19s | Max: 20m 53s | Hits:  99%/10532 
    🟩 cudacxx_family
      🟩 ClangCUDA          Pass: 100%/4   | Total: 19m 28s | Avg:  4m 52s | Max:  5m 30s
      🟩 nvcc               Pass: 100%/105 | Total: 11h 24m | Avg:  6m 31s | Max: 20m 53s | Hits:  99%/13165 
    🟩 cxx
      🟩 Clang9             Pass: 100%/6   | Total: 33m 57s | Avg:  5m 39s | Max:  7m 11s
      🟩 Clang10            Pass: 100%/3   | Total: 19m 51s | Avg:  6m 37s | Max:  7m 12s
      🟩 Clang11            Pass: 100%/4   | Total: 20m 44s | Avg:  5m 11s | Max:  5m 32s
      🟩 Clang12            Pass: 100%/4   | Total: 21m 08s | Avg:  5m 17s | Max:  5m 41s
      🟩 Clang13            Pass: 100%/4   | Total: 20m 25s | Avg:  5m 06s | Max:  5m 15s
      🟩 Clang14            Pass: 100%/4   | Total: 20m 31s | Avg:  5m 07s | Max:  5m 43s
      🟩 Clang15            Pass: 100%/4   | Total: 21m 48s | Avg:  5m 27s | Max:  5m 47s
      🟩 Clang16            Pass: 100%/4   | Total: 21m 10s | Avg:  5m 17s | Max:  5m 19s
      🟩 Clang17            Pass: 100%/4   | Total: 21m 19s | Avg:  5m 19s | Max:  5m 49s
      🟩 Clang18            Pass: 100%/11  | Total:  1h 05m | Avg:  5m 59s | Max: 15m 10s
      🟩 GCC6               Pass: 100%/2   | Total:  8m 29s | Avg:  4m 14s | Max:  4m 18s
      🟩 GCC7               Pass: 100%/6   | Total: 26m 54s | Avg:  4m 29s | Max:  5m 01s
      🟩 GCC8               Pass: 100%/6   | Total: 28m 14s | Avg:  4m 42s | Max:  5m 17s
      🟩 GCC9               Pass: 100%/6   | Total: 29m 20s | Avg:  4m 53s | Max:  5m 52s
      🟩 GCC10              Pass: 100%/4   | Total: 21m 04s | Avg:  5m 16s | Max:  5m 33s
      🟩 GCC11              Pass: 100%/7   | Total: 37m 22s | Avg:  5m 20s | Max:  5m 44s
      🟩 GCC12              Pass: 100%/4   | Total: 22m 32s | Avg:  5m 38s | Max:  6m 11s
      🟩 GCC13              Pass: 100%/14  | Total:  1h 33m | Avg:  6m 42s | Max: 16m 07s
      🟩 Intel2023.2.0      Pass: 100%/3   | Total: 20m 13s | Avg:  6m 44s | Max:  7m 10s
      🟩 MSVC14.16          Pass: 100%/1   | Total: 17m 10s | Avg: 17m 10s | Max: 17m 10s | Hits:  99%/2633  
      🟩 MSVC14.29          Pass: 100%/2   | Total: 30m 37s | Avg: 15m 18s | Max: 15m 39s | Hits:  99%/5266  
      🟩 MSVC14.39          Pass: 100%/2   | Total: 37m 00s | Avg: 18m 30s | Max: 20m 53s | Hits:  99%/5266  
      🟩 NVHPC24.7          Pass: 100%/4   | Total:  1h 04m | Avg: 16m 03s | Max: 16m 24s
    🟩 cxx_family
      🟩 Clang              Pass: 100%/48  | Total:  4h 26m | Avg:  5m 33s | Max: 15m 10s
      🟩 GCC                Pass: 100%/49  | Total:  4h 27m | Avg:  5m 27s | Max: 16m 07s
      🟩 Intel              Pass: 100%/3   | Total: 20m 13s | Avg:  6m 44s | Max:  7m 10s
      🟩 MSVC               Pass: 100%/5   | Total:  1h 24m | Avg: 16m 57s | Max: 20m 53s | Hits:  99%/13165 
      🟩 NVHPC              Pass: 100%/4   | Total:  1h 04m | Avg: 16m 03s | Max: 16m 24s
    🟩 gpu
      🟩 v100               Pass: 100%/109 | Total: 11h 43m | Avg:  6m 27s | Max: 20m 53s | Hits:  99%/13165 
    🟩 jobs
      🟩 Build              Pass: 100%/102 | Total: 10h 14m | Avg:  6m 01s | Max: 17m 10s | Hits:  99%/10532 
      🟩 TestCPU            Pass: 100%/4   | Total: 42m 42s | Avg: 10m 40s | Max: 20m 53s | Hits:  99%/2633  
      🟩 TestGPU            Pass: 100%/3   | Total: 46m 29s | Avg: 15m 29s | Max: 16m 07s
    🟩 sm
      🟩 60;70;80;90        Pass: 100%/3   | Total: 15m 47s | Avg:  5m 15s | Max:  5m 21s
      🟩 90a                Pass: 100%/4   | Total: 17m 24s | Avg:  4m 21s | Max:  4m 45s
    🟩 std
      🟩 11                 Pass: 100%/30  | Total:  2h 50m | Avg:  5m 40s | Max: 16m 24s
      🟩 14                 Pass: 100%/29  | Total:  3h 00m | Avg:  6m 13s | Max: 17m 10s | Hits:  99%/5266  
      🟩 17                 Pass: 100%/27  | Total:  2h 50m | Avg:  6m 18s | Max: 16m 22s | Hits:  99%/2633  
      🟩 20                 Pass: 100%/23  | Total:  3h 02m | Avg:  7m 56s | Max: 20m 53s | Hits:  99%/5266  
    
  • 🟩 cccl_c_parallel: Pass: 100%/2 | Total: 9m 13s | Avg: 4m 36s | Max: 7m 18s

    🟩 cpu
      🟩 amd64              Pass: 100%/2   | Total:  9m 13s | Avg:  4m 36s | Max:  7m 18s
    🟩 ctk
      🟩 12.6               Pass: 100%/2   | Total:  9m 13s | Avg:  4m 36s | Max:  7m 18s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/2   | Total:  9m 13s | Avg:  4m 36s | Max:  7m 18s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/2   | Total:  9m 13s | Avg:  4m 36s | Max:  7m 18s
    🟩 cxx
      🟩 GCC13              Pass: 100%/2   | Total:  9m 13s | Avg:  4m 36s | Max:  7m 18s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/2   | Total:  9m 13s | Avg:  4m 36s | Max:  7m 18s
    🟩 gpu
      🟩 v100               Pass: 100%/2   | Total:  9m 13s | Avg:  4m 36s | Max:  7m 18s
    🟩 jobs
      🟩 Build              Pass: 100%/1   | Total:  1m 55s | Avg:  1m 55s | Max:  1m 55s
      🟩 Test               Pass: 100%/1   | Total:  7m 18s | Avg:  7m 18s | Max:  7m 18s
    
  • 🟩 python: Pass: 100%/1 | Total: 16m 23s | Avg: 16m 23s | Max: 16m 23s

    🟩 cpu
      🟩 amd64              Pass: 100%/1   | Total: 16m 23s | Avg: 16m 23s | Max: 16m 23s
    🟩 ctk
      🟩 12.6               Pass: 100%/1   | Total: 16m 23s | Avg: 16m 23s | Max: 16m 23s
    🟩 cudacxx
      🟩 nvcc12.6           Pass: 100%/1   | Total: 16m 23s | Avg: 16m 23s | Max: 16m 23s
    🟩 cudacxx_family
      🟩 nvcc               Pass: 100%/1   | Total: 16m 23s | Avg: 16m 23s | Max: 16m 23s
    🟩 cxx
      🟩 GCC13              Pass: 100%/1   | Total: 16m 23s | Avg: 16m 23s | Max: 16m 23s
    🟩 cxx_family
      🟩 GCC                Pass: 100%/1   | Total: 16m 23s | Avg: 16m 23s | Max: 16m 23s
    🟩 gpu
      🟩 v100               Pass: 100%/1   | Total: 16m 23s | Avg: 16m 23s | Max: 16m 23s
    🟩 jobs
      🟩 Test               Pass: 100%/1   | Total: 16m 23s | Avg: 16m 23s | Max: 16m 23s
    

👃 Inspect Changes

Modifications in project?

Project
CCCL Infrastructure
libcu++
+/- CUB
Thrust
CUDA Experimental
python
CCCL C Parallel Library
Catch2Helper

Modifications in project or dependencies?

Project
CCCL Infrastructure
libcu++
+/- CUB
+/- Thrust
CUDA Experimental
+/- python
+/- CCCL C Parallel Library
+/- Catch2Helper

🏃‍ Runner counts (total jobs: 222)

# Runner
184 linux-amd64-cpu16
16 linux-arm64-cpu16
13 linux-amd64-gpu-v100-latest-1
9 windows-amd64-cpu16

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cub For all items related to CUB
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

Readd, optimize and profile memcpy_async-based transform kernel for A100
1 participant