University of Pennsylvania, CIS 565: GPU Programming and Architecture, Project 2
- Jacky Lu
Performance Comparison Between GPU Scan Implementations (Naive, Work-Efficient, and Thrust) And CPU Scan Implementations
[ 12 3 19 17 37 34 6 17 18 22 23 8 49 ... 17 0 ]
==== cpu scan, power-of-two ====
elapsed time: 102.834ms (std::chrono Measured)
[ 0 12 15 34 51 88 122 128 145 163 185 208 216 ... 821676053 821676070 ]
==== cpu scan, non-power-of-two ====
elapsed time: 38.8949ms (std::chrono Measured)
[ 0 12 15 34 51 88 122 128 145 163 185 208 216 ... 821675987 821676034 ]
==== naive scan, power-of-two ====
elapsed time: 19.5621ms (CUDA Measured)
[ 0 12 15 34 51 88 122 128 145 163 185 208 216 ... 821676053 821676070 ]
==== naive scan, non-power-of-two ====
elapsed time: 19.9352ms (CUDA Measured)
[ 0 12 15 34 51 88 122 128 145 163 185 208 216 ... 821675987 821676034 ]
==== work-efficient scan, power-of-two ====
elapsed time: 14.7942ms (CUDA Measured)
[ 0 12 15 34 51 88 122 128 145 163 185 208 216 ... 821676053 821676070 ]
==== work-efficient scan, non-power-of-two ====
elapsed time: 14.8075ms (CUDA Measured)
[ 0 12 15 34 51 88 122 128 145 163 185 208 216 ... 821675987 821676034 ]
==== thrust scan, power-of-two ====
elapsed time: 0.8088ms (CUDA Measured)
[ 0 12 15 34 51 88 122 128 145 163 185 208 216 ... 821676053 821676070 ]
==== thrust scan, non-power-of-two ====
elapsed time: 0.806624ms (CUDA Measured)
[ 0 12 15 34 51 88 122 128 145 163 185 208 216 ... 821675987 821676034 ]
[ 0 2 1 2 0 0 3 3 1 2 2 2 2 ... 2 0 ]
==== cpu compact without scan, power-of-two ====
elapsed time: 60.1642ms (std::chrono Measured)
[ 2 1 2 3 3 1 2 2 2 2 3 1 3 ... 3 2 ]
==== cpu compact without scan, non-power-of-two ====
elapsed time: 60.4809ms (std::chrono Measured)
[ 2 1 2 3 3 1 2 2 2 2 3 1 3 ... 1 3 ]
==== cpu compact with scan ====
elapsed time: 205.018ms (std::chrono Measured)
[ 2 1 2 3 3 1 2 2 2 2 3 1 3 ... 3 2 ]
==== work-efficient compact, power-of-two ====
elapsed time: 17.7603ms (CUDA Measured)
[ 2 1 2 3 3 1 2 2 2 2 3 1 3 ... 3 2 ]
==== work-efficient compact, non-power-of-two ====
elapsed time: 17.4516ms (CUDA Measured)
[ 2 1 2 3 3 1 2 2 2 2 3 1 3 ... 1 3 ]