vulkan: im2col and matmul optimizations for stable diffusion #10942

jeffbolznv · 2024-12-22T04:57:14Z

I was looking at performance using the command line from leejet/stable-diffusion.cpp#439, just with a larger resolution (640x640). These changes improve perf (with NV_coopmat2) from 3.68it/s to 5.07it/s on RTX 4070.

before:
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               9828 runs -   148.23 us/run -    10244 kB/run -   65.91 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               2460 runs -   548.56 us/run -    40964 kB/run -   71.23 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      156 runs -  8284.48 us/run -   655364 kB/run -   75.59 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      656 runs -  2168.58 us/run -   102445 kB/run -   45.07 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      246 runs -  5501.29 us/run -   409645 kB/run -   71.10 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               4278 runs -   316.99 us/run -    23536 kB/run -   70.81 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                670 runs -  1521.94 us/run -   100208 kB/run -   62.80 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       60 runs - 22757.88 us/run -  1678448 kB/run -   70.47 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      429 runs -  3180.95 us/run -   235365 kB/run -   70.59 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      102 runs - 13987.68 us/run -  1002085 kB/run -   68.40 GB/s

  stable_diffusion 3.68it/s

change matmul tile size:
  stable_diffusion 4.65it/s

change matmul tile size + im2col opts:
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              26208 runs -    42.28 us/run -    10244 kB/run -  231.08 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               4100 runs -   282.72 us/run -    40964 kB/run -  138.20 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      156 runs -  8237.07 us/run -   655364 kB/run -   76.02 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                     2296 runs -   495.94 us/run -   102445 kB/run -  197.06 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      492 runs -  2333.77 us/run -   409645 kB/run -  167.60 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               4278 runs -   248.65 us/run -    23536 kB/run -   90.27 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                670 runs -  2075.88 us/run -   100208 kB/run -   46.04 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       60 runs - 22675.12 us/run -  1678448 kB/run -   70.73 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      429 runs -  2332.59 us/run -   235365 kB/run -   96.26 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      102 runs - 13691.37 us/run -  1002085 kB/run -   69.88 GB/s

  stable_diffusion 5.07it/s

daniandtheweb · 2024-12-22T12:47:54Z

I've tested those changes on my RX 5700 XT and it seems to have a mix of huge improvements and regressions (some specific calculations have a really unreliable behaviour).

master:

WARNING: radv is not a conformant Vulkan implementation, testing use only.
ggml_vulkan: Found 1 Vulkan devices:
ggml_vulkan: 0 = AMD Radeon RX 5700 XT (RADV NAVI10) (radv) | uma: 0 | fp16: 1 | warp size: 64 | matrix cores: none
Testing 2 devices

Backend 1/2: Vulkan0
ggml_vulkan: Compiling shaders..........................Done!
  Device description: AMD Radeon RX 5700 XT (RADV NAVI10)
  Device memory: 8176 MB (8176 MB free)

  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               3276 runs -   528.58 us/run -    10244 kB/run -   18.48 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                820 runs -  3604.00 us/run -    40964 kB/run -   10.84 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       52 runs - 72597.98 us/run -   655364 kB/run -    8.63 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      328 runs -  7463.24 us/run -   102445 kB/run -   13.09 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       82 runs - 907185.68 us/run -   409645 kB/run -    0.43 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               2852 runs -   596.99 us/run -    23536 kB/run -   37.60 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                335 runs -  9167.82 us/run -   100208 kB/run -   10.43 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       20 runs - 161781.30 us/run -  1678448 kB/run -    9.91 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      143 runs - 11178.17 us/run -   235365 kB/run -   20.09 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       34 runs - 597637.38 us/run -  1002085 kB/run -    1.60 GB/s
  Backend Vulkan0: OK

PR run 1:

  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              13104 runs -    95.40 us/run -    10244 kB/run -  102.41 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               1640 runs -  1103.94 us/run -    40964 kB/run -   35.39 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       52 runs - 201101.04 us/run -   655364 kB/run -    3.11 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      984 runs -  1131.72 us/run -   102445 kB/run -   86.35 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       82 runs - 362910.74 us/run -   409645 kB/run -    1.08 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               5704 runs -   197.11 us/run -    23536 kB/run -  113.88 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                335 runs -  3319.92 us/run -   100208 kB/run -   28.79 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       20 runs - 3167804.15 us/run -  1678448 kB/run -    0.51 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1): radv/amdgpu: The CS has been cancelled because the context is lost. This context is innocent.
terminate called after throwing an instance of 'vk::DeviceLostError'
  what():  vk::Queue::submit: ErrorDeviceLost
zsh: IOT instruction (core dumped)  ./test-backend-ops perf -o IM2COL

PR run 2:


  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):              13104 runs -    95.36 us/run -    10244 kB/run -  102.46 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               1640 runs -  1103.63 us/run -    40964 kB/run -   35.40 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[3,3,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       52 runs - 1892911.98 us/run -   655364 kB/run -    0.33 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      984 runs -  1130.94 us/run -   102445 kB/run -   86.41 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[3,3,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       82 runs - 36353.83 us/run -   409645 kB/run -   10.76 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):               5704 runs -   196.57 us/run -    23536 kB/run -  114.19 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                335 runs -  3318.93 us/run -   100208 kB/run -   28.80 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[256,256,256,1],ne_kernel=[5,5,256,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       20 runs - 291055.75 us/run -  1678448 kB/run -    5.51 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[32,32,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                      429 runs -  2499.61 us/run -   235365 kB/run -   89.83 GB/s
  IM2COL(type_input=f32,type_kernel=f16,dst_type=f32,ne_input=[64,64,2560,1],ne_kernel=[5,5,2560,1],s0=1,s1=1,p0=1,p1=1,d0=1,d1=1,is_2D=1):                       34 runs - 69889.82 us/run -  1002085 kB/run -   13.69 GB/s
  Backend Vulkan0: OK

jeffbolznv added 3 commits December 21, 2024 16:11

tests: Add im2col perf tests

2074498

vulkan: optimize im2col, more elements per thread

2625283

vulkan: increase small tile size for NV_coopmat2

e52a0f2

jeffbolznv requested a review from 0cc4m December 22, 2024 04:57

github-actions bot added testing Everything test related Vulkan Issues specific to the Vulkan backend ggml changes relating to the ggml tensor library for machine learning labels Dec 22, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vulkan: im2col and matmul optimizations for stable diffusion #10942

vulkan: im2col and matmul optimizations for stable diffusion #10942

jeffbolznv commented Dec 22, 2024

daniandtheweb commented Dec 22, 2024 •

edited

Loading

vulkan: im2col and matmul optimizations for stable diffusion #10942

Are you sure you want to change the base?

vulkan: im2col and matmul optimizations for stable diffusion #10942

Conversation

jeffbolznv commented Dec 22, 2024

daniandtheweb commented Dec 22, 2024 • edited Loading

daniandtheweb commented Dec 22, 2024 •

edited

Loading