[Graph][Quantization] Multi-stage software pipelining and update parallel k rule #364

Aalanli · 2023-10-10T16:58:27Z

Update quantization implementation to support multi-stage pipeling and vectorized upcasting trick.
Update operator resolution rules to support parallel-k searching.

yaoyaoding · 2023-10-10T18:13:35Z

Hi @Aalanli, can you put the performance numbers of whether to use the casting tricks and the improvement of parallel k?

Aalanli · 2023-10-11T19:24:12Z

Yes, I can get the performance numbers after the previous PR is merged.
I am thinking of implementing atomics for parallel-k, as it appears that the performance of parallel-k for the int8 kernel is greater than expected. Is it ok for me to merge your red atomic instruction code to main?

Aalanli · 2023-10-13T16:48:09Z

It seems that the packed conversion offers minor performance improvements, while higher pipeline stages only offers performance improvements for certain shapes, while performance regressions on others. I think this is due to reduced occupancy due to shared memory usage, as it is higher in the second implementation than the first (indeed, according to ncu, the first impl is limited by registers while the second impl is limited by shared memory).

It may be helpful to benchmark this on A100, for higher pipelining stages to show performance improvements.
@yaoyaoding

# k-parts=1
#        bench_ref  bench_packed_quant
# 1024    0.062464            0.060416
# 2048    0.062464            0.061664
# 4096    0.062464            0.065536
# 8192    0.201728            0.190464
# 16384   0.780288            0.726208
# k-parts=4
#        bench_ref  bench_packed_quant
# 1024    0.059392            0.060208
# 2048    0.062464            0.061440
# 4096    0.061440            0.061440
# 8192    0.176128            0.192512
# 16384   0.632832            0.688416

# k-parts=4
#                   bench_ref  bench_packed_quant
# (1, 4096, 11008)   0.063488            0.065536
# (1, 11008, 4096)   0.067584            0.064512
# (1, 4096, 4096)    0.065536            0.064512
# (1, 4096, 32000)   0.155648            0.157696
#                     bench_ref  bench_packed_quant
# (128, 4096, 11008)   0.137216            0.142336
# (128, 11008, 4096)   0.129024            0.130048
# (128, 4096, 4096)    0.064512            0.063488
# (128, 4096, 32000)   0.337920            0.352256

yaoyaoding · 2023-10-16T21:36:06Z

I am thinking of implementing atomics for parallel-k, as it appears that the performance of parallel-k for the int8 kernel is greater than expected. Is it ok for me to merge your red atomic instruction code to main?

Sure, go ahead.

yaoyaoding · 2023-10-16T21:37:44Z

I think this is due to reduced occupancy due to shared memory usage, as it is higher in the second implementation than the first (indeed, according to ncu, the first impl is limited by registers while the second impl is limited by shared memory).

In the future, we might put num_stages in our search space (like triton).

Aalanli · 2023-10-16T21:47:00Z

I put num-stages in the search space, I forgot to mention that this PR also adds support for parallel-k searching for the quantized kernel.

yaoyaoding · 2023-10-16T21:58:52Z

It looks good to me. Feel free to merge by yourself when you think it is ready.

Aalanli · 2023-10-19T19:47:41Z

Some results for quantized matmul implemented with atomic reduction for parallel-k.

As expected, for pk=1, there are minimal performance benefits

But for higher pk, there are non-negligible performance improvements.

yaoyaoding · 2024-01-11T21:05:08Z

@Aalanli , is this PR ready to be merged?

yaoyaoding · 2024-01-11T21:05:40Z

$hidet-ci launch

yaoyaoding · 2024-01-11T21:06:48Z

$hidet-ci launch

yaoyaoding · 2024-01-11T21:18:05Z

Hi @Aalanli , could you rebase this PR regards the main branch? Thanks!

…nto new-quant-matmul

Allan Lin and others added 9 commits July 22, 2023 12:55

add new test

b2223e8

Merge branch 'hidet-org:main' into main

72b64fd

Merge branch 'hidet-org:main' into main

77ee2c1

fix test

c9061ac

Merge branch 'main' of https://github.com/Aalanli/hidet

e014161

Merge branch 'main' of https://github.com/Aalanli/hidet

7663a2d

Merge branch 'hidet-org:main' into main

a81233e

Merge branch 'main' of https://github.com/Aalanli/hidet into main

b643413

update

02c4c42

yaoyaoding changed the title ~~[Graph][Quantization]~~ [Graph][Quantization] Multi-stage software pipelining and update parallel k rule Oct 16, 2023

Aalanli added 2 commits October 19, 2023 11:12

atomic red

64cbe4a

atomic implementation

36a03c3

refactor quantize, finalize atomic quantize

16b4c21

Aalanli added 5 commits January 11, 2024 19:46

update

eeab99b

atomic red

0c9de31

atomic implementation

a7f2708

refactor quantize, finalize atomic quantize

d7aaaf0

Merge branch 'new-quant-matmul' of https://github.com/Aalanli/hidet i…

52b668d

…nto new-quant-matmul

Aalanli merged commit 53922fc into hidet-org:main Jan 14, 2024
2 checks passed

Aalanli deleted the new-quant-matmul branch January 14, 2024 03:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Graph][Quantization] Multi-stage software pipelining and update parallel k rule #364

[Graph][Quantization] Multi-stage software pipelining and update parallel k rule #364

Aalanli commented Oct 10, 2023

yaoyaoding commented Oct 10, 2023

Aalanli commented Oct 11, 2023

Aalanli commented Oct 13, 2023

yaoyaoding commented Oct 16, 2023 •

edited

Loading

yaoyaoding commented Oct 16, 2023

Aalanli commented Oct 16, 2023

yaoyaoding commented Oct 16, 2023

Aalanli commented Oct 19, 2023

yaoyaoding commented Jan 11, 2024

yaoyaoding commented Jan 11, 2024 •

edited

Loading

yaoyaoding commented Jan 11, 2024

yaoyaoding commented Jan 11, 2024

[Graph][Quantization] Multi-stage software pipelining and update parallel k rule #364

[Graph][Quantization] Multi-stage software pipelining and update parallel k rule #364

Conversation

Aalanli commented Oct 10, 2023

yaoyaoding commented Oct 10, 2023

Aalanli commented Oct 11, 2023

Aalanli commented Oct 13, 2023

yaoyaoding commented Oct 16, 2023 • edited Loading

yaoyaoding commented Oct 16, 2023

Aalanli commented Oct 16, 2023

yaoyaoding commented Oct 16, 2023

Aalanli commented Oct 19, 2023

yaoyaoding commented Jan 11, 2024

yaoyaoding commented Jan 11, 2024 • edited Loading

yaoyaoding commented Jan 11, 2024

yaoyaoding commented Jan 11, 2024

yaoyaoding commented Oct 16, 2023 •

edited

Loading

yaoyaoding commented Jan 11, 2024 •

edited

Loading