Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Graph][Quantization] Multi-stage software pipelining and update parallel k rule #364

Merged
merged 17 commits into from
Jan 14, 2024

Conversation

Aalanli
Copy link
Collaborator

@Aalanli Aalanli commented Oct 10, 2023

Update quantization implementation to support multi-stage pipeling and vectorized upcasting trick.
Update operator resolution rules to support parallel-k searching.

@yaoyaoding
Copy link
Member

Hi @Aalanli, can you put the performance numbers of whether to use the casting tricks and the improvement of parallel k?

@Aalanli
Copy link
Collaborator Author

Aalanli commented Oct 11, 2023

Yes, I can get the performance numbers after the previous PR is merged.
I am thinking of implementing atomics for parallel-k, as it appears that the performance of parallel-k for the int8 kernel is greater than expected. Is it ok for me to merge your red atomic instruction code to main?

@Aalanli
Copy link
Collaborator Author

Aalanli commented Oct 13, 2023

It seems that the packed conversion offers minor performance improvements, while higher pipeline stages only offers performance improvements for certain shapes, while performance regressions on others. I think this is due to reduced occupancy due to shared memory usage, as it is higher in the second implementation than the first (indeed, according to ncu, the first impl is limited by registers while the second impl is limited by shared memory).

It may be helpful to benchmark this on A100, for higher pipelining stages to show performance improvements.
@yaoyaoding

# k-parts=1
#        bench_ref  bench_packed_quant
# 1024    0.062464            0.060416
# 2048    0.062464            0.061664
# 4096    0.062464            0.065536
# 8192    0.201728            0.190464
# 16384   0.780288            0.726208
# k-parts=4
#        bench_ref  bench_packed_quant
# 1024    0.059392            0.060208
# 2048    0.062464            0.061440
# 4096    0.061440            0.061440
# 8192    0.176128            0.192512
# 16384   0.632832            0.688416

# k-parts=4
#                   bench_ref  bench_packed_quant
# (1, 4096, 11008)   0.063488            0.065536
# (1, 11008, 4096)   0.067584            0.064512
# (1, 4096, 4096)    0.065536            0.064512
# (1, 4096, 32000)   0.155648            0.157696
#                     bench_ref  bench_packed_quant
# (128, 4096, 11008)   0.137216            0.142336
# (128, 11008, 4096)   0.129024            0.130048
# (128, 4096, 4096)    0.064512            0.063488
# (128, 4096, 32000)   0.337920            0.352256

@yaoyaoding
Copy link
Member

yaoyaoding commented Oct 16, 2023

I am thinking of implementing atomics for parallel-k, as it appears that the performance of parallel-k for the int8 kernel is greater than expected. Is it ok for me to merge your red atomic instruction code to main?

Sure, go ahead.

@yaoyaoding
Copy link
Member

I think this is due to reduced occupancy due to shared memory usage, as it is higher in the second implementation than the first (indeed, according to ncu, the first impl is limited by registers while the second impl is limited by shared memory).

In the future, we might put num_stages in our search space (like triton).

@yaoyaoding yaoyaoding changed the title [Graph][Quantization] [Graph][Quantization] Multi-stage software pipelining and update parallel k rule Oct 16, 2023
@Aalanli
Copy link
Collaborator Author

Aalanli commented Oct 16, 2023

I put num-stages in the search space, I forgot to mention that this PR also adds support for parallel-k searching for the quantized kernel.

@yaoyaoding
Copy link
Member

It looks good to me. Feel free to merge by yourself when you think it is ready.

@Aalanli
Copy link
Collaborator Author

Aalanli commented Oct 19, 2023

Some results for quantized matmul implemented with atomic reduction for parallel-k.

As expected, for pk=1, there are minimal performance benefits
quant_red_pk1

quant_red_pk1_mem_bound

But for higher pk, there are non-negligible performance improvements.
quant_red_pk3
quant_red_pk3_mem_bound

@yaoyaoding
Copy link
Member

@Aalanli , is this PR ready to be merged?

@yaoyaoding
Copy link
Member

yaoyaoding commented Jan 11, 2024

$hidet-ci launch

1 similar comment
@yaoyaoding
Copy link
Member

$hidet-ci launch

@yaoyaoding
Copy link
Member

Hi @Aalanli , could you rebase this PR regards the main branch? Thanks!

@Aalanli Aalanli merged commit 53922fc into hidet-org:main Jan 14, 2024
2 checks passed
@Aalanli Aalanli deleted the new-quant-matmul branch January 14, 2024 03:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants