Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[js/webgpu] Optimize MatMul with M = 1 #22577

Merged
merged 5 commits into from
Nov 1, 2024
Merged

Conversation

qjia7
Copy link
Contributor

@qjia7 qjia7 commented Oct 24, 2024

Description

BUG #22031

In the demucs model, there are lots of MatMul ops with shapes like below:
input[0]: [3448,1,512] | float32, input[1]: [512,1536] | float32, output[0]: [3448,1,1536] | float32

We can see that for this kind of shape, the batch size is a big value, but M = 1. Our current algorithm is based on [M, N] to partition tiles, which is not efficient for such kind of shapes. This PR reshapes the inputs to improve the matmul performance.
Before: [3448,1,512] x [512,1536] = [3448,1,1536]
After: [1, 3448, 512] x [512, 1536] = [1, 3448, 1536] , then the output can be reshaped to [3448, 1, 1536]

The overall MatMul time in demucs model becomes 1778.45 ms from 4418.17 ms on my iGPUs.

@qjia7
Copy link
Contributor Author

qjia7 commented Oct 24, 2024

@guschmue @fs-eire Please take a look, thanks.

@guschmue guschmue added the ep:WebGPU ort-web webgpu provider label Oct 24, 2024
qjia7 added a commit to qjia7/onnxruntime that referenced this pull request Oct 29, 2024
BUG microsoft#22031

Optimize below two situations:
1. Increase workgroupSize if only one workgroup is dispatched.
2. Avoid transpose if not necessary.

The overall time of demucs model becomes 106.36 ms from 154.60 ms on my
dGPUs with this PR and PR microsoft#22577
fs-eire
fs-eire previously approved these changes Oct 29, 2024
@fs-eire
Copy link
Contributor

fs-eire commented Oct 29, 2024

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline

@fs-eire
Copy link
Contributor

fs-eire commented Oct 29, 2024

/azp run Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-linux-gpu-ci-pipeline,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Linux Android Emulator QNN CI Pipeline,Android CI Pipeline

@fs-eire
Copy link
Contributor

fs-eire commented Oct 29, 2024

/azp run iOS CI Pipeline,ONNX Runtime React Native CI Pipeline,CoreML CI Pipeline,Linux DNNL CI Pipeline,Linux MIGraphX CI Pipeline,Linux ROCm CI Pipeline

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

2 similar comments
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

guschmue pushed a commit that referenced this pull request Oct 30, 2024
BUG #22031

Optimize below two situations:
1. Increase workgroupSize if only one workgroup is dispatched.
2. Avoid transpose if not necessary.

The overall time of demucs model becomes 106.36 ms from 154.60 ms on my
dGPUs with this PR and PR #22577
guschmue
guschmue previously approved these changes Oct 30, 2024
@guschmue
Copy link
Contributor

CI not happy:
image

@fs-eire fs-eire dismissed stale reviews from guschmue and themself via e19f2a1 October 30, 2024 01:32
@qjia7
Copy link
Contributor Author

qjia7 commented Oct 30, 2024

CI not happy: image

Done. Thanks @fs-eire fixing it.
My previous perf data was based on the first commit which hasn't reshaped B so that I didn't find it earlier.

@fs-eire
Copy link
Contributor

fs-eire commented Oct 30, 2024

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline

@fs-eire
Copy link
Contributor

fs-eire commented Oct 30, 2024

/azp run Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Linux Android Emulator QNN CI Pipeline

@fs-eire
Copy link
Contributor

fs-eire commented Oct 30, 2024

/azp run Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline,CoreML CI Pipeline,Linux DNNL CI Pipeline,Linux MIGraphX CI Pipeline,Linux ROCm CI Pipeline

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

1 similar comment
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@guschmue guschmue merged commit 8fbbf2f into microsoft:main Nov 1, 2024
60 checks passed
@qjia7 qjia7 deleted the opt_matmul_m_1 branch November 7, 2024 02:23
ishwar-raut1 pushed a commit to ishwar-raut1/onnxruntime that referenced this pull request Nov 19, 2024
BUG microsoft#22031

Optimize below two situations:
1. Increase workgroupSize if only one workgroup is dispatched.
2. Avoid transpose if not necessary.

The overall time of demucs model becomes 106.36 ms from 154.60 ms on my
dGPUs with this PR and PR microsoft#22577
ishwar-raut1 pushed a commit to ishwar-raut1/onnxruntime that referenced this pull request Nov 19, 2024
### Description
<!-- Describe your changes. -->
BUG microsoft#22031

In the demucs model, there are lots of MatMul ops with shapes like
below:
`input[0]: [3448,1,512] | float32, input[1]: [512,1536] | float32,
output[0]: [3448,1,1536] | float32`

We can see that for this kind of shape, the batch size is a big value,
but M = 1. Our current algorithm is based on [M, N] to partition tiles,
which is not efficient for such kind of shapes. This PR reshapes the
inputs to improve the matmul performance.
Before:  [3448,1,512] x [512,1536] =  [3448,1,1536]
After: [1, 3448, 512] x [512, 1536] = [1, 3448, 1536] , then the output
can be reshaped to [3448, 1, 1536]

The overall MatMul time in demucs model becomes 1778.45 ms from 4418.17
ms on my iGPUs.

---------

Co-authored-by: Yulong Wang <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ep:WebGPU ort-web webgpu provider
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants