[js/webgpu] Optimize MatMul with M = 1 #22577

qjia7 · 2024-10-24T07:30:49Z

Description

BUG #22031

In the demucs model, there are lots of MatMul ops with shapes like below:
input[0]: [3448,1,512] | float32, input[1]: [512,1536] | float32, output[0]: [3448,1,1536] | float32

We can see that for this kind of shape, the batch size is a big value, but M = 1. Our current algorithm is based on [M, N] to partition tiles, which is not efficient for such kind of shapes. This PR reshapes the inputs to improve the matmul performance.
Before: [3448,1,512] x [512,1536] = [3448,1,1536]
After: [1, 3448, 512] x [512, 1536] = [1, 3448, 1536] , then the output can be reshaped to [3448, 1, 1536]

The overall MatMul time in demucs model becomes 1778.45 ms from 4418.17 ms on my iGPUs.

qjia7 · 2024-10-24T07:33:05Z

@guschmue @fs-eire Please take a look, thanks.

BUG microsoft#22031 Optimize below two situations: 1. Increase workgroupSize if only one workgroup is dispatched. 2. Avoid transpose if not necessary. The overall time of demucs model becomes 106.36 ms from 154.60 ms on my dGPUs with this PR and PR microsoft#22577

fs-eire · 2024-10-29T21:30:43Z

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline

fs-eire · 2024-10-29T21:30:44Z

/azp run Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-linux-gpu-ci-pipeline,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Linux Android Emulator QNN CI Pipeline,Android CI Pipeline

fs-eire · 2024-10-29T21:30:45Z

/azp run iOS CI Pipeline,ONNX Runtime React Native CI Pipeline,CoreML CI Pipeline,Linux DNNL CI Pipeline,Linux MIGraphX CI Pipeline,Linux ROCm CI Pipeline

azure-pipelines · 2024-10-29T21:30:58Z

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines · 2024-10-29T21:30:58Z

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines · 2024-10-29T21:31:00Z

Azure Pipelines successfully started running 1 pipeline(s).

js/web/lib/wasm/jsep/webgpu/ops/matmul.ts

BUG #22031 Optimize below two situations: 1. Increase workgroupSize if only one workgroup is dispatched. 2. Avoid transpose if not necessary. The overall time of demucs model becomes 106.36 ms from 154.60 ms on my dGPUs with this PR and PR #22577

guschmue · 2024-10-30T00:17:53Z

CI not happy:

qjia7 · 2024-10-30T02:29:41Z

CI not happy:

Done. Thanks @fs-eire fixing it.
My previous perf data was based on the first commit which hasn't reshaped B so that I didn't find it earlier.

fs-eire · 2024-10-30T03:21:21Z

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline

fs-eire · 2024-10-30T03:21:22Z

/azp run Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline,Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Linux Android Emulator QNN CI Pipeline

fs-eire · 2024-10-30T03:21:24Z

/azp run Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline,CoreML CI Pipeline,Linux DNNL CI Pipeline,Linux MIGraphX CI Pipeline,Linux ROCm CI Pipeline

azure-pipelines · 2024-10-30T03:21:36Z

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines · 2024-10-30T03:21:37Z

Azure Pipelines successfully started running 1 pipeline(s).

azure-pipelines · 2024-10-30T03:21:38Z

Azure Pipelines successfully started running 1 pipeline(s).

BUG microsoft#22031 Optimize below two situations: 1. Increase workgroupSize if only one workgroup is dispatched. 2. Avoid transpose if not necessary. The overall time of demucs model becomes 106.36 ms from 154.60 ms on my dGPUs with this PR and PR microsoft#22577

### Description  BUG microsoft#22031 In the demucs model, there are lots of MatMul ops with shapes like below: `input[0]: [3448,1,512] | float32, input[1]: [512,1536] | float32, output[0]: [3448,1,1536] | float32` We can see that for this kind of shape, the batch size is a big value, but M = 1. Our current algorithm is based on [M, N] to partition tiles, which is not efficient for such kind of shapes. This PR reshapes the inputs to improve the matmul performance. Before: [3448,1,512] x [512,1536] = [3448,1,1536] After: [1, 3448, 512] x [512, 1536] = [1, 3448, 1536] , then the output can be reshaped to [3448, 1, 1536] The overall MatMul time in demucs model becomes 1778.45 ms from 4418.17 ms on my iGPUs. --------- Co-authored-by: Yulong Wang <[email protected]>

qjia7 added 3 commits October 24, 2024 14:08

[js/webgpu] Optimize MatMul with M = 1

7a9235f

add tests

67f5e35

format

7024cd6

guschmue added the ep:WebGPU ort-web webgpu provider label Oct 24, 2024

qjia7 mentioned this pull request Oct 29, 2024

[js/webgpu] Optimize InstanceNorm in some shapes #22637

Merged

fs-eire previously approved these changes Oct 29, 2024

View reviewed changes

fs-eire reviewed Oct 29, 2024

View reviewed changes

js/web/lib/wasm/jsep/webgpu/ops/matmul.ts Outdated Show resolved Hide resolved

guschmue previously approved these changes Oct 30, 2024

View reviewed changes

Update js/web/lib/wasm/jsep/webgpu/ops/matmul.ts

e19f2a1

fs-eire dismissed stale reviews from guschmue and themself via e19f2a1 October 30, 2024 01:32

update tests to go to the new path

b05f37e

qjia7 requested review from guschmue and fs-eire October 30, 2024 02:11

guschmue approved these changes Nov 1, 2024

View reviewed changes

guschmue merged commit 8fbbf2f into microsoft:main Nov 1, 2024
60 checks passed

qjia7 deleted the opt_matmul_m_1 branch November 7, 2024 02:23

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[js/webgpu] Optimize MatMul with M = 1 #22577

[js/webgpu] Optimize MatMul with M = 1 #22577

qjia7 commented Oct 24, 2024 •

edited

Loading

qjia7 commented Oct 24, 2024

fs-eire commented Oct 29, 2024

fs-eire commented Oct 29, 2024

fs-eire commented Oct 29, 2024

azure-pipelines bot commented Oct 29, 2024

azure-pipelines bot commented Oct 29, 2024

azure-pipelines bot commented Oct 29, 2024

guschmue commented Oct 30, 2024

qjia7 commented Oct 30, 2024

fs-eire commented Oct 30, 2024

fs-eire commented Oct 30, 2024

fs-eire commented Oct 30, 2024

azure-pipelines bot commented Oct 30, 2024

azure-pipelines bot commented Oct 30, 2024

azure-pipelines bot commented Oct 30, 2024

[js/webgpu] Optimize MatMul with M = 1 #22577

[js/webgpu] Optimize MatMul with M = 1 #22577

Conversation

qjia7 commented Oct 24, 2024 • edited Loading

Description

qjia7 commented Oct 24, 2024

fs-eire commented Oct 29, 2024

fs-eire commented Oct 29, 2024

fs-eire commented Oct 29, 2024

azure-pipelines bot commented Oct 29, 2024

azure-pipelines bot commented Oct 29, 2024

azure-pipelines bot commented Oct 29, 2024

guschmue commented Oct 30, 2024

qjia7 commented Oct 30, 2024

fs-eire commented Oct 30, 2024

fs-eire commented Oct 30, 2024

fs-eire commented Oct 30, 2024

azure-pipelines bot commented Oct 30, 2024

azure-pipelines bot commented Oct 30, 2024

azure-pipelines bot commented Oct 30, 2024

qjia7 commented Oct 24, 2024 •

edited

Loading