Question regarding large WebAssembly (WASM) file size when using WebGPU backend #24161

grazder · 2025-03-25T06:33:52Z

grazder
Mar 25, 2025

Description:
I am working with ONNX Runtime using the WebGPU backend, which requires importing and running a .wasm file. While optimizing the build, I observed the following:

Default WASM size: The generated WASM file is ~20 MB.

Optimized build: Using the command below, I reduced the size to ~8 MB:

./build.sh --config MinSizeRel --build_wasm --skip_tests --disable_wasm_exception_catching --use_jsep --enable_wasm_simd --enable_wasm_threads --include_ops_by_config $ORT_MODELS_CONFIG --enable_reduced_operator_type_support --allow_running_as_root --parallel --target onnxruntime_webassembly

Minimal build: Adding --minimal_build extended reduces it further to ~3 MB, but this causes some operations to fall back to the CPU (observed when using .ort models, whereas .onnx models run fully on GPU).

Key Questions:

Why is a WASM file required for WebGPU execution? Since WebGPU operations are GPU-bound, the expectation is that the WASM layer would be minimal. However, the current WASM files remain large even after aggressive optimization.
What contributes to the WASM file size? For a forward pass that primarily uses WebGPU, why does the WASM file still require significant code (e.g., 8–20 MB)?
Comparison with WebGL: I noticed that the WebGL backend does not require a .wasm file. Could you explain the key architectural differences between WebGPU and WebGL in ONNX Runtime that necessitate a WASM dependency for WebGPU?
Relationship to PR [WIP] migrate WebGPU EP to WebAssembly to replace JSEP #23697: The migration of WebGPU EP to WebAssembly suggests a shift from JSEP to WASM. Could you clarify how this impacts the WASM size and its necessity for WebGPU workflows?

Additional Context:

The larger WASM size impacts web deployment, especially for bandwidth-sensitive applications.
The .ort format appears to reintroduce CPU dependencies in minimal builds, whereas .onnx avoids this. Is this expected behavior? In my experience, I have seen that some operations have started to use WASM cores instead of WebGPU. I will bring examples of such behavior later.

Request:
A detailed explanation of:

The role of WASM in WebGPU execution (why it cannot be purely JS-driven, unlike WebGL).
The primary contributors to the WASM file size in WebGPU-forward scenarios.
Whether further size reductions are feasible without sacrificing GPU execution.

Thank you for your insights!

grazder · 2025-03-25T06:34:52Z

grazder
Mar 25, 2025
Author

@fs-eire Can you help here please? 🙏

0 replies

fs-eire · 2025-03-25T18:40:07Z

fs-eire
Mar 25, 2025
Collaborator

Thank you for using onnxruntime-web. The questions are really good and this is going to be a long answer.

The role of WASM in WebGPU execution (why it cannot be purely JS-driven, unlike WebGL).

TL;DR answer: we tried but gave up because of
(1) high maintenance cost
(2) we found that using WebAssembly is a good solution.

To answer this, we need to look back to the history of the project.

In 2018, the onnxruntime team started with an experimental project ONNX.js to enable the capability of inferencing ONNX models in browsers. It is a pure JavaScript implementation, using a standalone code base and has no dependency on the onnxruntime repo. It contains a slow CPU backend and a faster WebGL backend. And can run very large (at that time ~100MB is considered very large) models like Resnet50 in a good performance.

In the following few years, we expanded the operator coverages, making performance optimization, integrating more and more complicated logics from onnxruntime and trying to support more models. However we cannot catch up with the pace of onnxruntime especially after BERT came out. The ONNX.js and onnxruntime are separated repos so features and opeartor implementation need to be rewritten, causing quite extra works. A lot of feature work and bug fixes that done in onnxruntime need to be redo in ONNX.js again, using JavaScript.

We started thinking about integrating ONNX.js into onnxruntime, based on the following state:
- Maintaining cost of the ONNX.js repo is becoming a burden to the team
- WebAssembly and Emscripten are becoming mature and performance is better than JavaScript
- A lot of features and functions can be used directly if they already worked in onnxruntime
In 2021, we released the first version of onnxruntime-web. It uses the same code of onnxruntime for the framework, including allocator, graph resolving, optimizer, sequential executor and all operator of CPU EP. Since then, we kept our work on onnxruntime, and archived ONNX.js by end of 2021.

For backward compatibility, we kept ONNX.js code for WebGL in onnxruntime-web.
The primary contributors to the WASM file size in WebGPU-forward scenarios.

I didn't do a very detailed data driven analysis. However, I can put a few contributor to the binary size:
- trade off between Release and MinSizeRel build in Emscripten:
  - MinSizeRel has a much smaller artifact size, but also has noticeable perf drop.
- ASYNCIFY
  - This feature allows to unwind/rewind to support calling async function in sync context, which is necessary to use WebGPU/WebNN.
  - There are 2 different implementation: the JS-based ASYNCIFY and JSPI. While JSPI is disabled by default in browser, we have to use the JS-based ASYNCIFY
- minimal build
  - minimal build removed a lot of runtime features so the artifact is much smaller. However the limitation is it can only work with ORT format models.
- Full CPU operator coverage.
  - because we don't know what EP user will use, we kept all operators implementation for CPU EP in the build.
Whether further size reductions are feasible without sacrificing GPU execution.

This is not on my list because of 3 reasons:
- the gzipped wasm file size if not big ( I think it is ~6MB last time I checked it ) and almost every server should have supported it now
- comparing to the model size the wasm file size is considered trivial because user need to download those anyway.
- if there is a usage that really requires small wasm file size, this can be achieved by using minimal build.
One of the goal for the default wasm in onnxruntime-web is to support as many scenarios as possible. And you can see offering a smaller sized wasm file in the release will sacrifice at least one aspect as listed above.

2 replies

grazder Mar 26, 2025
Author

Thank you for a great answer! I want to clarify a few things to understand this more precisely.

we don't know what EP user will use, we kept all operators implementation for CPU EP in the build

Is it possible to exclude CPU ops at build time if I know I will only use WebGPU ops? I am building the package with --use_jsep --enable_wasm_simd --enable_wasm_threads, so am I right in thinking that all CPU operations are already built into the package? Would the --use_webgpu argument help here? Or may be it's can be possible in the future to specify some EPs for tree-shaking like in ORT op-type config?

However the limitation is it can only work with ORT format models

Is it possible to do minimal_build with .onnx support? or it's impossible by minimal_build design?

MinSizeRel has a much smaller artifact size, but also has noticeable perf drop.

I didn't really notice any changes after switching from Release to MinSizeRel. Can you share some examples/issues so I can test this for my model? Or It's only applicable for wasm / wasm-simd performance?

fs-eire Mar 26, 2025
Collaborator

Is it possible to exclude CPU ops at build time if I know I will only use WebGPU ops?

Yes and No.

No because there is no general build flags to disable the CPU operators in a build.

Yes because the minimal build is designed for this kind of scenario. You need to build locally, specifying your model and your target environment, and the build script will analyze what operator(s) are used and only include those operators into the build.

Is it possible to do minimal_build with .onnx support? or it's impossible by minimal_build design?

it is impossible. the minimal build excludes a lot of implementation that used in graph resolve. So this is why it only supports .ort model (stores a resolved graph). This is by design: an .onnx model is supposed to be used in different environments while an .ort model is generated specifically for a particular environment and cannot be used in other environments.

switching from Release to MinSizeRel

I basically gave up using MinSizeRel especially after ASYNCIFY is introduced. We prioritize runtime speed than binary size for our default package. But if you want to do a local build, you can definitely try using MinSizeRel.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Question regarding large WebAssembly (WASM) file size when using WebGPU backend #24161

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 2 comments 2 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Question regarding large WebAssembly (WASM) file size when using WebGPU backend #24161

Uh oh!

Uh oh!

grazder Mar 25, 2025

Replies: 2 comments · 2 replies

Uh oh!

grazder Mar 25, 2025 Author

Uh oh!

Uh oh!

fs-eire Mar 25, 2025 Collaborator

Uh oh!

Uh oh!

grazder Mar 26, 2025 Author

Uh oh!

fs-eire Mar 26, 2025 Collaborator

grazder
Mar 25, 2025

Replies: 2 comments 2 replies

grazder
Mar 25, 2025
Author

fs-eire
Mar 25, 2025
Collaborator

grazder Mar 26, 2025
Author

fs-eire Mar 26, 2025
Collaborator