[WebNN EP] Enable IO Bindings with MLTensor #21301

egalli · 2024-07-09T21:36:47Z

Description

Enables using the MLTensor to pass data between models.

Motivation and Context

Using MLTensor instead of ArrayBuffers reduces the number of copies between the CPU and devices as well as the renderer and GPU process in Chromium.

Honry

Super! Great work @egalli！

Some first eyes comments, I can't wait to try it, will provide more feedbacks later.

js/common/lib/tensor-impl.ts

js/web/lib/wasm/jsep/backend-webnn.ts

js/web/lib/wasm/session-handler-inference.ts

js/web/lib/wasm/proxy-messages.ts

onnxruntime/core/providers/webnn/data_transfer.h

onnxruntime/core/providers/webnn/data_transfer.cc

js/web/lib/wasm/jsep/backend-webnn.ts

onnxruntime/core/providers/webnn/builders/model.cc

onnxruntime/core/providers/webnn/allocator.cc

onnxruntime/wasm/pre-jsep.js

onnxruntime/core/providers/webnn/data_transfer.cc

js/common/lib/tensor.ts

js/web/lib/wasm/jsep/webnn/buffer-manager.ts

onnxruntime/core/providers/webnn/data_transfer.cc

Honry · 2024-07-18T09:25:32Z

@egalli, I test a simple merged model (with If control flow), and only set the preferredOutputLocation = 'ml-buffer', it will throw

Since WebNN doesn't support If, the graph will be partitioned into 2 subgraphs, the ORT requires it to copy outputs across devices, i.e. CopyOutputsAcrossDevices will be called, which will trigger the new ml buffer upload, at this time we don't call ensureMLBuffer() to create the ml buffer, that's why it throws above error.

With pre-allocated output ml buffer, it works.

egalli · 2024-07-18T20:55:21Z

@Honry, I have changed from getMLBuffer to ensureBuffer when retrieving outputs. This changed fixes the issue on partitioned graphs.

Honry · 2024-07-19T02:50:14Z

@Honry, I have changed from getMLBuffer to ensureBuffer when retrieving outputs. This changed fixes the issue on partitioned graphs.

It works, thanks!

Honry

@egalli, some final comments. :)

js/web/test/test-runner.ts

js/web/lib/wasm/session-handler-inference.ts

js/web/lib/wasm/wasm-types.ts

js/web/test/test-runner.ts

onnxruntime/core/providers/webnn/data_transfer.cc

onnxruntime/wasm/pre-jsep.js

Honry

Thank you @egalli, LGTM % two nits.

onnxruntime/wasm/pre-jsep.js

egalli · 2024-07-29T21:13:14Z

MLBuffer specification has changed. createBuffer is now async.

fs-eire · 2024-08-05T20:47:26Z

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline

fs-eire · 2024-08-05T20:47:27Z

/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Linux Android Emulator QNN CI Pipeline

fs-eire · 2024-08-05T20:47:28Z

/azp run Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline

azure-pipelines · 2024-08-05T20:47:45Z

Azure Pipelines successfully started running 3 pipeline(s).

azure-pipelines · 2024-08-05T20:48:02Z

Azure Pipelines successfully started running 9 pipeline(s).

azure-pipelines · 2024-08-05T20:48:05Z

Azure Pipelines successfully started running 10 pipeline(s).

onnxruntime/core/providers/webnn/data_transfer.cc

huningxin · 2024-08-07T08:16:20Z

FYI, @qwu16 and @Honry are doing performance measure against a set of models on ORT Web build after and before applying this PR. Once the data is ready, we can review and make the decision.

fs-eire · 2024-08-07T20:41:48Z

/azp run Windows ARM64 QNN CI Pipeline,Windows x64 QNN CI Pipeline,Windows CPU CI Pipeline,Windows GPU CI Pipeline,Windows GPU TensorRT CI Pipeline,ONNX Runtime Web CI Pipeline,Linux CPU CI Pipeline,Linux CPU Minimal Build E2E CI Pipeline,Linux GPU CI Pipeline,Linux GPU TensorRT CI Pipeline

fs-eire · 2024-08-07T20:41:50Z

/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Linux Android Emulator QNN CI Pipeline

fdwr · 2024-09-18T23:34:35Z

/azp run Linux OpenVINO CI Pipeline,Linux QNN CI Pipeline,MacOS CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-linux-ci-pipeline,orttraining-linux-gpu-ci-pipeline,orttraining-ortmodule-distributed,onnxruntime-binary-size-checks-ci-pipeline,Big Models,Linux Android Emulator QNN CI Pipeline

fdwr · 2024-09-18T23:34:40Z

/azp run Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline

azure-pipelines · 2024-09-18T23:34:54Z

Azure Pipelines successfully started running 3 pipeline(s).

azure-pipelines · 2024-09-18T23:35:05Z

Azure Pipelines successfully started running 8 pipeline(s).

azure-pipelines · 2024-09-18T23:35:06Z

Azure Pipelines successfully started running 9 pipeline(s).

fs-eire · 2024-09-20T00:48:49Z

I am generally OK with this change. I have a question: does this MLTensor only represent a WebNN tensor that on GPU/NPU, or it may also possibly represent a WebNN tensor on CPU?

egalli · 2024-09-20T01:11:00Z

MLTensor can represent data in any of the device types (CPU, GPU, or NPU). Moreover, MLContext.compute() is getting removed from the WebNN specification. There won't be a way to use WebNN without MLTensor.

fs-eire · 2024-09-20T22:55:20Z

MLTensor can represent data in any of the device types (CPU, GPU, or NPU). Moreover, MLContext.compute() is getting removed from the WebNN specification. There won't be a way to use WebNN without MLTensor.

I understand that MLTensor will be the type that WebNN interface uses. My question is, I am not sure how MLTensor manages the location of its data. I have no idea how user can set locations to a model's inputs and outputs when using MLTensor (it's not yet in the spec). If MLTensor is created with explicitly set location, at least for CPU usage, we should allow user to use normal ort.Tensor as model input/output instead of having to use ort.Tensor.fromMLTensor().

egalli · 2024-09-21T02:01:53Z

There is a PR explainer for MLTensor. As for the location, MLTensor are created of an MLContext. MLContexts have a device associated with them, either CPU/GPU/NPU. Therefore, an MLTensor will inherit the context's location/device. Also, MLTensor are only valid when used in the context that created them (i.e. MLTensors are non-transferable between devices/MLContexts) Hopefully that helps.

bbernhar · 2024-09-23T16:46:25Z

@fs-eire Thanks for helping us integrate MLTensor into ORT web. To expand on what @egalli said, MLTensor handles the memory allocation for its data on behalf of the web developer. Unless you use createTensor(GPUDevice) to explicitly target GPU memory, the developer can't directly specify the memory location. Instead, createTensor(MLTensorUsageFlags) allows you to influence whether the data should prioritize certain resources (like CPU or GPU), though it doesn't grant full control over the specific device.

We are still working on fully specifying how tensor data will move between devices, so at this stage, it's unclear if the location will move predictably or remain fixed on a given deviceType. Stay tuned for updates as we refine this feature.

Let me know if you'd like more adjustments or details.

fs-eire · 2024-09-24T22:11:42Z

@egalli @bbernhar Thank you for the explaination. I think it is totally fine to consider "MLTensor" as a virtual (logical) location in the context of ort.Tensor.location. There 2 questions:

do we expect to allow ort.Tensor(CPU) to be used as input/output for ort.InferenceSession.run? (This may requires to implement implicit conversions from buffer to MLTensor inside)
do we want to align with current resource management of ort.Tensor? (This may requires to implement Tensor.download and Tensor.dispose for graph output)

huningxin · 2024-09-24T22:36:31Z

@fs-eire

do we expect to allow ort.Tensor(CPU) to be used as input/output for ort.InferenceSession.run? (This may requires to implement implicit conversions from buffer to MLTensor inside)

I suppose this would be a common inference scenario and we should support it. Yulong, what's your guidance?

I understand @egalli also has a plan to improve the performance of the ort.Tensor(CPU) input/output by reusing MLTensor. Enrico, do you happen to create a issue tracking this optimization?

do we want to align with current resource management of ort.Tensor? (This may requires to implement Tensor.download and Tensor.dispose for graph output)

I suppose we should support Tensor.download and Tensor.dispose. Could this be implemented in a follow-up CL?

egalli · 2024-09-24T22:48:12Z

@huningxin I have not created an issue, but I have a preliminary change set ready

do we expect to allow ort.Tensor(CPU) to be used as input/output for ort.InferenceSession.run? (This may requires to implement implicit conversions from buffer to MLTensor inside)

The code currently uses the Allocator and DataTransfer classes in the C++ code. Is this a problem?

do we want to align with current resource management of ort.Tensor? (This may requires to implement Tensor.download and Tensor.dispose for graph output)

Is this different than the currently implemented solution?

btw, keep in the mind that unlike ort.Tensor(CPU), MLTensor(CPU) is not easily accessible to JS/Wasm. In Chromium, MLTensor(CPU) is allocated in by the TFLife backend outside of the render process. Therefore, IPC/Mojo calls are required to read and write to it.

fs-eire · 2024-09-24T23:05:07Z

The basic idea of resource management in onnxruntime-web is:

CPU:

user should never worry about resource management for CPU tensor. ort.Tensor always uses TypedArray (non-string type) or string[] (string type) as underlying data and no resources need to be manually released.

Since no resource management is needed, there is also no lifecycle consideration for CPU tensors.

GPU (and other non-CPU location):

If an instance of non-CPU ort.Tensor is created by user ( via Tensor.fromGpuBuffer or Tensor.fromMLTensor), user need to manage the underlying resource. Specifically:

user need to make sure the underlying resource valid during the usage of the ort.Tensor instance.
user is responsible for managing the lifecycle of the underlying resource.

If an instance of non-CPU ort.Tensor is created by onnxruntime-web as output (This only happens when sessionOptions.preferredOutputLocation) is correctly set), this means the underlying resource is allocated by onnxruntime-web. In this case, the ort.Tensor instance should contain valid download and dispose properties.

user should release the underlying resource after use by calling ort.Tensor.dispose().

fs-eire · 2024-09-24T23:17:19Z

The code currently uses the Allocator and DataTransfer classes in the C++ code. Is this a problem?
no problem.

Is this different than the currently implemented solution?

it should be OK.

btw, keep in the mind that unlike ort.Tensor(CPU), MLTensor(CPU) is not easily accessible to JS/Wasm. In Chromium, MLTensor(CPU) is allocated in by the TFLife backend outside of the render process. Therefore, IPC/Mojo calls are required to read and write to it.

I think if user requires the "location" to be on CPU, it means to use ort.Tensor on CPU; if user want to use a MLTensor(CPU), they should specify location to "MLTensor".

egalli · 2024-09-24T23:29:55Z

I think if user requires the "location" to be on CPU, it means to use ort.Tensor on CPU; if user want to use a MLTensor(CPU), they should specify location to "MLTensor".

Considering that the WebNN WG wants to remove deviceType from the specification, that sounds reasonable to me.

bbernhar · 2024-09-24T23:33:13Z

@fs-eire

I don't expect we need to allow mapping ort.Tensor(CPU) using MLTensor(CPU) since it requires tensor data to stay behind a MLContext and not a TypedArray. It does make sense to me, we align non-CPU ort.Tensor resource management, even if it happens to be a CPU device, with MLTensor, by supporting "download" and "dispose" (+1 in a subsequent PR).

Does this answer your question/concern?

fs-eire · 2024-09-24T23:42:04Z

@fs-eire

I don't expect we need to allow mapping ort.Tensor(CPU) using MLTensor(CPU) since it requires tensor data to stay behind a MLContext and not a TypedArray. It does make sense to me, we align non-CPU ort.Tensor resource management, even if it happens to be a CPU device, with MLTensor, by supporting "download" and "dispose" (+1 in a subsequent PR).

Does this answer your question/concern?

Is using ort.Tensor(CPU) as input allowed? Or you want to enforce all places to use ort.Tensor(via ort.Tensor.fromMLTensor) with WebNN EP?

I would prefer that using ort.Tensor(CPU) is allowed and is the default behavior for input/output, because it is much easier and user friendly to use. Of course if I want to use MLTensor I can use it via the corresponding interface.

egalli · 2024-09-25T00:12:49Z

My understanding is that ort.Tensor(CPU) as both inputs and outputs to ort.InferenceSession.run is already supported. It is just using less efficient path of JS -(copy)-> wasm -> webnn::Allocate::Alloc -> webnn::DataTranfer::CopyTensor -(copy)-> MLTensor (2 copies). While this could be simplified to 1 copy, it would require knowing if the first node in the graph is going to run in the WebNN EP or the CPU EP (fallback). Therefore, I would rather tackle this in another PR.

Is there anything else blocking this PR?

fs-eire · 2024-09-25T00:14:19Z

My understanding is that ort.Tensor(CPU) as both inputs and outputs to ort.InferenceSession.run is already supported. It is just using less efficient path of JS -(copy)-> wasm -> webnn::Allocate::Alloc -> webnn::DataTranfer::CopyTensor -(copy)-> MLTensor (2 copies). While this could be simplified to 1 copy, it would require knowing if the first node in the graph is going to run in the WebNN EP or the CPU EP (fallback). Therefore, I would rather tackle this in another PR.

Is there anything else blocking this PR?

No. I think it's all good.

fs-eire · 2024-09-25T00:28:28Z

/azp run Windows GPU CUDA CI Pipeline,Windows GPU DML CI Pipeline,Windows GPU Doc Gen CI Pipeline,orttraining-amd-gpu-ci-pipeline,orttraining-ortmodule-distributed,Android CI Pipeline,iOS CI Pipeline,ONNX Runtime React Native CI Pipeline

azure-pipelines · 2024-09-25T00:28:55Z

Azure Pipelines successfully started running 6 pipeline(s).

bbernhar · 2024-09-25T00:32:02Z

@fs-eire

Is using ort.Tensor(CPU) as input allowed? Or you want to enforce all places to use ort.Tensor(via ort.Tensor.fromMLTensor) with WebNN EP?

Yes, it can be allowed so long as ort.Tensor(CPU) doesn't assert the final device-location of MLTensor must also be on the CPU (copies are OK).

bbernhar · 2024-09-26T17:10:21Z

@fs-eire @fdwr ready to merge this PR? @egalli does not have write access.

fdwr · 2024-09-28T01:49:53Z

Considering that the WebNN WG wants to remove deviceType from the specification, that sounds reasonable to me.

@egalli That idea was proposed, but there's not consensus on it. More likely it will end up being relaxed to be more of a hint than a requirement given CoreML's MLComputeUnits do not support saying only NPU or only GPU (in CoreML, you always implicitly get CPU too as a fallback). See, there remain ambiguous cases where a power preference alone is inadequate for device selection, such as devices where the GPU is actually slower than the NPU on the same device, or devices with both integrated and discrete GPUs where you use the deviceType to select that you want a GPU along with the power preference to select between them.

### Description Enables using the MLTensor to pass data between models. ### Motivation and Context Using MLTensor instead of ArrayBuffers reduces the number of copies between the CPU and devices as well as the renderer and GPU process in Chromium.

Honry reviewed Jul 10, 2024

View reviewed changes

onnxruntime/core/providers/webnn/data_transfer.h Outdated Show resolved Hide resolved

Honry reviewed Jul 10, 2024

View reviewed changes

onnxruntime/core/providers/webnn/data_transfer.cc Outdated Show resolved Hide resolved

Honry reviewed Jul 11, 2024

View reviewed changes

js/web/lib/wasm/jsep/backend-webnn.ts Outdated Show resolved Hide resolved

Honry reviewed Jul 11, 2024

View reviewed changes

onnxruntime/core/providers/webnn/builders/model.cc Outdated Show resolved Hide resolved

guschmue added the ep:WebNN WebNN execution provider label Jul 11, 2024

Honry reviewed Jul 15, 2024

View reviewed changes

fs-eire reviewed Jul 15, 2024

View reviewed changes

js/common/lib/tensor.ts Outdated Show resolved Hide resolved

egalli force-pushed the create_mlbuffer branch from 85ca43b to 1e07f85 Compare July 17, 2024 22:34

huningxin reviewed Jul 18, 2024

View reviewed changes

js/web/lib/wasm/jsep/webnn/buffer-manager.ts Outdated Show resolved Hide resolved

onnxruntime/core/providers/webnn/data_transfer.cc Outdated Show resolved Hide resolved

Honry reviewed Jul 22, 2024

View reviewed changes

egalli force-pushed the create_mlbuffer branch 2 times, most recently from 0c853c0 to e203298 Compare July 22, 2024 23:16

Honry approved these changes Jul 23, 2024

View reviewed changes

onnxruntime/wasm/pre-jsep.js Outdated Show resolved Hide resolved

onnxruntime/wasm/pre-jsep.js Outdated Show resolved Hide resolved

fs-eire reviewed Jul 24, 2024

View reviewed changes

onnxruntime/wasm/pre-jsep.js Outdated Show resolved Hide resolved

egalli force-pushed the create_mlbuffer branch from 1324a98 to d4e53ea Compare August 1, 2024 21:11

github-advanced-security bot found potential problems Aug 5, 2024

View reviewed changes

onnxruntime/core/providers/webnn/data_transfer.cc Fixed Show resolved Hide resolved

fs-eire approved these changes Sep 26, 2024

View reviewed changes

guschmue approved these changes Sep 28, 2024

View reviewed changes

guschmue merged commit 52a8c1c into microsoft:main Sep 28, 2024
79 checks passed

egalli deleted the create_mlbuffer branch October 22, 2024 01:28

[WebNN EP] Enable IO Bindings with MLTensor #21301

[WebNN EP] Enable IO Bindings with MLTensor #21301

Conversation

egalli commented Jul 9, 2024 • edited Loading

Description

Motivation and Context

Honry left a comment

Choose a reason for hiding this comment

Honry commented Jul 18, 2024

egalli commented Jul 18, 2024

Honry commented Jul 19, 2024

Honry left a comment

Choose a reason for hiding this comment

Honry left a comment

Choose a reason for hiding this comment

egalli commented Jul 29, 2024

fs-eire commented Aug 5, 2024

fs-eire commented Aug 5, 2024

fs-eire commented Aug 5, 2024

azure-pipelines bot commented Aug 5, 2024

azure-pipelines bot commented Aug 5, 2024

azure-pipelines bot commented Aug 5, 2024

huningxin commented Aug 7, 2024

fs-eire commented Aug 7, 2024

fs-eire commented Aug 7, 2024

fdwr commented Sep 18, 2024

fdwr commented Sep 18, 2024

azure-pipelines bot commented Sep 18, 2024

azure-pipelines bot commented Sep 18, 2024

azure-pipelines bot commented Sep 18, 2024

fs-eire commented Sep 20, 2024

egalli commented Sep 20, 2024

fs-eire commented Sep 20, 2024

egalli commented Sep 21, 2024 • edited Loading

bbernhar commented Sep 23, 2024

fs-eire commented Sep 24, 2024

huningxin commented Sep 24, 2024

egalli commented Sep 24, 2024 • edited Loading

fs-eire commented Sep 24, 2024

CPU:

GPU (and other non-CPU location):

fs-eire commented Sep 24, 2024

egalli commented Sep 24, 2024 • edited Loading

bbernhar commented Sep 24, 2024

fs-eire commented Sep 24, 2024

egalli commented Sep 25, 2024 • edited Loading

fs-eire commented Sep 25, 2024

fs-eire commented Sep 25, 2024

azure-pipelines bot commented Sep 25, 2024

bbernhar commented Sep 25, 2024 • edited Loading

bbernhar commented Sep 26, 2024

fdwr commented Sep 28, 2024 • edited Loading

egalli commented Jul 9, 2024 •

edited

Loading

egalli commented Sep 21, 2024 •

edited

Loading

egalli commented Sep 24, 2024 •

edited

Loading

egalli commented Sep 24, 2024 •

edited

Loading

egalli commented Sep 25, 2024 •

edited

Loading

bbernhar commented Sep 25, 2024 •

edited

Loading

fdwr commented Sep 28, 2024 •

edited

Loading