Skip to content

Commit

Permalink
[doc] Rename MLBuffer => MLTensor for WebNN EP (#22039)
Browse files Browse the repository at this point in the history
  • Loading branch information
Honry authored Sep 11, 2024
1 parent 8ae6332 commit 2403930
Showing 1 changed file with 36 additions and 36 deletions.
72 changes: 36 additions & 36 deletions docs/tutorials/web/ep-webnn.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,59 +74,59 @@ To use WebNN EP, you just need to make 3 small changes:

WebNN API and WebNN EP are in actively development, you might consider installing the latest nightly build version of ONNX Runtime Web (onnxruntime-web@dev) to benefit from the latest features and improvements.

## Keep tensor data on WebNN MLBuffer (IO binding)
## Keep tensor data on WebNN MLTensor (IO binding)

By default, a model's inputs and outputs are tensors that hold data in CPU memory. When you run a session with WebNN EP with 'gpu' or 'npu' device type, the data is copied to GPU or NPU memory, and the results are copied back to CPU memory. Memory copy between different devices as well as different sessions will bring much overhead to the inference time, WebNN provides a new opaque device-specific storage type MLBuffer to address this issue.
If you get your input data from a MLBuffer, or you want to keep the output data on MLBuffer for further processing, you can use IO binding to keep the data on MLBuffer. This will be especially helpful when running transformer based models, which usually runs a single model multiple times with previous output as the next input.
By default, a model's inputs and outputs are tensors that hold data in CPU memory. When you run a session with WebNN EP with 'gpu' or 'npu' device type, the data is copied to GPU or NPU memory, and the results are copied back to CPU memory. Memory copy between different devices as well as different sessions will bring much overhead to the inference time, WebNN provides a new opaque device-specific storage type MLTensor to address this issue.
If you get your input data from a MLTensor, or you want to keep the output data on MLTensor for further processing, you can use IO binding to keep the data on MLTensor. This will be especially helpful when running transformer based models, which usually runs a single model multiple times with previous output as the next input.

For model input, if your input data is a WebNN storage MLBuffer, you can [create a MLBuffer tensor and use it as input tensor](#create-input-tensor-from-a-mlbuffer).
For model input, if your input data is a WebNN storage MLTensor, you can [create a MLTensor tensor and use it as input tensor](#create-input-tensor-from-a-mltensor).

For model output, there are 2 ways to use the IO binding feature:
- [Use pre-allocated MLBuffer tensors](#use-pre-allocated-mlbuffer-tensors)
- [Use pre-allocated MLTensor tensors](#use-pre-allocated-mltensor-tensors)
- [Specify the output data location](#specify-the-output-data-location)

Please also check the following topic:
- [MLBuffer tensor life cycle management](#mlbuffer-tensor-life-cycle-management)
- [MLTensor tensor life cycle management](#mltensor-tensor-life-cycle-management)

**Note:** The MLBuffer necessitates a shared MLContext for IO binding. This implies that the MLContext should be pre-created as a WebNN EP option and utilized across all sessions.
**Note:** The MLTensor necessitates a shared MLContext for IO binding. This implies that the MLContext should be pre-created as a WebNN EP option and utilized across all sessions.

### Create input tensor from a MLBuffer
### Create input tensor from a MLTensor

If your input data is a WebNN storage MLBuffer, you can create a MLBuffer tensor and use it as input tensor:
If your input data is a WebNN storage MLTensor, you can create a MLTensor tensor and use it as input tensor:

```js
const mlContext = await navigator.ml.createContext({deviceType, ...});
const inputMLBuffer = await mlContext.createBuffer({
const inputMLTensor = await mlContext.createTensor({
dataType: 'float32',
dimensions: [1, 3, 224, 224],
usage: MLBufferUsage.WRITE_TO,
usage: MLTensorUsage.WRITE_TO,
});

mlContext.writeBuffer(mlBuffer, inputArrayBuffer);
const inputTensor = ort.Tensor.fromMLBuffer(mlBuffer, {
mlContext.writeTensor(inputMLTensor, inputArrayBuffer);
const inputTensor = ort.Tensor.fromMLTensor(inputMLTensor, {
dataType: 'float32',
dims: [1, 3, 224, 224]
});

```

Use this tensor as model inputs(feeds) so that the input data will be kept on MLBuffer.
Use this tensor as model inputs(feeds) so that the input data will be kept on MLTensor.

### Use pre-allocated MLBuffer tensors
### Use pre-allocated MLTensor tensors

If you know the output shape in advance, you can create a MLBuffer tensor and use it as output tensor:
If you know the output shape in advance, you can create a MLTensor tensor and use it as output tensor:

```js

// Create a pre-allocated buffer and the corresponding tensor. Assuming that the output shape is [10, 1000].
// Create a pre-allocated MLTensor and the corresponding ORT tensor. Assuming that the output shape is [10, 1000].
const mlContext = await navigator.ml.createContext({deviceType, ...});
const myPreAllocatedBuffer = await mlContext.createBuffer({
const myPreAllocatedMLTensor = await mlContext.createTensor({
dataType: 'float32',
dimensions: [10, 1000],
usage: MLBufferUsage.READ_FROM,
usage: MLTensorUsage.READ_FROM,
});

const myPreAllocatedOutputTensor = ort.Tensor.fromMLBuffer(myPreAllocatedBuffer, {
const myPreAllocatedOutputTensor = ort.Tensor.fromMLTensor(myPreAllocatedMLTensor, {
dataType: 'float32',
dims: [10, 1000]
});
Expand All @@ -140,25 +140,25 @@ const results = await mySession.run(feeds, fetches);

```

By specifying the output tensor in the fetches, ONNX Runtime Web will use the pre-allocated buffer as the output buffer. If there is a shape mismatch, the `run()` call will fail.
By specifying the output tensor in the fetches, ONNX Runtime Web will use the pre-allocated MLTensor as the output tensor. If there is a shape mismatch, the `run()` call will fail.

### Specify the output data location

If you don't want to use pre-allocated MLBuffer tensors for outputs, you can also specify the output data location in the session options:
If you don't want to use pre-allocated MLTensor tensors for outputs, you can also specify the output data location in the session options:

```js
const mySessionOptions1 = {
...,
// keep all output data on MLBuffer
preferredOutputLocation: 'ml-buffer'
// keep all output data on MLTensor
preferredOutputLocation: 'ml-tensor'
};

const mySessionOptions2 = {
...,
// alternatively, you can specify the output location for each output tensor
preferredOutputLocation: {
'output_0': 'cpu', // keep output_0 on CPU. This is the default behavior.
'output_1': 'ml-buffer' // keep output_1 on MLBuffer buffer
'output_1': 'ml-tensor' // keep output_1 on MLTensor tensor
}
};
```
Expand All @@ -169,18 +169,18 @@ See [API reference: preferredOutputLocation](https://onnxruntime.ai/docs/api/js/

## Notes

### MLBuffer tensor life cycle management
### MLTensor tensor life cycle management

It is important to understand how the underlying MLBuffer is managed so that you can avoid memory leaks and improve buffer usage efficiency.
It is important to understand how the underlying MLTensor is managed so that you can avoid memory leaks and improve tensor usage efficiency.

A MLBuffer tensor is created either by user code or by ONNX Runtime Web as model's output.
- When it is created by user code, it is always created with an existing MLBuffer using `Tensor.fromMLBuffer()`. In this case, the tensor does not "own" the MLBuffer.
A MLTensor tensor is created either by user code or by ONNX Runtime Web as model's output.
- When it is created by user code, it is always created with an existing MLTensor using `Tensor.fromMLTensor()`. In this case, the tensor does not "own" the MLTensor.

- It is user's responsibility to make sure the underlying buffer is valid during the inference, and call `mlBuffer.destroy()` to dispose the buffer when it is no longer needed.
- Avoid calling `tensor.getData()` and `tensor.dispose()`. Use the MLBuffer directly.
- Using a MLBuffer tensor with a destroyed MLBuffer will cause the session run to fail.
- When it is created by ONNX Runtime Web as model's output (not a pre-allocated MLBuffer tensor), the tensor "owns" the buffer.
- It is user's responsibility to make sure the underlying MLTensor is valid during the inference, and call `mlTensor.destroy()` to dispose the MLTensor when it is no longer needed.
- Avoid calling `tensor.getData()` and `tensor.dispose()`. Use the MLTensor tensor directly.
- Using a MLTensor tensor with a destroyed MLTensor will cause the session run to fail.
- When it is created by ONNX Runtime Web as model's output (not a pre-allocated MLTensor tensor), the tensor "owns" the MLTensor.

- You don't need to worry about the case that the buffer is destroyed before the tensor is used.
- Call `tensor.getData()` to download the data from the MLBuffer to CPU and get the data as a typed array.
- Call `tensor.dispose()` explicitly to destroy the underlying MLBuffer when it is no longer needed.
- You don't need to worry about the case that the MLTensor is destroyed before the tensor is used.
- Call `tensor.getData()` to download the data from the MLTensor to CPU and get the data as a typed array.
- Call `tensor.dispose()` explicitly to destroy the underlying MLTensor when it is no longer needed.

0 comments on commit 2403930

Please sign in to comment.