You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello,
I am experimenting with onnxruntime-web, specifically the webgpu ExectutionProvider, because I would be interested in running some models on my website, for some neat visualizations in real time.
To learn a bit about the framework, I implemented a simple pytorch model which reproduces this shader.
Testing the model in inference, the computation of one frame takes around ~50 ms with CUDA, and ~45 ms with CPU. (At bigger sizes, CUDA starts to take over, I guess it is bottlenecked by copying data from CPU to GPU). With torch.compile, CUDA inference drops to ~25 ms.
Using the Pytorch onnx API, I then export the model as .onnx. The code to generate the model is as follows :
Pytorch model :
import torch
import torch.nn as nn
class Beauty(nn.Module) :
def __init__(self,size, device='cpu'):
"""
size : (h,w) screen size
"""
super().__init__()
self.h, self.w = size
x , y = torch.meshgrid(torch.arange(0,self.h,device=device),torch.arange(0,self.w,device=device))
self.coords = torch.stack([x,y],dim=-1) # (h,w,2)
self.size_shape = torch.tensor([self.h,self.w],device=device)[None,None,:] # (1,1,2)
self.d = torch.tensor([0.263,0.416,0.557,1.],device=device)[None,None,:] # (1,1,4)
uv0 = (self.coords*2-self.size_shape)/self.size_shape # (h,w,2)
self.l_uv0 = torch.linalg.norm(uv0,dim=-1,keepdim=True) # (h,w,1)
uv1 = torch.zeros((4,self.h,self.w,2),device=device)
uv1[0] = (uv0*1.5)%1.-0.5
for i in range(1,4):
uv1[i] = (uv1[i-1]*1.5)%1.-0.5
# uv1 shape : (4,h,w,2)
self.base_d = torch.linalg.norm(uv1,dim=-1,keepdim=True) * (torch.exp(-self.l_uv0)[None]) # (4,h,w,1)
self.i_steps = torch.tensor([0,1,2,3],device=device)[:,None,None,None]
self.device=device
def palette(self,t:torch.Tensor):
"""
t should broadcast against (1,1,4)
Returns a tensor of shape (1,1,4) with values in [0,1]
"""
return 0.5 + 0.5*torch.cos(2*torch.pi*(t+self.d)) # (1,1,4)
def forward(self, time):
"""
Returns the image corresponding to the current time
time : float with the current time
"""
color = self.palette(self.l_uv0[None] + self.i_steps*.4 + time*.4) # (4,1,1,4)
d = torch.sin(self.base_d*8. + time)/8. # (4,h,w,1)
d = torch.abs(d) # (4,h,w,1)
d = torch.pow(0.01/d,1.2) # (4,h,w,1)
image = (color*d).sum(dim=0) # (h,w,4)
image = torch.clip(image,0,1) # (h,w,4)
return image # (h,w,4)
Finally, for testing it, I repurposed the WebGPU onnxruntime-web example (with bundler), here is the code for completeness :
const ort = require('onnxruntime-web/webgpu');
async function runSesh(session) {
const time = new ort.Tensor('float32', new Float32Array([0.]), [1]);
// prepare feeds. use model input names as keys.
const feeds = { 'time': time};
const start = performance.now();
// feed inputs and run
const results = await session.run(feeds);
console.log('Inference took ' + (performance.now() - start) + 'ms');
}
async function main() {
try {
// create a new session and load the specific model.
const session = await ort.InferenceSession.create('./shader.onnx', { executionProviders: ['webgpu'] });
console.log('Session webgpu success ! : ', session);
// prepare inputs. a tensor need its corresponding TypedArray as data
for (let i = 0; i < 10; i++) {
await runSesh(session);
}
} catch (e) {
document.write(`failed to inference ONNX model: ${e}.`);
}
}
main();
Now, with this example, looking at the logged timing, we find that the with WebGPU, we have that a frame takes approx 350ms to compute, or a 6x slowdown (14x slowdown w.r.t. torch.compile version).
Is this expected ? Should I just abandon the hope of reaching performance which is close to the pytorch cuda performance in inference ? Does it depend on the architecture of the model, maybe ?
Thanks for the help, and let me know if you want more details on some part of the program !
P.S. : I have also tried with the 'wasm' executionProvider, and I got speeds of about 700 ms, which is also much worse than the CPU equivalent in pytorch (17x slowdown).
reacted with thumbs up emoji reacted with thumbs down emoji reacted with laugh emoji reacted with hooray emoji reacted with confused emoji reacted with heart emoji reacted with rocket emoji reacted with eyes emoji
-
Hello,
I am experimenting with onnxruntime-web, specifically the webgpu ExectutionProvider, because I would be interested in running some models on my website, for some neat visualizations in real time.
To learn a bit about the framework, I implemented a simple pytorch model which reproduces this shader.
Testing the model in inference, the computation of one frame takes around ~50 ms with CUDA, and ~45 ms with CPU. (At bigger sizes, CUDA starts to take over, I guess it is bottlenecked by copying data from CPU to GPU). With torch.compile, CUDA inference drops to ~25 ms.
Using the Pytorch onnx API, I then export the model as .onnx. The code to generate the model is as follows :
Pytorch model :
And to export it :
Finally, for testing it, I repurposed the WebGPU onnxruntime-web example (with bundler), here is the code for completeness :
Now, with this example, looking at the logged timing, we find that the with WebGPU, we have that a frame takes approx 350ms to compute, or a 6x slowdown (14x slowdown w.r.t. torch.compile version).
Is this expected ? Should I just abandon the hope of reaching performance which is close to the pytorch cuda performance in inference ? Does it depend on the architecture of the model, maybe ?
Thanks for the help, and let me know if you want more details on some part of the program !
P.S. : I have also tried with the 'wasm' executionProvider, and I got speeds of about 700 ms, which is also much worse than the CPU equivalent in pytorch (17x slowdown).
Beta Was this translation helpful? Give feedback.
All reactions