CUDA vs WebGPU performance 6x to 14x slower #20177

frotaur · 2024-04-02T16:55:52Z

frotaur
Apr 2, 2024

Hello,
I am experimenting with onnxruntime-web, specifically the webgpu ExectutionProvider, because I would be interested in running some models on my website, for some neat visualizations in real time.

To learn a bit about the framework, I implemented a simple pytorch model which reproduces this shader.

Testing the model in inference, the computation of one frame takes around ~50 ms with CUDA, and ~45 ms with CPU. (At bigger sizes, CUDA starts to take over, I guess it is bottlenecked by copying data from CPU to GPU). With torch.compile, CUDA inference drops to ~25 ms.

Using the Pytorch onnx API, I then export the model as .onnx. The code to generate the model is as follows :

Pytorch model :

import torch
import torch.nn as nn

class Beauty(nn.Module) :
    def __init__(self,size, device='cpu'):
        """
            size : (h,w) screen size
        """
        super().__init__()
        self.h, self.w = size
        x , y = torch.meshgrid(torch.arange(0,self.h,device=device),torch.arange(0,self.w,device=device))

        self.coords = torch.stack([x,y],dim=-1) # (h,w,2)
        self.size_shape = torch.tensor([self.h,self.w],device=device)[None,None,:] # (1,1,2)
        self.d = torch.tensor([0.263,0.416,0.557,1.],device=device)[None,None,:] # (1,1,4)

        uv0 = (self.coords*2-self.size_shape)/self.size_shape # (h,w,2)
        self.l_uv0 = torch.linalg.norm(uv0,dim=-1,keepdim=True) # (h,w,1)
        uv1 = torch.zeros((4,self.h,self.w,2),device=device)
        uv1[0] = (uv0*1.5)%1.-0.5
        for i in range(1,4):
            uv1[i] = (uv1[i-1]*1.5)%1.-0.5
        # uv1 shape : (4,h,w,2)
            
        self.base_d = torch.linalg.norm(uv1,dim=-1,keepdim=True) * (torch.exp(-self.l_uv0)[None]) # (4,h,w,1)
        self.i_steps = torch.tensor([0,1,2,3],device=device)[:,None,None,None]

        self.device=device

    def palette(self,t:torch.Tensor):
        """
            t should broadcast against (1,1,4)
            Returns a tensor of shape (1,1,4) with values in [0,1]
        """
        return 0.5 + 0.5*torch.cos(2*torch.pi*(t+self.d)) # (1,1,4)
    
    
    def forward(self, time):
        """
            Returns the image corresponding to the current time
            time : float with the current time
        """
        color = self.palette(self.l_uv0[None] + self.i_steps*.4 + time*.4) # (4,1,1,4)

        d = torch.sin(self.base_d*8. + time)/8. # (4,h,w,1)
        d = torch.abs(d) # (4,h,w,1)
        d = torch.pow(0.01/d,1.2) # (4,h,w,1)

        image = (color*d).sum(dim=0) # (h,w,4)
        image = torch.clip(image,0,1) # (h,w,4)

        return image # (h,w,4)

And to export it :

size = (720,1280)
torch_model = Beauty(size)
torch_input = torch.tensor([0.])


input_names = ['time']
output_names = ['image']
torch.onnx.export(torch_model, torch_input, "shader.onnx", input_names=input_names, output_names=output_names)

Finally, for testing it, I repurposed the WebGPU onnxruntime-web example (with bundler), here is the code for completeness :

const ort = require('onnxruntime-web/webgpu');

async function runSesh(session) {
    const time = new ort.Tensor('float32', new Float32Array([0.]), [1]);
    // prepare feeds. use model input names as keys.
    const feeds = { 'time': time};
    const start = performance.now();
    // feed inputs and run
    const results = await session.run(feeds);
    console.log('Inference took ' + (performance.now() - start) + 'ms');
}

async function main() {
    try {
        // create a new session and load the specific model.
        const session = await ort.InferenceSession.create('./shader.onnx', { executionProviders: ['webgpu'] });
        console.log('Session webgpu success ! : ', session);
        // prepare inputs. a tensor need its corresponding TypedArray as data
        for (let i = 0; i < 10; i++) {
            await runSesh(session);
        }
    } catch (e) {
        document.write(`failed to inference ONNX model: ${e}.`);
    }
}

main();

Now, with this example, looking at the logged timing, we find that the with WebGPU, we have that a frame takes approx 350ms to compute, or a 6x slowdown (14x slowdown w.r.t. torch.compile version).

Is this expected ? Should I just abandon the hope of reaching performance which is close to the pytorch cuda performance in inference ? Does it depend on the architecture of the model, maybe ?

Thanks for the help, and let me know if you want more details on some part of the program !

P.S. : I have also tried with the 'wasm' executionProvider, and I got speeds of about 700 ms, which is also much worse than the CPU equivalent in pytorch (17x slowdown).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA vs WebGPU performance 6x to 14x slower #20177

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 0 comments

Select a reply

CUDA vs WebGPU performance 6x to 14x slower #20177

frotaur Apr 2, 2024

Replies: 0 comments

frotaur
Apr 2, 2024