Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Headless rendering much slower on AMD GPUs? (30x slower) #1208

Open
drywolf opened this issue Jun 4, 2024 · 19 comments
Open

Headless rendering much slower on AMD GPUs? (30x slower) #1208

drywolf opened this issue Jun 4, 2024 · 19 comments

Comments

@drywolf
Copy link

drywolf commented Jun 4, 2024

Describe the bug
I am using VSG to perform some headless rendering (i.e. no Swapchain and no vsg::Window)
The code that I am using is very similar to vsgheadless.cpp from the vsgExamples.

  • when I am running the code on Windows (10) with a NV RTX 2080 TI I am getting ca. 13.000 FPS
  • when I am running the same executable on Windows (11) with with an AMD RX 5700 XT I am getting just ca. 390 FPS
    • ~30 times slower when compared to the RTX 2080

Also when looking at the Windows Task-Manager GPU performance metrics, there is an interesting difference between the two GPUs:

  • the AMD "Copy" queue is showing heavy load while running the code (see ~64% Copy in the screenshot)
  • the NV "Copy" queue is pretty much idle while running the same code (see ~0% Copy in the screenshot)

AMD RX 5700 XT

NV RTX 2080 TI


I already tried to do some profiling on the AMD to find out what is happening, but all of the AMD profiling tools are failing to function.
This is the minimal code to reproduce what I showed above:
https://gist.github.com/drywolf/690c775bb181c946b30ed67ebcdee3de

PS: the minimal code does not render anything, it only contains a single RenderPass that would implicitly clear the color & depth-stencil images, but that is all the code is doing. So it is quite surprising to see the low FPS / high Copy load on the AMD card, for such a trivial minimal workload.

@martinweber
Copy link

I am seeing an even higher load of 74% on the Copy queue. This is with an AMD RX 6700 XT, 16 GB on a dual monitor setup (4K and 1440p resolution respectively). Framerate is at about 235 fps, so even worse.

Screenshot 2024-06-05 163842

@robertosfield
Copy link
Collaborator

I think it's important to differentiate between rendering on the GPU and copying of data to and from the GPU over the PCI express bus.

I presume this thread is actually about copying rather than rendering so the title of this thread is most likely misleading, is this so or am I have just reading things wrong?

There isn't any information above about the amount of data being transferred and what mechanism is being used.

With unexpected differences in performance between hardware/drivers sometimes Vulkan errors have occurred that one hardware/driver combination copes just fine with but others ended up slowing down. Running the application with Vulkan validation layer on would be useful test to make sure there are no issues that need fixing.

As a general comment, when writing in English it's best to stick with English language conventions on numbers, so a . is a decimal place, not a deliminator between thousands. A German convention of 13.000 in an English language text will be read as 13, not 13 thousands. Having to second guess what folks might mean by what they write just takes away from the bandwidth required to understand the actual problem in hand.

@drywolf
Copy link
Author

drywolf commented Jun 5, 2024

I apologize if I was unable to communicate the issue at hand clearly enough.

I think it's important to differentiate between rendering on the GPU and copying of data to and from the GPU over the PCI express bus.

I presume this thread is actually about copying rather than rendering so the title of this thread is most likely misleading, is this so or am I have just reading things wrong?

There isn't any information above about the amount of data being transferred and what mechanism is being used.

That is the curious thing here. We were seeing worse performance on AMD GPUs than we would have expected (by taking a rough guess based on the hardware specs)
So we started to reduce our VSG code down to something more minimal to isolate where the AMD Driver/GPU might be doing something wasteful.

We now ended up with the minimal example code that I mentioned above, and it is still showing the same low performance & unexpected COPY in Task-Manager that we were seeing in our production app:
https://gist.github.com/drywolf/690c775bb181c946b30ed67ebcdee3de

  • this minimal repro code is originally based on the vsgheadless vsgExample
  • this code does NOT copy any GPU data in any way !
  • this code also does not render any 3D models anymore
  • it is just running a single vsg::RenderPass that is just clearing the Framebuffer images, and that is all that it is doing

Translating this code to OpenGL for example, would be similar to a render-loop that is just doing glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT) and nothing else !!

So the framerate should be very high, and there should not be any GPU memory-copies happening ... because this is doing headless rendering, there should also be no VK Swapchain involved in any way.

That is why the COPY workload in the Task-Manager & the low FPS on AMD are so surprising & unexpected.
There is no Vulkan/VSG code that I could see responsible for this copy-overhead.

@robertosfield
So I wanted to ask if maybe you have stumbled across something similar related to AMD GPUs while working on VSG / other Vulkan projects?

Thanks

@robertosfield
Copy link
Collaborator

I had a quick look a the example and nothing jumps out as possible cause of slower rendering. I'm really busy with other VSG work right now so I'm not able to go test out the example as is, perhaps others can test out to get a feel for how things perform on different hardware/OS/driver combinations.

Do any of the standard VSG example exhibit the same performance issue?

As a general comment, I've been developing on Linux mostly when writing the VSG, using either AMD5700G integrated GPU or a Geforce 1650 and 2080 cards. I've also got an Intel laptop and desktop and use the integrated GPU on these. Mostly I'm seeing really consistent performance across the board.

The integrated GPUs show lower cost of copying data from GPU associated memory into CPU associated memory than on the dedicated GPUs.

The NVidia cards list more queue options, but that's down to their drivers, this can provide extra options for lowering the cost of copy, but generally I've found the AMD side to have lower copy cost but it's on integrated GPU so it's comparing apples to oranges. As I don't have a dedicated AMD card I can't say how the dedicated AMD card would perform.

Vulkan and VSG support GPU timing stats, with the vsg::Profiler supporting both GPU and CPU stats collection so perhaps this is something to try out when profiling how the application is running. The vsg::Profiler can output it's result to console/file after the collection phase so I've used to a few times to figure out cost of different parts of the work.

I would also recommend trying the same tests across different OS's and hardware/driver combinations.

@Mikalai
Copy link
Contributor

Mikalai commented Jun 5, 2024

@drywolf Similar behaviour for me
image

@Mikalai
Copy link
Contributor

Mikalai commented Jun 5, 2024

@drywolf Same hardware but on fedora 40 performs much better
image

@drywolf
Copy link
Author

drywolf commented Jun 5, 2024

Thanks @Mikalai for testing ❤️
Fedora behaving so differently might indicate an issue in the AMD Windows driver.
I will contact AMD and let them know about this.

@drywolf
Copy link
Author

drywolf commented Jun 6, 2024

@Mikalai the last time I worked with AMD GPUs on Linux there were two different kinds of drivers, the open-source driver and the proprietary "ROCm" driver.
Which one of these are you using on your Fedora Linux? Also the exact driver-version would be of help when I report this to AMD.
Thanks 🙏

@Mikalai
Copy link
Contributor

Mikalai commented Jun 6, 2024

@drywolf
vulkaninfo --summary reports this

Devices:
========
GPU0:
	apiVersion         = 1.3.274
	driverVersion      = 24.0.8
	vendorID           = 0x1002
	deviceID           = 0x73a5
	deviceType         = PHYSICAL_DEVICE_TYPE_DISCRETE_GPU
	deviceName         = AMD Radeon RX 6950 XT (RADV NAVI21)
	driverID           = DRIVER_ID_MESA_RADV
	driverName         = radv
	driverInfo         = Mesa 24.0.8
	conformanceVersion = 1.3.0.0
	deviceUUID         = 00000000-0a00-0000-0000-000000000000
	driverUUID         = 414d442d-4d45-5341-2d44-525600000000
GPU1:
	apiVersion         = 1.3.274
	driverVersion      = 0.0.1
	vendorID           = 0x10005
	deviceID           = 0x0000
	deviceType         = PHYSICAL_DEVICE_TYPE_CPU
	deviceName         = llvmpipe (LLVM 18.1.1, 256 bits)
	driverID           = DRIVER_ID_MESA_LLVMPIPE
	driverName         = llvmpipe
	driverInfo         = Mesa 24.0.8 (LLVM 18.1.1)
	conformanceVersion = 1.3.1.1
	deviceUUID         = 6d657361-3234-2e30-2e38-000000000000
	driverUUID         = 6c6c766d-7069-7065-5555-494400000000

@drywolf
Copy link
Author

drywolf commented Jun 6, 2024

@Mikalai
To me this looks like the Mesa RADV Vulkan Driver ... i.e. this is not the driver developed by AMD, but a driver that is developed by the Linux community AFAIK.

driverID = DRIVER_ID_MESA_RADV

The official (proprietary) AMD driver would be showing something like:

driverID = DRIVER_ID_AMD_PROPRIETARY

@drywolf
Copy link
Author

drywolf commented Jun 6, 2024

@robertosfield

Do any of the standard VSG example exhibit the same performance issue?

I am in the process of setting up a more complete repro-case, and now I also recreated the issue with a windowed vsg example code.
At first the COPY workload was not as noticable, because by default VSG would limit the framerate to 60 FPS.
But then I uncapped the framerate via windowTraits->swapchainPreferences.presentMode = VK_PRESENT_MODE_IMMEDIATE_KHR;

With that I am getting:

on AMD RX 5700 XT

  • ~350 FPS
  • a Task-Manager 3D workload of ~20%
  • a Task-Manager COPY workload of ~65%

on NV RTX 2080 TI

  • ~2900 FPS
  • a Task-Manager 3D workload of ~2%
  • a Task-Manager COPY workload of 0%

The VSG code is basically the vsghelloworld.cpp example, but without rendering any 3D scene.
(so just a window with a clear color, and nothing else)

@drywolf
Copy link
Author

drywolf commented Jun 6, 2024

I now created a self-contained Github repo that contains the same code for headless/offscreen VSG rendering that I already posted above.

https://github.com/drywolf/vsg_amd_perf
(this is using vcpkg to fetch VSG, so there should be little to no extra effort needed to build this)

Additionally I also added another minimal VSG app that is rendering to a vsg::Window / Swapchain.

  • the code is almost identical to vsghelloworld.cpp
  • except that I removed the loading of 3D files, so this app also is just rendering a blank/color-cleared window
  • I also set the swapchain presentMode to VK_PRESENT_MODE_IMMEDIATE_KHR to uncap the framerate
  • this app is showing exactly the same issues as discussed above
  • but additionally in the Task-Manager it is also showing ~50% 3D workload 🙈

PS:
The Windowed-App is now also working with some of the AMD profiling tools.
I only had a first quick chance to do some profiling, but at first glance these tools are not showing me any obvious Copy workloads / bottlenecks.
But the framerate in this App is similarly low as seen above in the Offscreen/Headless rendering tests.

@robertosfield
Copy link
Collaborator

Another thing you could look at is whether the windowing system is doing compositing in which was the application is rendering of a buffer that is then used by the compositor as input. Fullscreen without window decoration should bypass the compositor but this will be down to the OS/drivers to implement properly.

I'll have to defer to Windows devs to give guidance on how to control the Windows desktop composition and driver settings as I'm only an occasional Windows user with no platform expertise on the platform.

@drywolf
Copy link
Author

drywolf commented Jun 6, 2024

Another thing you could look at is whether the windowing system is doing compositing

I disabled all Windows 11 advanced compositing options (following this guide) and ran the app in fullscreen mode, by setting windowTraits->fullscreen = true
This made no FPS difference on AMD ... the 3D & Copy workloads remained pretty much unchanged (3D ~20% ... Copy ~64%, framerate still low at around 350-360 FPS)

@robertosfield
Copy link
Collaborator

Another variable you could experiment with is different formats for the colour and depth buffers, perhaps the defaults chosen by the VSG are tripping up the driver into a slower path on this particular hardware/driver combination.

@drywolf
Copy link
Author

drywolf commented Jun 6, 2024

Yeah that's a good idea 👍
I already did this yesterday with VK_FORMAT_R8_UNORM ... it was giving me the same results as for the default VK_FORMAT_R8G8B8A8_UNORM
But I will try some other formats as well, just in case there might be some insight to be gained.

PS: the offscreen_perf app in the example repo now is only using a color-attachment, and no depth-attachment anymore.
So the issue is still present without any depth-stencil rendering. (for the offscreen/headless case)

@drywolf
Copy link
Author

drywolf commented Jun 6, 2024

I now tried a couple more VkFormats, and none of them showed any significant difference in performance.

@Slaw6820
Copy link

https://github.com/vsg-dev/VulkanSceneGraph/blob/8a229b30637eea6fcfd9ace3d0745415dd563d7a/include/vsg/vk/CommandPool.h#L35 

Here the VK_COMMAND_POOL_RESET_RELEASE_RESOURCES_BIT flag is set, so resources are freed and reallocated in every frame. If set to 0, got the same performance as NV.

My guess is that Nvidia optimizes it and skips the flag so there is no resource reallocation on NV.

@robertosfield
Copy link
Collaborator

robertosfield commented Jul 18, 2024 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants