-
Notifications
You must be signed in to change notification settings - Fork 2.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DML] Tiled memory manager #20303
base: main
Are you sure you want to change the base?
[DML] Tiled memory manager #20303
Conversation
This looks very impressive, I'm curious why nobody from the project has responded. |
Hi @axodox, Sorry for the delay, this PR slipped until the radar since nobody was assigned to it and it wasn't triaged (we recommend putting the We had a similar PR (without the memory eviction) a while ago that we had to pause because of unintended performance regressions on a bunch of hardware (especially older hardware), and some crashes with the legacy WinML API that is still being used in the wild, which assumes that the DML EP uses a committed resource. Unfortunately this WinML API doesn't have great testing within onnxruntime, so it's hard to know if anything breaks without manually testing it. With just a first glance at your PR, I can already see one thing that would break: it doesn't have a static bool GetTilingEnabled(ID3D12Device* device)
{
D3D12_FEATURE_DATA_D3D12_OPTIONS options = {};
if (SUCCEEDED(device->CheckFeatureSupport(D3D12_FEATURE_D3D12_OPTIONS, &options, sizeof(options))))
{
return options.TiledResourcesTier >= D3D12_TILED_RESOURCES_TIER_1;
}
return false;
} |
@PatriceVignola On the software compatibility side if you have some suggestion on what models / applications I should manually test the change with, I can do a run on those and fix issues I uncover. On the hardware compatibility side, I will need to rely on others, as I do not have access to a wide array of hardware to test on. My suggestion for compatibility is that I update the code a bit so it can switch the memory manager at runtime. Then I can make it use the existing committed memory manager as default, and make the tiled memory manager an experimental option of the executor. This would avoid both the WinML compatibility issues and the possible performance degradation problems, in exchange for somewhat larger codebase. This would allow us to collect more data on the change, and fix possible problems. With this option to select the memory manager comparing performance and compatibility would also be more straightforward. Please let me know if the above could work for you. |
@oysteinkrog To answer your question I will need to refamiliarize myself with the DML code a bit - as I did this whole thing in a few days around half year ago, and have not worked on the codebase since that or before. What I remember is that during the memory manager implementation I noticed that the fences (memory barriers) are used very conservatively in the codebase. I have seen that the GPU is not being utilized to the degree I expect during inference, and I suspected that perhaps it could be caused by the overly conservative fences, either because of synchronization overhead or because some steps are too small leading much of the GPU idle as parallelization is prevented by the fences. This is just a theory though; I have not measured it yet. I also thought about the running of the same session parallel, but I thought that investing time into running a single session faster - that is more parallel internally - is a better way to spend effort on the issue, since that also helps with a single session, and if we can fully utilize the GPU, then you can just run the model multiple times quickly after each other and have the same result after the same time. However, I decided to wait for a review of my PRs before starting a new one. This one already builds on the previous memory eviction one, which allows to switch quickly between sessions by evicting them from VRAM when not in use. My idea so far has 3 steps:
|
@axodox The other PR (#16634) already makes use of both the legacy allocator and a new placed/reserved resource allocator. The only thing missing from your proposal is the opt-in option, but we already handle the WinML changes by using the old allocator. When I initially made the PR, it was passing all tests, but it was crashing on some software vendor scenarios/models that we do not easily have access to for debugging and were not able to keep going forward. As I mentioned, there were also a bunch of performance regressions and due to the lack of resources internally we had to keep the change on hold. At the very least I would like to compare the 2 PRs and see if there's a way to get the best part of both into a complete change. The approach to your tiled resource allocation is slightly different than mine (I was going the route of inheriting the existing BFC allocator in ORT, which adds to the complexity). The first step would be to resolve the merge conflicts, move over the Placed Resource fallback from the other branch into your branch, and running both branches in our perf lab as well as conformance testing. It's something i can start working on in-between tasks on my end and see how it goes. As for the resource barriers, you're right that they're relatively conservative within ORT. But the best use cases for DML rely on fusing all operations into a single fused node that gets processed entirely within the DirectML binary, where barriers allow for far better parallelization. Improving the barriers in ORT would help the naive eager mode cases, but it wouldn't help the performant use cases for DML. EDIT: One more thing, I think it would be better if we kept the memory eviction API in a separate PR. |
@PatriceVignola I have looked into updating the branch to latest main, but I have found that I cannot build matmul_nbits.cc (in onnxruntime_providers) with latest Visual Studio due to a number of errors (see attached). Reverting this and this recent commit resolves the issue. (The issue only affects the mentioned subproject, the DML provider project still builds, though that is not super useful without the onnxprovider.) EDIT: In the meantime, I have created this work branch, it has the following:
Please note that as this new branch does NOT have the reverts mentioned, this is so I can later merge it into the current PR branch without side effects. I did a quick smoke test with my app, I can enable / disable the new allocator properly and the optimization still works properly with SDXL. It looks like the new evict-less version of the change is already enough to remove the 10-15 second freeze on my PC - caused by the VRAM exhaustion when running this SDXL VAE decode model at 768x768. I have thought about the placed resource fallback you have suggested, but it might require some more work. The current allocator algorithm exploits the fact that reserved resources do not need to be continuous in memory, they do not even have to be in the same heap entirely. This is useful as memory fragmentation is not really a problem. The algorithm will prefer continuous memory ranges, but it will use fragmented memory regions if it can prevent additional heap allocations. Another thing to note that the evict API had a synergy with this change, which I have now removed. During eviction the reserved resources (and also the heaps) were destroyed, otherwise the resources were put in a pool for reuse (this is needed to save virtual address space). Without the evict API now these are only released on destroying the session. This is an issue, since heaps hold GPU memory while reserved resources still use virtual address space, which is not infinite either. I think on my machine it is 2TB. I think a simple solution to this could be that I add a counter which schedules such resources for deletion if they have not been used for an entire execution of a session again. I could apply a similar tactic to heaps, so release them if the memory use decreased for a while. (This is an issue mainly for models with variable size inputs, if you run the same session with same resource dimensions, it is not an issue.) |
…/onnxruntime into tiled_memory_allocator_pr
FYI I have now moved all changes described above into this very PR. No need to check those separately. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See: #22265
And now I have also added code to reclaim unused cached resources and heap space if the required memory drops. Seems to work properly in my app, tried multiple workflows, and behaved as expected. Let me know what you find with it. |
Description
This change introduces a new memory manager for the DirectML executor with the primary goal of reducing VRAM usage, to allow models to run on lower spec computers and achieve faster inference speeds. The actual memory savings will vary depending on the model used, for example:
The savings are achieved by utilizing reserved (aka. tiled) resources for storing pooled data, which allows assigning multiple resources to the same physical VRAM as long as they are not used at the same time.
This is a somewhat complex topic, so I have added detailed explanation below.
Motivation and Context
The existing solution
The new memory manager addresses a significant weakness in the current DML memory allocator called
BucketizedBufferAllocator
. The bucketized allocator creates a list of buckets, each bucket stores a number of power-of-two sized buffers, which can be reused multiple times as temporary memory required for different operators. The issue manifests when a network utilizes a larger amount of temporary memory, where the size of allocations change during the model execution. For example, lets imagine that a model which uses the following allocation pattern: first it needs 100 * 4 MB, then 50 * 8 MB, then 25 * 16 MB. An expectation could be that the required peak VRAM would be 400 MB, since we never need more than this at once, but in actuality the required memory will be 1200 MB.At first one could think that this is to allow more parts of the model to execute in parallel, however the DML executor uses barriers which apply to all buffers created by the executor - this applies to all nodes which produce output or use temporary memory, so parallel execution cannot utilize these long living allocations.
Another issue is that these allocations are committed resources, so a larger allocation cannot be used later for multiple allocations. Also since the bucketized allocator rounds up the allocation sizes to the next power of two, this leads to additional wasted memory, as for example an 1100 MB allocation would round up to 2048 MB.
The new solution
The new allocator called
TiledBufferAllocator
uses a different strategy. Instead of building chains of power of two blocks of video memory, it builds a single list of heaps which is handled as a large continuous buffer dynamically assigned to different resources. To take on the example above, when switching from 100 * 4 MB to 50 * 8 MB, instead of allocation new memory, the existing 400MB heap space is reassigned to fifty 8 MB resources. In fact, we will save more memory than that, since the allocations are no longer rounded to power of two block, so that 50 * 8 MB could become 50 * 6 MB. The new allocator can also save space by mapping a large buffer to a series of heap segments (which do not need to be continuous), so if we have two 8MB free segments in the heap, these can be assigned to make up a 16MB allocation.When a resource is allocated, the program maps it to one or more heaps (using reserved / tiled resources). Upon deallocation it is put aside for later use, if we need the same size of resource, and the current mapping is not allocated to anything else, it will be reused. This exploits the fact that while there can be thousands of allocations which could be assigned to the heap space in many configurations, in reality allocations will conform to a limited set of sizes, and there will be repeating patterns of use of the heap - I have discovered this by data mining the log of allocations and deallocations.
When additional memory is required, the algorithm allocates additional heap space. To prevent many small heap allocations these are rounded up to larger 64 MB blocks. This latter approach works so well, that it can be applied to the static unpooled allocations (which store the model weights) as well, as it saves ~1s when running an SDXL inference (9.6s => 8.6s) compared to only applying the new allocator to pooled resources.
The existing barriers placed by the command executor of DirectML are so conservative, that despite the resource aliasing no performance degradation is observed.
Results
I have created a number of captures of SDXL DML inference with PIX. On the graphs below I have selected the same sequence of operations:
Before:
After (note the reduced VRAM usage as well as some inference speedup):
As it can be seen the saving is significant, but even more importantly, since I am no longer exhausting the VRAM on the machine, Windows will not hitch and pause when running the VAE or spend much of the time by reloading the UNET (consider that reloading takes longer than generating the image itself).