[DML] Tiled memory manager #20303

axodox · 2024-04-14T17:15:50Z

Description

This change introduces a new memory manager for the DirectML executor with the primary goal of reducing VRAM usage, to allow models to run on lower spec computers and achieve faster inference speeds. The actual memory savings will vary depending on the model used, for example:

Model	VRAM Before	VRAM After	VRAM Saved
SDXL 768x768 UNET inference	5755 MB	5352 MB	403 MB (7 %)
SDXL 768x768 VAE decode	7220 MB	1948 MB	5272 MB (73 %)
SDXL 768x768 peak VRAM	14647 MB	9259 MB	5388 MB (37%)

The savings are achieved by utilizing reserved (aka. tiled) resources for storing pooled data, which allows assigning multiple resources to the same physical VRAM as long as they are not used at the same time.

This is a somewhat complex topic, so I have added detailed explanation below.

Note I have continued on my forked version which have another PR waiting for memory evictions. If you want to see the changes added for the tiled memory manager specifically, please check this diff between the two.

If you are interested in merging this, and would like to have the tiled memory manager separately, let me know and I will cherry pick the changes onto a new branch based on current main.

Motivation and Context

The existing solution

The new memory manager addresses a significant weakness in the current DML memory allocator called BucketizedBufferAllocator. The bucketized allocator creates a list of buckets, each bucket stores a number of power-of-two sized buffers, which can be reused multiple times as temporary memory required for different operators. The issue manifests when a network utilizes a larger amount of temporary memory, where the size of allocations change during the model execution. For example, lets imagine that a model which uses the following allocation pattern: first it needs 100 * 4 MB, then 50 * 8 MB, then 25 * 16 MB. An expectation could be that the required peak VRAM would be 400 MB, since we never need more than this at once, but in actuality the required memory will be 1200 MB.

At first one could think that this is to allow more parts of the model to execute in parallel, however the DML executor uses barriers which apply to all buffers created by the executor - this applies to all nodes which produce output or use temporary memory, so parallel execution cannot utilize these long living allocations.

Another issue is that these allocations are committed resources, so a larger allocation cannot be used later for multiple allocations. Also since the bucketized allocator rounds up the allocation sizes to the next power of two, this leads to additional wasted memory, as for example an 1100 MB allocation would round up to 2048 MB.

The new solution

The new allocator called TiledBufferAllocator uses a different strategy. Instead of building chains of power of two blocks of video memory, it builds a single list of heaps which is handled as a large continuous buffer dynamically assigned to different resources. To take on the example above, when switching from 100 * 4 MB to 50 * 8 MB, instead of allocation new memory, the existing 400MB heap space is reassigned to fifty 8 MB resources. In fact, we will save more memory than that, since the allocations are no longer rounded to power of two block, so that 50 * 8 MB could become 50 * 6 MB. The new allocator can also save space by mapping a large buffer to a series of heap segments (which do not need to be continuous), so if we have two 8MB free segments in the heap, these can be assigned to make up a 16MB allocation.

When a resource is allocated, the program maps it to one or more heaps (using reserved / tiled resources). Upon deallocation it is put aside for later use, if we need the same size of resource, and the current mapping is not allocated to anything else, it will be reused. This exploits the fact that while there can be thousands of allocations which could be assigned to the heap space in many configurations, in reality allocations will conform to a limited set of sizes, and there will be repeating patterns of use of the heap - I have discovered this by data mining the log of allocations and deallocations.

When additional memory is required, the algorithm allocates additional heap space. To prevent many small heap allocations these are rounded up to larger 64 MB blocks. This latter approach works so well, that it can be applied to the static unpooled allocations (which store the model weights) as well, as it saves ~1s when running an SDXL inference (9.6s => 8.6s) compared to only applying the new allocator to pooled resources.

The existing barriers placed by the command executor of DirectML are so conservative, that despite the resource aliasing no performance degradation is observed.

Results

I have created a number of captures of SDXL DML inference with PIX. On the graphs below I have selected the same sequence of operations:

Loading the UNET (the ramp upwards)
15 step UNET inference (the plateau after)
Loading + running the VAE decoder (the peak)

Before:

After (note the reduced VRAM usage as well as some inference speedup):

As it can be seen the saving is significant, but even more importantly, since I am no longer exhausting the VRAM on the machine, Windows will not hitch and pause when running the VAE or spend much of the time by reloading the UNET (consider that reloading takes longer than generating the image itself).

oysteinkrog · 2024-09-18T16:49:19Z

This looks very impressive, I'm curious why nobody from the project has responded.
@axodox Since you obviously know your way around the DML EP, can you say anything about the reasons why the DML EP is not thread safe? Specifically I mean the issue where a single session can not be Run() concurrently by multiple threads.

PatriceVignola · 2024-09-18T17:39:28Z

Hi @axodox,

Sorry for the delay, this PR slipped until the radar since nobody was assigned to it and it wasn't triaged (we recommend putting the DML label, which makes it easier to track on our side). Thank you @oysteinkrog for reviving it!

We had a similar PR (without the memory eviction) a while ago that we had to pause because of unintended performance regressions on a bunch of hardware (especially older hardware), and some crashes with the legacy WinML API that is still being used in the wild, which assumes that the DML EP uses a committed resource. Unfortunately this WinML API doesn't have great testing within onnxruntime, so it's hard to know if anything breaks without manually testing it.

With just a first glance at your PR, I can already see one thing that would break: it doesn't have a CreatePlacedResource fallback, which means that it would fail on older hardware. A check similar to the following needs to be done, and if false, CreatePlacedResource needs to be used instead of CreateReservedResource

    static bool GetTilingEnabled(ID3D12Device* device)
    {
        D3D12_FEATURE_DATA_D3D12_OPTIONS options = {};
        if (SUCCEEDED(device->CheckFeatureSupport(D3D12_FEATURE_D3D12_OPTIONS, &options, sizeof(options))))
        {
            return options.TiledResourcesTier >= D3D12_TILED_RESOURCES_TIER_1;
        }

        return false;
    }

axodox · 2024-09-19T21:01:48Z

@PatriceVignola On the software compatibility side if you have some suggestion on what models / applications I should manually test the change with, I can do a run on those and fix issues I uncover.

On the hardware compatibility side, I will need to rely on others, as I do not have access to a wide array of hardware to test on.

My suggestion for compatibility is that I update the code a bit so it can switch the memory manager at runtime. Then I can make it use the existing committed memory manager as default, and make the tiled memory manager an experimental option of the executor.

This would avoid both the WinML compatibility issues and the possible performance degradation problems, in exchange for somewhat larger codebase.

This would allow us to collect more data on the change, and fix possible problems. With this option to select the memory manager comparing performance and compatibility would also be more straightforward.

Please let me know if the above could work for you.

axodox · 2024-09-19T22:12:19Z

@oysteinkrog To answer your question I will need to refamiliarize myself with the DML code a bit - as I did this whole thing in a few days around half year ago, and have not worked on the codebase since that or before.

What I remember is that during the memory manager implementation I noticed that the fences (memory barriers) are used very conservatively in the codebase. I have seen that the GPU is not being utilized to the degree I expect during inference, and I suspected that perhaps it could be caused by the overly conservative fences, either because of synchronization overhead or because some steps are too small leading much of the GPU idle as parallelization is prevented by the fences. This is just a theory though; I have not measured it yet.

I also thought about the running of the same session parallel, but I thought that investing time into running a single session faster - that is more parallel internally - is a better way to spend effort on the issue, since that also helps with a single session, and if we can fully utilize the GPU, then you can just run the model multiple times quickly after each other and have the same result after the same time.

However, I decided to wait for a review of my PRs before starting a new one. This one already builds on the previous memory eviction one, which allows to switch quickly between sessions by evicting them from VRAM when not in use.

My idea so far has 3 steps:

Making sessions quickly swappable, so when I need to run multiple models (sessions) sequentially, only a single session needs to fit into VRAM
Making models use less VRAM by allowing multiple allocations to refer to the same memory as long they are not used at the same time. This means that a session will not use more memory than what is actually required at any one point of time during the execution.
Making models run faster by strategically scheduling the allocations during execution, so the amount of fences can be reduced, increasing parallelism.

PatriceVignola · 2024-09-20T18:03:38Z

@axodox The other PR (#16634) already makes use of both the legacy allocator and a new placed/reserved resource allocator. The only thing missing from your proposal is the opt-in option, but we already handle the WinML changes by using the old allocator. When I initially made the PR, it was passing all tests, but it was crashing on some software vendor scenarios/models that we do not easily have access to for debugging and were not able to keep going forward. As I mentioned, there were also a bunch of performance regressions and due to the lack of resources internally we had to keep the change on hold.

At the very least I would like to compare the 2 PRs and see if there's a way to get the best part of both into a complete change. The approach to your tiled resource allocation is slightly different than mine (I was going the route of inheriting the existing BFC allocator in ORT, which adds to the complexity). The first step would be to resolve the merge conflicts, move over the Placed Resource fallback from the other branch into your branch, and running both branches in our perf lab as well as conformance testing. It's something i can start working on in-between tasks on my end and see how it goes.

As for the resource barriers, you're right that they're relatively conservative within ORT. But the best use cases for DML rely on fusing all operations into a single fused node that gets processed entirely within the DirectML binary, where barriers allow for far better parallelization. Improving the barriers in ORT would help the naive eager mode cases, but it wouldn't help the performant use cases for DML.

EDIT: One more thing, I think it would be better if we kept the memory eviction API in a separate PR.

axodox · 2024-09-22T13:43:42Z

@PatriceVignola I have looked into updating the branch to latest main, but I have found that I cannot build matmul_nbits.cc (in onnxruntime_providers) with latest Visual Studio due to a number of errors (see attached). Reverting this and this recent commit resolves the issue. (The issue only affects the mentioned subproject, the DML provider project still builds, though that is not super useful without the onnxprovider.)

EDIT: In the meantime, I have created this work branch, it has the following:

Updated to latest main from upstream
All evict API related changes had been removed
Added a new ep.dml.preferred_memory_allocator_type setting to select the classic allocator or the new tiling setting
Added tiled resource tier support check, if not found falls back to the bucketized allocator

Please note that as this new branch does NOT have the reverts mentioned, this is so I can later merge it into the current PR branch without side effects.

I did a quick smoke test with my app, I can enable / disable the new allocator properly and the optimization still works properly with SDXL. It looks like the new evict-less version of the change is already enough to remove the 10-15 second freeze on my PC - caused by the VRAM exhaustion when running this SDXL VAE decode model at 768x768.

I have thought about the placed resource fallback you have suggested, but it might require some more work. The current allocator algorithm exploits the fact that reserved resources do not need to be continuous in memory, they do not even have to be in the same heap entirely. This is useful as memory fragmentation is not really a problem. The algorithm will prefer continuous memory ranges, but it will use fragmented memory regions if it can prevent additional heap allocations.

Another thing to note that the evict API had a synergy with this change, which I have now removed. During eviction the reserved resources (and also the heaps) were destroyed, otherwise the resources were put in a pool for reuse (this is needed to save virtual address space). Without the evict API now these are only released on destroying the session. This is an issue, since heaps hold GPU memory while reserved resources still use virtual address space, which is not infinite either. I think on my machine it is 2TB. I think a simple solution to this could be that I add a counter which schedules such resources for deletion if they have not been used for an entire execution of a session again. I could apply a similar tactic to heaps, so release them if the memory use decreased for a while. (This is an issue mainly for models with variable size inputs, if you run the same session with same resource dimensions, it is not an issue.)

…/onnxruntime into tiled_memory_allocator_pr

axodox · 2024-09-29T13:00:41Z

FYI I have now moved all changes described above into this very PR. No need to check those separately.

axodox · 2024-09-29T13:01:37Z

onnxruntime/contrib_ops/cpu/quantization/matmul_nbits.cc

See: #22265

axodox · 2024-09-29T17:15:44Z

And now I have also added code to reclaim unused cached resources and heap space if the required memory drops. Seems to work properly in my app, tried multiple workflows, and behaved as expected. Let me know what you find with it.

axodox added 13 commits October 28, 2023 15:32

Prototype works

71ccab7

Cleaned up implementation

d74ecbb

Remove unneeded reference

95b68ce

Fix use after free

d91ff28

Merge branch 'main' into evict_api

2206b4b

Remove free pooled memory on evict

f54a591

WIP

d7fac5d

Added basic heap allocation logic

f2fe357

Tiled resources work

af5e047

Performance improvement

f724853

Cleanup

50fb147

Merge branch 'main' into evict_api

90948cd

Merge branch 'evict_api' into tiled_evict_api

113bbbe

axodox changed the title ~~Tiled evict api~~ [DML] Tiled memory manager Apr 14, 2024

axodox changed the title ~~[DML] Tiled memory manager~~ [DML] Tiling memory manager Apr 14, 2024

axodox changed the title ~~[DML] Tiling memory manager~~ [DML] Tiled memory manager Apr 14, 2024

baijumeswani requested review from fdwr and PatriceVignola September 18, 2024 16:51

axodox added 2 commits September 22, 2024 14:31

Update to latest main

076ffcd

Updated external

aa18610

axodox added 4 commits September 22, 2024 16:22

Revert eviction API

69ac6a1

Revert eviction API

74a7ba4

Merge branch 'tiled_memory_allocator_pr' of https://github.com/axodox…

e047b64

…/onnxruntime into tiled_memory_allocator_pr

Added preferred allocator option

f6c87b3

axodox added 9 commits September 22, 2024 19:07

Added preferred allocator option

9d8b5e7

Merge branch 'tiled_memory_allocator_pr' of https://github.com/axodox…

792f4c5

…/onnxruntime into tiled_memory_allocator_pr

Merge branch 'main' into tiled_memory_allocator_pr

802ac92

Fix build

8054ca8

Fix matmul nbits casts

c371152

Cast fix

760c90b

Update submodules to main

39591c0

Remove accidental clang-format change

4ac604d

Merge branch 'matmul_nbits_cast_fix' into tiled_evict_api

9c102e5

axodox commented Sep 29, 2024

View reviewed changes

onnxruntime/contrib_ops/cpu/quantization/matmul_nbits.cc

Copy link

Author

axodox Sep 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See: #22265

axodox added 7 commits September 29, 2024 17:08

Simplify code

b43792d

Remove obsolete tiled resources

c86ea96

Remove debug comment

27cedfc

Do not add empty free blocks

d3f1e18

Reclaim heap space after use

42d3f8a

Merge branch 'tiled_evict_api_dev' into tiled_evict_api

f53ef33

Cancel clear just in case

8f8717a

fdwr added the ep:DML issues related to the DirectML execution provider label Nov 1, 2024

fdwr assigned PatriceVignola Nov 1, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DML] Tiled memory manager #20303

[DML] Tiled memory manager #20303

axodox commented Apr 14, 2024 •

edited

Loading

oysteinkrog commented Sep 18, 2024

PatriceVignola commented Sep 18, 2024

axodox commented Sep 19, 2024 •

edited

Loading

axodox commented Sep 19, 2024

PatriceVignola commented Sep 20, 2024 •

edited

Loading

axodox commented Sep 22, 2024 •

edited

Loading

axodox commented Sep 29, 2024

axodox Sep 29, 2024

axodox commented Sep 29, 2024 •

edited

Loading

[DML] Tiled memory manager #20303

Are you sure you want to change the base?

[DML] Tiled memory manager #20303

Conversation

axodox commented Apr 14, 2024 • edited Loading

Description

Motivation and Context

The existing solution

The new solution

Results

oysteinkrog commented Sep 18, 2024

PatriceVignola commented Sep 18, 2024

axodox commented Sep 19, 2024 • edited Loading

axodox commented Sep 19, 2024

PatriceVignola commented Sep 20, 2024 • edited Loading

axodox commented Sep 22, 2024 • edited Loading

axodox commented Sep 29, 2024

axodox Sep 29, 2024

Choose a reason for hiding this comment

axodox commented Sep 29, 2024 • edited Loading

axodox commented Apr 14, 2024 •

edited

Loading

axodox commented Sep 19, 2024 •

edited

Loading

PatriceVignola commented Sep 20, 2024 •

edited

Loading

axodox commented Sep 22, 2024 •

edited

Loading

axodox commented Sep 29, 2024 •

edited

Loading