Defragmentation for TLSF algorithm #232

adam-sawicki-a · 2022-02-07T16:12:05Z

We plan to implement defragmentation also for TLSF algorithm, which is the only missing feature before we can make TLSF the default and delete the old one.

@JustSid as you authored the current algorithm and use VMA with defragmentation in a big project, I would like to kindly ask you (and everyone interested) for suggestions about this. We develop the library towards new major release v3.0.0, which is an opportunity to do some changes that can break backward compatibility.

Is there anything you would like to see changed in the current defragmentation API?

Would you help us test the new defragmentation with your project when we have it implemented?

JustSid · 2022-02-07T22:25:22Z

Absolutely would love to test! We are actually gearing up to release a new major version of our game and are in closed alpha right now, which might be a good vehicle to throw it across a few different machines and set ups and see how it holds up.

I'm actually quite happy with the defragmentation as it is (as in, I haven't had a need to touch it since the initial version). We've successfully shipped this to customers together with our general Vulkan update in early 2020 and it's been holding up quite well. We actually defragment a lot while the game is running, since we do a lot of texture and and general asset paging. Since we make a flight simulator, the user has the ability to see a huge chunk of the level and also traverse it at very high speeds, requiring us to essentially constantly move assets in and out of VRAM. The thing we rely on is doing the defragment work in smaller chunks incrementally to avoid hiccups while rendering. The whole system is geared towards never stopping the world and gracefully handling changing VRAM budgets and requirements (our texture system allows dynamically adjusting resolution based on distance as well as overall VRAM constraints).

adam-sawicki-a · 2022-02-08T09:06:17Z

This is great to hear, thanks a lot! By the way, what is your method to stick to the available VRAM budget? Do you have any suggestions how to change the API for queries about budget and statistics? As I'm thinking about refactoring this as well.

JustSid · 2022-02-08T10:14:18Z

So for us our strategy for sticking to the VRAM budget mostly involves ditching texture mips, since there's a lot of memory to be won with that and our meshes rarely are big enough to make a substantial enough difference. We end up using the reported budget smoothed out over a couple of frames (since some drivers tend to fluctuate the budget a bit on a frame by frame basis and it was introducing instability in the system), minus some fixed red zone that we will never touch (500mb on a 4gb+ VRAM system, less on smaller systems). The idea is to always have some emergency memory available if we need it, since changing textures first requires to make a new image, fully specify it, then wait for the old image to be done with before the memory can be reclaimed. So running out of available budget would be too late to effectively dump memory usage, hence the emergency area.

At runtime we then look at the available budget and also our current resource usage to determine if our current working set is still good, or if we need to decrease usage. Alternatively, if the working set goes down or the budget goes up, the system might decide that it can safely fit higher resolution textures in and then pings that off. The key thing here is that this is also incremental, so before doing a step the system will check if it should maybe change its strategy.

Tying it all together is the fact that we have a memory manager on top of VMA whose job it is to decide who can have memory or not. It provides an async interface to memory and can potentially pause some allocations if we get into very high memory pressure situation. It will always allow the pager to go get some memory in order to allow VRAM pressure to decrease, but if the budget enters the red zone it will cut off access to more memory for just about anything that isn't critical. This works well for us because the whole loading mechanism is async and we start getting stuff ready well before it needs to be on screen, so adding a few frames of delay is a non issue. It also works quite well with the general philosophy of Vulkan to get stuff ready before they are needed.

Not sure how generally applicable this approach is to other titles though. But the red zone could be something VMA offers, together with maybe a smoothed view of the budget. I know VMA already has support for not making more allocations once the budget limit is reached, but for us it's more important to stop before that already and keep some memory around to use in emergencies. I realize we could just oversubscribe the VRAM a bit, but we started out as a pure GL engine which had constant stutters as things paged around the world and data was moved between VRAM and system memory or new pipelines were discovered and had to be compiled just in time, so our goal was to absolutely never have hiccups again.

adam-sawicki-a · 2022-02-08T11:03:09Z

Thank you for describing this system. It sounds very good.

I guess you use function vmaGetHeapBudgets but don't use VMA_ALLOCATION_CREATE_WITHIN_BUDGET_BIT?

How does this system interact with the fact that VMA allocates a whole new big memory block when needed? Did you set a custom, smaller VmaAllocatorCreateInfo::preferredLargeHeapBlockSize?

adam-sawicki-a · 2022-02-08T11:11:24Z

One more question: Do you rely on passing the list of allocations that can be moved, while some others stay not movable? Would simplifying the API to consider all allocations in (default/specified custom) pool movable work fine?

JustSid · 2022-02-08T18:28:58Z

That's right, we don't use VMA_ALLOCATION_CREATE_WITHIN_BUDGET_BIT since we are doing our own limiting on allocations. Dealing with the heap block size is actually pretty easy, when looking at the memory usage, we subtract the space that is not currently allocated to anything (using the VmaBudget blockBytes vs allocationBytes members). We treat that memory as essentially free. I realize that this glosses over things like alignment requirements and what not, but if the defragmenter does its job right and tightly packs everything, this is close enough to working. We do size our red zone as a multiple of the heap block size though. There also is a flag or setting that I can't recall off the top of my head that tells VMA to always allocate the max size blocks instead of slowly ramping up, which we also set in order to evenly sized 256mb chunks of device memory.

With regards to immovable allocations, there actually are some. This is mostly for cross queue resources, for example either buffers/images that are pending an asynchronous transfer, or resources used in async compute that are currently owned by the compute queue. This could be solved by adding queue ownership transfers, but our current design just prefers not touching any of these resources instead. Although I guess another work around for this would be to have an extra heap with immovable resources, eat the cost of fragmentation on that one, and make everything in the main heap be fair game for moving around.

adam-sawicki-a · 2022-02-22T22:39:49Z

With new commit 88510e9 we implemented the new defragmentation API and algorithm. Documentation is also written. @JustSid can you please try to use it in your project? Any feedback is welcome.

JustSid · 2022-02-25T01:00:02Z

@adam-sawicki-a I noticed you removed the specifying allocations feature, which you mentioned would make the design simpler. In trying to implement this though, I've run into a small issue: We use the copy queue to upload textures in the background. Our VkImage's are born onto the copy queue with undefined layout, get their contents copied in from a transfer buffer, then they get released onto the graphics queue. Since they are images, our usage for them is queue exclusive. How can we prevent those images from being moved while they are owned by a background queue?

I have two very similar ideas for how this could work in theory, for example we could just defer execution of the defragmentation steps until all images are uploaded. Alternatively, we could just not ask VMA to defragment anything while we are uploading things. Both have the problem that this requires locking in our asynchronous texture streaming process, which otherwise can just go get memory, copy the data to the GPU and then do the queue transfer.

My other suggestion would be, instead of a white list of allowed things to move, would it be possible to provide VMA with a black list? We could just tell VMA which resources we expect to stay where they are, and the implication is that everything else is free to move around. Would that make the implementation on your end easier? This assumes that we only have a few small items we care about not moving while the majority are allowed to move, so it'd be fast-ish to check (assuming we can get away with a cacheline or two worth of pointers)

adam-sawicki-a · 2022-02-25T09:02:53Z

@JustSid Thank you for your response. We want to make the defragmentation feature efficient and convenient to use.

When you receive the list of allocations to move as structure VmaDefragmentationPassMoveInfo returned by vmaBeginDefragmentationPass, I guess you look up each individual VmaAllocation in some kind of map or retrieve its details via vmaGetAllocationInfo + inspecting VmaAllocationInfo::pUserData to know which resource in your engine is it?

If you find out that the specific texture should not be moved currently, you can set VmaDefragmentationPassMoveInfo::pMoves[i].operation = VMA_DEFRAGMENTATION_MOVE_OPERATION_IGNORE while inside the pass to cancel the movement of this allocation. Would that work for you?

Regarding queue ownership, I think the recommendation to use VK_SHARING_MODE_EXCLUSIVE and perform the "queue family ownership transfer" is mostly for color-attachment, depth-stencil, possibly also storage-image images, as VK_SHARING_MODE_CONCURRENT could disable internal GPU compression (DCC) on those, making them inefficient. For regular textures that are just uploaded and then use for sampling it doesn't matter, so you can just use VK_SHARING_MODE_CONCURRENT. Of course, I recommend to test and measure the performance on a variety of GPUs to make sure.

JustSid · 2022-03-04T01:12:35Z

@adam-sawicki-a This makes a whole lot more sense now, thanks! I was able to get the upstream changes running, although I ran into a couple of issues (see #249). I have not yet run our extensive over night auto test suite on this, I was gonna save this for the weekend and then run it also with validation layers and tsan/asan. Also it hasn't made it to any of our testers yet, but so far, it seems to work without problems.

JustSid · 2022-03-04T01:25:18Z

One thing I did notice, I initialize the defragment like this:

VmaDefragmentationInfo info = {};
info.maxBytesPerPass = 256 * 1024 * 1024;
info.maxAllocationsPerPass = 32;

Which, if I understand the docs correctly, limits the max pass size to 256mb OR 32 moves. However, I see much bigger move counts of 700 and more. Looking at the code, this function seems like it's the offender:

bool VmaDefragmentationContext_T::IncrementCounters(uint32_t& allocations, VkDeviceSize bytes)
{
    if (++allocations >= m_MaxPassAllocations || bytes >= m_MaxPassBytes)
    {
        m_Stats.bytesMoved += bytes;
        m_Stats.allocationsMoved += allocations;
        return true;
    }
    return false;
}

This ends up not checking the global stats for the whole defragmentation step, but rather checks it on a per block level. Is that supposed to be the way? If so, I think the docs should be changed. Otherwise to make this truly per pass, the check should be against the m_Stats values. Book keeping in general seems a little off, I have seen VmaDefragmentationStats.bytesMoved underflowing when VMA_DEFRAGMENTATION_MOVE_OPERATION_IGNORE is used (and allocationsMoved is not updated). If the intention of the increment function is to be per pass, then it might be best to just have it update the stats member exclusively like this:

bool VmaDefragmentationContext_T::IncrementCounters(VkDeviceSize bytes)
{
    m_Stats.allocationsMoved  ++;
    m_Stats.bytesMoved += bytes;

    return (m_Stats.allocationsMoved  >= m_MaxPassAllocations || bytes >= m_MaxPassBytes);
}

And then removing the currentBytesMoved and currentAllocsMoved book keeping from ReallocWithinBlock and friends. I don't know, just spit balling and you might have different ideas here :) Let me know what you think and I'd be happy to make a pull request for this.

@medranSolus

Code by @medranSolus See #232

adam-sawicki-a · 2022-03-04T12:29:34Z

Thank you for these findings. Your testing of the new defragmentation and the whole library is very valuable to us. We've made some fixes to the code. Please let us know whether they solve the problems you observed.

adam-sawicki-a · 2022-03-04T15:35:46Z

I also improved the documentation of defragmentation. I'm happy to hear your opinion about it.
https://gpuopen-librariesandsdks.github.io/VulkanMemoryAllocator/html/defragmentation.html

adam-sawicki-a · 2022-03-07T16:03:05Z

@JustSid I'm sorry but we changed defragmentation API again. Instead of dstMemory + dstOffset, moves now have dstTmpAllocation, to be used for binding new buffers/images with functions like vmaBindBufferMemory(). It is important for thread safety and synchronization. Can you please test it and see how it works for you? Documentation is also fully updated.

JustSid · 2022-03-11T17:05:52Z

@adam-sawicki-a figured I'd give you a quick heads up, I was out of the office earlier this week and didn't get around to looking at the most recent changes yet. I'll take a look at it this weekend :)

JustSid · 2022-03-21T06:27:09Z

So, later than I meant to, but I actually ended up polishing the defrag implementation in our game a bit. After being a bit hesitant at first about the removal of the white list, I think this approach actually makes a lot more sense. I was able to simplify our code a bit when it came to object tracking. I was also able to run our test suite with and without validation tests, plus letting the AI run through the game continuously over the weekend. So far, it seems rock solid.

One thing I did notice though. When running with the extensive mode flag, I get a crash in ComputeDefragmentation_Extensive. The vectorState.firstFreeBlock is an insanely high value (0xc21a7fff46950e62 in my last test), but not quite SIZE_MAX. No idea where this is coming from, but that corrupt value is causing a very out of bounds memory access to the block data. That's the only thing I have found though.

adam-sawicki-a · 2022-03-21T11:05:10Z

Thank you for checking the new defragmentation. I'm glad to hear that the new API works well for you. We will investigate the issue with extensive algorithm. For now please use the other defragmentation algorithms.

@medranSolus

Unified across VMA and D3D12MA. Updated Python script for visualization - now called GpuMemDumpVis.py. Also a fix for bug in EXTENSIVE defragmentation algorithm - see #232 Code by @medranSolus

medranSolus · 2022-03-24T10:58:07Z

Hello @JustSid and thanks for so thorough test of new defragmentation feature! With new commit 3c6470c the reported bug about EXTENSIVE algorithm should be fixed now, but I'll be thankful if you could also confirm that.

Marek Machliński, Developer Technology Engineer Intern, AMD

JustSid · 2022-03-24T21:42:33Z

@medranSolus Yep, I can confirm that this works now.

FWIW, the internal branch with the VMA update is slated to get seeded to our alpha testers early next week. I'll keep an eye out on the crash reporter and let you know if there is anything :)

adam-sawicki-a · 2022-03-25T09:02:49Z

Thanks a lot. I'm closing the ticket for now. Please feel free to report any further issues or suggestions, here or as new tickets.

JustSid · 2022-03-25T18:21:08Z

Actually, sorry to dig this back out, but it looks like the EXTENSIVE flag is still broken, not sure why it worked yesterday when I did a quick test. Same spot as before, with last being an incredibly high value and therefore indexing out of bounds.

Also, quick update, we have started seeding this to our alpha testers today, because apparently nobody learned from the past few times we deployed on a Friday. This is without the extensive bit set, so so far no crashes.

@medranSolus

…tation algorithm. Code by @medranSolus See #232

medranSolus · 2022-03-28T11:44:11Z

Thanks for pointing this out! In last commit I found the corner case you speak of and added guard for it, hope that helps and problem no longer occurs. As before I'd be thankful for you confirmation @JustSid.

Marek Machliński, Developer Technology Engineer Intern, AMD

adam-sawicki-a · 2022-04-06T09:44:08Z

@JustSid Did you have a chance to test the latest code? We would like to know if it fixes the problem with EXTENSIVE defragmentation algorithm in your case.

JustSid · 2022-04-09T00:04:27Z

Hey, sorry, I just gave the changes a try. Sadly I still get the crash, firstFreeBlock is a super high number (18374686483949813760). I'll see if I can find something that makes sense causing this. Sadly stepping through the code isn't super easy because of the sheer number of objects and blocks involved.

adam-sawicki-a · 2022-04-11T08:05:57Z

Could you please share with is JSON dump before defragmentation? Maybe then we could reproduce the problem.

adam-sawicki-a added feature Adding new feature next release To be done as soon as possible labels Feb 7, 2022

adam-sawicki-a assigned medranSolus Feb 7, 2022

adam-sawicki-a added the input needed Waiting for more information label Feb 22, 2022

adam-sawicki-a added a commit that referenced this issue Mar 4, 2022

Fixed accounting of statistics in defragmentation.

b142708

Code by @medranSolus See #232

adam-sawicki-a added bug Something isn't working and removed feature Adding new feature input needed Waiting for more information labels Mar 21, 2022

adam-sawicki-a added a commit that referenced this issue Mar 24, 2022

New JSON dump format

3c6470c

Unified across VMA and D3D12MA. Updated Python script for visualization - now called GpuMemDumpVis.py. Also a fix for bug in EXTENSIVE defragmentation algorithm - see #232 Code by @medranSolus

adam-sawicki-a closed this as completed Mar 25, 2022

adam-sawicki-a reopened this Mar 28, 2022

adam-sawicki-a added a commit that referenced this issue Mar 28, 2022

Fix for out-of-bound accessing last free block in EXTENSIVE defragmen…

35e1539

…tation algorithm. Code by @medranSolus See #232

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Defragmentation for TLSF algorithm #232

Defragmentation for TLSF algorithm #232

adam-sawicki-a commented Feb 7, 2022

JustSid commented Feb 7, 2022

adam-sawicki-a commented Feb 8, 2022

JustSid commented Feb 8, 2022

adam-sawicki-a commented Feb 8, 2022

adam-sawicki-a commented Feb 8, 2022

JustSid commented Feb 8, 2022

adam-sawicki-a commented Feb 22, 2022

JustSid commented Feb 25, 2022

adam-sawicki-a commented Feb 25, 2022

JustSid commented Mar 4, 2022

JustSid commented Mar 4, 2022 •

edited

Loading

adam-sawicki-a commented Mar 4, 2022

adam-sawicki-a commented Mar 4, 2022

adam-sawicki-a commented Mar 7, 2022

JustSid commented Mar 11, 2022

JustSid commented Mar 21, 2022

adam-sawicki-a commented Mar 21, 2022

medranSolus commented Mar 24, 2022

JustSid commented Mar 24, 2022

adam-sawicki-a commented Mar 25, 2022

JustSid commented Mar 25, 2022

medranSolus commented Mar 28, 2022

adam-sawicki-a commented Apr 6, 2022

JustSid commented Apr 9, 2022

adam-sawicki-a commented Apr 11, 2022

Defragmentation for TLSF algorithm #232

Defragmentation for TLSF algorithm #232

Comments

adam-sawicki-a commented Feb 7, 2022

JustSid commented Feb 7, 2022

adam-sawicki-a commented Feb 8, 2022

JustSid commented Feb 8, 2022

adam-sawicki-a commented Feb 8, 2022

adam-sawicki-a commented Feb 8, 2022

JustSid commented Feb 8, 2022

adam-sawicki-a commented Feb 22, 2022

JustSid commented Feb 25, 2022

adam-sawicki-a commented Feb 25, 2022

JustSid commented Mar 4, 2022

JustSid commented Mar 4, 2022 • edited Loading

adam-sawicki-a commented Mar 4, 2022

adam-sawicki-a commented Mar 4, 2022

adam-sawicki-a commented Mar 7, 2022

JustSid commented Mar 11, 2022

JustSid commented Mar 21, 2022

adam-sawicki-a commented Mar 21, 2022

medranSolus commented Mar 24, 2022

JustSid commented Mar 24, 2022

adam-sawicki-a commented Mar 25, 2022

JustSid commented Mar 25, 2022

medranSolus commented Mar 28, 2022

adam-sawicki-a commented Apr 6, 2022

JustSid commented Apr 9, 2022

adam-sawicki-a commented Apr 11, 2022

JustSid commented Mar 4, 2022 •

edited

Loading