Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Possible race condition on VmaAllocator between defragmentation and regular allocations #313

Open
sbourasse opened this issue Jan 18, 2023 · 3 comments
Labels
bug Something isn't working investigating Still to be determined whether we work on this

Comments

@sbourasse
Copy link

We've been using VMA extensively for our memory management on projects with heavy memory load. However, upon integrating the defragmentation utility, we started getting frequent memory corruption on internal VMA data. We narrowed down the source of this corruption to concurrent accesses on our VmaAllocator and (apparently) succeeded in preventing it by blocking allocations for the duration of the defragmentation. More context follows, with a couple questions at the end.

We use a single VmaAllocator from two separate threads. Most of our allocations are done from the rendering thread, which is also the one that runs the defragmentation. We also allocate from the main thread, although less frequently.

Our defragmentation integration looks something like this (it has the same structure as what can be found in the online documentation) :

vmaBeginDefragmentation(mainAllocator, ...);

for (;;)
{
	if (vmaBeginDefragmentationPass(mainAllocator, ...) == VK_SUCCESS)
	   break;

	CreateResourcesAndCopy();
	WaitForCompletion();

	if (vmaEndDefragmentationPass(mainAllocator, ...) == VK_SUCCESS)
		break;
}

vmaEndDefragmentation(mainAllocator, ...);

Running this bit of code on heavy workloads from our rendering thread while our main thread is free to use mainAllocator for other allocations, we reliably trigger memory corruptions, specifically located on the first 4 bytes of random VmaAllocation from our working VmaDefragmentationPassMoveInfo::pMoves array.

Regardless of this particular symptom, we managed to prevent such corruption by implementing a mutex targeting the region between vmaBeginDefragmentation() and vmaEndDefragmentation() as well as other VMA calls on mainAllocator, specifically those that could be made concurrently to defragmentation.

As a side note, we first prevented memory corruption by enabling VMA_DEBUG_GLOBAL_MUTEX, changing VmaMutex::m_Mutex type to std::recursive_mutex, and locking it from our side right after vmaBeginDefragmentation() and releasing it right before vmaEndDefragmentation().

Does it make sense to treat the region from vmaBeginDefragmentation() to vmaEndDefragmentation() as a critical section with regards to the used VmaAllocator ? If so, is it something that could be enforced from within the library ?

@adam-sawicki-a adam-sawicki-a added bug Something isn't working investigating Still to be determined whether we work on this labels Jan 19, 2023
@adam-sawicki-a
Copy link
Contributor

Thank you for reporting this bug and a good description. It sounds very serious. We will investigate it. Any help in debugging it is welcome.

Blocking any usage of the allocator between Begin and EndDefragmentation may be a good workaround for the bug, but it shouldn't be needed. Defragmentation was developed with the assumption that using the allocator while defragmentation is in progress is allowed.

Is your code 32- or 64-bit?

@sbourasse
Copy link
Author

We triggered this issue on x86-64 with an AMD GPU.
We can definitely lend a hand with debugging, just let me know what we can do.

@CannibalVox
Copy link

CannibalVox commented Apr 7, 2023

Forgive me if I'm wrong, but I've been looking at the defrag code recently and the map transferring code seems really sketchy to me.

  • There's a period of time after the allocation data swap but before maps are transferred to the new block when an allocation thinks it has references to a block but doesn't (unmapping during this period of time can cause explosions or cause the reference count to underflow)
  • Copies to the new memory can occur while the same memory is being written elsewhere. Results vary based on memory type
  • New maps can be opened on memory that's already been copied to the new location but before the allocation swap, resulting in a write that isn't actually going to be used. It's possible for the block to be fully unmapped or freed by the defragmenter while this is happening.

It just seems like from a logical standpoint, from the moment the copy begins to the end of the pass, a memory block being moved needs to be unwritable. I'm not saying that VMA needs to do this, it probably makes more sense for us users to take care of that. More generally, the VMA documentation says that allocations are not considered thread safe, and the defrag manipulates them.

Am I wrong here?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working investigating Still to be determined whether we work on this
Projects
None yet
Development

No branches or pull requests

3 participants