Garbage collector in Ginkgo #993

pratikvn · 2022-03-23T08:37:00Z

In many cases, especially when doing things in an asynchronous fashion, there is a need to not synchronize with respect to either the host or the default execution stream (context). This is especially relevant on GPU devices. Most of the functions, kernel launches, device(host) to device(host) memory copies can be made asynchronous with respect to the host/default stream because they can take a stream/execution context to perform the operation on. In most cases, there are two functions that are not necessarily asynchronous:

Allocation functions (cudaMalloc, hipMalloc etc)
Free functions (cudaFree, hipFree etc).

From empirical observations, it looks like many of the programming models that are relevant to us (CUDA, HIP and DPCPP), by default can execute malloc functions in an asynchronous fashion, but free functions are necessarily synchronizing to ensure correctness and prevent use after free issues. Below is a simple reproducer that demonstrates the above observation with the CUDA programming model. The same can be duplicated for the HIP programming model with s/cuda/hip (with the correct HIP header).

#include <cuda_runtime.h>


__global__ void kernel(float* data, int size)
{
#pragma unroll
    for (int i = 0; i < size; i++) {
        data[i] = .5 + i;
    }
}


void free_synchronize()
{
    cudaStream_t stream1, stream2;
    cudaStreamCreateWithFlags(&stream1, cudaStreamNonBlocking);
    cudaStreamCreateWithFlags(&stream2, cudaStreamNonBlocking);
    const size_t length = 1 << 10;
    float* dev_a1 = 0;
    float* dev_a2 = 0;
    for (int i = 0; i < 10; i++) {
        cudaMalloc(&dev_a1, length);
        kernel<<<1, 1, 0, stream1>>>(dev_a1, length);
        cudaFree(dev_a1);
        cudaMalloc(&dev_a2, length);
        kernel<<<1, 1, 0, stream2>>>(dev_a2, length);
        cudaFree(dev_a2);
    }
}

int main()
{
    free_synchronize();
    cudaDeviceSynchronize();
    cudaDeviceReset();
    return 0;
}

resulting in a timeline that looks like this:

You can see that the streams are serialized here due to the interleaved cudaFrees. If we for example, change the loop to:

for (int i = 0; i < 10; i++) {
        cudaMalloc(&dev_a1, length);
        kernel<<<1, 1, 0, stream1>>>(dev_a1, length);
        cudaMalloc(&dev_a2, length);
        kernel<<<1, 1, 0, stream2>>>(dev_a2, length);
        cudaFree(dev_a1);
        cudaFree(dev_a2);
    }

you can have concurrent execution of the two streams, but
due to the cudaFree, each loop index is asynchronized with respect to the next and you see a timeline like this:

But if you move the cudaFree outside the loop, with malloc's still inside the loop, you can get concurrent execution between two loop indices as well. For example:

for (int i = 0; i < 10; i++) {
        cudaMalloc(&dev_a1, length);
        kernel<<<1, 1, 0, stream1>>>(dev_a1, length);
        cudaMalloc(&dev_a2, length);
        kernel<<<1, 1, 0, stream2>>>(dev_a2, length);
    }
cudaFree(dev_a1);
cudaFree(dev_a2);

results in the following timeline:

This means that a lot can be gained by deferring cudaFree to a later stage, mainly in terms of asynchronous execution.

To handle this, I would like to maybe propose something like a garbage collector or device allocated memory. This garbage collector basically, would be some kind of a deferred deleter for an object and could be implemented through the deleter semantics for example, within the Array class.

Some things to discuss:

This in some sense departs from the scoped objects, and we are deliberately allowing objects to escape their scope. So, this is something that is definitely not desirable
If large arrays are being allocated and then they are not being subsequently freed, then we may run into out of memory issues much earlier than the normal case. We could make the "garbage collector" smarter and provide it with a way to free objects when necessary.

I just wanted to throw this idea out and not necessarily advocate implementing it right now, but if we could live with some of the disadvantages, then I think this could be a nice thing to have.

An additional note: I believe CUDA and HIP libraries such as cublas, hipblas etc, could be using their handle objects(cublasHandle_t, hipblasHandle_t) to implement some variant of this, so that the allocation is done when necessary, but the free is only done at the end when destroying the handle.

The text was updated successfully, but these errors were encountered:

upsj · 2022-03-23T08:41:34Z

This would not be an issue if we were storing temporary data inside solvers, right?

pratikvn · 2022-03-23T08:56:10Z

Yes, in general for any class, you could have a temporary storage in place of this, but that would mean:

You need to allocate temporary objects in the generate step and store them in the class, much before they are necessary and even if it is not necessary.
Each class will need members for the temporary objects, which have to be manually managed to some extent.
Everytime the class needs more temporary data, the allocation has to be updated, resulting in more maintenance for the code.

I think have temporary data inside solvers (or their base classes) is a good first step, but maybe it is not general enough and I think this approach is a more general alternative.

upsj · 2022-03-23T09:01:51Z

What you are describing (eliminating all cudaFree calls from hot paths) does not necessarily require garbage collection functionality. The main requirement is pulling (re)allocations of temporary memory out of these hot paths, which would be feasible to achieve if the temporary vectors we are operating on are only allocated if they are not yet present (like the dense cache inside preconditioner::Ilu or distributed::matrix) and reused otherwise. That does not need to happen in the generation. The temporary objects are supposed to be stored in SolverBase or IterativeSolverBase, the reallocation happens automatically if the sizes don't match.

yhmtsai · 2022-03-23T09:10:41Z

Something may be related to it.
#654
#652

no, the allocation can happen in the begining of apply. the deconstruction will happen in object deconstruction.
I think so. or we need to have a memory pool with the executor and then handle all memory free there.
If we only focus on some utils function, we can use static to allocate the temporary memory. it will reallocate(including free) when it needs more but if less doesn't reallocate.
same as 1

pratikvn · 2022-03-23T13:32:42Z

Yes, I agree that if each class (or one of its base class) has temporary array members, these can be used instead and to the most point the behaviour would be equivalent (except the deferred/selective allocation). The management for this would also have to be done individually for the classes. In this case, the allocation can be either in the generate or the (first) apply. I think that is just an implementation detail and not too important, atleast for the discussion here.

What I wanted to propose is more on a object level rather than a class level, in which the object takes care of the deallocation rather than having class level temporaries. This might allow us to have more expressive code and slightly lesser memory usage.

For example, consider the case below:

#include <ginkgo/ginkgo.hpp>


__global__ kernel(int* data){
// Do something
}

int main(){
// Initialize executor and setup
.
.
.
    {
        // arr1 is deallocated when exiting this scope,
        // that is a cudaFree is called.
        gko::Array<int> arr1(cuda, 100);

        kernel<<<1,1,0,stream1>>>(arr1.data());
    }
    
    // Despite operating on different data
    // and on different streams, the kernels and streams are 
    // synchronized.
    gko::Array<int> arr2(cuda, 100);
    kernel<<<1,1,0,stream2>>>(arr2.data());
}

Now imagine something like a garbage collector (probably not a good name) that does not necessarily call cudaFree when exiting the scope. I think we can probably achieve this with a custom deleter which defers the deletion. Then we would be able to:

#include <ginkgo/ginkgo.hpp>


__global__ kernel(int* data){
// Do something
}

int main(){
// Initialize executor and setup
.
.
.
    {
        // Now arr1 is not deallocated at the scope exit, but the deletion is deferred.
        gko::Array<int> arr1(cuda, 100, deferring_deleter);

        kernel<<<1,1,0,stream1>>>(arr1.data());
    }
    
    // Due to no synchronizing calls between the kernels,
    // and given that they are on different streams, the operations are executed
    // asynchronously.
    gko::Array<int> arr2(cuda, 100);
    kernel<<<1,1,0,stream2>>>(arr2.data());
}

There are multiple ways this deferring deleter could be implemented:

Some kind of executor function which tells the executor to free all the memory allocated with a deferring deleter at some point when needed.
The Array class itself having an explicit deferred free method. In case, the object is not freed, to make sure there are no memory leaks, the object can be explicitly freed when the executor is being destructed.
Carry some kind of a "global" state with a struct or a flag and use that to decide when to free the object(s).

upsj · 2022-03-24T09:29:11Z

Thank you for the clarification! That is a much more limited view that I could potentially get on board with. It slightly reminds me of Herb Sutter's deferred_ptr proposal. I think the right place to do this kind of deferred deletion would be the Executor, since it outlives all other objects involved. We could have adeletion queue in our executors that will be freed regularly, e.g. on synchronize or if a certain amount of time has passed (i.e. run all deferred free calls in a flush function call that is called on synchronize or if a certain amount of time has passed in any other member function of the executor). That makes use-after-free bugs harder to detect though, so it may complicate things w.r.t. debugging.
On the other hand, it fits really well with how SYCL is should be doing things, i.e. submitting deletion to a queue.

The other question would be how this can be integrated into Array. We could either defer all free calls, or add a flag/custom deleter that calls raw_free_deferred, but that also means that we need to talk about value semantics of Array again 🙁

On terminology: Garbage Collection usually refers to a function that marks objects for deletion based on their reachability, what you are describing is slightly different, since the Array objects notify the Executor directly of their destruction, so this is more like a memory pool without data reuse. Extending the allocator functionality of Executor is definitely worth investigating though, especially for small objects.
I guess @tcojean might also have some experience with this from the Runtime Systems perspective?

In general, I am not sure whether this is the right solution to your problem though: Due to the use of Array, all of our free calls are paired with malloc calls inside a hot loop, which is not a good basis especially if you do a large number of iterations. So just deferring the free calls will probably not be sufficient in the long run?

tcojean · 2022-03-24T10:23:21Z

Thanks for pinging me @upsj. When I saw the title I wanted to comment already :). In general, I think this is a good idea or at least it might be necessary. On the other hand, I think a proper implementation, like in many runtime systems, might be a bit tricky and we would need to find the right balance in terms of complexity/features. I also wonder if there's any library we can just reuse for this as that might be easier, but I did not check for what exists.

This essentially boils down to adding a memory pool/memory cache, but it needs:

One pool per memory space (thus, we need memory spaces to begin with). And they need to be thread safe, etc.
Under constrained memory systems like a GPU, you might need to deallocate things when allocating, i.e. if a user requests 20 GB of memory, only 15GB is free but you have more than enough "not yet deleted but practically useless" data, then they you should delete >5GB of data (the allocation should not fail in this case!). The question then is how much to free, it's usually best to be greedy (if there were allocations then there could be other ones), but not too much or you go back to the previous situation.
- Note that this systems would fail in a MPI setting if multiple processes share the same GPU (you cannot get a global view of the full GPU memory).
- It similarly fails if you have concurrent applications running on the same device/memory as well. This is actually a big problem as I believe that's usually our use case: Ginkgo is an underlying library to other applications, so it's hard to get a good memory view that stays up to date in a lightweight fashion, but it's required to ensure correctness of that system...
Finally, another problem is a usage problem, if the user creates a dangling pointer to some Array data, he technically can still use it until the deferred alloc happens... Thankfully I don't think we use raw pointers anywhere from the user interface, but if we would then we also probably need some way for the user to mark the data as now useless.

On the flip side, that would open up the ability to have permanent work buffers on the devices which we can reuse between kernels, managed by that system.

upsj · 2022-03-24T10:28:32Z

I think if we want to take on allocation, one important change would be adding a size parameter to raw_free, otherwise we would need to keep track of allocation sizes ourselves. If we want to do a true memory pool, we would also need size feedback (how much memory did we actually allocate? Overallocating might make sense in some cases) from raw_alloc. A really comprehensive overview over good allocator design can be found at https://www.youtube.com/watch?v=LIb3L4vKZ7U

yhmtsai · 2022-03-24T10:53:38Z

@pratikvn in your example, if run the function several times, is the arr1 from different run on the same memory or different?

// it will mark unused or move the indicator back when delete and it must not live out of scope
shared_ptr = exec->get_temporary_workspace<Type>(size)();
exec->shrink_to_fit(); // delete all unused temporary workspace

executor handles only one large space: good for reusability but it is hard to adjust the memory size when something alive and only availble for one thread.
executor handles the union or the map of memory/Array: good for memory control, but do we need to take care of the reusbility? // give the unused Array when the size is match or larger?

lahwaacz · 2023-11-19T20:55:55Z

Have you considered using cudaMallocAsync + cudaFreeAsync to solve the problem described in the original post? You can even use cudaFreeAsync to free a pointer allocated by cudaMalloc.

upsj · 2023-11-19T21:00:19Z

We already support it since #1315, and I believe most of the allocation overhead concerns have been handled by #1028. Not sure if we need to keep this open?

upsj · 2023-11-20T22:29:52Z

Closing this after discussion in the group

pratikvn added the is:idea Just a thought - if it's good, it could evolve into a proposal. label Mar 23, 2022

pratikvn self-assigned this Mar 23, 2022

pratikvn added the is:proposal Maybe we should do something this way. label Mar 23, 2022

upsj mentioned this issue Mar 24, 2022

Thread-safety of Ginkgo #996

Open

upsj closed this as completed Nov 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Garbage collector in Ginkgo #993

Garbage collector in Ginkgo #993

pratikvn commented Mar 23, 2022

upsj commented Mar 23, 2022

pratikvn commented Mar 23, 2022

upsj commented Mar 23, 2022

yhmtsai commented Mar 23, 2022

pratikvn commented Mar 23, 2022

upsj commented Mar 24, 2022 •

edited

Loading

tcojean commented Mar 24, 2022

upsj commented Mar 24, 2022

yhmtsai commented Mar 24, 2022

lahwaacz commented Nov 19, 2023

upsj commented Nov 19, 2023 •

edited

Loading

upsj commented Nov 20, 2023

Garbage collector in Ginkgo #993

Garbage collector in Ginkgo #993

Comments

pratikvn commented Mar 23, 2022

upsj commented Mar 23, 2022

pratikvn commented Mar 23, 2022

upsj commented Mar 23, 2022

yhmtsai commented Mar 23, 2022

pratikvn commented Mar 23, 2022

upsj commented Mar 24, 2022 • edited Loading

tcojean commented Mar 24, 2022

upsj commented Mar 24, 2022

yhmtsai commented Mar 24, 2022

lahwaacz commented Nov 19, 2023

upsj commented Nov 19, 2023 • edited Loading

upsj commented Nov 20, 2023

upsj commented Mar 24, 2022 •

edited

Loading

upsj commented Nov 19, 2023 •

edited

Loading