Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Realizing to a crop of a Buffer doesn't work on GPU. #8395

Open
mcourteaux opened this issue Aug 22, 2024 · 6 comments
Open

Realizing to a crop of a Buffer doesn't work on GPU. #8395

mcourteaux opened this issue Aug 22, 2024 · 6 comments

Comments

@mcourteaux
Copy link
Contributor

mcourteaux commented Aug 22, 2024

Here is a repro:

#include "Halide.h"
#include "HalideBuffer.h"

using namespace Halide;

int main(int argc, char **argv) {
    Target target = get_jit_target_from_environment();

    bool hexagon_rpc = (target.arch != Target::Hexagon) &&
                       target.has_feature(Target::HVX);

    if (!hexagon_rpc && !target.has_gpu_feature()) {
        printf("[SKIP] No GPU target enabled.\n");
        return 0;
    }

    Param param(Int(32));
    Func f;
    Var x, y;
    f(x, y) = x * 100 + y + param * 10000;
    f.gpu_single_thread();

    const halide_device_interface_t *itf = get_device_interface_for_device_api(
        get_default_device_api_for_target(target), target);
    Runtime::Buffer<int, 2> buf(20, 20);

    // Problem 1: manually triggering the device allocation is required;
    // otherwise the crops we will make will not alias into the same buffer.
    buf.device_malloc(itf);

    param.set(1);
    f.realize(Pipeline::RealizationArg(buf.cropped({{2, 8}, {4, 4}})), target);
    param.set(2);
    f.realize(Pipeline::RealizationArg(buf.cropped({{0, 1}, {2, 8}})), target);

    // Problem 2: the dirty bits of the crop don't propagate back to the parent,
    // so for copy_to_host() to even consider copying, we need to explicitly tell
    // it that the device is dirty, otherwise it no-ops.
    buf.set_device_dirty(true);

    buf.copy_to_host();
    for (int y = 0; y < buf.dim(1).extent(); ++y) {
        for (int x = 0; x < buf.dim(0).extent(); ++x) {
            printf("% 6d", buf(x, y));
        }
        printf("\n");
    }
    printf("\n");

    printf("Success!\n");
    return 0;
}

Comment out either of the two lines below the // Problem comments to see it failing.


Original context for of my use-case. (click to expand)
I'm working on a denoiser and was currently experimenting with denoising in YCoCg colorspace with different filter banks for Y and for Co/Cg. So naturally, I...
  1. convert the input to YCoCg
  2. make an output buffer that will contain YCoCg (planar layout)
  3. make two crops for the output (channel Y and channels Co/Cg),
  4. run the denoising pipeline separately to produce both crops.
  5. continue working with the original (non-cropped) denoised buffer.
using U16_Image = Halide::Runtime::Buffer<uint16_t, 3>;

// (1) Convert to YCoCg
U16_Image noisy_YCoCg(noisy.width(), noisy.height(), noisy.channels());
neonraw_convert_RGB16_to_YCoCg16R(noisy, noisy_YCoCg);

// (2) Make buffer for the result
U16_Image denoised_YCoCg(noisy.width(), noisy.height(), noisy.channels());

if constexpr (true) {

  // Doens't work on CUDA, but works on CPU.

  // (3/4a) Realize the pipeline to the Y-crop of the output buffer
  U16_Image denoised_Y = denoised_YCoCg.cropped(2, 0, 1);
  neonraw::run_denoiser(fba.filter_banks[mm], noisy_YCoCg, fib, denoised_Y);

  // (3/4b) Realize the pipeline to the CoCg-crop of the output buffer
  U16_Image denoised_CoCg = denoised_YCoCg.cropped(2, 1, 2);
  neonraw::run_denoiser(fba.filter_banks[mm], noisy_YCoCg, fib, denoised_CoCg);

} else {

  // Does work on both CPU and CUDA.
  neonraw::run_denoiser(
    fba.filter_banks[mm], noisy_YCoCg, guide_YCoCg, fib, denoised_YCoCg
  );

}

// (5) Work with the denoised_YCoCg buffer...

This strategy, as far as I understand should work; and it does work on CPU-compiled pipelines (so everything computed on host, and no device copies). I tried this on CUDA, OpenCL, and Vulkan; and it doesn't work on any of those.

@mcourteaux
Copy link
Contributor Author

mcourteaux commented Aug 30, 2024

I think I narrowed it down to the scenario where the buffer does not have a device allocation, but you realize to a crop. The cropped buffer sees there is no device allocation, and thus allocates, but only allocates the crop instead of the full buffer.

Also, dirty-bits are not updated on the underlying buffer when the cropped/sliced buffer is made dirty. @abadams Can we discuss this on the dev-meeting? It's failing in many subtle ways; so I think some input will be valuable.

@mcourteaux
Copy link
Contributor Author

mcourteaux commented Aug 30, 2024

So, just to be clear: to fix my particular issue I did:

// (2) Make buffer for the result
U16_Image denoised_YCoCg(noisy.width(), noisy.height(), noisy.channels());
denoised_YCoCg.device_malloc(halide_cuda_device_interface());

// ...

// (5) Work with the denoised_YCoCg buffer...
denoised_YCoCg.set_device_dirty(true); // Tell the parent buffer that it actually changed!

@mcourteaux
Copy link
Contributor Author

Summary of current working:

  • AllocationHeader is used to ref-count Halide-owned buffers on the host. In case the buffer is not Halide-owned, it isn't used.
  • DeviceRefCount is used to ref-count Halide-owned buffer on device. In case there is no device-side memory for a buffer, this is not used (nullptr).
  • There is a DeviceRefCountCropped in case the buffer is cropped. This struct keeps track of which is the original buffer this is a crop from.

The issue arises because:

  • When there is no device-side buffer at the time a crop is made, the crop_from() function is not called, which would have kept track of the original buffer. As a result the new cropped Buffer object doesn't know from which other Buffer it is a crop.

So at least, we should take care of keeping track of which other Buffer a Buffer is a crop, always. Also when there is no device-side memory yet.

Additional issue:

  • The device_malloc() function doesn't take into account that a halide_buffer_t might be a crop/slice.
  • Moreover, halide_buffer_t currently has no fields pointing to any crop-related information. Yet Pipelines calling out to device_malloc() act on a halide_buffer_t, instead of a Halide::Runtime::Buffer<>, so there is right now no way to even know that it's a crop from within device_malloc().

@mcourteaux
Copy link
Contributor Author

Conclusion from the dev-meeting:

Either we do:

  • document that the functionality presented by slice/crop in the C++ Halide::Runtime::Buffer is only consistent on CPU-side buffers, and has an intuitive behavior when the parent-buffer does not have a device-allocation yet.
  • create a virtual halide_device_interface_t that will serve as a proxy interface that takes care of actually allocating the parent buffer. Such halide_device_interface_t can be provided by the C++ wrapper Halide::Runtime::Buffer on-demand. Afterwards, cropping and slicing Halide::Runtime::Buffers should have the ability to let us explicitly specify that the resulting cropped buffer should have this custom virtual device interface (and thus correctly behave as a device-side crop), or it doesn't (and therefore behave as just a temporary device-side allocation of a subregion with the intent of copying it back to the host once it's done). I think a good term for this feature would be Device-side Aliasing (footnote 1). This will somehow also require reworking where the dirty bits live.

Either way, we need to figure out why we don't see the error that says that the device is still dirty and the crop goes out-of-scope.

(1) This raises the question: how would you make clear that if you specify you do NOT want device-side aliasing, and yet a device-allocation already exists, the result crop is still going to be aliasing the parent device buffer.

@mcourteaux
Copy link
Contributor Author

mcourteaux commented Aug 30, 2024

@zvookin Managing the dirty-bits is something we haven't discussed yet. I think the starting point would be to modify set_device_dirty():

HALIDE_ALWAYS_INLINE void set_host_dirty(bool v = true) {
set_flag(halide_buffer_flag_host_dirty, v);
}
HALIDE_ALWAYS_INLINE void set_device_dirty(bool v = true) {
set_flag(halide_buffer_flag_device_dirty, v);
}

They would somehow need to propagate this dirty-bit to the parent buffer. However; the link to the parent-buffer --we established-- was going to be through a virtual device-interface. However, this interface has no dirty-bits related functions, so I think we might be stuck again with this approach...

@mcourteaux
Copy link
Contributor Author

zalman_martijn

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant