Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inject profiling for function calls to 'halide_copy_to_host' and 'halide_copy_to_device'. #7913

Merged
merged 27 commits into from
Dec 12, 2023

Conversation

mcourteaux
Copy link
Contributor

This is the first time I'm writing code to manipulate the IR tree. I'm not quite sure if this is the right way to do this: I actually want to wrap the Call* op, but decorating that with additional Stmts seems to be something I can't do, so I'm manipulating the surrounding LetStmt, as generated by

Stmt call_extern_and_assert(const string &name, const vector<Expr> &args) {
Expr call = Call::make(Int(32), name, args, Call::Extern);
string call_result_name = unique_name(name + "_result");
Expr call_result_var = Variable::make(Int(32), call_result_name);
return LetStmt::make(call_result_name, call,
AssertStmt::make(EQ::make(call_result_var, 0), call_result_var));
}

Unlike malloc and free function-ids, I opted to lazily fetch them when the function call is actually encounter, to avoid adding these lines to the profile report for CPU-only functions.


Example of the impact on my pipeline where I had a bug of one Func that was accidentally scheduled on CPU, and required some copies back and forth, which I didn't notice. Before this PR:

neonraw_multiscale_denoiser_f12_Gk3_Nk7
 total time: 294.014740 ms  samples: 278  runs: 3  time/run: 98.004913 ms
 average threads used: 0.377698
 heap allocations: 9  peak heap usage: 73208804 bytes
  halide_malloc:                                        0.000ms (0%)    threads: 0.000 
  halide_free:                                          0.000ms (0%)    threads: 0.000 
  luminance_guide:                                      2.116ms (2%)    threads: 1.000  peak: 24421764 num: 3         avg: 10699254
  blur_pass_h:                                         10.917ms (11%)   threads: 1.000  peak: 24405596 num: 3         avg: 10689806
  blur_pass_v:                                         10.201ms (10%)   threads: 1.000  peak: 24381444 num: 3         avg: 10675708
  filter_input_guide:                                   1.056ms (1%)    threads: 1.000 
  blur_x2:                                              3.527ms (3%)    threads: 1.000 
  luminance_noisy:                                      6.703ms (6%)    threads: 1.000 
  filter_input_noisy:                                   0.353ms (0%)    threads: 1.000 
  denoised:                                            63.129ms (64%)   threads: 0.033 
  allocgroup__denoise_conv_guide_global_wrapper$0:      0.000ms (0%)    threads: 0.000  stack: 4

After this PR:

neonraw_multiscale_denoiser_f12_Gk3_Nk7
 total time: 304.260193 ms  samples: 287  runs: 3  time/run: 101.420067 ms
 average threads used: 0.365854
 heap allocations: 9  peak heap usage: 73208804 bytes
  halide_malloc:                                        0.000ms (0%)    threads: 0.000 
  halide_free:                                          0.000ms (0%)    threads: 0.000 
  copy_to_host:                                        14.468ms (14%)   threads: 1.000 
  copy_to_device:                                       9.906ms (9%)    threads: 1.000 
  luminance_guide:                                      1.058ms (1%)    threads: 0.333  peak: 24421764 num: 3         avg: 10699254
  blur_pass_h:                                          3.883ms (3%)    threads: 1.000  peak: 24405596 num: 3         avg: 10689806
  blur_pass_v:                                          4.589ms (4%)    threads: 1.000  peak: 24381444 num: 3         avg: 10675708
  filter_input_guide:                                   0.352ms (0%)    threads: 0.000 
  blur_x2:                                              2.481ms (2%)    threads: 1.000 
  luminance_noisy:                                      0.353ms (0%)    threads: 0.000 
  filter_input_noisy:                                   0.705ms (0%)    threads: 0.000 
  denoised:                                            63.620ms (62%)   threads: 0.022 
  allocgroup__denoise_conv_guide_global_wrapper$0:      0.000ms (0%)    threads: 0.000  stack: 4

As you can see, the time attributed moved from blur_pass_h and blur_pass_v to the copy-to-host and copy-to-device functions.


Discussion:

  • In the end, one could argue that the devices copies are part of the producer/consumer in which they sit, but if malloc and free get there own timer-slot, I think these also deserve a separate slot to profile their time.
  • Currently, restoring the original "current_func" being profiled to the current consumer/producer, might be redundant if multiple copies happen sequentially. You'd get:
    set_current_func(blur_pass_v); // <-- immediately overwritten as well.
    set_current_func(copy_to_host);
    copy_to_host();
    set_current_func(blur_pass_v); // <-- redundant!
    set_current_func(copy_to_host);
    copy_to_host();
    set_current_func(blur_pass_v);
    ...
    // actual code for blur_pass_v
    ...
    Not optimal, but not worth the effort to try to optimize this debugging feature any further, IMO.

@abadams
Copy link
Member

abadams commented Oct 23, 2023

There's a problem with using the profiler with GPU pipelines, and I worry that this PR makes it slightly worse:

Some of the GPU APIs are asynchronous, so the time taken to do the compute on GPU doesn't immediately show up, but instead manifests when you first copy the data back from device. If that's billed to the Func we're copying back, then that's fine I guess. This doesn't always happen though. E.g. Consider a pipeline A -> B -> C, where A and B are on GPU, and C is on CPU. A and B are launched asynchronously, and the profiler shows them taking no time. The actual time is in the copy-back of B, so you may falsely think that A is very cheap even if it's the real culprit.

With this PR, this will still happen, but the cost of the compute of A and B is hidden in a single global copy_to_host bucket, which may be added together with many other copies to host. In long pipelines this may make it hard to pinpoint where the time is actually spent. For profiling halide cuda code I usually just give up and use nvidia tools to get a full cpu and gpu timeline.

But thinking out loud, what if there were separate profiling IDs per Func for copy_to_host/device? So assuming that blur_pass_v is incorrectly on the CPU, and blur_pass_h comes before it in the pipeline, the profile could look like e.g.:

neonraw_multiscale_denoiser_f12_Gk3_Nk7
 total time: 294.014740 ms  samples: 278  runs: 3  time/run: 98.004913 ms
 average threads used: 0.377698
 heap allocations: 9  peak heap usage: 73208804 bytes
  halide_malloc:                                        0.000ms (0%)    threads: 0.000 
  halide_free:                                          0.000ms (0%)    threads: 0.000 
  luminance_guide:                                      2.116ms (2%)    threads: 1.000  peak: 24421764 num: 3         avg: 10699254
  blur_pass_h:                                         10.917ms (11%)   threads: 1.000  peak: 24405596 num: 3         avg: 10689806
  blur_pass_h (copy_to_host): ...
  blur_pass_v:                                         10.201ms (10%)   threads: 1.000  peak: 24381444 num: 3         avg: 10675708
  blur_pass_v (copy_to_device): ...
  filter_input_guide:                                   1.056ms (1%)    threads: 1.000 
  input (copy_to_device): ...
  blur_x2:                                              3.527ms (3%)    threads: 1.000 
  luminance_noisy:                                      6.703ms (6%)    threads: 1.000 
  filter_input_noisy:                                   0.353ms (0%)    threads: 1.000 
  denoised:                                            63.129ms (64%)   threads: 0.033 
  allocgroup__denoise_conv_guide_global_wrapper$0:      0.000ms (0%)    threads: 0.000  stack: 4

Then looking at the profile you could immediately see that a lot of time is being spent copying a specific Func from one place to another.

@mcourteaux
Copy link
Contributor Author

I was thinking about the same problem, but I think that when the debug option is enabled in the CUDA runtime, every GPU operation is always synchronized back with the CPU and you get representative timings. So using the combination of profile and debug gives interpretable results I think.

This is in the CUDA kernel run function:

Halide/src/runtime/cuda.cpp

Lines 1208 to 1215 in d023065

#ifdef DEBUG_RUNTIME
err = cuCtxSynchronize();
if (err != CUDA_SUCCESS) {
return error_cuda(user_context, err, "cuCtxSynchronize failed");
}
uint64_t t_after = halide_current_time_ns(user_context);
debug(user_context) << " Time: " << (t_after - t_before) / 1.0e6 << " ms\n";
#endif

It seems that the device copy functions are not explicitly synchronized by means of a cuCtxSynchronize(), but maybe they implicitly are?

I agree that using the NVIDIA profiling tools such as nsight and nsys are preferable, but the Halide built-in profile report is still a quicker way to iterate.

Overall, I think the argument of the timings being misleading is already the case, also for the normal realizations of Funcs, if you do not have the debug feature enabled. This PR improves the profiling report when the report in its entirety is reliable, and just adds more noise to the report if the report is unreliable due to asynchronous execution.

Q: This makes me think that maybe we should have a profile option, alongside a synced_profile option, or something. This would serve as an alternative for the debug option force-synchronizing the GPU. I think that always injecting synchronization for profile-enabled pipelines would be bad, as you still might want to just measure run times, without artificially slowing down everything with synchronizations. The debug option right now makes timings reliable (I think), but spams a lot of stuff to stdout, which you are not always interested in.


But thinking out loud, what if there were separate profiling IDs per Func for copy_to_host/device? So assuming that blur_pass_v is incorrectly on the CPU, and blur_pass_h comes before it in the pipeline, the profile could look like e.g.:

This is also fine with me, although probably harder to put into place with the IRMutator. To me, your proposal sounds like a more advanced/detailed version of what I proposed. Definitely useful to know how big the buffers are. Although this level of detail might be more appropriate for nsys.

@mcourteaux
Copy link
Contributor Author

but the cost of the compute of A and B is hidden in a single global copy_to_host bucket, which may be added together with many other copies to host.

I see what you mean here. Right now, you at least still get an upper bound of the range of where the culprit is. As in: if C is reported as expensive due to the copy-to-host waiting for the GPU to complete A and B, you now know that it is A, B, or C that is expensive: "C is the furthest stage of the pipeline where the bottleneck can occur." In this case, I'd argue again that the Halide built-in performance report is simply not to be trusted without the debug flag, due to the asynchronous nature of the compute API. You could even argue the opposite of what you sketched out: now the user mistakenly might believe that C is the culprit, whereas A is the culprit. This PR will make them scratch their head and think "why does copying 1MB of data from GPU to CPU take 200 milliseconds?" and then quickly realize it's basically a synchronization that is being timed.

@abadams
Copy link
Member

abadams commented Oct 24, 2023

-debug is pretty aggressive because it enables a bunch of validation code in the runtime, but maybe when -profile is on we should inject some device_sync calls. I'll flag profiling gpu pipelines as an issue to discuss next dev meeting.

@abadams abadams added the dev_meeting Topic to be discussed at the next dev meeting label Oct 24, 2023
@mcourteaux
Copy link
Contributor Author

The obvious caveat here is that the synchronization is a CUDA runtime feature, whereas the profile feature is a pipeline one, currently.

@abadams
Copy link
Member

abadams commented Oct 24, 2023

We have the halide_device_sync runtime function for all gpu targets, so if we inject calls to that in the Pipeline it shouldn't be cuda-specific and it shouldn't require a different runtime.

@mcourteaux
Copy link
Contributor Author

That sounds like a sane solution to not depend on the debug runtime feature. If you want, I can implement as you suggest and make a PR.

@antonysigma
Copy link
Contributor

antonysigma commented Oct 26, 2023

In long pipelines this may make it hard to pinpoint where the time is actually spent. For profiling halide cuda code I usually just give up and use nvidia tools to get a full cpu and gpu timeline.

Drawing from the CUDA (not Halide) optimization experience, I often find "tracing" tools more effective than "profiling" tools.

My use cases often require repeating the same iteration by N times, similar to gradient descent. To optimize for the end-to-end compute time, I take advantage of system pipelining to hide the kernel compute time in the data transfer time. Even better, utilize the duplex Host-to-device and D2H data bus to further hide the two way data transfer time.

What I mean by "tracing": I export the nvprof report (or nsys in the latest CUDA toolchain) to the Google Trace Format via a script. Then, visualize the Gantt-chart style multi-track time traces on the chrome://tracing online app. Finally, identify the critical path of the entire algorithm run.

Profiling is still helpful for fine-grained optimization of one specific GPU kernel; I use it only when my algorithm is compute bound.

@mcourteaux
Copy link
Contributor Author

In long pipelines this may make it hard to pinpoint where the time is actually spent. For profiling halide cuda code I usually just give up and use nvidia tools to get a full cpu and gpu timeline.

Drawing from the CUDA (not Halide) optimization experience, I often find "tracing" tools more effective than "profiling" tools.

That is what he meant with the "full cpu and gpu timeline".

What I mean by "tracing": I export the nvprof report (or nsys in the latest CUDA toolchain) to the Google Trace Format via a script. Then, visualize the Gantt-chart style multi-track time traces on the chrome://tracing online app. Finally, identify the critical path of the entire algorithm run.

Why not use the NVIDIA GUI for this? nsys-ui? I would assume you get much more information for a similar interface.

nsys-ui

@mcourteaux
Copy link
Contributor Author

Conclusion: I'll add slots based on the buffer names, and inject synchronization calls after cuda kernel launches. @abadams Are we gonna ignore the double synchronization if we add both debug and profile flags?

@abadams
Copy link
Member

abadams commented Oct 30, 2023

If both -profile and -debug are on, I guess we could skip the synchronization introduced by -profile, but I think it's probably cleaner to leave them decoupled and just do double synchronization in that case.

@mcourteaux
Copy link
Contributor Author

Implementing this, brings up the following obstacle: some APIs (i.e., D3D12 and Metal) hard require a buffer to be passed to the halide_device_sync() to work. When rewriting the kernel launches for the wide variety of APIs, it's unclear to me what the most relevant buffer would be to sync on. You'd need information of which buffer is being produced (by means of an Output<> in a generator, I guess), but that information is long gone at that point during lowering, I think. I don't know which buffer to take. Perhaps there is a way to deduce this by means of walking up the IR tree, searching for a ProducerConsumer node, but in general seems hacky?

Should we instead require this API function to support synchronizing without any buffer being passed? This would change mostly fixing the D3D12 implementation, and perhaps the Metal one (although it won't crash, but might not properly sync right now).

Alternatively, we don't exclude the APIs that do require the buffer pointer, which is suboptimal, for sure.

            } else if (false
                       // clang-format off
                      || call->name == "halide_cuda_run"
                      || call->name == "halide_d3d12_run"
                      || call->name == "halide_metal_run"
                      || call->name == "halide_vulkan_run"
                      || call->name == "halide_hexagon_run"
                      || call->name == "halide_opencl_run"
                      || call->name == "halide_openglcompute_run"
                      || call->name == "halide_webgpu_run"
                       // clang-format on
            ) {
                // These APIs are typically asynchronous, so the CPU only spends time to
                // launch the operation, which is typically very short. When profiling with
                // the built-in-Halide profiler, this behaviour gives misleading profiling
                // reports, as certain Funcs are represented as being extremely cheap, whereas
                // it is actually taking quite some time and causes a stall later (typically
                // when copying back a buffer from the device to the host), which gets billed
                // all the time of the asynchronous operations that came before the buffer copy.
                // To avoid this, we will synchronize with the device, but only when profiling,
                // after every kernel-launch.
                //
                // The buffer to synchronize on is in most APIs not used.
                // Exceptions:
                //  - d3d12: peel_buffer will assert if no buffer is passed, although wait_until_idle() exists.
                //  - metal: seems to take one, but has an if (buffer != nullptr) check. Not sure if any synchronization will happen.
                //  - hexagon_dma: currently does not implement anything, so it won't sync.
                const Expr buffer_ptr = make_zero(type_of<void *>());
                Expr sync_result = Variable::make(Int(32), "device_sync_result");
                return LetStmt::make(op->name, op->value,
                        LetStmt::make("device_sync_result", Call::make(Int(32), "halide_device_sync", {buffer_ptr}, Call::Extern),
                            Block::make(AssertStmt::make(sync_result == 0, sync_result),
                                op->body)));
            }

What I don't like about this is how the assert that I inject here, comes after the assert that checks the result value of the halide_xxx_run():
image

To fix this, I'd have to pattern match the Assert inside the Block inside the LetStmt->body. This becomes increasingly ugly...

@abadams
Copy link
Member

abadams commented Oct 31, 2023

Metal device sync should work (i.e. do a global sync) with a null buffer. Tagging @shoaibkamil to take a look and verify. Let's change the d3d device_sync to do the same. Looks like it's just a matter of adding an if (buffer == nullptr) case

I think you're going to have to walk inside the assert in the let stmt body. We don't want launch failures to show up as sync errors. There should always be an assert immediately inside the let.

@mcourteaux
Copy link
Contributor Author

Okay, good to hear your thoughts align with mine. Will proceed as such.

@shoaibkamil
Copy link
Contributor

Metal device sync should work (i.e. do a global sync) with a null buffer. Tagging @shoaibkamil to take a look and verify.

Yes, that's correct-- you can pass a nullptr into the function and it will work; the optional buffer argument is there for syncing on a specific output buffer.

@mcourteaux
Copy link
Contributor Author

Another problem: not passing a buffer seems to be a problem for figuring out the device_interface, here on L221:

WEAK int halide_device_sync(void *user_context, struct halide_buffer_t *buf) {
auto result = debug_log_and_validate_buf(user_context, buf, "halide_device_sync");
if (result) {
return result;
}
const halide_device_interface_t *device_interface = buf->device_interface;
if (device_interface == nullptr) {
return halide_error_no_device_interface(user_context);
}
result = device_interface->impl->device_sync(user_context, buf);

Note sure if there is a correct way to get the device interface if there is no buffer...

@abadams
Copy link
Member

abadams commented Oct 31, 2023

Yuck, good point. I can't think of a simple way around that.

You can acquire a device interface Expr via make_device_interface_call in DeviceInterface.h, and you could just use the DeviceAPI of the GPUBlocks loop you're presumably just outside of (assuming this is happening in lowering before offloading gpu loops). But that doesn't give you a buffer, and halide_device_sync wants a buffer. I guess we'd have to add a new runtime function, e.g. halide_device_sync_global(void *user_context, struct halide_device_interface_t *). It could be implemented in src/runtime/device_interface.cpp using device_interface->impl->device_sync(user_context, nullptr);

@mcourteaux mcourteaux marked this pull request as draft November 2, 2023 23:48
@mcourteaux
Copy link
Contributor Author

mcourteaux commented Nov 3, 2023

@abadams I implemented the changes a suggested and it works.

Only (minor) downside now is that all of those buffer copies are reported in the report, even if they were very small or not dirty (and hence not copied). I get very long reports now like this:

neonraw_multiscale_denoiser_f12_Gk3_Nk7
 total time: 157.598099 ms  samples: 149  runs: 3  time/run: 52.532700 ms
 average threads used: 0.187919
 heap allocations: 0  peak heap usage: 0 bytes
  halide_malloc:                               0.000ms (0%)    threads: 0.000
  halide_free:                                 0.000ms (0%)    threads: 0.000
  denoise_conv_guide_global_wrapper$0:         1.060ms (2%)    threads: 0.333
  copy_to_device__blurring_filters_guide:      0.000ms (0%)    threads: 0.000
  copy_to_device__guide:                       2.120ms (4%)    threads: 1.000
  luminance_guide:                             0.353ms (0%)    threads: 0.000
  copy_to_device__luminance_input_guide:       0.000ms (0%)    threads: 0.000
  copy_to_device__luminance_weights:           0.000ms (0%)    threads: 0.000
  blur_pass_h_global_wrapper$0:                0.000ms (0%)    threads: 0.000
  blur_pass_v_global_wrapper$0:                0.000ms (0%)    threads: 0.000
  filter_input_guide:                          0.000ms (0%)    threads: 0.000
  blur_x2:                                     0.705ms (1%)    threads: 0.000
  luminance_noisy:                             0.705ms (1%)    threads: 0.000
  copy_to_device__luminance_input_noisy:       6.356ms (12%)   threads: 1.000
  filter_input_noisy:                          0.000ms (0%)    threads: 0.000
  filter_response_noisy_global_wrapper$0:      9.181ms (17%)   threads: 0.076
  copy_to_device__masking_filters_noisy:       0.000ms (0%)    threads: 0.000
  denoised:                                   31.695ms (60%)   threads: 0.000
  copy_to_device__blurring_filters_noisy:      0.000ms (0%)    threads: 0.000
  copy_to_device__denoised:                    0.000ms (0%)    threads: 0.000
  copy_to_device__masking_bias:                0.000ms (0%)    threads: 0.000
  copy_to_device__masking_filters_guide:       0.352ms (0%)    threads: 1.000
  copy_to_device__noisy:                       0.000ms (0%)    threads: 0.000
  filter_response_guide:                       0.000ms (0%)    threads: 0.000  stack: 4
neonraw_bilinear_resizer_u16

Also wondering if we should pick names that are easier to filter on the eyes? Perhaps D2H__ and H2D__, or something that starts with _?

Also, buffer copies can probably still be asynchronous, depending on how the backend implements it, but I'm a bit hesitant on making them synchronous. Copying data is typically the least of ones concern when working on a pipeline schedule. Any thoughts?

@mcourteaux mcourteaux marked this pull request as ready for review November 3, 2023 00:04
@mcourteaux mcourteaux marked this pull request as draft November 3, 2023 16:05
@mcourteaux mcourteaux marked this pull request as ready for review November 3, 2023 17:30
@mcourteaux
Copy link
Contributor Author

@abadams Ready for review. IR looks like this (green lines manually drawn by me to indicate the blocks of profiling a buffer copy, and in the end a synchronize a kernel launch):

image

I tried to clean the redundant calls to halide_profiler_set_current_func with an IRMutator, but I couldn't get to correctly pattern match the structure of the IR. Not being possible to select the parent node made it too tricky for me to find the next call to halide_profiler_set_current_func in the IR tree. My logic was to not insert the restoring halide_profiler_set_current_func if there is immediately after another one that would overwrite the profiler token again. Due to the IR being in "non-canonical" nesting order with respect to the Blocks, I couldn't figure it out. I was looking through the code base to see if there was a way to IRMutate a tree into canonical form but couldn't find it.

In the end, I don't think I care that much about the redundant calls to halide_profiler_set_current_func.

@mcourteaux mcourteaux force-pushed the profile-device-copies branch from 0136c0a to be143ad Compare November 3, 2023 17:45
@abadams
Copy link
Member

abadams commented Nov 9, 2023

Should we even be printing profiler entries for things with zero samples? Or maybe we should be for some categories of thing but not for others.

For names, I don't think we're restricted to valid C identifiers. What do you think about something like:

my_func
my_func (copy to host)
my_func (copy to device)

@abadams
Copy link
Member

abadams commented Nov 10, 2023

Conclusion from dev meeting:

We should be printing things with zero samples, because you want to know that Halide measured the thing. Also skipping them makes side-by-side comparisons of profiles annoying. If you don't like it, you can always pipe the result into grep -v

@abadams abadams removed the dev_meeting Topic to be discussed at the next dev meeting label Nov 10, 2023
@mcourteaux
Copy link
Contributor Author

Agreed. I'll get back to this soon. I'm thinking to maybe also sort the buffer-copy lines together in the report. Currently working on some two week side project for fun. 😄

@mcourteaux mcourteaux force-pushed the profile-device-copies branch 2 times, most recently from 8ce0fa0 to 6167639 Compare November 30, 2023 12:40
@mcourteaux mcourteaux force-pushed the profile-device-copies branch from 6167639 to 7f6478f Compare November 30, 2023 12:54
@mcourteaux
Copy link
Contributor Author

mcourteaux commented Nov 30, 2023

Okay, thanks @steven-johnson for pointing this out! I was unaware I had introduced some issues. I tried but can't fix them just by guessing what's wrong with the code, just by looking at the test logs. I'll delve into this a bit deeper later today/tomorrow and actually run the tests locally myself.

@mcourteaux
Copy link
Contributor Author

@steven-johnson @abadams Okay, I investigated this a little bit, and managed to run the tests myself locally, and they do work. So I'm clueless what might be going wrong. I see segfauls and bus errors. So this is a clearly a serious bug, but I haven't been able to reproduce it. Any tips to analyze this? I tested it on host-cuda as the argument to the CMake's -DHalide_TARGET, mimicking what the build bots do.

@steven-johnson
Copy link
Contributor

@steven-johnson @abadams Okay, I investigated this a little bit, and managed to run the tests myself locally, and they do work. So I'm clueless what might be going wrong. I see segfauls and bus errors

On host-metal, the crash happens inside halide_device_sync_global(), because device_interface->impl is null, which, kablooey. Why this is the case isn't clear, debugging.

@steven-johnson
Copy link
Contributor

Aha:

       Expr device_interface = call->args.back();  // The last argument to the copy calls is the device_interface.

This is true for copy_to_device, but not for copy_to_host, which takes only (user_context, buf), so you are treating a buffer like a device_interface.

@steven-johnson
Copy link
Contributor

I took the liberty of pushing a simple fix; now generator_aot_memory_profiler_mandelbrot doesn't crash, but fails with Error: Buffer has a non-zero device but no device interface. -- I presume the issue here is that we unconditionally call the injected halide_device_sync_global() on a buffer that has a null device_interface, and halide_device_sync_global() calls halide_error_no_device_interface() for that case (which isn't quite right, since the buffer has neither device nor device interface).

@steven-johnson
Copy link
Contributor

OK, I pushed a followup fix that changes the semantics of calling halide_device_sync_global() with a null device_interface -- instead of being an error, it just returns success (because there is no device to sync with). LMK if this matches your thoughts.

@mcourteaux
Copy link
Contributor Author

mcourteaux commented Dec 1, 2023

Oh wow thank you for taking a look!

I presume the issue here is that we unconditionally call the injected halide_device_sync_global() on a buffer that has a null device_interface

I'm not following. How can it be that for a buffer that has no device_interface, a corresponding copy_to_host or copy_to_device gets injected? So just to recap: we only inject the global_device_sync() in two cases:

  • buffer copies (in both directions)
  • after running kernels on the device

So, the troublesome case is when we synchronize on a buffer copy. In which case can a buffer that gets copied (in either direction) not have a device_interface assigned to it? So, in other words, I don't think the second commit you added is the right solution. As far as my understanding goes: either the buffer copy is incorrectly injected, or the buffer is incorrectly missing the device_interface.

EDIT: One thing that especially confuses me is the fact that the copy_to_host() that got injected does work, which also relies on the device interface.

Additionally, I think we should add a test that just makes sure we have pipeline where all the three types of synchronization are explicitly executed within the compiled pipeline. EDIT: I wrote such a test, and now I also get the error on my system.

@mcourteaux
Copy link
Contributor Author

EDIT: One thing that especially confuses me is the fact that the copy_to_host() that got injected does work, which also relies on the device interface.

Aha, I think I see. Not all buffers have a device_interface. The corresponding copy_to_host that gets injected just gets the interface through the functions like halide_cuda_device_interface().

Additionally, I believe that copy_to_host() actually is blocking, so we do not need an explicit sync with the GPU there. I'll just throw out that part.

…pies in two directions within the compiled pipeline. This will catch this better in the future. Tweaked the profile report section header printing.
@mcourteaux
Copy link
Contributor Author

mcourteaux commented Dec 1, 2023

I removed the explicit device synchronize for copy_to_host() as that function is not asynchronous anyway.

I added a test that makes sure buffer copies are compiled into the pipeline, and go in both directions. This pipeline is then run without profiling, and with the two types of profiling (Profile, and ProfileByTimer).

You can run this test:

❯ HL_JIT_TARGET=host-cuda ./test/correctness/correctness_device_buffer_copies_with_profile
Testing without profiler.
Testing thread based profiler.
f2_on_cpu_1
 total time: 0.000000 ms  samples: 0  runs: 1  time/run: 0.000000 ms
 heap allocations: 0  peak heap usage: 0 bytes
f2_on_cpu_1
 total time: 29.645176 ms  samples: 28  runs: 1  time/run: 29.645176 ms
 heap allocations: 2  peak heap usage: 32008000 bytes
  [funcs ::::::::::::::::::::::::::   17.980ms (60.6%) ::::]
    halide_malloc:                     0.000ms ( 0.0%)   
    halide_free:                       0.000ms ( 0.0%)   
    f0_on_cpu$1:                       7.407ms (24.9%)    peak: 16008000 num: 1         avg: 16008000
    f1_on_gpu$1:                       0.000ms ( 0.0%)    peak: 16000000 num: 1         avg: 16000000
    f2_on_cpu$1:                      10.572ms (35.6%)   
  [buffer copies to device ::::::::    2.140ms ( 7.2%) ::::]
    f0_on_cpu$1:                       2.140ms ( 7.2%)   
  [buffer copies to host ::::::::::    9.523ms (32.1%) ::::]
    f1_on_gpu$1:                       9.523ms (32.1%)   
    f2_on_cpu$1:                       0.000ms ( 0.0%)   
Testing timer based profiler.
f2_on_cpu_2
 total time: 0.000000 ms  samples: 0  runs: 1  time/run: 0.000000 ms
 heap allocations: 0  peak heap usage: 0 bytes
f2_on_cpu_2
 total time: 25.380686 ms  samples: 24  runs: 1  time/run: 25.380686 ms
 heap allocations: 2  peak heap usage: 32008000 bytes
  [funcs ::::::::::::::::::::::::::   14.796ms (58.2%) ::::]
    halide_malloc:                     0.000ms ( 0.0%)   
    halide_free:                       0.000ms ( 0.0%)   
    f0_on_cpu$2:                       8.454ms (33.3%)    peak: 16008000 num: 1         avg: 16008000
    f1_on_gpu$2:                       0.000ms ( 0.0%)    peak: 16000000 num: 1         avg: 16000000
    f2_on_cpu$2:                       6.341ms (24.9%)   
  [buffer copies to device ::::::::    2.114ms ( 8.3%) ::::]
    f0_on_cpu$2:                       2.114ms ( 8.3%)   
  [buffer copies to host ::::::::::    8.469ms (33.3%) ::::]
    f1_on_gpu$2:                       8.469ms (33.3%)   
    f2_on_cpu$2:                       0.000ms ( 0.0%)   
Success!

Which, rendered, looks like this for me:
image
Which I think is a whole lot clearer than the previous version.

Thanks a bunch @steven-johnson for the initial analysis of the problem!

@steven-johnson
Copy link
Contributor

Which I think is a whole lot clearer than the previous version.

yes, beautiful and a wonderful improvement!

@mcourteaux
Copy link
Contributor Author

mcourteaux commented Dec 1, 2023

Do you know if:

168/461 Test  #48: correctness_boundary_conditions ............................Subprocess aborted***Exception:  46.31 sec
Error: CUDA error: CUDA_ERROR_MISALIGNED_ADDRESS cuMemcpyDtoH failed

is related to my work here?

https://buildbot.halide-lang.org/master/#/builders/68/builds/19/steps/23/logs/stdio


EDIT: Aha! It is also present in other PR build tests, such as #7967 : https://buildbot.halide-lang.org/master/#/builders/68/builds/20

@steven-johnson
Copy link
Contributor

Is this ready to land (pending green)?

@mcourteaux
Copy link
Contributor Author

mcourteaux commented Dec 8, 2023

@steven-johnson To me it's ready. Three remarks:

  • It's been a while since @abadams did his review, and quite some things changed.
  • Feedback on the style of the section headers is welcome.
  • The global device syncs are just eyeball-implemented, I did not rigorously test this on all device APIs. Perhaps a test can validate this in a hand-wavy way, but I wouldn't know how in particular? At least they didn't break anything it seems.

@abadams
Copy link
Member

abadams commented Dec 8, 2023

Looks good to me!

@steven-johnson steven-johnson merged commit 3d5cf40 into halide:main Dec 12, 2023
19 checks passed
ardier pushed a commit to ardier/Halide-mutation that referenced this pull request Mar 3, 2024
…ide_copy_to_device'. (halide#7913)

* Inject profiling for function calls to 'halide_copy_to_host' and 'halide_copy_to_device'.

* WIP: I get segfaults. The device_interface pointer is bogus.

* Figured it out...

* Allow global sync on d3d12.

* Cleanly time all buffer copies as well.

* Cleanup old comment.

* Following Andrews suggestion for suffixing buffer copies in the profiler.

* Sort the profiler report lines into three sections: funcs, buffer copy to device, and buffer copy to host.

* Inject profiling for function calls to 'halide_copy_to_host' and 'halide_copy_to_device'.

* WIP: I get segfaults. The device_interface pointer is bogus.

* Figured it out...

* Allow global sync on d3d12.

* Cleanly time all buffer copies as well.

* Cleanup old comment.

* Following Andrews suggestion for suffixing buffer copies in the profiler.

* Sort the profiler report lines into three sections: funcs, buffer copy to device, and buffer copy to host.

* Attempt to fix output parsing.

* Fix crash for copy_to_device

* halide_device_sync_global(NULL) -> success

* Fixed the buffer copy bug. Added a new test that will cause buffer copies in two directions within the compiled pipeline. This will catch this better in the future. Tweaked the profile report section header printing.

* Clang-format, my dear friend...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants