-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inject profiling for function calls to 'halide_copy_to_host' and 'halide_copy_to_device'. #7913
Inject profiling for function calls to 'halide_copy_to_host' and 'halide_copy_to_device'. #7913
Conversation
There's a problem with using the profiler with GPU pipelines, and I worry that this PR makes it slightly worse: Some of the GPU APIs are asynchronous, so the time taken to do the compute on GPU doesn't immediately show up, but instead manifests when you first copy the data back from device. If that's billed to the Func we're copying back, then that's fine I guess. This doesn't always happen though. E.g. Consider a pipeline A -> B -> C, where A and B are on GPU, and C is on CPU. A and B are launched asynchronously, and the profiler shows them taking no time. The actual time is in the copy-back of B, so you may falsely think that A is very cheap even if it's the real culprit. With this PR, this will still happen, but the cost of the compute of A and B is hidden in a single global copy_to_host bucket, which may be added together with many other copies to host. In long pipelines this may make it hard to pinpoint where the time is actually spent. For profiling halide cuda code I usually just give up and use nvidia tools to get a full cpu and gpu timeline. But thinking out loud, what if there were separate profiling IDs per Func for copy_to_host/device? So assuming that blur_pass_v is incorrectly on the CPU, and blur_pass_h comes before it in the pipeline, the profile could look like e.g.:
Then looking at the profile you could immediately see that a lot of time is being spent copying a specific Func from one place to another. |
I was thinking about the same problem, but I think that when the This is in the CUDA kernel run function: Lines 1208 to 1215 in d023065
It seems that the device copy functions are not explicitly synchronized by means of a I agree that using the NVIDIA profiling tools such as Overall, I think the argument of the timings being misleading is already the case, also for the normal realizations of Funcs, if you do not have the Q: This makes me think that maybe we should have a
This is also fine with me, although probably harder to put into place with the IRMutator. To me, your proposal sounds like a more advanced/detailed version of what I proposed. Definitely useful to know how big the buffers are. Although this level of detail might be more appropriate for |
I see what you mean here. Right now, you at least still get an upper bound of the range of where the culprit is. As in: if C is reported as expensive due to the copy-to-host waiting for the GPU to complete A and B, you now know that it is A, B, or C that is expensive: "C is the furthest stage of the pipeline where the bottleneck can occur." In this case, I'd argue again that the Halide built-in performance report is simply not to be trusted without the |
-debug is pretty aggressive because it enables a bunch of validation code in the runtime, but maybe when -profile is on we should inject some device_sync calls. I'll flag profiling gpu pipelines as an issue to discuss next dev meeting. |
The obvious caveat here is that the synchronization is a CUDA runtime feature, whereas the |
We have the halide_device_sync runtime function for all gpu targets, so if we inject calls to that in the Pipeline it shouldn't be cuda-specific and it shouldn't require a different runtime. |
That sounds like a sane solution to not depend on the |
Drawing from the CUDA (not Halide) optimization experience, I often find "tracing" tools more effective than "profiling" tools. My use cases often require repeating the same iteration by N times, similar to gradient descent. To optimize for the end-to-end compute time, I take advantage of system pipelining to hide the kernel compute time in the data transfer time. Even better, utilize the duplex Host-to-device and D2H data bus to further hide the two way data transfer time. What I mean by "tracing": I export the Profiling is still helpful for fine-grained optimization of one specific GPU kernel; I use it only when my algorithm is compute bound. |
That is what he meant with the "full cpu and gpu timeline".
Why not use the NVIDIA GUI for this? |
Conclusion: I'll add slots based on the buffer names, and inject synchronization calls after cuda kernel launches. @abadams Are we gonna ignore the double synchronization if we add both debug and profile flags? |
If both -profile and -debug are on, I guess we could skip the synchronization introduced by -profile, but I think it's probably cleaner to leave them decoupled and just do double synchronization in that case. |
Metal device sync should work (i.e. do a global sync) with a null buffer. Tagging @shoaibkamil to take a look and verify. Let's change the d3d device_sync to do the same. Looks like it's just a matter of adding an I think you're going to have to walk inside the assert in the let stmt body. We don't want launch failures to show up as sync errors. There should always be an assert immediately inside the let. |
Okay, good to hear your thoughts align with mine. Will proceed as such. |
Yes, that's correct-- you can pass a |
Another problem: not passing a buffer seems to be a problem for figuring out the device_interface, here on L221: Halide/src/runtime/device_interface.cpp Lines 216 to 226 in 1865101
Note sure if there is a correct way to get the device interface if there is no buffer... |
Yuck, good point. I can't think of a simple way around that. You can acquire a device interface Expr via make_device_interface_call in DeviceInterface.h, and you could just use the DeviceAPI of the GPUBlocks loop you're presumably just outside of (assuming this is happening in lowering before offloading gpu loops). But that doesn't give you a buffer, and halide_device_sync wants a buffer. I guess we'd have to add a new runtime function, e.g. |
@abadams I implemented the changes a suggested and it works. Only (minor) downside now is that all of those buffer copies are reported in the report, even if they were very small or not dirty (and hence not copied). I get very long reports now like this:
Also wondering if we should pick names that are easier to filter on the eyes? Perhaps Also, buffer copies can probably still be asynchronous, depending on how the backend implements it, but I'm a bit hesitant on making them synchronous. Copying data is typically the least of ones concern when working on a pipeline schedule. Any thoughts? |
@abadams Ready for review. IR looks like this (green lines manually drawn by me to indicate the blocks of profiling a buffer copy, and in the end a synchronize a kernel launch): I tried to clean the redundant calls to In the end, I don't think I care that much about the redundant calls to |
0136c0a
to
be143ad
Compare
Should we even be printing profiler entries for things with zero samples? Or maybe we should be for some categories of thing but not for others. For names, I don't think we're restricted to valid C identifiers. What do you think about something like:
|
Conclusion from dev meeting: We should be printing things with zero samples, because you want to know that Halide measured the thing. Also skipping them makes side-by-side comparisons of profiles annoying. If you don't like it, you can always pipe the result into grep -v |
Agreed. I'll get back to this soon. I'm thinking to maybe also sort the buffer-copy lines together in the report. Currently working on some two week side project for fun. 😄 |
…ide_copy_to_device'.
…y to device, and buffer copy to host.
8ce0fa0
to
6167639
Compare
6167639
to
7f6478f
Compare
Okay, thanks @steven-johnson for pointing this out! I was unaware I had introduced some issues. I tried but can't fix them just by guessing what's wrong with the code, just by looking at the test logs. I'll delve into this a bit deeper later today/tomorrow and actually run the tests locally myself. |
@steven-johnson @abadams Okay, I investigated this a little bit, and managed to run the tests myself locally, and they do work. So I'm clueless what might be going wrong. I see segfauls and bus errors. So this is a clearly a serious bug, but I haven't been able to reproduce it. Any tips to analyze this? I tested it on |
On host-metal, the crash happens inside |
Aha:
This is true for |
I took the liberty of pushing a simple fix; now generator_aot_memory_profiler_mandelbrot doesn't crash, but fails with |
OK, I pushed a followup fix that changes the semantics of calling |
Oh wow thank you for taking a look!
I'm not following. How can it be that for a buffer that has no
So, the troublesome case is when we synchronize on a buffer copy. In which case can a buffer that gets copied (in either direction) not have a device_interface assigned to it? So, in other words, I don't think the second commit you added is the right solution. As far as my understanding goes: either the buffer copy is incorrectly injected, or the buffer is incorrectly missing the device_interface. EDIT: One thing that especially confuses me is the fact that the copy_to_host() that got injected does work, which also relies on the device interface. Additionally, I think we should add a test that just makes sure we have pipeline where all the three types of synchronization are explicitly executed within the compiled pipeline. EDIT: I wrote such a test, and now I also get the error on my system. |
Aha, I think I see. Not all buffers have a device_interface. The corresponding copy_to_host that gets injected just gets the interface through the functions like Additionally, I believe that copy_to_host() actually is blocking, so we do not need an explicit sync with the GPU there. I'll just throw out that part. |
…pies in two directions within the compiled pipeline. This will catch this better in the future. Tweaked the profile report section header printing.
I removed the explicit device synchronize for I added a test that makes sure buffer copies are compiled into the pipeline, and go in both directions. This pipeline is then run without profiling, and with the two types of profiling (Profile, and ProfileByTimer). You can run this test:
Which, rendered, looks like this for me: Thanks a bunch @steven-johnson for the initial analysis of the problem! |
yes, beautiful and a wonderful improvement! |
Do you know if:
is related to my work here? https://buildbot.halide-lang.org/master/#/builders/68/builds/19/steps/23/logs/stdio EDIT: Aha! It is also present in other PR build tests, such as #7967 : https://buildbot.halide-lang.org/master/#/builders/68/builds/20 |
Is this ready to land (pending green)? |
@steven-johnson To me it's ready. Three remarks:
|
Looks good to me! |
…ide_copy_to_device'. (halide#7913) * Inject profiling for function calls to 'halide_copy_to_host' and 'halide_copy_to_device'. * WIP: I get segfaults. The device_interface pointer is bogus. * Figured it out... * Allow global sync on d3d12. * Cleanly time all buffer copies as well. * Cleanup old comment. * Following Andrews suggestion for suffixing buffer copies in the profiler. * Sort the profiler report lines into three sections: funcs, buffer copy to device, and buffer copy to host. * Inject profiling for function calls to 'halide_copy_to_host' and 'halide_copy_to_device'. * WIP: I get segfaults. The device_interface pointer is bogus. * Figured it out... * Allow global sync on d3d12. * Cleanly time all buffer copies as well. * Cleanup old comment. * Following Andrews suggestion for suffixing buffer copies in the profiler. * Sort the profiler report lines into three sections: funcs, buffer copy to device, and buffer copy to host. * Attempt to fix output parsing. * Fix crash for copy_to_device * halide_device_sync_global(NULL) -> success * Fixed the buffer copy bug. Added a new test that will cause buffer copies in two directions within the compiled pipeline. This will catch this better in the future. Tweaked the profile report section header printing. * Clang-format, my dear friend...
This is the first time I'm writing code to manipulate the IR tree. I'm not quite sure if this is the right way to do this: I actually want to wrap the
Call*
op, but decorating that with additional Stmts seems to be something I can't do, so I'm manipulating the surroundingLetStmt
, as generated byHalide/src/InjectHostDevBufferCopies.cpp
Lines 21 to 27 in d023065
Unlike
malloc
andfree
function-ids, I opted to lazily fetch them when the function call is actually encounter, to avoid adding these lines to the profile report for CPU-only functions.Example of the impact on my pipeline where I had a bug of one Func that was accidentally scheduled on CPU, and required some copies back and forth, which I didn't notice. Before this PR:
After this PR:
As you can see, the time attributed moved from
blur_pass_h
andblur_pass_v
to the copy-to-host and copy-to-device functions.Discussion:
malloc
andfree
get there own timer-slot, I think these also deserve a separate slot to profile their time.