Commit bda408a
[UR][CUDA] Add opportunistic queue serialize prop, impl for cuda (#18443)
Makes short kernels that don't need to see the same global memory (or
user guarantees global memory writes are complete) launch faster. See
https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#programmatic-dependent-launch-and-synchronization
Makes lots of short kernels in cutlass great again. cc @FMarno who
identified this performance gap.
---------
Signed-off-by: JackAKirk <[email protected]>
Co-authored-by: Jakub Chlanda <[email protected]>1 parent fd866a3 commit bda408a
File tree
5 files changed
+38
-0
lines changed- unified-runtime
- include
- scripts/core
- source/adapters/cuda
- test/conformance/exp_launch_properties
5 files changed
+38
-0
lines changedSome generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
36 | 36 | | |
37 | 37 | | |
38 | 38 | | |
| 39 | + | |
| 40 | + | |
39 | 41 | | |
40 | 42 | | |
41 | 43 | | |
| |||
56 | 58 | | |
57 | 59 | | |
58 | 60 | | |
| 61 | + | |
| 62 | + | |
| 63 | + | |
| 64 | + | |
59 | 65 | | |
60 | 66 | | |
61 | 67 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
560 | 560 | | |
561 | 561 | | |
562 | 562 | | |
| 563 | + | |
| 564 | + | |
| 565 | + | |
| 566 | + | |
| 567 | + | |
| 568 | + | |
| 569 | + | |
563 | 570 | | |
564 | 571 | | |
565 | 572 | | |
| |||
Lines changed: 9 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
66 | 66 | | |
67 | 67 | | |
68 | 68 | | |
| 69 | + | |
| 70 | + | |
| 71 | + | |
| 72 | + | |
| 73 | + | |
| 74 | + | |
| 75 | + | |
| 76 | + | |
| 77 | + | |
69 | 78 | | |
70 | 79 | | |
71 | 80 | | |
| |||
0 commit comments