[RFC] Updated cl_khr_defined_builtin_kernels #1007

linehill · 2023-11-23T09:43:32Z

This draft is an updated version of #867.

Notable changes:

Changed defined built-in kernels to use the new cl_tensor data type for data arguments. The data type is drafted here.
Added API functions for creating defined built-in kernels.
Added sample code for utilizing defined built-in kernels.

HTML is rendered here. (last update: 8/22/2024)

For previous discussion about the topic, follow this link.

First WiP draft of a defined BiKs extension.

* Add API function for describing (and querying at the same time) a DBK. * Add API function for creating a program from described DBKs. * Add API function for creating a kernel handle for DBK. * Change DBK description. * Replaced old DBKs for simpler start for illustrating new features. * Use (yet specified) tensors. * Add constraints for usage. * Add sample code for queuing a DBK. * Device-side launching needs redesign. Added note for it.

Co-authored-by: Pekka Jääskeläinen <[email protected]>

code sample. * Fixes and tweaks in the DBK code sample. * Update date.

* Fix typo.

CLAassistant · 2023-11-23T09:43:38Z

Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ pjaaskel
❌ linehill
_{You have signed the CLA already but the status is still pending? Let us recheck it.}

FreddieWitherden · 2023-12-01T17:08:21Z

So while I am extremely supportive of the idea of OpenCL getting built-in kernels, I am unsure if this is the right approach.

Instead, I feel that there are real advantages to handling this at the kernel language level. That is to say one simply writes a kernel corresponding to an S/DGEMM (say). This kernel can then be annotated (assuming the availability of a language extension) saying "try and match this against any built-ins and if so, feel free to substitute". If so, the runtime would be free to use its own implementation and disregard any user-provided work group sizes and the such like.

This idea is not without precedent: Intel's C compiler has been applying this exact idea for the best part of a decade: write a triple-for-loop S/DGEMM and the compiler will recognise it and substitute all three loops for a call out to MKL.

The advantages are numerous. Firstly, as a developer it makes my life easier as I am going to need a fallback kernel anyway given that it is unlikely that all vendors are going to ship with built-in kernels anytime soon. Thus I already have my own kernel to hand. Secondly, it is more expressive, with built-ins it is going to be hard to agree on a set of functionality everyone is happy with (for S/DGEMM you have the data layout, alpha and beta and if they are fixed or varying, leading dimensions, and all of the extra fun which comes with tensors and batching). Make your API too expansive and a vendor may be locked out just because they only support 95% of the functionality. But, make it too restrictive and you leave performance on the table. Kernels + pattern matching is a far more natural means of handling this.

Additionally, if one is forward looking kernels enable easier bolt-ons. For example, say I want a S/DGEMM but then want to take the absolute value of each element in C. This is a trivial post-processing operation which ideally should be fused or merged into the end of the GEMM kernel. But it is difficult to express through an API, but trivial inside a kernel. The compiler can then either fuse it in (assuming it knows how, but it is in the best place for this being a compiler) or just crack it into two operations (S/DGEMM + second kernel to take the absolute values). Either way this makes fusing possible and is compatible with existing code (command buffers could potentially enable fusing but requires runtimes from science-fiction).

pjaaskel · 2023-12-04T10:56:52Z

Thanks for your useful input. What you describe makes a lot of sense and I agree on the software kernel writer's perspective here, but from the point of view of this DBK proposal, automated function pattern matching would be just another way of utilizing a DBK, not a completely contradicting alternative mechanism. An annotation mechanism somewhere along these lines is definitely something to add to the OpenCL C/SPIR-V side of the spec when we start on it.

Unfortunately, automated primitive operation pattern matching of potentially complex non-DAG functions is known to be very tricky and rather limited in applicability while with this proposal we aim to cover a large range of useful accelerated functions from multiple application domains, not only GEMMS. Relying on automated matching would require writing a (potentially) complex pattern matcher for each DBK for them to be usable in the compiler if I understood your suggestion right.

The matching task would be easier if the kernel could be annotated with the matching kernel directly by the programmer similarly as you describe, but not delegate the "DBK detection" to the compiler. That is, what you describe would be instead a software-fallback mechanism / assistance for automated DBK selection.

In a comment in the original thread there's a question if we would like to make the DBKs callable from software kernels via a mechanism or another, which could help kernel fusion use cases among other things. I think that makes sense. For now the main means for automated kernel fusion we have written up for this proposal is via command buffers of which contents can be optimized as an entity, but it can be expanded and improved to cover other means.

You consider automated task/compute graph node fusion is science fiction? I disagree in this aspect - I think a full kernel pattern matcher from primitive software operations is much more complex than merging successive nodes in a higher level task/computation graph (for example, via simple rule/pattern based matching). Moreover, "fuser middlewares" that assist in this task can be provided as reusable (open source) components implemented in frameworks like MLIR or TVM which the drivers can integrate if they want since they work on generic well-defined DAGs.

So, overall, I think it's better to support both options; a nice way for software assist/fallback and whole task graph optimizations, which I think can be done within the specs we are arriving at.

pjaaskel · 2024-08-13T16:21:37Z

https://github.com/pocl/pocl/pull/1521/files a prototype implementation was merged in PoCL.

RABijl · 2024-08-20T10:55:06Z

extensions/cl_khr_defined_builtin_kernels.asciidoc

+tensors of shape `(1, a, b)`.
+
+Operations of the matrix muliplication are performed in the precision
+of the `elementof\(COUT)`.


There seem to be a couple of formatting errors regarding \(COUT)

Thanks for pointing this out. It'll be fixed.

pjaaskel and others added 7 commits December 13, 2022 15:53

cl_khr_defined_builtin_kernels

a40b2f4

First WiP draft of a defined BiKs extension.

Apply suggested changes

0080151

Co-authored-by: Pekka Jääskeläinen <[email protected]>

Strip '_t' suffix on the new OpenCL API types

c086f4d

* Reflect cl_exp_tensor contents in the dependency section and the DBK

13d085d

code sample. * Fixes and tweaks in the DBK code sample. * Update date.

Remove note about unpublished extension

e0905aa

* Missed cl_tensor_desc -> cl_tensor

e8ce3b0

* Fix typo.

linehill mentioned this pull request Nov 23, 2023

[RFC] cl_exp_tensor #1006

Draft

linehill changed the title ~~[RFC] Updated cl_khr_defined_builtin_kernel~~ [RFC] Updated cl_khr_defined_builtin_kernels Nov 23, 2023

pjaaskel mentioned this pull request Nov 23, 2023

cl_khr_defined_builtin_kernels #867

Closed

Update cl_khr_defined_builtin_kernels

452597e

RABijl reviewed Aug 20, 2024

View reviewed changes

khr -> exp, resolve open questions, small fixes

9edee14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] Updated cl_khr_defined_builtin_kernels #1007

[RFC] Updated cl_khr_defined_builtin_kernels #1007

linehill commented Nov 23, 2023 •

edited

Loading

CLAassistant commented Nov 23, 2023

FreddieWitherden commented Dec 1, 2023

pjaaskel commented Dec 4, 2023

pjaaskel commented Aug 13, 2024

RABijl Aug 20, 2024

linehill Aug 22, 2024

[RFC] Updated cl_khr_defined_builtin_kernels #1007

Are you sure you want to change the base?

[RFC] Updated cl_khr_defined_builtin_kernels #1007

Conversation

linehill commented Nov 23, 2023 • edited Loading

CLAassistant commented Nov 23, 2023

FreddieWitherden commented Dec 1, 2023

pjaaskel commented Dec 4, 2023

pjaaskel commented Aug 13, 2024

RABijl Aug 20, 2024

Choose a reason for hiding this comment

linehill Aug 22, 2024

Choose a reason for hiding this comment

linehill commented Nov 23, 2023 •

edited

Loading