oclude
is a command line tool and a Python 3 module to run and test arbitrary standalone OpenCL kernels, without the need to write hostcode or specify its arguments.
Besides simply running the OpenCL kernel, oclude
can also:
- measure its execution time,
- count the instructions executed through an accurate and meaningful mapping from the OpenCL kernel code to the LLVM instruction set, created by using the
clang
toolkit, - profile a specified OpenCL device.
The project is currently functional, with some limitations regarding mainly the complexity of the OpenCL C source code provided to it. More specifically, successful handling of OpenCL source files that are more complicated than the rodinia benchmark suite is not guaranteed.
* Keep in mind that proper behavior is not guaranteed if different versions than the ones that are listed below are used.
At least for now, oclude
is developed and expected to work on *nix operating systems only.
python
, version >= 3.6setuptools
is recommended
In case you want to use oclude
as an OpenCL kernel driver only or measure the execution time of OpenCL kernels only:
- An
OpenCL
runtime environment. Have in mind that installingoclude
results in also installingpyopencl
which means that, depending on your case, this dependency may get automatically resolved.
In case you want to use oclude
as an OpenCL kernel profiler (i.e. get LLVM instruction counts):
- The
clang
compiler (tested with version10.0.0
that was installed along withLLVM
) g++
withC++17
(or later) supportLLVM
(tested with version10.0.0git
. You can check your version by runningllvm-config --version
in a terminal. Tested with version3.8
and did not work, so my guess is that you will need something quite higher than that)- Note: If the OpenCL kernels that you want to profile are very complex and/or large -meaning a (really) high count of instructions-, the selected OpenCL device should support the
cl_khr_int64_base_atomics
OpenCL extension. If not,oclude
(and, more specifically, thehostcode
component) will warn you with the following error:
[hostcode] WARNING: Selected device does not support the `cl_khr_int64_base_atomics` OpenCL extension!
[hostcode] This means that instructions will not get correctly reported if they are too many!
Be aware that, if this extension is not supported by the selected OpenCL device, it is not guaranteed that the instructions reported for complex and/or large OpenCL kernels will be truthful.
For the time being, oclude
is not available in PyPI
. This is hopefully going to change in the future.
git clone
the repo andcd
inside it:
git clone [email protected]:zehanort/oclude.git
cd oclude
- (optional) If you need to use
oclude
as a full OpenCL kernel profiler (i.e. countLLVM
instructions executed), you will need to build aC++
component ofoclude
. Simply runmake
in the directory you are currently at. If any errors occur, yourg++
and/orLLVM
versions are not compatible withoclude
. Ignore this step; you will not be able to useoclude
as a full OpenCL kernel profiler (unless you change yourg++
and/orLLVM
versions, obviously) - Install
oclude
on your system or inside a virtual environment (e.g. usingvenv
). From the directory you are currently at, run:
pip install .
or
pip install -e .
in case you would like to experiment with the oclude
code.
Everything you need to know about the different ways in which oclude
can be used, including a full documentation of all the APIs it exports, is located in the wiki. The examples in the following sections are using the oclude
CLI.
As a brief overview, oclude
supports 2 different commands:
- the profiling of the selected device, and
- the execution and/or profiling of an OpenCL kernel.
The latter supports 2 different modes of operation, apart from simply executing the kernel:
- count the LLVM instructions that were executed, codenamed instcounts, and/or
- measure the execution time, codenamed timeit:
oclude
├── device
└── kernel
├── instcounts
└── timeit
In the oclude
CLI, the syntax is the following:
$ oclude <command> <command flags>
Note that command
is optional and defaults to kernel
.
An example of the device
command in the oclude
CLI could be the following:
$ oclude device -p 0 -d 0
[hostcode] Collecting profiling info for the following device:
[hostcode] Platform: Intel(R) OpenCL HD Graphics
[hostcode] Device: Intel(R) Gen9 HD Graphics NEO
[hostcode] Version: OpenCL 2.1 NEO
[hostcode] Please wait, this may take a while...
Profiling info for selected OpenCL device:
profiling overhead (time) - 0.011303499341011047
profiling overhead (percentage) - 17.80%
command latency - 0.06351426243782043
host-to-device transfer latency - 0.011074915528297424
device-to-host transfer latency - 0.011512413620948792
device-to-device transfer latency - 0.06323426961898804
host-device bandwidth bandwidth @ 64 bytes - 0.005645181903735443 GB/s
host-device bandwidth bandwidth @ 256 bytes - 0.022125035974706695 GB/s
host-device bandwidth bandwidth @ 1024 bytes - 0.08657326467722175 GB/s
... a lot of bandwidth measurements follow ...
An example of the kernel
command in the oclude
CLI could be the following (note that the kernel
keyword is omitted as it is implied when absent and that, besides running the kernel, nothing else really happens):
$ oclude -f tests/rodinia_kernels/dwt2d/com_dwt.cl -k c_CopySrcToComponents -g 1024 -l 128
[oclude] INFO: Input file tests/rodinia_kernels/dwt2d/com_dwt.cl is not cached
[oclude] Running kernel 'c_CopySrcToComponents' from file tests/rodinia_kernels/dwt2d/com_dwt.cl
[hostcode] Using the following device:
[hostcode] Platform: Intel(R) OpenCL HD Graphics
[hostcode] Device: Intel(R) Gen9 HD Graphics NEO
[hostcode] Version: OpenCL 2.1 NEO
[hostcode] Kernel name: c_CopySrcToComponents
[hostcode] Kernel arg 1: d_r (int*, global)
[hostcode] Kernel arg 2: d_g (int*, global)
[hostcode] Kernel arg 3: d_b (int*, global)
[hostcode] Kernel arg 4: cl_d_src (uchar*, global)
[hostcode] Kernel arg 5: pixels (int, private)
[hostcode] About to execute kernel with Global NDRange = 1024 and Local NDRange = 128
[hostcode] Number of executions (a.k.a. samples) to perform: 1
[hostcode] Kernel run completed successfully
Observe the following from the usage above:
- Firstly, an OpenCL kernel file (*.cl) is specified with the
--file/-f
flag - A kernel from inside this file is chosen with
--kernel/-k
(optional; if it is not used,oclude
will inform the user of the kernels present in the input file and they will be able to choose which one to run interactively) - The global and local OpenCL NDRanges are specified with the
--gsize/-g
and--lsize/-l
flags, respectively. Only 1 dimension is supported, therefore these flags accept only a single positive integer.
Nothing interesting happened though... That is why the kernel
command has 2 modes of operation.
Simply use the --inst-counts/-i
flag to instrument the kernel and count the LLVM instructions that correspond to the instructions that were actually ran by the kernel:
$ oclude -f tests/rodinia_kernels/dwt2d/com_dwt.cl -k c_CopySrcToComponents -g 1024 -l 128 -i
[oclude] INFO: Input file tests/rodinia_kernels/dwt2d/com_dwt.cl is cached
[oclude] INFO: Using cached instrumented file
[oclude] Running kernel 'c_CopySrcToComponents' from file tests/rodinia_kernels/dwt2d/com_dwt.cl
[hostcode] Using the following device:
[hostcode] Platform: Intel(R) OpenCL HD Graphics
[hostcode] Device: Intel(R) Gen9 HD Graphics NEO
[hostcode] Version: OpenCL 2.1 NEO
[hostcode] Kernel name: c_CopySrcToComponents
[hostcode] Kernel arg 1: d_r (int*, global)
[hostcode] Kernel arg 2: d_g (int*, global)
[hostcode] Kernel arg 3: d_b (int*, global)
[hostcode] Kernel arg 4: cl_d_src (uchar*, global)
[hostcode] Kernel arg 5: pixels (int, private)
[hostcode] About to execute kernel with Global NDRange = 1024 and Local NDRange = 128
[hostcode] Number of executions (a.k.a. samples) to perform: 1
[hostcode] Collecting instruction counts...
[hostcode] Kernel run completed successfully
Instructions executed for kernel 'c_CopySrcToComponents':
26920 - load private
20776 - alloca
14336 - store private
12288 - add
11631 - getelementptr
11264 - mul
8855 - store callee
7245 - load callee
4096 - call
3072 - load global
3072 - load local
3072 - store local
3072 - zext
2415 - sub
1829 - br
1024 - ret
1024 - icmp
NOTE: The output of this mode was designed to resemble that of Oclgrind.
Simply use the --time-it/-t
flag to measure the execution time of the specified kernel:
$ oclude -f tests/rodinia_kernels/dwt2d/com_dwt.cl -k c_CopySrcToComponents -g 1024 -l 128 -t
[oclude] INFO: Input file tests/rodinia_kernels/dwt2d/com_dwt.cl is not cached
[oclude] Running kernel 'c_CopySrcToComponents' from file tests/rodinia_kernels/dwt2d/com_dwt.cl
[hostcode] Using the following device:
[hostcode] Platform: Intel(R) OpenCL HD Graphics
[hostcode] Device: Intel(R) Gen9 HD Graphics NEO
[hostcode] Version: OpenCL 2.1 NEO
[hostcode] Kernel name: c_CopySrcToComponents
[hostcode] Kernel arg 1: d_r (int*, global)
[hostcode] Kernel arg 2: d_g (int*, global)
[hostcode] Kernel arg 3: d_b (int*, global)
[hostcode] Kernel arg 4: cl_d_src (uchar*, global)
[hostcode] Kernel arg 5: pixels (int, private)
[hostcode] About to execute kernel with Global NDRange = 1024 and Local NDRange = 128
[hostcode] Number of executions (a.k.a. samples) to perform: 1
[hostcode] Collecting time profiling info...
[hostcode] Kernel run completed successfully
Time measurement info regarding the execution for kernel 'c_CopySrcToComponents' (in milliseconds):
hostcode - 1.9354820251464844
device - 0.013415999999999999
transfer - 1.9220660251464843
The 2 modes of the kernel
command can be combined to measure the execution time of the instrumented OpenCL code.
oclude
exports its 2 commands -device
and kernel
- as 2 different functions:
- the
device
command is exported as theoclude.profile_opencl_device()
function - the
kernel
command is exported as theoclude.profile_opencl_kernel()
function
Their complete documentation can be found in the respective wiki page.
- For the time being,
oclude
instruments the OpenCL source code directly in order to count the LLVM instructions that are executed. To achieve that, a mapping between the OpenCL C source code and the LLVM bitcode basic blocks has been designed. As you may know, a 1-1 mapping between source code and basic blocks of an IR is not a trivial problem, which means that many design choices had to be made. For this mapping to be properly designed, no optimizations could be used during the parsing of the LLVM instructions to which the input source file is compiled. This means that the instruction counts that are reported when using thekernel
command with the--instcounts/-i
mode of operation corresponds to the unoptimized OpenCL source code. - If there are certain sizes and/or values of the input arguments that may lead the specified kernel to a segfault, there are 3 different possible outcomes:
- normal execution
- empty output
- execution of
oclude
hangs and you have to kill it manually