Standalone Patatrack pixel tracking
The purpose of this package is to explore various performance portability solutions with the Patatrack pixel tracking application. The version here corresponds to CMSSW_11_2_0_pre8_Patatrack.
The application is designed to require minimal dependencies on the system. All programs require
- GNU Make,
curl
,md5sum
,tar
- C++17 capable compiler. For programs using CUDA that must work with
nvcc
, this means GCC 8, 9, 10 or 11 (since CUDA 11.4.1).- testing is currently done with GCC 8
- not that due to a bug in GCC, GCC 10.3 is not supported
In addition, the individual programs assume the following be found from the system
Application | CMake (>= 3.16) | CUDA 11.2 | ROCm 5.0 | Intel oneAPI Base Toolkit |
---|---|---|---|---|
cudatest |
✔️ | |||
cuda |
✔️ | |||
cudadev |
✔️ | |||
cudauvm |
✔️ | |||
cudacompat |
✔️ | |||
hiptest |
✔️ | |||
hip |
✔️ | |||
kokkostest |
✔️ | ✅ (1) | ✅ (2) | |
kokkos |
✔️ | ✅ (1) | ✅ (2) | |
alpakatest |
âś… (3) | âś… (4) | ||
alpaka |
âś… (3) | âś… (4) | ||
sycltest |
✔️ | |||
stdpar |
✔️ |
kokkos
andkokkostest
have an optional dependence on CUDA, by default it is required (seekokkos
andkokkostest
for more details)kokkos
andkokkostest
have an optional dependence on ROCm, by default it is not required (seekokkos
andkokkostest
for more details)alpaka
andalpakatest
have an optional dependence on CUDA, by default it is required (seealpaka
andalpakatest
for more details)alpaka
andalpakatest
have an optional dependence on ROCm, by default it is not required (seealpaka
andalpakatest
for more details)
All other dependencies (listed below) are downloaded and built automatically
Application | TBB | Eigen | Kokkos | Boost (1) | Alpaka | libbacktrace | hwloc |
---|---|---|---|---|---|---|---|
fwtest |
✔️ | ||||||
serial |
✔️ | ✔️ | ✔️ | ✔️ | |||
cudatest |
✔️ | ✔️ | ✔️ | ||||
cuda |
✔️ | ✔️ | ✔️ | ✔️ | |||
cudadev |
✔️ | ✔️ | ✔️ | ✔️ | |||
cudauvm |
✔️ | ✔️ | ✔️ | ✔️ | |||
cudacompat |
✔️ | ✔️ | ✔️ | ✔️ | |||
hiptest |
✔️ | ✔️ | ✔️ | ||||
hip |
✔️ | ✔️ | ✔️ | ✔️ | |||
kokkostest |
✔️ | ✔️ | ✔️ | ✔️ | ✔️ (2) | ||
kokkos |
✔️ | ✔️ | ✔️ | ✔️ (2) | |||
alpakatest |
✔️ | ✔️ | ✔️ | ||||
alpaka |
✔️ | ✔️ | ✔️ | ||||
sycltest |
✔️ | ||||||
stdpar |
✔️ | ✔️ | ✔️ | ✔️ |
- Boost libraries from the system can also be used, but they need to be version 1.73.0 or newer
kokkos
andkokkostest
have an optional dependence on hwloc, by default it is not required (seekokkos
andkokkostest
for more details)
The input data set consists of a minimal binary dump of 1000 events of ttbar+PU events from of /TTToHadronic_TuneCP5_13TeV-powheg-pythia8/RunIIAutumn18DR-PUAvg50IdealConditions_IdealConditions_102X_upgrade2018_design_v9_ext1-v2/FEVTDEBUGHLT dataset from the CMS Open Data. The data are downloaded automatically during the build process.
RHEL 7.x / CentOS 7.x use GCC 4.8 as their system compiler. More recent versions can be used from the "Developer Toolset" software collections:
# list available software collections
$ scl -l
devtoolset-9
# load the GCC 9.x environment
$ source scl_source enable devtoolset-9
$ gcc --version
gcc (GCC) 9.3.1 20200408 (Red Hat 9.3.1-2)
Copyright (C) 2019 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Various versions of GCC are also available from the SFT CVMFS area, for example:
$ source /cvmfs/sft.cern.ch/lcg/contrib/gcc/8.3.0/x86_64-centos7/setup.sh
$ $ gcc --version
gcc (GCC) 8.3.0
Copyright (C) 2018 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
RHEL 8.x / CentOS 8.x use GCC 8 as their system compiler.
Application | Description | Framework | Device framework | Test code | Raw2Cluster | RecHit | Pixel tracking | Vertex | Transfers to CPU | Validation code | Validated |
---|---|---|---|---|---|---|---|---|---|---|---|
fwtest |
Framework test | ✔️ | ✔️ | ||||||||
serial |
CPU version (via cudaCompat ) |
✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ||
cudatest |
CUDA FW test | ✔️ | ✔️ | ✔️ | |||||||
cuda |
CUDA version (frozen) | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | |
cudadev |
CUDA version (development) | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | |
cudauvm |
CUDA version with managed memory | ✔️ | ✔️ | ✔️ | ✅ | ✅ | ✅ | ✅ | ✔️ | ✔️ | |
cudacompat |
cudaCompat version |
✔️ | ✔️ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ✔️ | |
hiptest |
HIP FW test | ✔️ | ✔️ | ✔️ | |||||||
hip |
HIP version | ✔️ | ✔️ | ✅ | ✅ | ✅ | ✅ | ✅ | ✅ | ||
kokkostest |
Kokkos FW test | ✔️ | ✔️ | ✔️ | |||||||
kokkos |
Kokkos version | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | ✔️ | |
alpakatest |
Alpaka FW test | ✔️ | ✅ | ||||||||
alpaka |
Alpaka version | âś… | âś… | ||||||||
sycltest |
SYCL/oneAPI FW test | ✔️ | ✔️ | ✔️ | |||||||
stdpar |
std::execution::par version |
âś… | âś… | âś… | âś… | âś… | âś… | âś… | âś… | âś… | âś… |
The "Device framework" refers to a mechanism similar to cms::cuda::Product
and cms::cuda::ScopedContext
to support chains of modules to use the same device and the same work queue.
The column "Validated" means that the program produces the same histograms as the reference cuda
program within numerical precision (judged "by eye").
# Build application using all available CPUs
$ make -j`nproc` cuda
# For CUDA installations elsewhere than /usr/local/cuda
$ make -j`nproc` cuda CUDA_BASE=/path/to/cuda
# Source environment
$ source env.sh
# Process 1000 events in 1 thread
$ ./cuda
# Command line arguments
$ ./cuda -h
./cuda: [--numberOfThreads NT] [--numberOfStreams NS] [--maxEvents ME] [--data PATH] [--transfer] [--validation] [--empty]
Options
--numberOfThreads Number of threads to use (default 1)
--numberOfStreams Number of concurrent events (default 0=numberOfThreads)
--maxEvents Number of events to process (default -1 for all events in the input file)
--data Path to the 'data' directory (default 'data' in the directory of the executable)
--transfer Transfer results from GPU to CPU (default is to leave them on GPU)
--validation Run (rudimentary) validation at the end (implies --transfer)
--empty Ignore all producers (for testing only)
Note that the contents of all
, test
, and all test_<arch>
targets
are filtered based on the availability of compilers/toolchains. Essentially
- by default programs using only GCC (or "host compiler") are included
- if
CUDA_BASE
directory exists, programs using CUDA are included - if
SYCL_BASE
directory exists, programs using SYCL are included
Target | Description |
---|---|
all (default) |
Build all programs |
print_targets |
Print the programs that would be built with all |
test |
Run all tests |
test_cpu |
Run tests that use only CPU |
test_nvidiagpu |
Run tests that require NVIDIA GPU |
test_amdgpu |
Run tests that require AMD GPU |
test_intelgpu |
Run tests that require Intel GPU |
test_auto |
Run tests that auto-discover the available hardware |
test_<program> |
Run tests for program <program> |
test_<program>_<arch> |
Run tests for program <program> that require <arch> |
format |
Format the code with clang-format |
clean |
Remove all build artifacts |
distclean |
clean and remove all externals |
dataclean |
Remove downloaded data files |
external_kokkos_clean |
Remove Kokkos build and installation directory |
The printouts can be disabled at compile with with
make fwtest ... USER_CXXFLAGS="-DFWTEST_SILENT"
This program is a fork of cudacompat
by removing all dependencies to
CUDA in order to be a "pure CPU" version. Note that the name refers to
(the absence of) intra-algorithm parallelization and is thus
comparable to the Serial backend of Alpaka or Kokkos. The event-level
parallelism is implemented as in fwtest
.
The use of caching allocator can be disabled at compile time setting the
CUDATEST_DISABLE_CACHING_ALLOCATOR
preprocessor symbol:
make cudatest ... USER_CXXFLAGS="-DCUDATEST_DISABLE_CACHING_ALLOCATOR"
If the caching allocator is disabled and CUDA version is 11.2 or greater is detected,
device allocations and deallocations will use the stream-ordered CUDA functions
cudaMallocAsync
and cudaFreeAsync
. Their use can be disabled explicitly at
compile time setting also the CUDATEST_DISABLE_ASYNC_ALLOCATOR
preprocessor symbol:
make cudatest ... USER_CXXFLAGS="-DCUDATEST_DISABLE_CACHING_ALLOCATOR -DCUDATEST_DISABLE_ASYNC_ALLOCATOR"
This program is frozen to correspond to CMSSW_11_2_0_pre8_Patatrack.
The location of CUDA 11 libraries can be set with CUDA_BASE
variable.
The use of caching allocator can be disabled at compile time setting the
CUDA_DISABLE_CACHING_ALLOCATOR
preprocessor symbol:
make cuda ... USER_CXXFLAGS="-DCUDA_DISABLE_CACHING_ALLOCATOR"
If the caching allocator is disabled and CUDA version is 11.2 or greater is detected,
device allocations and deallocations will use the stream-ordered CUDA functions
cudaMallocAsync
and cudaFreeAsync
. Their use can be disabled explicitly at
compile time setting also the CUDA_DISABLE_ASYNC_ALLOCATOR
preprocessor symbol:
make cuda ... USER_CXXFLAGS="-DCUDA_DISABLE_CACHING_ALLOCATOR -DCUDA_DISABLE_ASYNC_ALLOCATOR"
This program corresponds to the updated version of the pixel tracking software integrated in CMSSW_12_0_0_pre3.
The use of caching allocator can be disabled at compile time setting the
CUDADEV_DISABLE_CACHING_ALLOCATOR
preprocessor symbol:
make cudadev ... USER_CXXFLAGS="-DCUDADEV_DISABLE_CACHING_ALLOCATOR"
If the caching allocator is disabled and CUDA version is 11.2 or greater is detected,
device allocations and deallocations will use the stream-ordered CUDA functions
cudaMallocAsync
and cudaFreeAsync
. Their use can be disabled explicitly at
compile time setting also the CUDADEV_DISABLE_ASYNC_ALLOCATOR
preprocessor symbol:
make cudadev ... USER_CXXFLAGS="-DCUDADEV_DISABLE_CACHING_ALLOCATOR -DCUDADEV_DISABLE_ASYNC_ALLOCATOR"
The purpose of this program is to test the performance of the CUDA
managed memory. There are various macros that can be used to switch on
and off various behaviors. The default behavior is to use use managed
memory only for those memory blocks that are used for memory
transfers, call cudaMemPrefetchAsync()
, and
cudaMemAdvise(cudaMemAdviseSetReadMostly)
. The macros can be set at
compile time along
make cudauvm ... USER_CXXFLAGS="-DCUDAUVM_DISABLE_ADVISE"
Macro | Effect |
---|---|
-DCUDAUVM_DISABLE_ADVISE |
Disable cudaMemAdvise(cudaMemAdviseSetReadMostly) |
-DCUDAUVM_DISABLE_PREFETCH |
Disable cudaMemPrefetchAsync |
-DCUDAUVM_DISABLE_CACHING_ALLOCATOR |
Disable caching allocator |
-DCUDAUVM_MANAGED_TEMPORARY |
Use managed memory also for temporary data structures |
-DCUDAUVM_DISABLE_MANAGED_BEAMSPOT |
Disable managed memory in BeamSpotToCUDA |
-DCUDAUVM_DISABLE_MANAGED_CLUSTERING |
Disable managed memory in SiPixelRawToClusterCUDA |
-DCUDAUVM_DISABLE_MANAGED_RECHIT |
Disable managed memory in SiPixelRecHitCUDA |
-DCUDAUVM_DISABLE_MANAGED_TRACK |
Disable managed memory in CAHitNtupletCUDA |
-DCUDAUVM_DISABLE_MANAGED_VERTEX |
Disable managed memory in PixelVertexProducerCUDA |
To use managed memory also for temporary device-only allocations, compile with
make cudauvm ... USER_CXXFLAGS="-DCUDAUVM_MANAGED_TEMPORARY"
This program is a fork of cuda
by extending the use of cudaCompat
to clustering and RecHits. The aim is to run the same code on CPU. Currently, however, the program requires a GPU because of (still) using pinned host memory in a few places. In the future the program could be extended to provide both CUDA and CPU flavors.
The program contains the changes from following external PRs on top of cuda
The path to ROCm can be set with ROCM_BASE
variable.
# If nvcc is not in $PATH, create environment file and source it
$ make environment [CUDA_BASE=...]
$ source env.sh
# Actual build command
$ make -j N kokkos [CUDA_BASE=...] [KOKKOS_CUDA_ARCH=...] [...]
$ ./kokkos --cuda
# If changing KOKKOS_HOST_PARALLEL or KOKKOS_DEVICE_PARALLEL, clean up existing build first
$ make clean external_kokkos_clean
$ make kokkos ...
- Note that if
CUDA_BASE
needs to be set, it needs to be set for bothmake
commands. - The target CUDA architecture needs to be set explicitly with
KOKKOS_CUDA_ARCH
(see table below) - The CMake executable can be set with
CMAKE
in case the default one is too old. - The backends to be used in the Kokkos runtime library build are set with
KOKKOS_HOST_PARALLEL
andKOKKOS_DEVICE_PARALLEL
(see table below)- The Serial backend is always enabled
- When running, the backend(s) need to be set explicitly via command line parameters
--serial
for CPU serial backend--pthread
for CPU pthread backend--cuda
for CUDA backend--hip
for HIP backend
- Use of multiple threads (
--numberOfThreads
) has not been tested and likely does not work correctly. Concurrent events (--numberOfStreams
) works. - Support for HIP backend is still work in progress
kokkostest
runskokkos
fails at run time inside the "Pixel tracking"- Target AMD GPU architecture needs to be set explicitly with
KOKKOS_HIP_ARCH
(see table below)
Make variable | Description |
---|---|
CMAKE |
Path to CMake executable (by default assume cmake is found in $PATH )) |
KOKKOS_HOST_PARALLEL |
Host-parallel backend (default empty, possible values: empty, PTHREAD ) |
KOKKOS_DEVICE_PARALLEL |
Device-parallel backend (default CUDA , possible values: empty, CUDA , HIP ) |
CUDA_BASE |
Path to CUDA installation. Relevant only if KOKKOS_DEVICE_PARALLEL=CUDA . |
KOKKOS_CUDA_ARCH |
Target CUDA architecture for Kokkos build (default: 70 , possible values: 50 , 70 , 75 ; trivial to extend). Relevant only if KOKKOS_DEVICE_PARALLEL=CUDA . |
ROCM_BASE |
Path to ROCm installation. Relevant only if KOKKOS_DEVICE_PARALLEL=HIP . |
KOKKOS_HIP_ARCH |
Target AMD GPU architecture for Kokkos build (default: VEGA900 , possible values: VEGA900 , VEGA909 ; trivial to extend). Relevant only if KOKKOS_DEVICE_PARALLEL=HIP . |
KOKKOS_KOKKOS_PTHREAD_DISABLE_HWLOC |
If defined, do not use hwloc. Relevant only if KOKKOS_HOST_PARALLEL=PTHREAD . |
Macro | Effect |
---|---|
-DKOKKOS_SERIALONLY_DISABLE_ATOMICS |
Disable Kokkos (real) atomics, can be used with Serial-only build |
The alpaka
code base is loosely based on the cuda
code base, with some minor changes introduced during the porting.
The alpaka
and alpakatest
always support the CPU backends (serial synchronous and oneTBB asynchronous).
They can be built with either the CUDA backend or the HIP/ROCm backend, with
make alpaka ... CUDA_BASE=path_to_cuda ROCM_BASE=
or
make alpaka ... CUDA_BASE= ROCM_BASE=path_to_rocm
Due to conflicting symbols in the two backends and in Alpaka itself, rnabling both backends at the same time results in compilation errors or undefined behaviour.
The use of caching allocator can be disabled at compile time setting the
ALPAKA_DISABLE_CACHING_ALLOCATOR
preprocessor symbol:
make alpaka ... USER_CXXFLAGS="-DALPAKA_DISABLE_CACHING_ALLOCATOR"
If the caching allocator is disabled and CUDA version is 11.2 or greater is detected,
device allocations and deallocations will use the stream-ordered CUDA functions
cudaMallocAsync
and cudaFreeAsync
. Their use can be disabled explicitly at
compile time setting also the ALPAKA_DISABLE_ASYNC_ALLOCATOR
preprocessor symbol:
make alpaka ... USER_CXXFLAGS="-DALPAKA_DISABLE_CACHING_ALLOCATOR -DALPAKA_DISABLE_ASYNC_ALLOCATOR"
The stdpar
program is cloned from cudauvm
and currently intended
to experiment the use of NVIDIA's implementation of
std::execution::par
with nvc++
and in conjunction with direct CUDA code.
stdpar
implementation requires a c++20 implementation of the c++ standard library (atomic_ref, ranges).
It has only been tested with the GCC 11.2.0 implementation, libstdc++
.
As it is work-in-progress and contains CUDA Kernels, it currently only supports nvc++
. Other compilers will
eventually be supported once Kernels have been ported to their stdpar
equivalent.
stdpar
implementation only supports a single GPU. A multi-gpus implementation would require either multiple processes
or using vendor-specific APIs.
The project is split into several programs, one (or more) for each
test case. Each test case has its own directory under src
directory. A test case contains the full application: framework, data
formats, device tooling, plugins for the algorithmic modules ran
by the framework, and the executable.
Each test program is structured as follows within src/<program name>
(examples point to cuda
Makefile
that defines the actual build rules for the programMakefile.deps
that declares the external dependencies of the program, and the dependencies between shared objects within the programplugins.txt
contains a simple mapping from module names to the plugin shared object names- In CMSSW such information is generated automatically by
scram
, in this project the original author was lazy to automate that
- In CMSSW such information is generated automatically by
bin/
directory that contains all the framework code for the executable binary. These files should not need to be modified, exceptmain.cc
for changin the set of modules to run, and possibly more command line optionsplugin-<PluginName>/
directories contain the source code for plugins. The<PluginName>
part specifies the name of the plugin, and the resulting shared object file isplugin<PluginName>.so
. Note that no other library or plugin may depend on a plugin (either at link time or even thourgh#includ
ing a header). The plugins may only be loaded through the names of the modules by thePluginManager
.<LibraryName>/
: the remaining directories are for libraries. The<LibraryName>
specifies the name of the library, and the resulting shared object file islib<LibraryName>.so
. Other libraries or plugins may depend on a library, in which case the dependence must be declared inMakefile.deps
.CondFormats/
:CUDADataFormats/
: CUDA-specific data structures that can be passed from one module to another via theedm::Event
. A given portability technology likely needs its own data format directory, theCUDADataFormats
can be used as an example.CUDACore/
: Various tools for CUDA. A given portability technology likely needs its own tool directory, theCUDACore
can be used as an example.DataFormats/
: mainly CPU-side data structures that can be passed from one module to another via theedm::Event
. Some of these are produced by theedm::Source
by reading the binary dumps. These files should not need to be modified. New classes may be added, but they should be independent of the portability technology.Framework/
: crude approximation of the CMSSW framework. Utilizes TBB tasks to orchestrate the processing of the events by the modules. These files should not need to be modified.Geometry/
: geometry information, essentially handful of compile-time constants. May be modified.
For more detailed description of the application structure (mostly plugins) see CodeStructure.md
The build system is based on pure GNU Make. There are two levels of Makefiles. The top-level Makefile handles the building of the entire project: it defines general build flags, paths to external dependencies in the system, recipes to download and build the externals, and targets for the test programs.
For more information see BuildSystem.md.
Given that the approach of this project is to maintain many programs
in a single branch, in order to keep the commit history readable, each
commit should contain changes only for one test program, and the short
commit message should start with the program name, e.g. [cuda]
. A
pull request may touch many test programs. General commits (e.g.
top-level Makefile or documentation) can be left without such a prefix.
When starting work for a new portability technology, the first steps
are to figure out the installation of the necessary external software
packages and the build rules (both can be adjusted later). It is
probably best to start by cloning the fwtest
code for the new
program (e.g. footest
for a technology foo
), adjust the test
modules to exercise the API of the technology (see cudatest
for
examples), and start crafting the tools package (CUDACore
in
cuda
).
Pull requests are expected to build (make all
succeeds) and pass
tests (make test
). Programs to have build errors should primarily be
filtered out from $(TARGETS)
, and failing tests should primarily be
removed from the set of tests run by default. Breakages can, however,
be accepted for short periods of time with a good justification.
The code is formatted with clang-format
version 10.