forked from ENCCS/documents-for-publications
-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy path11_gpu_porting.tex
228 lines (156 loc) · 17 KB
/
11_gpu_porting.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
\section{Preparing Code for GPU Porting}\label{sec:porting_code}
\subsection{Porting from CPU to GPU}
\par
When porting code to take advantage of the parallel processing capability of GPUs, several steps need to be followed and some additional work is required before writing actual parallel code to be executed on the GPUs:
\begin{itemize}
\item~\textbf{Identify Targeted Parts}: Begin by identifying the parts of the code that contribute significantly to the execution time. These are often computationally intensive sections such as loops or matrix operations. The Pareto principle suggests that roughly 10\% of the code accounts for 90\% of the execution time.
\item~\textbf{Equivalent GPU Libraries}: If the original code uses CPU libraries like BLAS, FFT, etc, it’s crucial to identify the equivalent GPU libraries. For example, cuBLAS or hipBLAS can replace CPU-based BLAS libraries. Utilizing GPU-specific libraries ensures efficient GPU utilization.
\item~\textbf{Refactor Loops}: When porting loops directly to GPUs, some refactoring is necessary to suit the GPU architecture. This typically involves splitting the loop into multiple steps or modifying operations to exploit the independence between iterations and improve memory access patterns. Each step of the original loop can be mapped to a kernel, executed by multiple GPU threads, with each thread corresponding to an iteration.
\item~\textbf{Memory Access Optimization}: Consider the memory access patterns in the code. GPUs perform best when memory access is coalesced and aligned. Minimizing global memory accesses and maximizing utilization of shared memory or registers can significantly enhance performance. Review the code to ensure optimal memory access for GPU execution.
\end{itemize}
\par
Inspect the Fortran code in List~\ref{lst:11_fortran_code_discussion} (if you don’t read Fortran: do-loops == for-loops), how would this be ported from CPU to GPU?
\lstinputlisting[language=fortran, caption={How to port the code from CPU to GPU?}, label={lst:11_fortran_code_discussion}, xleftmargin=0.05\textwidth, xrightmargin=0.05\textwidth]{code_examples/11_fortran_code_discussion.f}
\par
Some steps at first glance:
\begin{itemize}
\item the code could (has to) be splitted in 3 kernels. Why?
\item check if there are any variables that could lead to false dependencies between iterations, like the index $k2$;
\item is it efficient for GPUs to split the work over the index $i$? What about the memory access? Note the arrays are 2D in Fortran;
\item is it possible to collapse some loops? Combining nested loops can reduce overhead and improve memory access patterns, leading to better GPU performance;
\item what is the best memory access in a GPU? Review memory access patterns in the code. Minimize global memory access by utilizing shared memory or registers;
\item where appropriate. Ensure memory access is coalesced and aligned, maximizing GPU memory throughput.
\end{itemize}
\par
Keypoints of this subsection:
\begin{itemize}
\item Identify equivalent GPU libraries for CPU-based libraries and utilizing them to ensure efficient GPU utilization;
\item Importance of identifying the computationally intensive parts of the code that contribute significantly to the execution time;
\item The need to refactor loops to suit the GPU architecture;
\item Significance of memory access optimization for efficient GPU execution, including coalesced and aligned memory access patterns.
\end{itemize}
% ---------------------------------------------------------------------- %
\subsection{Porting between different GPU frameworks}
\par
You might also find yourself in a situation where you need to port a code from one particular GPU framework to another.
This section gives an overview of different tools that enable converting CUDA and OpenACC codes to HIP and OpenMP, respectively.
This conversion process enables an application to target various GPU architectures, specifically, NVIDIA and AMD GPUs.
Here we focus on~\textbf{hipify}~\cite{hipify} and~\textbf{clacc}~\cite{clacc} tools.
This guide is adapted from the~\href{https://documentation.sigma2.no/code_development/guides/cuda_translating-tools.html}{NRIS documentation}.
\subsubsection{Translating CUDA to HIP with Hipify}
\par
In this section, we cover the use of~\textbf{hipify-perl} and~\textbf{hipify-clang} tools to translate a CUDA code to HIP.
\paragraph{Hipify-perl}
\par
The~\textbf{hipify-perl} tool is a script based on perl that translates CUDA syntax into HIP syntax~\cite{hipify}.
For instance, in a CUDA code that incorporates the CUDA functions~\textbf{\textcolor{red}{cudaMalloc}} and~\textbf{\textcolor{red}{cudaDeviceSynchronize}}, the tool will substitute~\textbf{\textcolor{red}{cudaMalloc}} with the HIP function~\textbf{\textcolor{red}{hipMalloc}}.
Similarly the CUDA function~\textbf{\textcolor{red}{cudaDeviceSynchronize}} will be substituted with the HIP function~\textbf{\textcolor{red}{hipDeviceSynchronize}}.
Below we list the basic steps to run~\textbf{hipify-perl} on LUMI-G.
\begin{itemize}
\item Step 1: Generating~\textbf{hipify-perl} script
\lstinputlisting[language=bash, firstline=4, lastline=5, label={lst:11_hipify_perl_clang_exercise_lumig}, xleftmargin=0.05\textwidth, xrightmargin=0.05\textwidth]{code_examples/11_hipify_perl_clang_exercise_lumig.pbs}
\item Step 2: Running the generated~\textbf{hipify-perl}
\lstinputlisting[language=bash, firstline=7, lastline=7, label={lst:11_hipify_perl_clang_exercise_lumig1}, xleftmargin=0.05\textwidth, xrightmargin=0.05\textwidth]{code_examples/11_hipify_perl_clang_exercise_lumig.pbs}
\item Step 3: Compiling with~\textbf{hipcc} the generated HIP code
\lstinputlisting[language=bash, firstline=9, lastline=9, label={lst:11_hipify_perl_clang_exercise_lumig2}, xleftmargin=0.05\textwidth, xrightmargin=0.05\textwidth]{code_examples/11_hipify_perl_clang_exercise_lumig.pbs}
\end{itemize}
\par
Despite the simplicity of the use of~\textbf{hipify-perl}, the tool might not be suitable for large applications, as it relies heavily on substituting CUDA strings with HIP strings ($e.g.$, it substitutes~\textbf{\textcolor{red}{*cuda*}} with~\textbf{\textcolor{red}{*hip*}}).
In addition,~\textbf{hipify-perl} lacks the ability of distinguishing device/host function calls~\cite{hipify}.
The alternative here is to use the~\textbf{hipify-clang} tool that will be described in the next section.
\paragraph{Hipify-clang}
\par
As described in the HIPIFY documentation~\cite{hipify}, the~\textbf{hipify-clang} tool is based on clang for translating CUDA sources into HIP sources.
The tool is more robust for translating CUDA codes compared to the~\textbf{hipify-perl} tool.
Furthermore, it facilitates the analysis of the code by providing assistance.
\par
In short,~\textbf{hipify-clang} requires~\textbf{LLVM+CLANG} and~\textbf{CUDA}.
Details about building~\textbf{hipify-clang} can be found at the HIPIFY documentation~\cite{hipify}.
Note that~\textbf{hipify-clang} is available on LUMI-G.
The issue however might be related to the installation of CUDA toolkit.
To avoid any eventual issues with the installation procedure, we opt for CUDA singularity container.
Herein we present a step-by-step guide for running~\textbf{hipify-clang}.
\begin{itemize}
\item Step 1: Pulling a CUDA singularity container
\lstinputlisting[language=bash, firstline=15, lastline=15, label={lst:11_hipify_perl_clang_exercise_lumig3}, xleftmargin=0.05\textwidth, xrightmargin=0.05\textwidth]{code_examples/11_hipify_perl_clang_exercise_lumig.pbs}
\item Step 2: Loading a rocm module and launching the CUDA singularity using the commands below, where the current directory~\textbf{\textcolor{brown}{\$PWD}} in the host is mounted to that of the container, and the directory~\textbf{\textcolor{brown}{/opt}} in the host is mounted to the that inside the container.
\lstinputlisting[language=bash, firstline=17, lastline=18, label={lst:11_hipify_perl_clang_exercise_lumig4}, xleftmargin=0.05\textwidth, xrightmargin=0.05\textwidth]{code_examples/11_hipify_perl_clang_exercise_lumig.pbs}
\item Step 3: Setting the environment variable~\textbf{\textcolor{brown}{\$PATH}}. In order to run~\textbf{hipify-clang} from inside the container, one can set the environment variable~\textbf{\textcolor{brown}{\$PATH}} that defines the path to look for the binary~\textbf{hipify-clang}. Note that the rocm version we used is~\textbf{rocm-5.2.3}.
\lstinputlisting[language=bash, firstline=20, lastline=20, label={lst:11_hipify_perl_clang_exercise_lumig5}, xleftmargin=0.05\textwidth, xrightmargin=0.05\textwidth]{code_examples/11_hipify_perl_clang_exercise_lumig.pbs}
\item Step 4: Running~\textbf{hipify-clang} from inside the singularity container. Here the cuda path and the path to the~\textbf{\textcolor{red}{*includes*}} and~\textbf{\textcolor{red}{*defines*}} files should be specified. The CUDA source code and the generated output code are~\textbf{program.cu} and~\textbf{hip\_program.cu.hip}, respectively. The syntax for the compilation process of the generated hip code is similar to the one described in the~\textbf{hipify-perl} section.
\lstinputlisting[language=bash, firstline=22, lastline=22, label={lst:11_hipify_perl_clang_exercise_lumig6}, xleftmargin=0.05\textwidth, xrightmargin=0.05\textwidth]{code_examples/11_hipify_perl_clang_exercise_lumig.pbs}
\end{itemize}
\par
Code examples for the~\textbf{hipify} exercises can be accessed in the~\textbf{\textcolor{brown}{content/examples/exercise\_hipify}} subdirectory by cloning this repository.
\lstinputlisting[language=bash, firstline=28, lastline=29, label={lst:11_hipify_perl_clang_exercise_lumig7}, xleftmargin=0.05\textwidth, xrightmargin=0.05\textwidth]{code_examples/11_hipify_perl_clang_exercise_lumig.pbs}
\par
Following the steps mentioned above, there are exercises to translate and CUDA to HIP code using~\textbf{hipify-perl} and~\textbf{hipify-clang}.
\begin{itemize}
\item Exercise I : Translate an CUDA code to HIP with~\textbf{hipify-perl}
\begin{itemize}
\item 1.1 Generate the~\textbf{hipify-perl} tool.
\item 1.2 Convert the CUCA code~\textbf{\textcolor{brown}{vec\_add\_cuda.cu}} located in~\textbf{\textcolor{brown}{/exercise\_hipify/Hipify\_perl}} with the~\textbf{Hipify-perl} tool to HIP.
\item 1.3 Compile the generated HIP code with the~\textbf{hipcc} compiler wrapper and run it.
\end{itemize}
\item Exercise II : Translate an CUDA code to HIP with~\textbf{hipify-clang}
\begin{itemize}
\item 2.1 Convert the CUCA code~\textbf{\textcolor{brown}{vec\_add\_cuda.cu}} located in~\textbf{\textcolor{brown}{/exercise\_hipify/Hipify\_clang}} with the~\textbf{hipify-clang} tool to HIP.
\item 2.2 Compile the generated HIP code with the~\textbf{hipcc} compiler wrapper and run it.
\end{itemize}
\end{itemize}
\subsubsection{Translating OpenACC to OpenMP with Clacc}
\par
\textbf{Clacc}~\cite{clacc} is a tool to translate an OpenACC application to OpenMP offloading with the Clang/LLVM compiler environment.
Note that the tool is specific to OpenACC C, while OpenACC fortran is already supported on AMD GPU.
As indicated in the GitHub repository~\cite{llvm_project}, the compiler~\textbf{clacc} is the~\textbf{Clang}’s executable in the subdirectory~\textbf{\textcolor{brown}{/bin}} of the~\textbf{\textcolor{brown}{/install}} directory as described below.
In the following part, we present a step-by-step guide for building and using~\textbf{clacc}.
\begin{itemize}
\item Step 1: Building and installing~\textbf{clacc}.
\lstinputlisting[language=bash, firstline=4, lastline=16, label={lst:11_clacc_exercise_lumig}, xleftmargin=0.05\textwidth, xrightmargin=0.05\textwidth]{code_examples/11_clacc_exercise_lumig.pbs}
\item Step 2: Setting up environment variables to be able to work from the~\textbf{\textcolor{brown}{/install}} directory, which is the simplest way. We assume that the~\textbf{\textcolor{brown}{/install}} directory is located in the path~\textbf{\textcolor{brown}{/project/project\_xxxxxx/Clacc/llvm-project}}. For more advanced usage of~\textbf{clacc}, we refer readers to the~\href{https://github.com/llvm-doe-org/llvm-project/blob/clacc/main/README.md}{Usage from Build directory}.
\lstinputlisting[language=bash, firstline=23, lastline=24, label={lst:11_clacc_exercise_lumig1}, xleftmargin=0.05\textwidth, xrightmargin=0.05\textwidth]{code_examples/11_clacc_exercise_lumig.pbs}
\item Step 3: Source to source conversion of the~\textbf{\textcolor{brown}{openACC\_code.c}} code to be printed out to the file~\textbf{\textcolor{brown}{openMP\_code.c}}. Here the flag~\textbf{\textcolor{brown}{-fopenacc-structured-ref-count-omp=no-ompx-hold}} is introduced to disable the~\textbf{\textcolor{brown}{ompx\_hold}} map type modifier, which is used by the OpenACC copy clause translation. The~\textbf{\textcolor{brown}{ompx\_hold}} is an OpenMP extension that might not be supported yet by other compilers.
\lstinputlisting[language=bash, firstline=31, lastline=31, label={lst:11_clacc_exercise_lumig2}, xleftmargin=0.05\textwidth, xrightmargin=0.05\textwidth]{code_examples/11_clacc_exercise_lumig.pbs}
\item Step 4: Compiling the code with the~\href{https://docs.lumi-supercomputer.eu/development/compiling/prgenv/}{cc compiler wrapper}.
\lstinputlisting[language=bash, firstline=38, lastline=43, label={lst:11_clacc_exercise_lumig3}, xleftmargin=0.05\textwidth, xrightmargin=0.05\textwidth]{code_examples/11_clacc_exercise_lumig.pbs}
\end{itemize}
\par
Code examples for the~\textbf{clacc} exercise can be accessed in the~\textbf{\textcolor{brown}{content/examples/exercise\_clacc}} subdirectory by cloning this repository.
\lstinputlisting[language=bash, firstline=50, lastline=51, label={lst:11_clacc_exercise_lumig4}, xleftmargin=0.05\textwidth, xrightmargin=0.05\textwidth]{code_examples/11_clacc_exercise_lumig.pbs}
\par
Here is an additional exercise to translate an OpenACC code to OpenMP.
\begin{itemize}
\item Convert the OpenACC code~\textbf{\textcolor{brown}{openACC\_code.c}} located in~\textbf{\textcolor{brown}{/exercise\_clacc}} with the~\textbf{clacc} compiler.
\item Compile the generated OpenMP code with the~\textbf{cc} compiler wrapper and run it.
\end{itemize}
\subsubsection{Translating CUDA to SYCL/DPC++ with SYCLomatic}
\par
Intel offers a tool for CUDA-to-SYCL code migration, included in the Intel oneAPI Base Toolkit~\cite{intel_oneapi_base_toolkit}.
Currently, it is not installed on LUMI, but the general workflow is similar to the~\textbf{hipify-clang} and also requires an existing CUDA installation.
\lstinputlisting[language=bash, firstline=1, lastline=3, label={lst:11_sycl_lumig}, xleftmargin=0.05\textwidth, xrightmargin=0.05\textwidth]{code_examples/11_sycl_lumig.pbs}
\par
The~\textbf{SYCLomatic} can migrate larger projects by using~\textbf{\textcolor{brown}{-in-root}} and~\textbf{\textcolor{brown}{-out-root}} flags to process directories recursively.
It can also use compilation database (supported by CMake and other build systems) to deal with more complex project layouts.
\par
Please note that the code generated by SYCLomatic relies on oneAPI-specific extensions, and thus cannot be directly used with other SYCL implementations, such as hipSYCL.
The~\textbf{\textcolor{brown}{--no-incremental-migration}} flag can be added to~\textbf{\textcolor{brown}{dpct}} command to minimize, but not completely avoid, the use of this compatibility layer.
That would require manual effort, since some CUDA concepts cannot be directly mapped to SYCL.
\par
Additionally, CUDA applications might assume certain hardware behavior, such as 32-wide warps.
If the target hardware is different ($e.g.$, AMD MI250 GPUs, used in LUMI, have warp size of 64), the algorithms might need to be adjusted manually.
\subsubsection{Conclusion}
\par
This concludes a brief overview of the usage of available tools to convert CUDA codes to HIP and SYCL, and OpenACC codes to OpenMP offloading.
In general the translation process for large applications might be incomplete and thus requires manual modification to complete the porting process.
It is however worth noting that the accuracy of the translation process requires that applications are written correctly according to the CUDA and OpenACC syntaxes.
\par
More reading materials related to the topics discussed above are available at:
\begin{itemize}
\item~\href{https://github.com/ROCm-Developer-Tools/HIPIFY}{Hipify GitHub}
\item~\href{https://rocm.docs.amd.com/projects/HIPIFY/en/latest/index.html}{HIPIFY Documentation}
\item~\href{https://github.com/olcf-tutorials/simple_HIP_examples/tree/master/vector_addition}{HIP Examples}
\item~\href{https://www.admin-magazine.com/HPC/Articles/Porting-CUDA-to-HIP}{Porting CUDA to HIP}
\item~\href{https://github.com/llvm-doe-org/llvm-project/blob/clacc/main/README.md}{Clacc Repository}
\item~\href{https://www.intel.com/content/www/us/en/developer/articles/technical/syclomatic-new-cuda-to-sycl-code-migration-tool.html}{SYCLomatic Project}
\item~\href{https://oneapi-src.github.io/SYCLomatic/get_started/index.html}{SYCLomatic Documentation}
\end{itemize}