This repository is used to participate in a competition organized by Intel®.
Eigen
Library:libeigen / eigen · GitLab。SYCL
Library:、icpx
Compiler:Intel® oneAPI Base Toolkit: Essential oneAPI Tools & Libraries。
Need to generate the calculation grid (like gen_gird.cpp
) to X.txt
and Y.txt
.
icpx -fsycl src/main.cpp
./a.out
If you encounter the following problems:
error: Double type is not supported on this problem.
in kernel: 'xxxxxx'
error: backend compiler failed build.
Enter the following command on the command line before building the project:
export OverrideDefaultFP64Settings=1
export IGC_EnableDPEmulation=1
chmod 755 q; chmod 755 run.sh;if [ -x "$(command -v qsub)" ]; then ./q run.sh; else ./run.sh; fi
const int DIV = 1000; // The number of divisions per second
const int SEC = 200; // Total simulation time (seconds)
const double Re = 200; // Reynolds number of the fluid
const double err = 0.005; // Minimum error value required for psi iterations
Output to the ./output/
folder in the same directory, including PSI
, U
, V
three kinds of data, the number contained in the file name represents the number of seconds of simulation.
You can use Matlab
or Python
for data analysis, such as using plot_flt.m
and rainbow_figure.m
in this repository.
The following is a schematic diagram of the speed when Re = 200, T = 100s
(the color depth represents the speed):
The specific optimization is reflected in the three functions of the vorticity FTCS iteration push()
、the flow function iteration psi_iteration()
、and the velocity solution velocity()
, because these three parts perform a large number of matrix operations, and the amortized complexity of each iteration is boundary()
in the same iteration has a complexity of only
Here is the iterated discrete form of the stream function
$$
\begin{align}
\psi_{i, j}^{k+1}&=\frac{1}{2(\alpha+\gamma)}\left[\left(\alpha-\frac{\delta}{2}\right) \psi_{i-1, j}^{k}+\left(\alpha+\frac{\delta}{2}\right) \psi_{i+1, j}^{k}+\left(\gamma-\frac{\varepsilon}{2}\right) \psi_{i, j-1}^{k}+\left(\gamma+\frac{\varepsilon}{2}\right) \psi_{i, j+1}^{k}\right] \
&+\frac{1}{2(\alpha+\gamma)}\left[2 \beta\left(\psi^{k}{i+1, j+1}-\psi^{k}{i+1, j-1}-\psi^{k}{i-1, j+1}+\psi^{k}{i-1, j-1}\right)-\omega_{i, j}\right]
\end{align}
$$
MatrixXd
defaults to column priority, which leads to discontinuous memory in each row when converted to an array by MatrixXd::data()
. I exchange the summation symbol and transform the formula to traverse by column to increase addressing efficiency.
We can use assembly code to calculate the clock cycle:
static __inline unsigned long long rdtsc(void) {
unsigned hi, lo;
__asm __volatile__("rdtsc" : "=a"(lo), "=d"(hi));
return ((unsigned long long)lo) | (((unsigned long long)hi) << 32);
}
Taking the push()
function as an example, it takes a lot of time to read data into the device memory, and about 73% of the time is spent on reading in a large matrix. In addition, there is still a lot of time spent on addressing the matrix. For codes that do not use heterogeneous computing, each location in push()
consumes an average of 1172 clock cycles for data reading, and an average of 1543 clock cycles for floating-point calculations.