- Introduction
- File tree intro
- Compile and run
- Detailed explain of code
- How to figure out diff with fortran version of 3dmhd
- Further work and concerns
This is 3dmhd code rewritten using c++20. The main features are:
- The high degree of code encapsulation reduces the code size from the original 6,000 lines to about 3,000 lines.
- The global variable characteristics and basic code structure of the original Fortran code are maintained, and the naming of variables is also consistent.
- On default configuration, correctness with original Fortran code are guaranteed.
TODOs:
- Verify the correctness on other configurations.
- Optimize performance.
- Implementation with cuda accelerate (in progress).
The file tree and explaination of 3dmhd-cpp is listed below:
├── 3dmhd.cpp
├── 3dmhddata.h
├── 3dmhdio.h
├── 3dmhdmpi.h
├── 3dmhdparam.h
├── 3dmhdphsic.h
├── 3dmhdprint.h
├── 3dmhdset.h
├── 3dmhdsub.h
├── 3dmhdutils.h
├── autodiff.py
├── makefile
├── output
├── README.md
└── test.cppThis project depends on openmpi together with a g++ compiler which supports c++ 20 standard.
# compile
mpic++ -O2 -std=c++20 3dmhd.cpp -o 3dmhd
# run with 24 processes
mpirun -np 24 ./3dmhd > clog.logNOTE: 3dmhd will print massive logs to stdout for debug purpose. A comparison between Fortran and C++ logs will be done (see later sections). To remove these debug logs, just commented out the content of function print_with_tag_index_all in 3dmhdprint.h.
Fortran and C++ are two different programming languages with distinct conventions for array indexing and memory layout. These differences can be summarized as follows:
-
Array Indexing:
- Fortran uses 1-based indexing, which means that the first element of an array is accessed with the index 1. For example, in Fortran, if you have a 3-dimension array
RU, you would access an element asRU(x, y, z), with x, y, and z starting from 1. - C++ uses 0-based indexing, so the first element of an array is accessed with the index 0. In C++, if you have a 3-dimension array
RU, you would access an element asRU[x-1][y-1][z-1], with x, y, and z starting from 0.
- Fortran uses 1-based indexing, which means that the first element of an array is accessed with the index 1. For example, in Fortran, if you have a 3-dimension array
-
Memory Layout:
- In Fortran, memory is typically laid out in a column-major order, which means that elements along the first dimension are stored consecutively in memory. In the case of a 3D array
RU(z, y, x), the elements along the first dimension (z) are contiguous in memory, making it efficient for certain types of operations. - In C++, memory is laid out in a row-major order, with elements along the last dimension being stored consecutively in memory. For a 3D array RU[z][y][x], the elements along the last dimension (x) are contiguous in memory.
- In Fortran, memory is typically laid out in a column-major order, which means that elements along the first dimension are stored consecutively in memory. In the case of a 3D array
For a better performance and consistent of for loop logical. The arrays in C++ use:
- same access name and logical as Fortran but with a 1-0 offset diff. access location of (x, y, z) in C++ is logically equivalent as access location of (x+1, y+1, z+1) in Fortran array.
- same memroy layout in logical, i.e. memory is contiguous in x dimension.
In summary, expression RU(x, y, z) in Fortran is equivalent to RU[z-1][y-1][x-1] in C++ in our rewrite design.
- For 3D array, we define it as a type by
typedef double data[nz][ny][nx];. - Configuration of
lRot,lMag,lPot,lRem,lShr,izCon,iyCon,ixConetc is done by macros, this can lead to better binary generation and compile optimizations. - Nested loop on x,y,z dim is widely used, so we define macro
FOR3Dfor simplify. It extends toFor(z, zl, zr) For(y, yl, yr) For(x, xl, xr), wherexl, xr, yl, yr, zl, zrare bounds on each dimension. The bounds are commonly passed as function paramters, so we define macroFORRANGEPARAMasint zl, int zr, int yl, int yr, int xl, int xrfor short, and macroFORRANGEaszl, zr, yl, yr, xl, xr.
In this section, there will be a example showing how we re-implmented the original fortran program to c++ program.
Take the loop DO 530 in the subroutine FLUXES as the example. The fortran code writes:
DO 530 K=ILAP/2+1,NZ-ILAP/2
TMPZ1=HZ*D2ZZDZ2(K)
TMPZ2=H2Z*DZZDZ(K)*DZZDZ(K)
DO 530 J=2,NY-IY+1
TMPY1=HY*D2YYDY2(J)
TMPY2=H2Y*DYYDY(J)*DYYDY(J)
DO 530 I=2,NX-IX+1
FU(I,J,K)=FU(I,J,K)+ORE*(
2 C43*((UU(I+1,J,K)-UU(I-1,J,K))*HX*D2XXDX2(I)
3 +(UU(I+1,J,K)-2.0E00*UU(I,J,K)+UU(I-1,J,K))
4 *H2X*DXXDX(I)*DXXDX(I))
5 +(UU(I,J+1,K)-UU(I,J-1,K))*TMPY1
6 +(UU(I,J+1,K)-2.0E00*UU(I,J,K)+UU(I,J-1,K))
7 *TMPY2
8 +(UU(I,J,K+1)-UU(I,J,K-1))*TMPZ1
9 +(UU(I,J,K+1)-2.0E00*UU(I,J,K)+UU(I,J,K-1))
1 *TMPZ2)First, the code can be splitted into three identical operation in three dimensions (x, y, z). Focus on the x dimension, it is:
DO 530 K=ILAP/2+1,NZ-ILAP/2
DO 530 J=2,NY-IY+1
DO 530 I=2,NX-IX+1
FU(I,J,K)=FU(I,J,K)+ORE*(
2 C43*((UU(I+1,J,K)-UU(I-1,J,K))*HX*D2XXDX2(I)
3 +(UU(I+1,J,K)-2.0E00*UU(I,J,K)+UU(I-1,J,K))
4 *H2X*DXXDX(I)*DXXDX(I)))Then, the code can be splitted into two sections by different difference used. The first section is: FU(I,J,K) = FU(I,J,K) + ORE * C43 * (UU(I+1,J,K) - UU(I-1,J,K)) * HX * D2XXDX2(I). The second section is: FU(I,J,K) = FU(I,J,K) + ORE * C43 * (UU(I+1,J,K)-2.0E00*UU(I,J,K)+UU(I-1,J,K))*H2X*DXXDX(I)*DXXDX(I))).
We created templates for caculating difference in three dimensions. dFdu is created for caculating first order difference. We can replace UU(I+1,J,K) - UU(I-1,J,K) with dFdu<DIM::X>(UU,I,J,K). And d2Fduu is created for caculating second order difference. We can replace UU(I+1,J,K)-2.0E00*UU(I,J,K)+UU(I-1,J,K) with d2Fduu<DIM::X>(UU,I,J,K).
We also created templates Hu, d2uudu2, h2u, duudu for indexing HX, D2XXDX2,H2X, DXXDX using the DIM::X. So we can replace D2XXDX2 with d2uudu2<DIM::X>. Note that though DXXDX and D2XXDX2 are one dimension arrays, we still need indecies on three dimensions for implementing the templates for three dimensions.
Utilizing the templates, the code can be rewritten into
DO 530 K=ILAP/2+1,NZ-ILAP/2
DO 530 J=2,NY-IY+1
DO 530 I=2,NX-IX+1
FU(I,J,K)=FU(I,J,K)+ORE*C43
2 (dFdu<DIM::X>(UU,I,J,K)*hu<DIM::X>*d2uudu2<DIM::X>(I,J,K)
3 +d2Fduu<DIM::X>(UU,I,J,K)*h2u<DIM::X>*duudu<DIM::X>(I,J,K)*duudu<DIM::X>(I,J,K))We noticed that through the program, there are a lot for loop using the same computing pattern, the differnces are dimension, the scalar to be multiplied, destination array, source array and how the destination be changed(the result is added/substracted to the destination, or the destination is set to the result). In DO 530, the result is added to the destination array FU. So we abstracted this computing pattern into a template func_d2.
The source code of the template func_d2 is:
template <OPT opt, DIM dim> void func_d2(FORRANGEPARAM, data tar, const data src, double scale = 1) {
SingleFOR3D oper<opt>(tar[z][y][x],
(dFdu<dim>(src, z, y, x) * d2uudu2<dim>(z, y, x) * hu<dim>() +
d2Fdu2<dim>(src, z, y, x) * square(duudu<dim>(z, y, x)) * h2u<dim>()) *
scale);
}SingleFOR3D is a marco defining a 3D for loop.
So the computation on dimension x in DO 530 can be replaced by one function call: func_d2<OPT::ADD, DIM::X>(FORRANGE, FU, UU, ORE * C43);.
Finally, we use three line in C++ to implement DO 530 loop in fortran:
func_d2<OPT::ADD, DIM::X>(FORRANGE, FU, UU, ORE * C43); // do 530.1
func_d2<OPT::ADD, DIM::Y>(FORRANGE, FU, UU, ORE); // do 530.2
func_d2<OPT::ADD, DIM::Z>(FORRANGE, FU, UU, ORE); // do 530.3As mentioned before, we've add debug prints in the C++ code, runs will print lots of values of all major 3D matrix in the certain index at different point. As the operation is rely on neighbor values, if some operater goes wrong, all value will have an error, so when the value at certain index of all array is same with origin fortran code, we will have a high confidence of correctness of our implement, and if there any problem, it is easy to figure out which step goes wrong because we have debug print at different point.
Below we introduce how to do corectness mapping between C++ and Fortran.
We also provide a version of Fortran code that have the sample style of debug print with C++. So we could do a diff of stdout of both program to verify the correctness.
Compile the fortran with debug print just using make command or manually using following:
mpifort -O2 -std=legacy -ffixed-form -c 3dmhd.f
mpifort -O2 -std=legacy -ffixed-form -c 3dmhdset.f
mpifort -O2 -std=legacy -ffixed-form -c 3dmhdsub.f
mpifort -O2 -std=legacy -ffixed-form -c 3dmhdprint.f
mpifort -O2 -std=legacy -ffixed-form 3dmhd.o 3dmhdset.o 3dmhdsub.o 3dmhdprint.o -o 3dmhd
run the Fortran program:
mpirun -np 24 ./3dmhd > flog.log
To figure out the difference, we would like to use a python script to diff the log, then generate a html file, with the visualize of diff between flog.log and clog.log, we could easily capture the detailed infomation.
python diff.py -f /path/to/flog.log -c path/to/clog.log -o diff.html
Then using browser or HTML Preview extension of VSCode to view diff.html.
Here is an example of difference at index (10, 10, 10) after 9 iterations with all values equal in high precision.
Fig1. 14-digits difference in 9th iteration.
After turning on the lRot switch, we observed some numerical differences bewteen Fortran and C++ code. A simple example is shown below.
For C++ code, we modify line 9 in 3dmhdparam.h from #define lRot false to #define lRot true. And we set par[4] (RY) and par[5] (ANG) in 3dmhdset.cpp. i.e. we change
par[3] = 10;
par[11] = 0.1f;to
par[3] = 10;
par[4] = 10.f;
par[5] = 0.25f;
par[11] = 0.1f;For Fortran code, we modify line 8 in 3dmhdparam.f from
PARAMETER(LROT=.FALSE.,LMAG=.FALSE.,LPOT=.FALSE.,LREM=.FALSE.,
2 LSHR=.FALSE.)to
PARAMETER(LROT=.TRUE.,LMAG=.FALSE.,LPOT=.FALSE.,LREM=.FALSE.,
2 LSHR=.FALSE.)And modify line 109~110 in 3dmhdset.f from
PAR(05) = 0.0E00
PAR(06) = 0.0E00to
PAR(05) = 10.0E00
PAR(06) = 0.25E00Then we find a numerical difference at the very beginning of iteration 1, after loop DO 10 in the second call of fluxes:
Fig2. 17-digits difference in 1st iteration with lRot enabled.
This difference will spread and accumulate in the following iterations:
Fig3. 14-digits differences in 9th iteration with lRot enabled.
here is an example of difference at index (10, 10, 10) ater 9 iteration all values are equal with high precision.