-
Notifications
You must be signed in to change notification settings - Fork 350
Enable GPU exection of summarize_timestep via OpenACC #1294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: develop
Are you sure you want to change the base?
Enable GPU exection of summarize_timestep via OpenACC #1294
Conversation
…mestep These 'integer, pointer' variables are used as loop bounds and can cause issues with OpenACC. Since OpenACC will directly copy the pointers and may not preserve the association (may not also copy the value pointed to), odd behavior can result.
This prevents having to add OpenACC routine information to mpas_log_write and routine(s) it calls.
Only for loops executed when 'config_print_detailed_minmax_vel = .true.' When running with OpenACC, it is difficult to get values to "come with" during a reduction. This means it's difficult to get both the minimum value of an array and the position (the array indices) it occurs at in one pass. Using 2 loops like this ensures the correct positions are found. Using a "linear index" to record the position of the minimum/maximum values helps in cases where the min/max occurs at multiple points in an array. The detailed prints will always use the first occurence of the min/max in the array.
Adds timing information for the OpenACC data transfers in the 'summarize_timestep [ACC_data_xfer]' timer. This timer will be removed as data regions in the MPAS-A dycore are fused.
@@ -7262,11 +7263,19 @@ subroutine summarize_timestep(domain) | |||
real (kind=RKIND) :: lonMax, lonMax_global | |||
real (kind=RKIND), dimension(5) :: localVals, globalVals | |||
|
|||
real (kind=RKIND) :: spd | |||
real (kind=RKIND), dimension(:,:), allocatable :: spd | |||
!$acc declare create(spd) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Everything looks good, but I had a question about why the declare create is required here. Wasn't fully clear about that from our previous meetings.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did this just because it seems to be what is preferred by the OpenACC standard for Fortran allocatable arrays.
acc declare create(spd)
allows the compiled code to "just do the right thing" whenever it encounters the allocate(spd(...))
or deallocate(spd)
statements later on - in this case, allocate this array on both device and host.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, since spd
won't be fused into any data region (and thus isn't a type of temporary data movement) it doesn't need to be tracked in any MPAS_ACC_TIMER_...
region.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. Thanks for the explanation!
The code changes look reasonable and I've only compared the log file to the reference run, but that seems to match. Perhaps the commit history could be squashed into fewer commits, but also fine as it is. |
Adds timing information for the OpenACC data transfers in the temporary 'summarize_timestep [ACC_data_xfer]' timer.
Adds timing information for the OpenACC data transfers in the temporary 'summarize_timestep [ACC_data_xfer]' timer.
256463e
to
a21db2c
Compare
This PR modifies the code and adds OpenACC directives to allow the loops in the
summarize_timestep
routine can execute on GPUs.Timing information for the temporary OpenACC data transfers in this routine is captured in the log file by a new timer: 'summarize_timestep [ACC_data_xfer]'.
This PR includes a re-write of the loops associated with
config_print_detailed_minmax_vel
to ensure consistent locations are reported for the minimum and maximum values for the wind speeds.