-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add UNDEFINED status and force copy methods that always transfer data #85
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot @wertysas for this excellent piece of work 🙏 Having an expert mode that unlocks advanced functionality is a really good idea, and your implementation of exposing this via an alternate API is a very clean and simple way of achieving this.
I've left a few minor comments that I would like you to address. You've also now accrued a small conflict with a change I made to tests/async_host.F90 (sorry 😅 ), so a rebase would also be necessary.
Readme.md
Outdated
QUEUE parameter when calling the aforementioned subroutines. With this QUEUE | ||
## Data Transfers | ||
There are two categories of data transfer methods in Field API. The *core API* consists of methods that internally keeps track of the status of a field and the *advanced API* which consists of methods that relies on the user to | ||
keep track of where the data is located. There is not benefit in using the advanced api if the default API can be used and it is recommended to only use the features from the advanced API when |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Typo: "no benefit" instead of "not benefit"
Readme.md
Outdated
|
||
Where``DEVICE/HOST`` indicates the transfer direction and ``RDONLY/WRONLY`` indicates the mode. | ||
The difference between the ``GET`` and ``SYNC`` method is their interface. The | ||
``GET`` methods are called with a pointer argument that will be point to transferred data at its destination. The ``SYNC`` method is called without any arguments and |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: "pointed to" (or even better "associated to") instead of "point to"
Readme.md
Outdated
|
||
### Advanced API | ||
|
||
**The advanced API is still experimental and requires the user to handle the status tracking of fields.** |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"The advanced API cedes all responsibility for data synchronisation to the user and must therefore be used with caution."
Readme.md
Outdated
|
||
|
||
#### Asynchronous data transfers | ||
Using the advanced API it is possible to overlap data transfers with the computations. To do so you can add the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Asynchronous transfers are more general than the case of overlapping communication and computation so we should update the description here.
"Using the advanced API it is possible to asynchronously transfer data, i.e. issue a non-blocking instruction. To do so..."
src/core/field_RANKSUFF_module.fypp
Outdated
PROCEDURE, PRIVATE :: ${ftn}$_GET_HOST_DATA | ||
PROCEDURE, PRIVATE :: ${ftn}$_GET_DEVICE_DATA | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you please delete this new line?
src/core/field_basic_module.F90
Outdated
@@ -116,6 +124,22 @@ SUBROUTINE FIELD_BASIC_SET_DEVICE_DIRTY (SELF) | |||
|
|||
END SUBROUTINE FIELD_BASIC_SET_DEVICE_DIRTY | |||
|
|||
SUBROUTINE FIELD_BASIC_SET_DEVICE_FRESH (SELF) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
SET_DEVICE_FRESH
should set the device freshness bit, and unset the undefined and hostfresh bits. So I think it should be instead:
CALL SELF%SET_STATUS( IAND(SELF%GET_STATUS(), NOT(UNDEFINED)))
CALL SELF%SET_STATUS( IAND(SELF%GET_STATUS(), NOT(NHSTFRESH)))
CALL SELF%SET_STATUS( IOR( SELF%GET_STATUS() , NDEVFRESH))
And let's use more lines please to give my tired brain a better chance of understanding it 😅
src/core/field_basic_module.F90
Outdated
|
||
END SUBROUTINE FIELD_BASIC_SET_DEVICE_FRESH | ||
|
||
SUBROUTINE FIELD_BASIC_SET_HOST_FRESH (SELF) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same comment as SET_DEVICE_FRESH
.
|
||
END SUBROUTINE ${ftn}$_SYNC_HOST_RDWR | ||
|
||
SUBROUTINE ${ftn}$_GET_HOST_DATA_FORCE (SELF, PPTR, QUEUE) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need a GET_HOST_DATA_OWNER_FORCE
which can handle the case of delayed allocation.
Hi @mlange05, @dareg, @pmarguinaud. Could you please review this PR too whenever you get the chance? Johan can then address all the comments together and rebase onto main. Thanks! 🙏 |
8bf6116
to
532fca4
Compare
Thank you for the comments and suggestions for improvement @awnawab! I have addressed the comments and rebased onto main now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me. Many thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
src/core/field_RANKSUFF_module.fypp
Outdated
${ft.type}$, POINTER, INTENT(INOUT) :: PPTR(${ft.shape}$) | ||
INTEGER (KIND=JPIM), OPTIONAL, INTENT(IN) :: QUEUE | ||
|
||
IF ( SELF%IS_UNINITIALIZED() ) THEN |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This probably got lost in the rebase, but shouldn't we check here if SELF%DEVPTR
is associated? Even if PTR is uninitialised, DEVPTR may be initialised, so we should copy back as long as device memory is allocated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick fix, G2G now 👌
Hello, Could someone explain me why forcing the user to keep track of the status of field data is a progress ? And removing asynchronicity from the regular methods, is also a progress; why ? |
Hello, Do you have fields that are so big they don't fit on the GPU in the real code you are porting (not externalised test-case like cloudsc)? In the benchmark we prepared for the call for tenders, we have been able to reduce the memory consumption by simply not doing all the blocks at once, and looping over a set of blocks. It can be seen here: https://github.com/pmarguinaud/IAL/blob/3522929d65a43763ad86577986abdbbb00b1a6f5/arpifs/adiab/cpg_drv.F90#L342 |
Hi @pmarguinaud and @dareg, We discovered there was a flaw in the original API; the status of a field is invalid between the entire time an asynchronous copy is issued and a wait is called. This gives the user a very easy way to break the field tracking mechanism. Consider the following:
The above would result in incorrect behaviour; we may end up copying garbage back to host and subsequent calls to The core API is now limited to synchronous copies of the entire field. For the core API, FIELD_API is responsible for tracking the fields. If however the user wants to do something more sophisticated and complex, then I think it is quite fair to cede the tracking responsibility to the user and ask them to use the advanced API instead (rather than add unnecessary complexity in FIELD_API trying to cover every corner case of asynchronous offload). Such an advanced mode would also enable us to add more features to FIELD_API, namely partial offload of fields and (eventually) streamed offloads of fields, which is necessary to overlap compute and communication. I am aware you are already able to achieve some reduction in the device memory footprint in your benchmark. Please correct me if I am wrong, but I believe this is done through allocating smaller temporaries inside the outer block loop. Non-temporary or wrapped fields are offloaded in their entirety. We would like to add the ability to do a partial offload of any field, whether temporary or not. Whether or not we use this functionality for our production runs is beside the point. A technical infrastructure library like FIELD_API should allow greater flexibility than simply what we need for production. Especially if such "advanced"/"optional" functionality is:
I hope the above can assuage some of your concerns. Please do reach out if you have any further questions. And as always, technical feedback on the actual implementation would be greatly appreciated. |
Good morning @pmarguinaud and @dareg, Have you had another chance to look at this PR? Could you please provide feedback at your earliest convenience? I would really like to merge this soon. |
Hi,
As you said, the status of a field is invalid while an async transfer is pending. So it seems to me that the solution is to always call the WAIT_FOR_ASYNC_QUEUE subroutines when you want to make use of a field that have been transferred asynchronously, as stated in our documentation.
Is it not enough to prevent using halfl-transferred data? On your second point. I think we have a divergent view on what field api should be. I think we should not aim for something very flexible, general, or versatile. In the contrary we should aim for something tailored specifically for the porting of IFS/IAL on GPU, with the simplest[1] field api library we can get away with in order to ease the maintenance burden that will arise in the future. I think we should constrain ourself to make field api work well, first and foremost, on production code. [1] And I think it's already quite complex. |
Hi Judicaël, Thanks for your reply. Yes sure, adding an explicit wait solves the problem. I personally think the API should by design not allow the user to make mistakes, but I don't have very strong feelings on it and I would be happy to restore the You didn't however address one of the main points in my comment so I'll say it again more clearly. Back in December we filed #68 explaining that we want to add the capability to split a copy into multiple streams and overlap communication and computation. That issue was referenced in PRs #69 and #83, so we have been talking about it for a while. Overlapping communication and computation can potentially have significant performance benefits, so we are very keen to have this capability in FIELD_API. It has to be in the main trunk and not in a fork because evaluating the performance benefits on different kernels and architectures will be an ongoing medium-to-long-term process. Please reevaluate the current PR in light of the above clarification. It is a key part of potentially very powerful new capability that we have been working on (and signposting to you) for months. |
This PR adds new features to Field API that allow the user to make data copies without Field API tracking the status of the data. The new features addresses the issue with Add support for blocked offload of fields to device #83, where the proposed changes could break the internal status tracking of fields.
Four new data transfer methods are added that always copies the fields:
SUBROUTINE GET_HOST_DATA_FORCE(SELF, PTR, QUEUE)
SUBROUTINE SYNC_HOST_DATA_FORCE(SELF, QUEUE)
SUBROUTINE GET_DEVICE_DATA_FORCE(SELF, PTR, QUEUE)
SUBROUTINE SYNC_HOST_DATA_FORCE(SELF, QUEUE)
These methods all set the field's status to
UNDEFINED
and the user has to set the status back to a valid status with the newly addedSET_DEVICE_FRESH
andSET_HOST_FRESH
methods before the normal Field API routines can be used on the field.Additionally, the optional
QUEUE
argument has been removed from the normal copy methods and are now only exposed to the user through theFORCE
copy methods. Previously, the copy methods could be called with theQUEUE
argument and the status of the field would have been updated even if the data transfer hadn't been launched. By only allowing asycnhronous copies through theFORCE
methods the tracking logic of Field API will always work as expected when calling the normal copy routines.The open draft PR, Add support for blocked offload of fields to device #83, will be rebased onto this and use the
FORCE
routine for offloading instead to make it impossible for a user to break the status of the fields when using non-FORCE
copy methods.