Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add UNDEFINED status and force copy methods that always transfer data #85

Open
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

wertysas
Copy link
Contributor

This PR adds new features to Field API that allow the user to make data copies without Field API tracking the status of the data. The new features addresses the issue with Add support for blocked offload of fields to device #83, where the proposed changes could break the internal status tracking of fields.

Four new data transfer methods are added that always copies the fields:

  • SUBROUTINE GET_HOST_DATA_FORCE(SELF, PTR, QUEUE)
  • SUBROUTINE SYNC_HOST_DATA_FORCE(SELF, QUEUE)
  • SUBROUTINE GET_DEVICE_DATA_FORCE(SELF, PTR, QUEUE)
  • SUBROUTINE SYNC_HOST_DATA_FORCE(SELF, QUEUE)

These methods all set the field's status to UNDEFINED and the user has to set the status back to a valid status with the newly added SET_DEVICE_FRESH and SET_HOST_FRESH methods before the normal Field API routines can be used on the field.

Additionally, the optional QUEUE argument has been removed from the normal copy methods and are now only exposed to the user through the FORCE copy methods. Previously, the copy methods could be called with the QUEUE argument and the status of the field would have been updated even if the data transfer hadn't been launched. By only allowing asycnhronous copies through the FORCE methods the tracking logic of Field API will always work as expected when calling the normal copy routines.

The open draft PR, Add support for blocked offload of fields to device #83, will be rebased onto this and use the FORCE routine for offloading instead to make it impossible for a user to break the status of the fields when using non-FORCE copy methods.

@wertysas wertysas requested a review from awnawab February 20, 2025 08:57
@awnawab awnawab added the approved-for-ci Approved to run hpc-ci label Feb 20, 2025
Copy link
Collaborator

@awnawab awnawab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @wertysas for this excellent piece of work 🙏 Having an expert mode that unlocks advanced functionality is a really good idea, and your implementation of exposing this via an alternate API is a very clean and simple way of achieving this.

I've left a few minor comments that I would like you to address. You've also now accrued a small conflict with a change I made to tests/async_host.F90 (sorry 😅 ), so a rebase would also be necessary.

Readme.md Outdated
QUEUE parameter when calling the aforementioned subroutines. With this QUEUE
## Data Transfers
There are two categories of data transfer methods in Field API. The *core API* consists of methods that internally keeps track of the status of a field and the *advanced API* which consists of methods that relies on the user to
keep track of where the data is located. There is not benefit in using the advanced api if the default API can be used and it is recommended to only use the features from the advanced API when
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typo: "no benefit" instead of "not benefit"

Readme.md Outdated

Where``DEVICE/HOST`` indicates the transfer direction and ``RDONLY/WRONLY`` indicates the mode.
The difference between the ``GET`` and ``SYNC`` method is their interface. The
``GET`` methods are called with a pointer argument that will be point to transferred data at its destination. The ``SYNC`` method is called without any arguments and
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: "pointed to" (or even better "associated to") instead of "point to"

Readme.md Outdated

### Advanced API

**The advanced API is still experimental and requires the user to handle the status tracking of fields.**
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"The advanced API cedes all responsibility for data synchronisation to the user and must therefore be used with caution."

Readme.md Outdated


#### Asynchronous data transfers
Using the advanced API it is possible to overlap data transfers with the computations. To do so you can add the
Copy link
Collaborator

@awnawab awnawab Feb 26, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Asynchronous transfers are more general than the case of overlapping communication and computation so we should update the description here.

"Using the advanced API it is possible to asynchronously transfer data, i.e. issue a non-blocking instruction. To do so..."

PROCEDURE, PRIVATE :: ${ftn}$_GET_HOST_DATA
PROCEDURE, PRIVATE :: ${ftn}$_GET_DEVICE_DATA

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please delete this new line?

@@ -116,6 +124,22 @@ SUBROUTINE FIELD_BASIC_SET_DEVICE_DIRTY (SELF)

END SUBROUTINE FIELD_BASIC_SET_DEVICE_DIRTY

SUBROUTINE FIELD_BASIC_SET_DEVICE_FRESH (SELF)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SET_DEVICE_FRESH should set the device freshness bit, and unset the undefined and hostfresh bits. So I think it should be instead:

CALL SELF%SET_STATUS( IAND(SELF%GET_STATUS(), NOT(UNDEFINED)))
CALL SELF%SET_STATUS( IAND(SELF%GET_STATUS(), NOT(NHSTFRESH)))
CALL SELF%SET_STATUS( IOR( SELF%GET_STATUS() , NDEVFRESH))

And let's use more lines please to give my tired brain a better chance of understanding it 😅


END SUBROUTINE FIELD_BASIC_SET_DEVICE_FRESH

SUBROUTINE FIELD_BASIC_SET_HOST_FRESH (SELF)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same comment as SET_DEVICE_FRESH.


END SUBROUTINE ${ftn}$_SYNC_HOST_RDWR

SUBROUTINE ${ftn}$_GET_HOST_DATA_FORCE (SELF, PPTR, QUEUE)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need a GET_HOST_DATA_OWNER_FORCE which can handle the case of delayed allocation.

@awnawab
Copy link
Collaborator

awnawab commented Feb 28, 2025

Hi @mlange05, @dareg, @pmarguinaud. Could you please review this PR too whenever you get the chance? Johan can then address all the comments together and rebase onto main. Thanks! 🙏

@wertysas wertysas force-pushed the je-data-force-methods branch from 8bf6116 to 532fca4 Compare March 6, 2025 15:00
@github-actions github-actions bot removed the approved-for-ci Approved to run hpc-ci label Mar 6, 2025
@wertysas
Copy link
Contributor Author

wertysas commented Mar 7, 2025

Thanks a lot @wertysas for this excellent piece of work 🙏 Having an expert mode that unlocks advanced functionality is a really good idea, and your implementation of exposing this via an alternate API is a very clean and simple way of achieving this.

I've left a few minor comments that I would like you to address. You've also now accrued a small conflict with a change I made to tests/async_host.F90 (sorry 😅 ), so a rebase would also be necessary.

Thank you for the comments and suggestions for improvement @awnawab! I have addressed the comments and rebased onto main now.

Copy link
Collaborator

@mlange05 mlange05 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. Many thanks!

@awnawab awnawab added the approved-for-ci Approved to run hpc-ci label Mar 10, 2025
Copy link
Collaborator

@awnawab awnawab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @wertysas for addressing the changes 🙏 This looks really good now. I just have minor comment, but other than that this looks good to go 👌 @dareg could you please give this another quick look at your earliest convenience?

${ft.type}$, POINTER, INTENT(INOUT) :: PPTR(${ft.shape}$)
INTEGER (KIND=JPIM), OPTIONAL, INTENT(IN) :: QUEUE

IF ( SELF%IS_UNINITIALIZED() ) THEN
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This probably got lost in the rebase, but shouldn't we check here if SELF%DEVPTR is associated? Even if PTR is uninitialised, DEVPTR may be initialised, so we should copy back as long as device memory is allocated.

@github-actions github-actions bot removed the approved-for-ci Approved to run hpc-ci label Mar 11, 2025
Copy link
Collaborator

@awnawab awnawab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the quick fix, G2G now 👌

@awnawab awnawab added the approved-for-ci Approved to run hpc-ci label Mar 11, 2025
@pmarguinaud
Copy link
Collaborator

Hello,

Could someone explain me why forcing the user to keep track of the status of field data is a progress ? And removing asynchronicity from the regular methods, is also a progress; why ?

@dareg dareg self-requested a review March 11, 2025 10:22
@dareg
Copy link
Collaborator

dareg commented Mar 11, 2025

Hello,
Sorry to do that kind of comment that late in the process, but we were wondering if adding that kind of complexity in field api is really necessary?

Do you have fields that are so big they don't fit on the GPU in the real code you are porting (not externalised test-case like cloudsc)?

In the benchmark we prepared for the call for tenders, we have been able to reduce the memory consumption by simply not doing all the blocks at once, and looping over a set of blocks. It can be seen here: https://github.com/pmarguinaud/IAL/blob/3522929d65a43763ad86577986abdbbb00b1a6f5/arpifs/adiab/cpg_drv.F90#L342

@awnawab
Copy link
Collaborator

awnawab commented Mar 11, 2025

Hi @pmarguinaud and @dareg,

We discovered there was a flaw in the original API; the status of a field is invalid between the entire time an asynchronous copy is issued and a wait is called. This gives the user a very easy way to break the field tracking mechanism. Consider the following:

CALL  FIELD%SYNC_DEVICE_RDWR(QUEUE=1)
CALL FIELD%SYNC_HOST_RDWR()

The above would result in incorrect behaviour; we may end up copying garbage back to host and subsequent calls to SYNC_HOST_RDWR would not update the data either since the status will already be NHSTFRESH. Rather than introducing additional complexity and introducing checks for proper synchronisation of asynchronous copies, we thought it is better to separate the "core" from the "advanced" API.

The core API is now limited to synchronous copies of the entire field. For the core API, FIELD_API is responsible for tracking the fields. If however the user wants to do something more sophisticated and complex, then I think it is quite fair to cede the tracking responsibility to the user and ask them to use the advanced API instead (rather than add unnecessary complexity in FIELD_API trying to cover every corner case of asynchronous offload).

Such an advanced mode would also enable us to add more features to FIELD_API, namely partial offload of fields and (eventually) streamed offloads of fields, which is necessary to overlap compute and communication.

I am aware you are already able to achieve some reduction in the device memory footprint in your benchmark. Please correct me if I am wrong, but I believe this is done through allocating smaller temporaries inside the outer block loop. Non-temporary or wrapped fields are offloaded in their entirety. We would like to add the ability to do a partial offload of any field, whether temporary or not.

Whether or not we use this functionality for our production runs is beside the point. A technical infrastructure library like FIELD_API should allow greater flexibility than simply what we need for production. Especially if such "advanced"/"optional" functionality is:

  • Clearly delineated from the core functionality at the API level
  • Is well explained in the Readme
  • Is thoroughly tested.

I hope the above can assuage some of your concerns. Please do reach out if you have any further questions. And as always, technical feedback on the actual implementation would be greatly appreciated.

@awnawab
Copy link
Collaborator

awnawab commented Mar 17, 2025

Good morning @pmarguinaud and @dareg,

Have you had another chance to look at this PR? Could you please provide feedback at your earliest convenience? I would really like to merge this soon.

@dareg
Copy link
Collaborator

dareg commented Mar 17, 2025

Hi,
In the example you've shown:

CALL FIELD%SYNC_DEVICE_RDWR(QUEUE=1)
CALL FIELD%SYNC_HOST_RDWR()

As you said, the status of a field is invalid while an async transfer is pending. So it seems to me that the solution is to always call the WAIT_FOR_ASYNC_QUEUE subroutines when you want to make use of a field that have been transferred asynchronously, as stated in our documentation.

CALL FIELD%SYNC_DEVICE_RDWR(QUEUE=1)
!do other stuff while the transfer is happening
CALL WAIT_FOR_ASYNC_QUEUE(1) !wait for the transfer to finish if it wasn't already
CALL FIELD%SYNC_HOST_RDWR()

Is it not enough to prevent using halfl-transferred data?

On your second point. I think we have a divergent view on what field api should be. I think we should not aim for something very flexible, general, or versatile. In the contrary we should aim for something tailored specifically for the porting of IFS/IAL on GPU, with the simplest[1] field api library we can get away with in order to ease the maintenance burden that will arise in the future.

I think we should constrain ourself to make field api work well, first and foremost, on production code.

[1] And I think it's already quite complex.

@awnawab
Copy link
Collaborator

awnawab commented Mar 17, 2025

Hi Judicaël,

Thanks for your reply. Yes sure, adding an explicit wait solves the problem. I personally think the API should by design not allow the user to make mistakes, but I don't have very strong feelings on it and I would be happy to restore the QUEUE argument to the standard API if that helps unblock things.

You didn't however address one of the main points in my comment so I'll say it again more clearly. Back in December we filed #68 explaining that we want to add the capability to split a copy into multiple streams and overlap communication and computation. That issue was referenced in PRs #69 and #83, so we have been talking about it for a while.

Overlapping communication and computation can potentially have significant performance benefits, so we are very keen to have this capability in FIELD_API. It has to be in the main trunk and not in a fork because evaluating the performance benefits on different kernels and architectures will be an ongoing medium-to-long-term process.

Please reevaluate the current PR in light of the above clarification. It is a key part of potentially very powerful new capability that we have been working on (and signposting to you) for months.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved-for-ci Approved to run hpc-ci contributor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants