Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Release backward inputs per static graph ref count #20804

Merged
merged 9 commits into from
Jun 14, 2024

Conversation

pengwa
Copy link
Contributor

@pengwa pengwa commented May 24, 2024

Release backward inputs per static graph ref count

For the output buffer marked as external output:

  1. Remove the additional ref count we used for avoiding reusing buffer. Instead, when we find reuse input/output buffer, we will make sure the reused buffer not not generated by nodes that has external outputs.
  2. Remove the ref count of pybind feed inputs, which exists all the time until the run_backward completed. Instead, passing a mutuble feeds, and we clean the feeds vector once that is copied into session states and not needed any more before run the graph sequencentially.

Before the change:

One of the backward inputs is 3.9GB, it lives until the backward ends.
image

With the change:

The 3.9GB is released when the last node depending on that tensor completed.

image

Be noted: the peak did not change though, we have more work to do to reduce on the peak.

Others

It is found there are few tests that were updated to use incorrect expected values in previous code refactoring a81faee#diff-9e8fbae7d3dff24106cd17564949f320e943cb3048eae07813c7de144f140419L382.

This PR tries to fix them back, and I think now all test cases are back to normal.

Motivation and Context

@pengwa pengwa added the training issues related to ONNX Runtime training; typically submitted using template label May 24, 2024
@pengwa pengwa requested a review from wschin May 24, 2024 08:25
@pengwa pengwa requested a review from souptc June 11, 2024 13:54
@pengwa
Copy link
Contributor Author

pengwa commented Jun 14, 2024

Thanks @wschin !!

@pengwa pengwa merged commit 87b14ac into main Jun 14, 2024
96 checks passed
@pengwa pengwa deleted the pengwa/release_external_outputs branch June 14, 2024 06:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
training issues related to ONNX Runtime training; typically submitted using template
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants