Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[XLA] Improve GPU memory limit handling and shape size calculation #23271

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

zhenying-liu
Copy link
Contributor

This change improves accuracy of GPU memory limit calculations and provides a more maintainable interface for shape size computations.

  • Change memory limit type from int64_t to uint64_t to prevent negative values and better represent memory sizes.

  • Add memory space filtering to exclude host memory when calculating input/output sizes that impact GPU memory usage. This ensures we only count buffers that actually reside in device memory.

  • Replace direct GetSizeOfShape() calls with ShapeSizeBytesFunction() wrapper, which provides:

    • Consistent shape size calculation across the codebase
    • Optional memory space filtering capability
    • Proper handling of dynamic shapes and their metadata

@zhenying-liu
Copy link
Contributor Author

The two CI failures and AMD ROCm seem not related with my changes:

Error: Error: pod failed to come online with error: Error: Pod linux-x86-n2-16-xf6x6-runner-xj7vr-workflow is unhealthy with phase status Failed
Error: Process completed with exit code 1.
Error: Executing the custom container implementation failed. Please contact your self hosted runner administrator.

@zhenying-liu zhenying-liu requested a review from frgossen March 1, 2025 21:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant