[XLA] Improve GPU memory limit handling and shape size calculation #23271

zhenying-liu · 2025-02-28T22:28:57Z

This change improves accuracy of GPU memory limit calculations and provides a more maintainable interface for shape size computations.

Change memory limit type from int64_t to uint64_t to prevent negative values and better represent memory sizes.
Add memory space filtering to exclude host memory when calculating input/output sizes that impact GPU memory usage. This ensures we only count buffers that actually reside in device memory.
Replace direct GetSizeOfShape() calls with ShapeSizeBytesFunction() wrapper, which provides:
- Consistent shape size calculation across the codebase
- Optional memory space filtering capability
- Proper handling of dynamic shapes and their metadata

zhenying-liu · 2025-02-28T23:00:49Z

The two CI failures and AMD ROCm seem not related with my changes:

Error: Error: pod failed to come online with error: Error: Pod linux-x86-n2-16-xf6x6-runner-xj7vr-workflow is unhealthy with phase status Failed
Error: Process completed with exit code 1.
Error: Executing the custom container implementation failed. Please contact your self hosted runner administrator.

[XLA] Improve GPU memory limit handling and shape size calculation

8b31840

zhenying-liu force-pushed the lhs-oom branch from 0eb5512 to 8b31840 Compare February 28, 2025 22:38

zhenying-liu requested a review from frgossen March 1, 2025 21:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[XLA] Improve GPU memory limit handling and shape size calculation #23271

[XLA] Improve GPU memory limit handling and shape size calculation #23271

zhenying-liu commented Feb 28, 2025

zhenying-liu commented Feb 28, 2025

[XLA] Improve GPU memory limit handling and shape size calculation #23271

Are you sure you want to change the base?

[XLA] Improve GPU memory limit handling and shape size calculation #23271

Conversation

zhenying-liu commented Feb 28, 2025

zhenying-liu commented Feb 28, 2025