Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Investigate extra memory used by FBP whose origin is still unknown #365

Closed
yousefmoazzam opened this issue Jun 13, 2024 · 2 comments
Closed
Labels
memory-estimation GPU memory estimator related

Comments

@yousefmoazzam
Copy link
Collaborator

As part of #361, the FBP memory estimation was improved to more accurately represent what memory allocations the FBP method does. However, there is still some memory being allocated that is not yet known where in the method it is happening.

For now, a multiplier of 2 of the output of the 1D RFFT has been added to bump the memory estimation:

# The multiplier_heuristic comes from the empirical testing of the module for various
# data sizes. So far, the need for it is cannot be explained from the algorithmic
# standpoint. Also, there is no association of it with filtered_freq_slice array,
# as we just need to bump up the memory to avoid the OOM error.
multiplier_heuristic = 2
tot_memory_bytes = int(
2 * input_slice_size
+ filtered_input_slice_size
+ multiplier_heuristic * filtered_freq_slice

and this allows 80GB data to be put through the method in httomo without issue.

However, from observations of using cupy's LineProfileHook is was determined that it's most likely not the case that the 1D RFFT is creating more than one array. Therefore, more investigation is needed to determine what is causing the extra memory allocated.

@yousefmoazzam yousefmoazzam added the memory-estimation GPU memory estimator related label Jun 13, 2024
@dkazanc dkazanc added bug Something isn't working and removed bug Something isn't working labels Jun 13, 2024
@team-gpu
Copy link
Contributor

This has been addressed in #393 . Specifically, I've been tracing all individual allocations and digged into the astra toolbox as well. The estimator code has been adjusted to reflect the original code structure better. The full details are as follows, for the test case with the following input data size: (1801, 5, 2560), i.e. 5 slices in sinogram view. The recorded allocations are as follows:

  • The input size at the start is captured trivially
  • For the filtersinc function, these are the allocations that happen:
    • FFT plan: 184,495,104 bytes, correctly estimated
    • complex output of the FFT of half the width: 92,283,392 bytes, which is 18,456,678.4 per slice. It's a tiny bit bigger than estimated, looks like some padding for alignment (128 bytes)
    • filter memory: 5,632 bytes (correctly estimated)
    • IFFT plan: same size as above for FFT
    • output of IFFT: 92,211,200 bytes. This is the same as the input size, as expected
    • at exiting the function, the memory for the complex frequency domain data as well as filter is freed as expected
  • then the FFT plans (both) are destroyed, freeing the memory for them
  • then there's a swapaxis which allocates input_size again: 92,211,200
  • also the space for the recon output is allocated as expected: 28,800,000 (1200x1200 of float32)
  • then it calls the astra toolbox code, which internally creates a cudaArray of the same size as the input for texture access. This is an allocation of around the input size (give or take a few bytes), and we assume that (could not be tracked with memory hook)
  • this cudaArray is then freed again in astra, which we take into account
  • the check_kwargs function applies a circular mask, and this is heavy on lots of tiny allocations and frees (won't list them all here). The biggest one is an allocation of a 1200 x 1200 int64 array, which we'll use for the estimates as a fixed cost independent of slices
  • at the end of the astra call wrappers, it frees 2x the input size, which would be the swapaxis allocation above and the output of the filtersync, as that is going out of scope and not used anymore
  • then there's another swapaxis call in the return value, which allocates another 28,800,000, i.e. 1200 x 1200 x 5 x float32
  • The second astra-related part seems to be almost always smaller than the filtersync part at the start, and memory is re-used. But the estimator has a max(...) call to make sure the max of both is considered.

@dkazanc
Copy link
Collaborator

dkazanc commented Aug 16, 2024

#417 resolves the issue

@dkazanc dkazanc closed this as completed Aug 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
memory-estimation GPU memory estimator related
Projects
None yet
Development

No branches or pull requests

3 participants