Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large overestimation of GPU memory causing failing memory hook tests on some DLS workstations #184

Closed
yousefmoazzam opened this issue Oct 4, 2023 · 3 comments
Labels
bug Something isn't working

Comments

@yousefmoazzam
Copy link
Collaborator

yousefmoazzam commented Oct 4, 2023

On the DLS machine ws448 , using:

  • gpuloop branch on httomo
  • library branch on httomolibgpu

and running the memory hook tests in test_httomolibgpu.py shows that one of the cases in the FBP memory hook parametrised test (namely, the case recon_size_it=600)

@pytest.mark.parametrize("recon_size_it", [600, 1200, 2560])
def test_recon_FBP_memoryhook(slices, recon_size_it, ensure_clean_memory):
fails due to a large overestimation of the number of bytes that the method uses compared to the peak memory usage reported by the cupy memory hook.

A threshold is defined on the size of the difference between

  • peak GPU memory usage
  • estimated memory usage

when compared to the peak GPU memory usage. In terms of a simple formula:

(estimated - peak)/peak * 100 = size of difference between peak and estimated, compared to peak

This is happening in the following part of the test:

# now we compare both memory estimations
difference_mb = abs(estimated_memory_mb - max_mem_mb)
percents_relative_maxmem = round((difference_mb/max_mem_mb)*100)
# the estimated_memory_mb should be LARGER or EQUAL to max_mem_mb
# the resulting percent value should not deviate from max_mem on more than 20%
assert estimated_memory_mb >= max_mem_mb
assert percents_relative_maxmem <= 35
and it is this threshold of 35% that is being exceeded.

The output of the failing test is the following:

================================================================================== short test summary info ===================================================================================
FAILED tests/test_backends/test_httomolibgpu.py::test_recon_FBP_memoryhook[600-3] - assert 71 <= 35
FAILED tests/test_backends/test_httomolibgpu.py::test_recon_FBP_memoryhook[600-5] - assert 71 <= 35
FAILED tests/test_backends/test_httomolibgpu.py::test_recon_FBP_memoryhook[600-8] - assert 71 <= 35
================================================================================ 3 failed, 6 passed in 4.69s =================================================================================

Note that manually changing the 600 case to 601 in the parametrised test produces a passing test on ws448, which potentially indicates that, on some machines and in some cases, the threshold of 35% is fine, and in other cases the threshold of 35% is not fine.

@yousefmoazzam yousefmoazzam added the bug Something isn't working label Oct 4, 2023
@yousefmoazzam
Copy link
Collaborator Author

Note that on other DLS machines, it appears that other memory hook tests can fail (though there's always the possibility that the CUDA OOM errors are influencing the results of other tests?).

For example, on pc0074, another recon method memory hook test fails and a paganin filter one (ignore tests failing due to CUDA OOM errors, the failing tests that don't get a CUDA OOM error instead have an assertion error regarding exceeding the threshold mentioned in the first comment):

================================================================================== short test summary info ===================================================================================
FAILED tests/test_backends/test_httomolibgpu.py::test_normalize_memoryhook_parametrise[512] - cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 536,870,912 bytes (allocated so far: 578,813,952 bytes).
FAILED tests/test_backends/test_httomolibgpu.py::test_paganin_filter_tomopy_memoryhook[340-260-128] - cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 268,435,456 bytes (allocated so far: 761,610,240 bytes).
FAILED tests/test_backends/test_httomolibgpu.py::test_paganin_filter_tomopy_memoryhook[340-320-64] - assert 76 <= 20
FAILED tests/test_backends/test_httomolibgpu.py::test_paganin_filter_tomopy_memoryhook[340-320-128] - cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 268,435,456 bytes (allocated so far: 726,794,240 bytes).
FAILED tests/test_backends/test_httomolibgpu.py::test_recon_FBP_memoryhook[2560-8] - cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 295,191,552 bytes (allocated so far: 590,388,736 bytes).
FAILED tests/test_backends/test_httomolibgpu.py::test_recon_SIRT_memoryhook[8] - cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 209,715,200 bytes (allocated so far: 652,328,960 bytes).
FAILED tests/test_backends/test_httomolibgpu.py::test_recon_CGLS_memoryhook[3] - assert 313 <= 20
FAILED tests/test_backends/test_httomolibgpu.py::test_recon_CGLS_memoryhook[5] - cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 131,072,000 bytes (allocated so far: 826,214,400 bytes).
========================================================================== 8 failed, 54 passed in 82.14s (0:01:22) ===========================================================================

despite those non-CUDA OOM error memory hook tests passing on ws448.

@yousefmoazzam
Copy link
Collaborator Author

Here are some concrete numbers for why recon_size_it=600 fails but recon_size_it=601 passes on ws448.

Case recon_size_it=600:

  • ~50MB peak GPU memory actually allocated
  • ~85MB estimated
  • (estimated - peak)/peak * 100 = 35/50 * 100 = 70%
  • 70 <= 35 is False, hence failing test

Case recon_size_it=601

  • ~75MB peak GPU memory actually allocated
  • ~85MB estimated
  • (estimated - peak)/peak * 100 = 10/75 * 100 = 13.3%
  • 13.3 <= 35 is True, hence passing test

Notice that there is somewhat of a "jump" in the peak GPU memory allocated of 25MB between 600 and 601, but the estimation going from 600 to 601 is basically the same (it's increased by a very small amount going from 600 to 601, which is expected).

From this it appears that maybe as the value of recon_size_it increases, the peak GPU memory allocated doesn't increase uniformly (ie, sometimes when recon_size_it increases, the peak GPU memory allocated could jump, or it could just be bumped up by a bit). Whereas the estimated memory value appears to be increasing gradually as recon_size_it increases.

@dkazanc
Copy link
Collaborator

dkazanc commented Jun 17, 2024

I guess this not an issue now, but we've got instead
#365

@dkazanc dkazanc closed this as not planned Won't fix, can't repro, duplicate, stale Jun 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants