Large overestimation of GPU memory causing failing memory hook tests on some DLS workstations #184

yousefmoazzam · 2023-10-04T14:32:26Z

On the DLS machine ws448 , using:

gpuloop branch on httomo
library branch on httomolibgpu

and running the memory hook tests in test_httomolibgpu.py shows that one of the cases in the FBP memory hook parametrised test (namely, the case recon_size_it=600)

httomo/tests/test_backends/test_httomolibgpu.py

Lines 270 to 271 in e44782e

    
           @pytest.mark.parametrize("recon_size_it", [600, 1200, 2560]) 
        
           def test_recon_FBP_memoryhook(slices, recon_size_it, ensure_clean_memory):

fails due to a large overestimation of the number of bytes that the method uses compared to the peak memory usage reported by the cupy memory hook.

A threshold is defined on the size of the difference between

peak GPU memory usage
estimated memory usage

when compared to the peak GPU memory usage. In terms of a simple formula:

(estimated - peak)/peak * 100 = size of difference between peak and estimated, compared to peak

This is happening in the following part of the test:

httomo/tests/test_backends/test_httomolibgpu.py

Lines 293 to 299 in e44782e

    
           # now we compare both memory estimations  
        
           difference_mb = abs(estimated_memory_mb - max_mem_mb) 
        
           percents_relative_maxmem = round((difference_mb/max_mem_mb)*100) 
        
           # the estimated_memory_mb should be LARGER or EQUAL to max_mem_mb 
        
           # the resulting percent value should not deviate from max_mem on more than 20%     
        
           assert estimated_memory_mb >= max_mem_mb  
        
           assert percents_relative_maxmem <= 35

and it is this threshold of 35% that is being exceeded.

The output of the failing test is the following:

================================================================================== short test summary info ===================================================================================
FAILED tests/test_backends/test_httomolibgpu.py::test_recon_FBP_memoryhook[600-3] - assert 71 <= 35
FAILED tests/test_backends/test_httomolibgpu.py::test_recon_FBP_memoryhook[600-5] - assert 71 <= 35
FAILED tests/test_backends/test_httomolibgpu.py::test_recon_FBP_memoryhook[600-8] - assert 71 <= 35
================================================================================ 3 failed, 6 passed in 4.69s =================================================================================

Note that manually changing the 600 case to 601 in the parametrised test produces a passing test on ws448, which potentially indicates that, on some machines and in some cases, the threshold of 35% is fine, and in other cases the threshold of 35% is not fine.

The text was updated successfully, but these errors were encountered:

yousefmoazzam · 2023-10-04T14:40:44Z

Note that on other DLS machines, it appears that other memory hook tests can fail (though there's always the possibility that the CUDA OOM errors are influencing the results of other tests?).

For example, on pc0074, another recon method memory hook test fails and a paganin filter one (ignore tests failing due to CUDA OOM errors, the failing tests that don't get a CUDA OOM error instead have an assertion error regarding exceeding the threshold mentioned in the first comment):

================================================================================== short test summary info ===================================================================================
FAILED tests/test_backends/test_httomolibgpu.py::test_normalize_memoryhook_parametrise[512] - cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 536,870,912 bytes (allocated so far: 578,813,952 bytes).
FAILED tests/test_backends/test_httomolibgpu.py::test_paganin_filter_tomopy_memoryhook[340-260-128] - cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 268,435,456 bytes (allocated so far: 761,610,240 bytes).
FAILED tests/test_backends/test_httomolibgpu.py::test_paganin_filter_tomopy_memoryhook[340-320-64] - assert 76 <= 20
FAILED tests/test_backends/test_httomolibgpu.py::test_paganin_filter_tomopy_memoryhook[340-320-128] - cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 268,435,456 bytes (allocated so far: 726,794,240 bytes).
FAILED tests/test_backends/test_httomolibgpu.py::test_recon_FBP_memoryhook[2560-8] - cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 295,191,552 bytes (allocated so far: 590,388,736 bytes).
FAILED tests/test_backends/test_httomolibgpu.py::test_recon_SIRT_memoryhook[8] - cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 209,715,200 bytes (allocated so far: 652,328,960 bytes).
FAILED tests/test_backends/test_httomolibgpu.py::test_recon_CGLS_memoryhook[3] - assert 313 <= 20
FAILED tests/test_backends/test_httomolibgpu.py::test_recon_CGLS_memoryhook[5] - cupy.cuda.memory.OutOfMemoryError: Out of memory allocating 131,072,000 bytes (allocated so far: 826,214,400 bytes).
========================================================================== 8 failed, 54 passed in 82.14s (0:01:22) ===========================================================================

despite those non-CUDA OOM error memory hook tests passing on ws448.

yousefmoazzam · 2023-10-05T09:16:16Z

Here are some concrete numbers for why recon_size_it=600 fails but recon_size_it=601 passes on ws448.

Case recon_size_it=600:

~50MB peak GPU memory actually allocated
~85MB estimated
(estimated - peak)/peak * 100 = 35/50 * 100 = 70%
70 <= 35 is False, hence failing test

Case recon_size_it=601

~75MB peak GPU memory actually allocated
~85MB estimated
(estimated - peak)/peak * 100 = 10/75 * 100 = 13.3%
13.3 <= 35 is True, hence passing test

Notice that there is somewhat of a "jump" in the peak GPU memory allocated of 25MB between 600 and 601, but the estimation going from 600 to 601 is basically the same (it's increased by a very small amount going from 600 to 601, which is expected).

From this it appears that maybe as the value of recon_size_it increases, the peak GPU memory allocated doesn't increase uniformly (ie, sometimes when recon_size_it increases, the peak GPU memory allocated could jump, or it could just be bumped up by a bit). Whereas the estimated memory value appears to be increasing gradually as recon_size_it increases.

dkazanc · 2024-06-17T08:05:55Z

I guess this not an issue now, but we've got instead
#365

yousefmoazzam added the bug Something isn't working label Oct 4, 2023

dkazanc closed this as not planned Won't fix, can't repro, duplicate, stale Jun 17, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large overestimation of GPU memory causing failing memory hook tests on some DLS workstations #184

Large overestimation of GPU memory causing failing memory hook tests on some DLS workstations #184

yousefmoazzam commented Oct 4, 2023 •

edited

Loading

yousefmoazzam commented Oct 4, 2023

yousefmoazzam commented Oct 5, 2023

dkazanc commented Jun 17, 2024

Large overestimation of GPU memory causing failing memory hook tests on some DLS workstations #184

Large overestimation of GPU memory causing failing memory hook tests on some DLS workstations #184

Comments

yousefmoazzam commented Oct 4, 2023 • edited Loading

yousefmoazzam commented Oct 4, 2023

yousefmoazzam commented Oct 5, 2023

dkazanc commented Jun 17, 2024

yousefmoazzam commented Oct 4, 2023 •

edited

Loading