Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA runs out of memory during tuning of patched layer #120

Closed
Philippe-Drolet opened this issue Jan 4, 2022 · 4 comments
Closed

CUDA runs out of memory during tuning of patched layer #120

Philippe-Drolet opened this issue Jan 4, 2022 · 4 comments

Comments

@Philippe-Drolet
Copy link
Contributor

Hello,

when I use transistor = False with use_bindings = True and tile_shape = (128,128), there is an error when trying to tune the layers in the function memtorch_bindings.tiled_inference(). I believe this is because too much memory is required to allocate the ABCD_matrix_indices_x and ABCD_matrix_indices_y in the GPU. The value of the non_zero_elements variable of tiles_matmul_kernel.cu file = 8 * 128 * 128 - 2 * 128 - 2 * 128 = 130560 and the value of the n_kernels variable (of the same file) is 1024 (grid.x) * 1024 (grid.y) * 64 (grid.z) *5 (block.x) * 1 (block.y) * 1 (block.z) = 335544320 and these two values are then multiplied together (and multiplied to sizeof(int)), leading to an incredibly large number that cannot be allocated on the GPU. This happens during the _tune() call after the network was successfully patched.

The network I use is a very simple MNIST recognition network where the images were shrinked to 10x10 in length (100 input) with one single hidden layer of size 50 and I only use a Linear layer . Here is the patch_model function call along with the required params:

reference_memristor_params_dd = {'time_series_resolution': 2e-7, "r_on": 1800, "r_off": 2500, "A_p": 600.10075,
                                 "t_p": -0.0212028, "A_n": -34.5988399, "t_n": -0.05343997,
                                 "r_p": [2699.2336, -672.930205], "r_n": [649.413746, -1474.32358], "a_p": 0.32046175,
                                 "b_p": 2.71689828, "a_n": 0.32046175, "b_n": 2.71689828}

patched_model = patch_model(copy.deepcopy(network),
                                    memristor_model=memtorch.bh.memristor.Data_Driven2021,
                                    memristor_model_params=reference_memristor_params_dd,
                                    module_parameters_to_patch=[torch.nn.Linear],
                                    mapping_routine=naive_map,
                                    transistor=false,
                                    programming_routine=naive_program,
                                    programming_routine_params={"rel_tol": 0.05,
                                        "pulse_duration": 2e-7,
                                        "refactory_period": 0,
                                        "pos_voltage_level": 1.2,
                                        "neg_voltage_level": -1.2,
                                        "timeout": 5,
                                        "simulate_neighbours" : True,
                                        "force_adjustment": 1e-2,
                                        "force_adjustment_rel_tol": 1e-1,
                                        "force_adjustment_pos_voltage_threshold": 1.8,
                                        "force_adjustment_neg_voltage_threshold": -1.8, },
                                    tile_shape=(128,128),
                                    scheme=memtorch.bh.Scheme.DoubleColumn,
                                    p_l = None,
                                    max_input_voltage=1.0,
                                    ADC_resolution=32,
                                    ADC_overflow_rate=0,
                                    source_resistance=5,
                                    line_resistance=5,
                                    random_crossbar_init=False,
                                    quant_method="linear")

        patched_model.tune_() #error occurs here

Maybe this error has never occured before because it is difficult to simulate large networks with the naive_program routine, this occured first after I had trained my model using a CUDA data_driven simulation routine related to issue that I am developping which makes it faster to train large networks. Some of the programming_routine_params may appear strange and that is because they are related to the CUDA_data_driven routine I spoke of. To make sure that this issue was not caused by the code that I have added, I replaced lines 234 to 254 of Crossbar.py with lines 228 to 232 of the same file and commented out anything that was not already there. I did not try with naive_program as it takes far too long to program 128 x 128 tile crossbars. Thank you for your time! Do not hesitate if you have any questions.

Philippe

@coreylammie
Copy link
Owner

Hi @Philippe-Drolet,

Apologies again for the delayed response. This looks like an oversight on my end! The amount of allocable memory (using cudaMalloc) can be set using cuda_malloc_heap_size, which defaults to 50MB if CUDA is enabled.

For linear layers, see https://github.com/coreylammie/MemTorch/blob/master/memtorch/mn/Linear.py#L106 and https://github.com/coreylammie/MemTorch/blob/master/memtorch/cu/tile_matmul_kernels.cu#L242-L244 for how this is set. I considered adding cuda_malloc_heap_size as an input argument to all memtorch.mn modules, and to memtorch.mn.Module.patch_model, but chose not to, as when testing, 50MB appeared to be more than sufficient for most layers which were tested.

Instead, set_cuda_malloc_heap_size https://github.com/coreylammie/MemTorch/blob/master/memtorch/mn/Module.py#L210-L214 can be used to set this for all layers of a patched network.

That being said, currently, only the maximum number of threads are considered, and not the total available VRAM on the GPU being used. cudaGetDeviceProperties, which is used here: https://github.com/coreylammie/MemTorch/blob/master/memtorch/cu/tile_matmul_kernels.cu#L215, can be used to determine the total amount of global memory. Ideally, this should be considered when allocating memory and determining the number of kernels to execute to avoid OOM errors. A side note- using Pytorch, this can now be determined directly using: pytorch/pytorch#58635, before calling any bindings.

I'll try my best to get around to doing this in the future- if the total required amount of memory exceeds that of the free GPU memory of the device being used, and the number of kernels to execute in parallel is below a set threshold, then execution should naturally fallback to CPU.

I have yet to re-implement/integrate the exact following logic myself, but https://github.com/louisprimeau/crossbar-simulator/blob/master/sim/crossbar/crossbar.py#L153-L247 can be used to construct and solve ABCD V = E quite efficiently on CPU in Python without bindings. A C++ binding implementing the same logic may be a great fallback option in this regard.

Kind Regards,

Corey.

@Philippe-Drolet
Copy link
Contributor Author

Hello,

Thank you for replying! Do not worry about the time it takes you to respond, as others have pointed out, you are the sole developer of this simulator and it is perfectly understandable. I changed the cuda_malloc_heap_size to the max value permitted by my card and it is not enough. I do not believe the problem can be solved like this as 130560 * 335544320 * 4 bytes (size of int) = 175234 GB which could never be allocated on the GPU to begin with and the problem would rather have something to do with this line of the code:https://github.com/coreylammie/MemTorch/blob/master/memtorch/cu/tile_matmul_kernels.cu#:~:text=int%20n_kernels%20%3D%20grid.x%20*%20block.x%20*%20grid.y%20*%20block.y%20*%20grid.z%20*%20block.z%3B or by how the grid and blocks are defined. Unless there is something I do not understand. Thank you,

Sincerely,

Philippe

@coreylammie
Copy link
Owner

Hi @Philippe-Drolet,

n_kernels is currently computed using the product of all grid and block dimensions here: https://github.com/coreylammie/MemTorch/blob/master/memtorch/cu/tile_matmul_kernels.cu#L278.

Grid and block dimensions are dependent on the maximum number of threads available on the CUDA device being used:

cudaDeviceProp prop;
cudaGetDeviceProperties(&prop, 0);
int *max_threads_dim = prop.maxThreadsDim
dim3 grid;
dim3 block;
if (max_threads_dim[0] >= limit_i && max_threads_dim[1] >= limit_j &&
    max_threads_dim[2] >= limit_k) {
  // If multiple blocks are not required
  grid = {(unsigned int)limit_i, (unsigned int)limit_j,
          (unsigned int)limit_k};
  block = {1, 1, 1};
} else {
  // If multiple blocks are required
  grid = {(unsigned int)max_threads_dim[0], (unsigned int)max_threads_dim[1],
          (unsigned int)max_threads_dim[2]};
  block = {(unsigned int)ceil_int_div(limit_i, max_threads_dim[0]),
           (unsigned int)ceil_int_div(limit_j, max_threads_dim[1]),
           (unsigned int)ceil_int_div(limit_k, max_threads_dim[2])};
}

This means that n_kernels is dependent on the maximum number of available threads of the CUDA device being used, and not on the amount of allocatable GPU memory. Ideally, it should depend on both.

The total number of bytes required (of CUDA memory) to launch tile_matmul_kernel_A can be determined as a function of n, m, non_zero_elements, and n_kernels, as follows:

(3 * sizeof(int) * non_zero_elements * n_kernels) +
(2 * sizeof(double) * non_zero_elements * n_kernels) +
(sizeof(int) * (2 * n * m) * n_kernels) +
(sizeof(double) * (2 * m * n) * n_kernels)

Using cudaGetDeviceProperties, the maximum allocatable byte size on the CUDA device being used can be determined. Logic should be added to limit the total number of kernels, such that the total number of bytes required (of CUDA memory) does exceed the maximum allocatable byte size on the CUDA device being used.

This would require the CUDA kernel tile_matmul_kernel_A to be launched multiple times using a grid stride loop. I can have a further look into this in the next couple of days, and add the necessary logic.

You may find the following blog post helpful: https://developer.nvidia.com/blog/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/. Essentially, when it is not feasible to launch a kernel instance for each element (due to block/thread or memory constraints), grid stride loops should be used.

Hopefully this makes more sense!

coreylammie added a commit that referenced this issue Feb 1, 2022
Updated CUDA bindings for passive crossbar inference routines to avoid OOM errors (#120).
@coreylammie
Copy link
Owner

After looking at this more closely, I have completely redeveloped the kernels used to solve for the output currents of tiled passive crossbars during inference. This improved logic/functionality has been merged to master in #123.

Currently, SparseLU factorization is performed using Eigen. I believe performance can be improved using cuSPARSE and cuSOLVER, however, I have left this as a future improvement. The following code can be used to tune a 1,000-1,000 linear layer with passive crossbar tiles of size/shape (512x512):

import torch
import memtorch
from memtorch.bh.crossbar.Tile import gen_tiles
from memtorch.map.Input import naive_scale
from memtorch.map.Parameter import naive_map
import memtorch_cuda_bindings as memtorch_bindings

device = torch.device('cuda:0')
linear = torch.nn.Linear(1000, 1000, bias=True).to(device)
m_linear = memtorch.mn.Linear(
    linear_layer=linear,
    memristor_model=memtorch.bh.memristor.VTEAM,
    memristor_model_params={'r_on': 1e5, 'r_off': 1e6},
    mapping_routine=naive_map,
    transistor=False,
    programming_routine=None,
    tile_shape=(512, 512),
    max_input_voltage=0.3,
    scaling_routine=naive_scale,
    source_resistance=2,
    line_resistance=2,
    ADC_resolution=8,
    ADC_overflow_rate=0.0,
    quant_method='linear',
)

m_linear.tune(input_shape=10)

Considering that during inference, tiles are now solved in series and not parallel, simulating larger passive crossbar architectures may take a considerable amount of time. If you have any suggestions/ideas as to how this functionality can be improved, please let me know!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants