-
Notifications
You must be signed in to change notification settings - Fork 43
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CUDA runs out of memory during tuning of patched layer #120
Comments
Hi @Philippe-Drolet, Apologies again for the delayed response. This looks like an oversight on my end! The amount of allocable memory (using For linear layers, see https://github.com/coreylammie/MemTorch/blob/master/memtorch/mn/Linear.py#L106 and https://github.com/coreylammie/MemTorch/blob/master/memtorch/cu/tile_matmul_kernels.cu#L242-L244 for how this is set. I considered adding Instead, That being said, currently, only the maximum number of threads are considered, and not the total available VRAM on the GPU being used. I'll try my best to get around to doing this in the future- if the total required amount of memory exceeds that of the free GPU memory of the device being used, and the number of kernels to execute in parallel is below a set threshold, then execution should naturally fallback to CPU. I have yet to re-implement/integrate the exact following logic myself, but https://github.com/louisprimeau/crossbar-simulator/blob/master/sim/crossbar/crossbar.py#L153-L247 can be used to construct and solve Kind Regards, Corey. |
Hello, Thank you for replying! Do not worry about the time it takes you to respond, as others have pointed out, you are the sole developer of this simulator and it is perfectly understandable. I changed the cuda_malloc_heap_size to the max value permitted by my card and it is not enough. I do not believe the problem can be solved like this as 130560 * 335544320 * 4 bytes (size of int) = 175234 GB which could never be allocated on the GPU to begin with and the problem would rather have something to do with this line of the code:https://github.com/coreylammie/MemTorch/blob/master/memtorch/cu/tile_matmul_kernels.cu#:~:text=int%20n_kernels%20%3D%20grid.x%20*%20block.x%20*%20grid.y%20*%20block.y%20*%20grid.z%20*%20block.z%3B or by how the grid and blocks are defined. Unless there is something I do not understand. Thank you, Sincerely, Philippe |
Hi @Philippe-Drolet,
Grid and block dimensions are dependent on the maximum number of threads available on the CUDA device being used:
This means that The total number of bytes required (of CUDA memory) to launch
Using This would require the CUDA kernel You may find the following blog post helpful: https://developer.nvidia.com/blog/cuda-pro-tip-write-flexible-kernels-grid-stride-loops/. Essentially, when it is not feasible to launch a kernel instance for each element (due to block/thread or memory constraints), grid stride loops should be used. Hopefully this makes more sense! |
Updated CUDA bindings for passive crossbar inference routines to avoid OOM errors (#120).
After looking at this more closely, I have completely redeveloped the kernels used to solve for the output currents of tiled passive crossbars during inference. This improved logic/functionality has been merged to master in #123. Currently, SparseLU factorization is performed using Eigen. I believe performance can be improved using cuSPARSE and cuSOLVER, however, I have left this as a future improvement. The following code can be used to tune a 1,000-1,000 linear layer with passive crossbar tiles of size/shape (512x512):
Considering that during inference, tiles are now solved in series and not parallel, simulating larger passive crossbar architectures may take a considerable amount of time. If you have any suggestions/ideas as to how this functionality can be improved, please let me know! |
Hello,
when I use transistor = False with use_bindings = True and tile_shape = (128,128), there is an error when trying to tune the layers in the function memtorch_bindings.tiled_inference(). I believe this is because too much memory is required to allocate the ABCD_matrix_indices_x and ABCD_matrix_indices_y in the GPU. The value of the non_zero_elements variable of tiles_matmul_kernel.cu file = 8 * 128 * 128 - 2 * 128 - 2 * 128 = 130560 and the value of the n_kernels variable (of the same file) is 1024 (grid.x) * 1024 (grid.y) * 64 (grid.z) *5 (block.x) * 1 (block.y) * 1 (block.z) = 335544320 and these two values are then multiplied together (and multiplied to sizeof(int)), leading to an incredibly large number that cannot be allocated on the GPU. This happens during the _tune() call after the network was successfully patched.
The network I use is a very simple MNIST recognition network where the images were shrinked to 10x10 in length (100 input) with one single hidden layer of size 50 and I only use a Linear layer . Here is the patch_model function call along with the required params:
Maybe this error has never occured before because it is difficult to simulate large networks with the naive_program routine, this occured first after I had trained my model using a CUDA data_driven simulation routine related to issue that I am developping which makes it faster to train large networks. Some of the programming_routine_params may appear strange and that is because they are related to the CUDA_data_driven routine I spoke of. To make sure that this issue was not caused by the code that I have added, I replaced lines 234 to 254 of Crossbar.py with lines 228 to 232 of the same file and commented out anything that was not already there. I did not try with naive_program as it takes far too long to program 128 x 128 tile crossbars. Thank you for your time! Do not hesitate if you have any questions.
Philippe
The text was updated successfully, but these errors were encountered: