Parallelize/Optimize Device Simulation Logic #53

coreylammie · 2021-06-14T04:46:11Z

Currently, when performing inference, or programming devices in passive arrays (0T1R arrangements), devices are simulated in a sequential manor. CUDA kernels and other optimization methods can be used to drastically improve performance, as some specific operations are not easily parallelizable using the Python API.

Philippe-Drolet · 2021-10-29T14:24:23Z

Hello, I was wondering if any advancement had been made on this issue? I would get started on it otherwise. Thank you,

coreylammie · 2021-11-01T00:11:36Z

Hi @Philippe-Drolet,

I have prioritized the implementation of torch.nn.RNN, torch.nn.RNNCell, torch.nn.LTSM, torch.nn.LTSMCell, torch.nn.GRU, and torch.nn.GRUCell modules, so I will likely be unable to work on this issue in the near-future.

You are welcome to contribute yourself! I'm happy to answer any questions you may have.

Philippe-Drolet · 2021-11-15T03:45:44Z

Hello,

So I have started work on this, I was curious as to what you would recommend for debugging the cuda files when using them with the python interface. So far, I have created a new pytest with the debug networks that you have defined but when I get to debugging my new .cu files, I cannot access them line by line as I would a regular python file. I am trying visual studio code right now to run the tests, what IDE are you using (I suppose its impossible to debug the c++ files with pycharm). Any guidance would help and I am also simply curious as to how you do it.

Also, do you have any documentation as per the purpose of matrices ABCD_E from simulate passive? Thanks!

coreylammie · 2021-11-16T00:44:09Z

Hi @Philippe-Drolet,

Sure- my preferred method of debugging is to use cuda-memcheck. The cuda-memcheck tool can be used to pin-point the exact line/kernel and respective error message, as long as the -lineinfo flag is added during compilation. This has been done here: https://github.com/coreylammie/MemTorch/blob/master/setup.py#L46.

This tool can be used when executing a Python script that calls a C++/CUDA binding, which launches one or more CUDA kernels. It can be invoked as follows: cuda-memcheck python test.py. When debugging an especially problematic kernel, I would suggest setting the following environmental variable CUDA_LAUNCH_BLOCKING=1, so that only one kernel is executed at a time, i.e., cuda-memcheck CUDA_LAUNCH_BLOCKING=1 python test.py can be used.

In addition, cudaSafeCall can be used, which is defined in memtorch.cu.utils.cuh. Technically, breakpoints can be added using NVIDIA Nsight, however, in my experience, this is cumbersome to use, and printf statements can easily be used alongside cuda-memcheck, enabling the use of your preferred IDE.

ABCD_E matrices were originally proposed and defined in [1]. They are used to solve for node voltages using linear matrix algebra, while accounting for source and line resistances in crossbar architectures. Solving them efficiently is rather nuanced, as the ABCD matrix is sparse, and sparse linear matrix equations are difficult to solve in a parallelized manner.

Hopefully, this helps! I'm happy to answer any further questions you may have.

[1] A. Chen, “A Comprehensive Crossbar Array Model With Solutions for Line Resistance and Nonlinear Device Characteristics,” IEEE Transactions on Electron Devices, vol. 60, no. 4, pp. 1318–1326, Apr. 2013, doi: 10.1109/ted.2013.2246791.

Philippe-Drolet · 2021-11-23T14:19:57Z

Thank you very much for this response, I will i go for the good old printf approach, it seems to work so far!

…nes for the 2021 Data-Driven Model (#125) Implementation of CUDA accelerated passive crossbar programming routines for the 2021 Data-Driven model (#125) as a partial solution to (#53).

coreylammie self-assigned this Jun 14, 2021

coreylammie added the enhancement New feature or request label Jun 14, 2021

coreylammie added this to To do in MemTorch via automation Jun 14, 2021

coreylammie mentioned this issue Jun 14, 2021

naive_program method from Program.py file incompatible with tiled crossbars #49

Closed

stale bot added the stale label Jul 14, 2021

Repository owner deleted a comment from stale bot Jul 18, 2021

stale bot removed the stale label Jul 18, 2021

Philippe-Drolet mentioned this issue Jan 4, 2022

CUDA runs out of memory during tuning of patched layer #120

Closed

Philippe-Drolet mentioned this issue Feb 7, 2022

Implementation of CUDA accelerated passive crossbar programming simulation for the 2021 Data Driven model #125

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelize/Optimize Device Simulation Logic #53

Parallelize/Optimize Device Simulation Logic #53

coreylammie commented Jun 14, 2021

Philippe-Drolet commented Oct 29, 2021 •

edited

Loading

coreylammie commented Nov 1, 2021

Philippe-Drolet commented Nov 15, 2021 •

edited

Loading

coreylammie commented Nov 16, 2021

Philippe-Drolet commented Nov 23, 2021

Parallelize/Optimize Device Simulation Logic #53

Parallelize/Optimize Device Simulation Logic #53

Comments

coreylammie commented Jun 14, 2021

Philippe-Drolet commented Oct 29, 2021 • edited Loading

coreylammie commented Nov 1, 2021

Philippe-Drolet commented Nov 15, 2021 • edited Loading

coreylammie commented Nov 16, 2021

Philippe-Drolet commented Nov 23, 2021

Philippe-Drolet commented Oct 29, 2021 •

edited

Loading

Philippe-Drolet commented Nov 15, 2021 •

edited

Loading