Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize/Optimize Device Simulation Logic #53

Open
coreylammie opened this issue Jun 14, 2021 · 5 comments
Open

Parallelize/Optimize Device Simulation Logic #53

coreylammie opened this issue Jun 14, 2021 · 5 comments
Assignees
Labels
enhancement New feature or request
Projects

Comments

@coreylammie
Copy link
Owner

Currently, when performing inference, or programming devices in passive arrays (0T1R arrangements), devices are simulated in a sequential manor. CUDA kernels and other optimization methods can be used to drastically improve performance, as some specific operations are not easily parallelizable using the Python API.

@coreylammie coreylammie self-assigned this Jun 14, 2021
@coreylammie coreylammie added the enhancement New feature or request label Jun 14, 2021
@coreylammie coreylammie added this to To do in MemTorch via automation Jun 14, 2021
@stale stale bot added the stale label Jul 14, 2021
Repository owner deleted a comment from stale bot Jul 18, 2021
@stale stale bot removed the stale label Jul 18, 2021
@Philippe-Drolet
Copy link
Contributor

Philippe-Drolet commented Oct 29, 2021

Hello, I was wondering if any advancement had been made on this issue? I would get started on it otherwise. Thank you,

@coreylammie
Copy link
Owner Author

Hi @Philippe-Drolet,

I have prioritized the implementation of torch.nn.RNN, torch.nn.RNNCell, torch.nn.LTSM, torch.nn.LTSMCell, torch.nn.GRU, and torch.nn.GRUCell modules, so I will likely be unable to work on this issue in the near-future.

You are welcome to contribute yourself! I'm happy to answer any questions you may have.

@Philippe-Drolet
Copy link
Contributor

Philippe-Drolet commented Nov 15, 2021

Hello,

So I have started work on this, I was curious as to what you would recommend for debugging the cuda files when using them with the python interface. So far, I have created a new pytest with the debug networks that you have defined but when I get to debugging my new .cu files, I cannot access them line by line as I would a regular python file. I am trying visual studio code right now to run the tests, what IDE are you using (I suppose its impossible to debug the c++ files with pycharm). Any guidance would help and I am also simply curious as to how you do it.

Also, do you have any documentation as per the purpose of matrices ABCD_E from simulate passive? Thanks!

@coreylammie
Copy link
Owner Author

Hi @Philippe-Drolet,

Sure- my preferred method of debugging is to use cuda-memcheck. The cuda-memcheck tool can be used to pin-point the exact line/kernel and respective error message, as long as the -lineinfo flag is added during compilation. This has been done here: https://github.com/coreylammie/MemTorch/blob/master/setup.py#L46.

This tool can be used when executing a Python script that calls a C++/CUDA binding, which launches one or more CUDA kernels. It can be invoked as follows: cuda-memcheck python test.py. When debugging an especially problematic kernel, I would suggest setting the following environmental variable CUDA_LAUNCH_BLOCKING=1, so that only one kernel is executed at a time, i.e., cuda-memcheck CUDA_LAUNCH_BLOCKING=1 python test.py can be used.

In addition, cudaSafeCall can be used, which is defined in memtorch.cu.utils.cuh. Technically, breakpoints can be added using NVIDIA Nsight, however, in my experience, this is cumbersome to use, and printf statements can easily be used alongside cuda-memcheck, enabling the use of your preferred IDE.

ABCD_E matrices were originally proposed and defined in [1]. They are used to solve for node voltages using linear matrix algebra, while accounting for source and line resistances in crossbar architectures. Solving them efficiently is rather nuanced, as the ABCD matrix is sparse, and sparse linear matrix equations are difficult to solve in a parallelized manner.

Hopefully, this helps! I'm happy to answer any further questions you may have.

[1] A. Chen, “A Comprehensive Crossbar Array Model With Solutions for Line Resistance and Nonlinear Device Characteristics,” IEEE Transactions on Electron Devices, vol. 60, no. 4, pp. 1318–1326, Apr. 2013, doi: 10.1109/ted.2013.2246791.

@Philippe-Drolet
Copy link
Contributor

Thank you very much for this response, I will i go for the good old printf approach, it seems to work so far!

coreylammie pushed a commit that referenced this issue Feb 10, 2022
…nes for the 2021 Data-Driven Model (#125)

Implementation of CUDA accelerated passive crossbar programming routines for the 2021 Data-Driven model (#125) as a partial solution to (#53).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
MemTorch
  
To do
Development

No branches or pull requests

2 participants