Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Native integration of pytorch_memlab or something like it #5189

Closed
turian opened this issue Dec 19, 2020 · 5 comments
Closed

Native integration of pytorch_memlab or something like it #5189

turian opened this issue Dec 19, 2020 · 5 comments
Labels
feature Is an improvement or enhancement help wanted Open to be worked on won't fix This will not be worked on

Comments

@turian
Copy link
Contributor

turian commented Dec 19, 2020

🚀 Feature

Fine-grained memory profiling in pytorch-lightning, that explains:

  1. what specifically causes GPU utilization.memory
  2. what specifically causes GPU memory.used

Motivation

  1. If data is currently being moved back and forth unnecessarily between the GPU, this slows things down. However, it is hard to pinpoint where this comes from.
  2. Certain architectures and batch sizes cause OOM. However, it would be useful to pinpoint exactly where the memory consumption comes from, to make more compact networks and use larger batch sizes.

Pitch

pytorch-lightning is designed to make it easy to train networks using pytorch. However, debugging utilization.memory and memory.used is very ad-hoc and tricky. Best-practices don't work all the time, and a very simple fine-grained profiler would be very useful, even for experts if they are writing complicated nets.

Alternatives

  • Instrumenting every single line of code with GPUStatsMonitor
  • pytorch_memlab, but it doesn't have good support for pl. Better native integration of this tool in pytorch-lightning would be very beneficial.

Additional context

Attached is a graph from wandb.ai dashboard of my utilization.memory:

image

I am tearing my hair out figuring why this is the case. As far as I can tell everything is on the GPU, and I don't know where the memory accesses are coming from. I'd love a one-liner tool that explained this, rather than poking around blind in a haphazard way.

@turian turian added feature Is an improvement or enhancement help wanted Open to be worked on labels Dec 19, 2020
@awaelchli
Copy link
Contributor

If you suspect that Lightning is causing the memory overhead, it would be good to have a baseline pytorch implementation and compare this. Also, maybe it is worth taking another logger (maybe comet?) that has a similar gpu-logging feature to make sure it's not a visualization issue.

Related: #2080

@turian
Copy link
Contributor Author

turian commented Dec 21, 2020

@awaelchli no I don't think that Lightning is causing the issue, just some code line that I didn't think through.

I think a bad solution to memory profiling is:

  • rewrite your application without pytorch lightning
  • add a second logger which also is not granular

This is why I ask for a simple granular memory profiler, which would be of great use to many if it doesn't require a ton of fiddly busywork that doesn't really get at the root problem.

@tchaton
Copy link
Contributor

tchaton commented Jan 18, 2021

Hey @turian,

We will first add PytorchProfiler introduced in 1.6
Feel free to make a PR to support this profiler.

Best,
T.C

@stale
Copy link

stale bot commented Feb 17, 2021

This issue has been automatically marked as stale because it hasn't had any recent activity. This issue will be closed in 7 days if no further activity occurs. Thank you for your contributions, Pytorch Lightning Team!

@stale stale bot added the won't fix This will not be worked on label Feb 17, 2021
@stale stale bot closed this as completed Feb 27, 2021
@turian
Copy link
Contributor Author

turian commented Feb 28, 2021

@tchaton happy to see PytorchProfiler introduced when it arrives

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Is an improvement or enhancement help wanted Open to be worked on won't fix This will not be worked on
Projects
None yet
Development

No branches or pull requests

3 participants