-
Notifications
You must be signed in to change notification settings - Fork 769
Sweep autodiff graphs #3977
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sweep autodiff graphs #3977
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #3977 +/- ##
==========================================
- Coverage 65.42% 65.40% -0.03%
==========================================
Files 1183 1183
Lines 139251 139266 +15
==========================================
- Hits 91103 91084 -19
- Misses 48148 48182 +34 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
laggui
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A little late on the review sorry 😅
This could be useful to expose for users that want more fine-grained control over the cleanup, especially if they know they are creating tracked autodiff tensors that are not used.
Though I am not sure if this is a very widespread pattern, to explicitly create tensors to require grad without actually using them 🤔
As per your example in the linked issue:
let a: Tensor<AutoB, 2> = Tensor::zeros([2, 2], &train_device);
let a = a.require_grad(); // an autodiff node is created
drop(a); // the tensor is dropped but its unusable graph persistsIf you don't explicitly mark that tensor to require grad, then it will not be tracked and thus should be cleaned up with the current "orphaned" cleanup strategy.
| fn graph_cleanup() { | ||
| let graphs_to_visit = { | ||
| let graph_locator = crate::runtime::graph::STATE.lock(); | ||
| let graph_locator = graph_locator.as_ref().unwrap(); | ||
| let mut graphs_to_visit = HashMap::new(); | ||
| for (_node_id, graph) in &graph_locator.graphs { | ||
| graphs_to_visit | ||
| .entry(graph.origin) | ||
| .or_insert_with(|| Arc::clone(graph)); | ||
| } | ||
| graphs_to_visit | ||
| }; | ||
|
|
||
| use crate::runtime::NodeCleaner; | ||
| let mut cleaner = crate::runtime::graph::GraphCleaner::init(); | ||
| cleaner.cleanup_orphaned_entries(); | ||
| for (_graph_origin, graph) in graphs_to_visit { | ||
| let mut state = graph.state.lock().unwrap(); | ||
| let server = &mut state.server; | ||
| server | ||
| .memory_management | ||
| .free_unavailable_nodes(|node_id: &NodeId| { | ||
| server.steps.remove(node_id); | ||
| server.actions_builder.remove(node_id); | ||
| cleaner.clean(node_id); | ||
| }); | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would keep the implementation more local to the GraphMutexClient, and simply call the cleanup procedure here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! That makes sense as the (autograd) device isn't used at all. As I think this change is not really demanded (I'll be closing the issue and PR), I'll leave the adaption for posterity (in the case the PR is intended to be merged).
|
Thanks @laggui, I think the majority of situations are solved by your previous solution, and I'd agree that this PR actually favor some anti-patterns (leaving non-backward-related yet requiring-grads hanging tensors). On the counter-side, I think that this could be useful for debugging -- if a user has some ram leak and isn't sure if "hanging nodes" is the cause, this sort of method can help in identifying or eliminating that possibility -- but actual demand is indeed low. I think I'll be closing the issue and the PR. I mean, they can always be referred to or re-opened in case this sort of demand repeats in the future. |
|
Yeah it could be useful for debugging, I think we should expose more utilities for that. Though in this case, it might actually only be useful when there is a bug 😅 which hopefully shouldn't happen again. But if we want to add something like this I would at least put it behind a feature so it is not exposed at the general level. Sounds good! Thanks for the overall feedback nonetheless, and helping us find the initial memory leak issue 🙏 |
|
@laggui I ran my code with main and with the branch from this PR merged into main. I got horrible performance on main, starting at about 2.1s per step and gradually increasing to around 15s per step and settling there. With this branch merged to main, I call Edit: If you need it, I am maintaining main with graph-sweep merged here for now: https://github.com/oxideye/burn/tree/graph-sweep |
|
@vadixidav I think I agree actually. As long as we document the usage, I think this can be useful. I was actually debugging a program yesterday where a bunch of tensors explicitly marked as require grad were never used as part of the backward pass, creating an increasing amount of tiny graphs during training that were never freed. Because some steps might require the tensors as part of the backward state, it means that the GPU handles for these tensors were also not being freed. Was trying to think of a good way to handle this kind of "orphaned graph" cleanup. But exposing a cleanup procedure similar to the proposal in this PR would be helpful for more advanced users. I'll have a look at your code this week to validate, but probably you have a similar behavior to what I described above. |
Don't worry about it, my code is literally just this PR merged into main. It has no conflicts currently. You can reopen this PR as-is and get the same behavior. |
Oh lol my brain shortened the URL when reading on my phone so I thought it linked to a repo (not burn fork 😅). Forget my comment haha |
|
Hey @laggui @vadixidav, I wanted to mention that I made a new PR with the |
|
FYI not sure if you've seen #4039 which now takes care of all unused graphs, including tensors explicitly marked as require grad that are never used within a backward pass. So technically the manual "sweep" should not be required |
|
Thanks @laggui! Yes I've updated the sweep to your changes. I've tested Edit: but on |
Ohhh right I forgot that your stress test just never called backward lol. That will not trigger a cleanup, it only happens on backward. But it now also takes care of all unused tensors. As long as your program calls backward at least once for autodiff, it should clean them up (even if not involved in the backward graph). |
|
Yes, but sorry @laggui for my insistence lol I have adapted the test to create two independent tensors, both requiring grad, where one of them has a backward call. On In this code snippet: for _ in 0..10_000_000 {
let a: Tensor<AutoB, 2> = Tensor::zeros([2, 2], &train_device);
let b: Tensor<AutoB, 2> = Tensor::zeros([2, 2], &train_device);
drop(a.require_grad());
drop(b.require_grad().sum().backward());
} |
Pull Request Template
Checklist
cargo run-checkscommand has been executed.Related Issues/PRs
Changes
graph_cleanupmethod to theburn::tensor::backend::AutodiffBackendtrait.Testing
Example of usage:
Then by accompanying the system's ram usage, the memory usage doesn't grow (in case it's in the loop). Also, the are no vector/hashmaps/hashsets re-allocations, possibly improving runtime by a little.