Skip to content

Conversation

laggui
Copy link
Member

@laggui laggui commented Aug 21, 2025

[WIP] Currently retains intermediate graph due to ref count, need to adjust state

Checklist

  • Confirmed that cargo run-checks command has been executed.

Related Issues/PRs

Fixes #3513

Nodes were marked as unavailable (parent nodes unavailable) and getting cleaned up but were still referenced in another thread for (incomplete) backward pass.

Changes

Check if node is still referenced by another thread (global autodiff, multi-device setup) when marking for deletion

Testing

Tested code in linked issue, validated that such nodes are eventually cleaned up as unused roots (only server reference remains, no more parents).

@laggui laggui requested a review from nathanielsimard August 21, 2025 12:46
@laggui laggui marked this pull request as draft August 21, 2025 13:23
Copy link
Member

@nathanielsimard nathanielsimard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The context seems to have the same mechanism as streams in Fusion. Right now, the context mechanism doesn't support or detect merging multiple streams, which is probably problematic for multi-device training with the naive approach.

// TODO: contexts cleanup

#[derive(Default)]
pub struct AutodiffServer {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could move the context to another file context.rs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Autodiff panic during training, when run on multiple GPUs
2 participants