Remote reconnection improvement #3305

Cielbird · 2025-06-19T14:57:07Z

burn-remote

Added reconnection attemts when the Websocket streams fail. At every reconnection, a new session ID is created.
(Bug fix): Added a RegisterEmptyTensor compute task
Refactored client worker
Improved tests

burn-router

Added drop_client()

I had an issue where my tests would hang, because the websocket client had started threads. Adding drop_device allows me to explicitly close the websocket client in my tests.

codecov · 2025-06-19T15:34:20Z

Codecov Report

❌ Patch coverage is 86.81672% with 41 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.69%. Comparing base (81985bd) to head (01ff4ad).
⚠️ Report is 191 commits behind head on main.

Files with missing lines	Patch %	Lines
crates/burn-remote/src/client/worker.rs	82.16%	33 Missing ⚠️
crates/burn-remote/src/server/base.rs	61.53%	5 Missing ⚠️
crates/burn-remote/src/client/channel.rs	0.00%	1 Missing ⚠️
crates/burn-remote/src/client/runner.rs	83.33%	1 Missing ⚠️
crates/burn-router/src/client/base.rs	92.85%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3305      +/-   ##
==========================================
+ Coverage   82.66%   82.69%   +0.03%     
==========================================
  Files         995      995              
  Lines      127626   127822     +196     
==========================================
+ Hits       105498   105708     +210     
+ Misses      22128    22114      -14

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

laggui

Regarding your drop_client issues.. should we also have a Drop implementation to clean up the tasks when dropped?

laggui · 2025-06-19T15:33:42Z

crates/burn-router/src/runner.rs

+    pub fn register_empty_tensor(&self, id: TensorId, shape: Vec<usize>, dtype: DType) -> TensorIr {
+        let mut ctx = self.context.lock().unwrap();
+
+        let shape = Shape { dims: shape };


You can just call shape.into() (Shape implements From<Vec<usize>>)

crates/burn-remote/src/client/worker.rs

Cielbird · 2025-06-19T17:39:57Z

Regarding your drop_client issues.. should we also have a Drop implementation to clean up the tasks when dropped?

Drop is implemented for WsSender, which is dropped when the last WsClient is dropped. This is what triggers the "close" flag for the workers. There are no issues now with drop_client.

Just tested the mnist example using the remote backend and there seems to be an issue with the handle container, which then causes the lock to be poisoned.

laggui · 2025-06-19T19:09:17Z

It looks like the tensor handle issue still exists. Just tried running one of the examples (mnist) with the remote backend:

2025-06-19T18:58:05.740970Z  INFO burn_remote::server::base: Start server 0.0.0.0:3000 on device Cuda(0)
2025-06-19T18:58:24.328371Z  INFO burn_remote::server::base: [Request Handler] On new connection.
2025-06-19T18:58:24.328531Z  INFO burn_remote::server::base: [Response Handler] On new connection.
2025-06-19T18:58:24.328639Z  INFO burn_remote::server::session: Register responder for session SessionId(8413177382708317273)
2025-06-19T18:58:24.328660Z  INFO burn_remote::server::session: Creating a new session SessionId(8413177382708317273)
2025-06-19T18:58:24.328664Z  INFO burn_remote::server::base: Response handler connection active
2025-06-19T18:58:24.328689Z  INFO burn_remote::server::session: Init requester for session SessionId(8413177382708317273)
2025-06-19T18:58:24.328695Z  INFO burn_remote::server::base: Ops session activated Some(SessionId { id: 8413177382708317273 })

thread 'tokio-runtime-worker' panicked at /home/laggui/workspace/burn/crates/burn-ir/src/handle.rs:78:32:
Should have handle for tensor TensorId { value: 950715 }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

thread 'tokio-runtime-worker' panicked at /home/laggui/workspace/burn/crates/burn-router/src/runner.rs:168:43:
called `Result::unwrap()` on an `Err` value: "poisoned lock: another task failed inside"

thread 'tokio-runtime-worker' panicked at /home/laggui/workspace/burn/crates/burn-router/src/runner.rs:168:43:
called `Result::unwrap()` on an `Err` value: "poisoned lock: another task failed inside"

thread 'tokio-runtime-worker' panicked at /home/laggui/workspace/burn/crates/burn-remote/src/server/stream.rs:44:14:
called `Result::unwrap()` on an `Err` value: SendError { .. }

thread 'tokio-runtime-worker' panicked at /home/laggui/workspace/burn/crates/burn-router/src/runner.rs:168:43:
called `Result::unwrap()` on an `Err` value: "poisoned lock: another task failed inside"

thread 'tokio-runtime-worker' panicked at /home/laggui/workspace/burn/crates/burn-router/src/runner.rs:168:43:
called `Result::unwrap()` on an `Err` value: "poisoned lock: another task failed inside"

2025-06-19T18:58:59.050182Z  INFO burn_remote::server::base: [Request Handler] On new connection.
2025-06-19T18:58:59.050543Z  INFO burn_remote::server::base: [Response Handler] On new connection.
2025-06-19T18:58:59.050596Z  INFO burn_remote::server::session: Register responder for session SessionId(1156240062615404436)
2025-06-19T18:58:59.050599Z  INFO burn_remote::server::session: Init requester for session SessionId(1156240062615404436)
2025-06-19T18:58:59.050617Z  INFO burn_remote::server::session: Creating a new session SessionId(1156240062615404436)
2025-06-19T18:58:59.050623Z  INFO burn_remote::server::base: Ops session activated Some(SessionId { id: 1156240062615404436 })
2025-06-19T18:58:59.050635Z  INFO burn_remote::server::base: Response handler connection active

thread 'tokio-runtime-worker' panicked at /home/laggui/workspace/burn/crates/burn-router/src/runner.rs:168:43:
called `Result::unwrap()` on an `Err` value: "poisoned lock: another task failed inside"

thread 'tokio-runtime-worker' panicked at /home/laggui/workspace/burn/crates/burn-router/src/runner.rs:168:43:
called `Result::unwrap()` on an `Err` value: "poisoned lock: another task failed inside"

thread 'tokio-runtime-worker' panicked at /home/laggui/workspace/burn/crates/burn-router/src/runner.rs:74:43:
called `Result::unwrap()` on an `Err` value: "poisoned lock: another task failed inside"

thread 'tokio-runtime-worker' panicked at /home/laggui/workspace/burn/crates/burn-router/src/runner.rs:74:43:
called `Result::unwrap()` on an `Err` value: "poisoned lock: another task failed inside"

github-actions · 2025-07-24T12:13:12Z

This PR has been marked as stale because it has not been updated for over a month

github-actions · 2025-09-02T12:12:44Z

This PR has been marked as stale because it has not been updated for over a month

Cielbird added 2 commits June 19, 2025 10:34

Remote reconnection improvement and tests

4b0281e

Clippy and fmt

1b257a1

nathanielsimard requested a review from laggui June 19, 2025 15:00

laggui reviewed Jun 19, 2025

View reviewed changes

Refactoring and CancellationToken

01ff4ad

laggui previously approved these changes Jun 19, 2025

View reviewed changes

nathanielsimard requested a review from laggui June 23, 2025 12:20

github-actions bot added the stale The issue or pr has been open for too long label Jul 24, 2025

github-actions bot removed the stale The issue or pr has been open for too long label Aug 3, 2025

github-actions bot added the stale The issue or pr has been open for too long label Sep 2, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remote reconnection improvement #3305

Remote reconnection improvement #3305

Uh oh!

Cielbird commented Jun 19, 2025

Uh oh!

codecov bot commented Jun 19, 2025 •

edited

Loading

Uh oh!

laggui left a comment •

edited

Loading

Uh oh!

laggui Jun 19, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Cielbird commented Jun 19, 2025

Uh oh!

laggui commented Jun 19, 2025

Uh oh!

github-actions bot commented Jul 24, 2025

Uh oh!

github-actions bot commented Sep 2, 2025

Uh oh!

Uh oh!

Remote reconnection improvement #3305

Are you sure you want to change the base?

Remote reconnection improvement #3305

Uh oh!

Conversation

Cielbird commented Jun 19, 2025

burn-remote

burn-router

Uh oh!

codecov bot commented Jun 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

laggui left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

laggui Jun 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Cielbird commented Jun 19, 2025

Uh oh!

laggui commented Jun 19, 2025

Uh oh!

github-actions bot commented Jul 24, 2025

Uh oh!

github-actions bot commented Sep 2, 2025

Uh oh!

Uh oh!

codecov bot commented Jun 19, 2025 •

edited

Loading

laggui left a comment •

edited

Loading