Skip to content

Conversation

Cielbird
Copy link
Contributor

burn-remote

  • Added reconnection attemts when the Websocket streams fail. At every reconnection, a new session ID is created.

  • (Bug fix): Added a RegisterEmptyTensor compute task

  • Refactored client worker

  • Improved tests

burn-router

  • Added drop_client()

I had an issue where my tests would hang, because the websocket client had started threads. Adding drop_device allows me to explicitly close the websocket client in my tests.

@nathanielsimard nathanielsimard requested a review from laggui June 19, 2025 15:00
Copy link

codecov bot commented Jun 19, 2025

Codecov Report

❌ Patch coverage is 86.81672% with 41 lines in your changes missing coverage. Please review.
✅ Project coverage is 82.69%. Comparing base (81985bd) to head (01ff4ad).
⚠️ Report is 191 commits behind head on main.

Files with missing lines Patch % Lines
crates/burn-remote/src/client/worker.rs 82.16% 33 Missing ⚠️
crates/burn-remote/src/server/base.rs 61.53% 5 Missing ⚠️
crates/burn-remote/src/client/channel.rs 0.00% 1 Missing ⚠️
crates/burn-remote/src/client/runner.rs 83.33% 1 Missing ⚠️
crates/burn-router/src/client/base.rs 92.85% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3305      +/-   ##
==========================================
+ Coverage   82.66%   82.69%   +0.03%     
==========================================
  Files         995      995              
  Lines      127626   127822     +196     
==========================================
+ Hits       105498   105708     +210     
+ Misses      22128    22114      -14     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Member

@laggui laggui left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Regarding your drop_client issues.. should we also have a Drop implementation to clean up the tasks when dropped?

pub fn register_empty_tensor(&self, id: TensorId, shape: Vec<usize>, dtype: DType) -> TensorIr {
let mut ctx = self.context.lock().unwrap();

let shape = Shape { dims: shape };
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just call shape.into() (Shape implements From<Vec<usize>>)

@Cielbird
Copy link
Contributor Author

Regarding your drop_client issues.. should we also have a Drop implementation to clean up the tasks when dropped?

Drop is implemented for WsSender, which is dropped when the last WsClient is dropped. This is what triggers the "close" flag for the workers. There are no issues now with drop_client.

laggui
laggui previously approved these changes Jun 19, 2025
@laggui laggui dismissed their stale review June 19, 2025 19:03

Just tested the mnist example using the remote backend and there seems to be an issue with the handle container, which then causes the lock to be poisoned.

@laggui
Copy link
Member

laggui commented Jun 19, 2025

It looks like the tensor handle issue still exists. Just tried running one of the examples (mnist) with the remote backend:

2025-06-19T18:58:05.740970Z  INFO burn_remote::server::base: Start server 0.0.0.0:3000 on device Cuda(0)
2025-06-19T18:58:24.328371Z  INFO burn_remote::server::base: [Request Handler] On new connection.
2025-06-19T18:58:24.328531Z  INFO burn_remote::server::base: [Response Handler] On new connection.
2025-06-19T18:58:24.328639Z  INFO burn_remote::server::session: Register responder for session SessionId(8413177382708317273)
2025-06-19T18:58:24.328660Z  INFO burn_remote::server::session: Creating a new session SessionId(8413177382708317273)
2025-06-19T18:58:24.328664Z  INFO burn_remote::server::base: Response handler connection active
2025-06-19T18:58:24.328689Z  INFO burn_remote::server::session: Init requester for session SessionId(8413177382708317273)
2025-06-19T18:58:24.328695Z  INFO burn_remote::server::base: Ops session activated Some(SessionId { id: 8413177382708317273 })

thread 'tokio-runtime-worker' panicked at /home/laggui/workspace/burn/crates/burn-ir/src/handle.rs:78:32:
Should have handle for tensor TensorId { value: 950715 }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

thread 'tokio-runtime-worker' panicked at /home/laggui/workspace/burn/crates/burn-router/src/runner.rs:168:43:
called `Result::unwrap()` on an `Err` value: "poisoned lock: another task failed inside"

thread 'tokio-runtime-worker' panicked at /home/laggui/workspace/burn/crates/burn-router/src/runner.rs:168:43:
called `Result::unwrap()` on an `Err` value: "poisoned lock: another task failed inside"

thread 'tokio-runtime-worker' panicked at /home/laggui/workspace/burn/crates/burn-remote/src/server/stream.rs:44:14:
called `Result::unwrap()` on an `Err` value: SendError { .. }

thread 'tokio-runtime-worker' panicked at /home/laggui/workspace/burn/crates/burn-router/src/runner.rs:168:43:
called `Result::unwrap()` on an `Err` value: "poisoned lock: another task failed inside"

thread 'tokio-runtime-worker' panicked at /home/laggui/workspace/burn/crates/burn-router/src/runner.rs:168:43:
called `Result::unwrap()` on an `Err` value: "poisoned lock: another task failed inside"

2025-06-19T18:58:59.050182Z  INFO burn_remote::server::base: [Request Handler] On new connection.
2025-06-19T18:58:59.050543Z  INFO burn_remote::server::base: [Response Handler] On new connection.
2025-06-19T18:58:59.050596Z  INFO burn_remote::server::session: Register responder for session SessionId(1156240062615404436)
2025-06-19T18:58:59.050599Z  INFO burn_remote::server::session: Init requester for session SessionId(1156240062615404436)
2025-06-19T18:58:59.050617Z  INFO burn_remote::server::session: Creating a new session SessionId(1156240062615404436)
2025-06-19T18:58:59.050623Z  INFO burn_remote::server::base: Ops session activated Some(SessionId { id: 1156240062615404436 })
2025-06-19T18:58:59.050635Z  INFO burn_remote::server::base: Response handler connection active

thread 'tokio-runtime-worker' panicked at /home/laggui/workspace/burn/crates/burn-router/src/runner.rs:168:43:
called `Result::unwrap()` on an `Err` value: "poisoned lock: another task failed inside"

thread 'tokio-runtime-worker' panicked at /home/laggui/workspace/burn/crates/burn-router/src/runner.rs:168:43:
called `Result::unwrap()` on an `Err` value: "poisoned lock: another task failed inside"

thread 'tokio-runtime-worker' panicked at /home/laggui/workspace/burn/crates/burn-router/src/runner.rs:74:43:
called `Result::unwrap()` on an `Err` value: "poisoned lock: another task failed inside"

thread 'tokio-runtime-worker' panicked at /home/laggui/workspace/burn/crates/burn-router/src/runner.rs:74:43:
called `Result::unwrap()` on an `Err` value: "poisoned lock: another task failed inside"

@nathanielsimard nathanielsimard requested a review from laggui June 23, 2025 12:20
Copy link
Contributor

This PR has been marked as stale because it has not been updated for over a month

@github-actions github-actions bot added the stale The issue or pr has been open for too long label Jul 24, 2025
@github-actions github-actions bot removed the stale The issue or pr has been open for too long label Aug 3, 2025
Copy link
Contributor

github-actions bot commented Sep 2, 2025

This PR has been marked as stale because it has not been updated for over a month

@github-actions github-actions bot added the stale The issue or pr has been open for too long label Sep 2, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale The issue or pr has been open for too long
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants