[tool] fix: make rate-limiter actor detached to prevent ActorDiedError#5327
[tool] fix: make rate-limiter actor detached to prevent ActorDiedError#5327mirrorboat wants to merge 1 commit intoverl-project:mainfrom
Conversation
There was a problem hiding this comment.
Code Review
This pull request correctly identifies and addresses a critical stability issue where the rate-limiter actor was being terminated prematurely due to its ownership by a non-detached worker. By transitioning the TokenBucketWorker to a detached actor, its lifecycle is now independent of the creating worker, which effectively prevents the ActorDiedError. My feedback focuses on improving the robustness of this fix by avoiding potential name collisions in the global Ray actor registry and simplifying the initialization logic using Ray's built-in 'get or create' functionality.
| namespace = ray.get_runtime_context().namespace | ||
| try: | ||
| actor = ray.get_actor("rate-limiter", namespace=namespace) | ||
| logger.info("Reusing existing detached rate-limiter actor.") | ||
| return actor | ||
| except ValueError: | ||
| logger.info("Creating new detached rate-limiter actor.") | ||
| return TokenBucketWorker.options( | ||
| name="rate-limiter", | ||
| lifetime="detached", | ||
| namespace=namespace, | ||
| get_if_exists=True | ||
| ).remote(rate_limit) |
There was a problem hiding this comment.
The use of the generic name "rate-limiter" for a detached actor is risky because it can collide with other tools (such as SandboxFusionTool in this repository) that might also use the same name within the same Ray namespace. If another tool creates a non-detached actor with this name first, this code will reuse it, and the ActorDiedError will still occur when that other tool is destroyed. Additionally, the try...except block and manual namespace handling are redundant; Ray's get_if_exists=True option in options() already handles the 'get or create' logic safely and respects the current namespace by default.
# Use a specific name to avoid collisions with other tools in the same namespace.
return TokenBucketWorker.options(
name="search-tool-rate-limiter",
lifetime="detached",
get_if_exists=True
).remote(rate_limit)
What does this PR do?
This PR addresses a critical stability issue in the
SearchToolwhere concurrent executions would intermittently fail withray.exceptions.ActorDiedError. The root cause was that theTokenBucketWorker(named "rate-limiter") was created as a regular Ray actor, tying its lifecycle to the creatingSearchExecutionWorkerinstance. When theSearchExecutionWorkerwas garbage-collected or released, Ray would automatically terminate the rate-limiter actor, causing subsequent search executions to fail.To resolve this, the
TokenBucketWorkeris now explicitly created as a detached actor usinglifetime="detached". This ensures the rate limiter persists independently of any individual tool instance and remains available for the entire duration of the Ray cluster session. The initialization logic has been updated to safely reuse an existing detached actor if it already exists, enabling shared global rate limiting across multipleSearchToolinstances.Error log:
Ray log of TokenBucketWorker(pid: 194826)
Ray log of the owner of TokenBucketWorker(pid: 194826), i.e., Id: 0c9085d20c7700587d14ba3be8178e9f62fcf32ff0001eb872e76430