Skip to content

[tool] fix: make rate-limiter actor detached to prevent ActorDiedError#5327

Open
mirrorboat wants to merge 1 commit intoverl-project:mainfrom
mirrorboat:main
Open

[tool] fix: make rate-limiter actor detached to prevent ActorDiedError#5327
mirrorboat wants to merge 1 commit intoverl-project:mainfrom
mirrorboat:main

Conversation

@mirrorboat
Copy link

What does this PR do?

This PR addresses a critical stability issue in the SearchTool where concurrent executions would intermittently fail with ray.exceptions.ActorDiedError. The root cause was that the TokenBucketWorker (named "rate-limiter") was created as a regular Ray actor, tying its lifecycle to the creating SearchExecutionWorker instance. When the SearchExecutionWorker was garbage-collected or released, Ray would automatically terminate the rate-limiter actor, causing subsequent search executions to fail.

To resolve this, the TokenBucketWorker is now explicitly created as a detached actor using lifetime="detached". This ensures the rate limiter persists independently of any individual tool instance and remains available for the entire duration of the Ray cluster session. The initialization logic has been updated to safely reuse an existing detached actor if it already exists, enabling shared global rate limiting across multiple SearchTool instances.

Error log:

�[36m(AgentLoopWorker pid=118037)�[0m ERROR:2026-02-15 13:40:19,925:[SearchTool] Execution failed: �[36mray::SearchExecutionWorker.execute()�[39m (pid=119104, ip=10.48.90.221, actor_id=cf1f33ebd4264d1f35620fbf01000000, repr=<verl.tools.search_tool.SearchExecutionWorker object at 0x7fba8fca83b0>)
�[36m(AgentLoopWorker pid=118037)�[0m            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[36m(AgentLoopWorker pid=118037)�[0m            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[36m(AgentLoopWorker pid=118037)�[0m   File "/verl/tools/search_tool.py", line 93, in execute
�[36m(AgentLoopWorker pid=118037)�[0m     ray.get(self.rate_limit_worker.acquire.remote())
�[36m(AgentLoopWorker pid=118037)�[0m            ^^^^^^^^^^^^^^^^^^^
�[36m(AgentLoopWorker pid=118037)�[0m            ^^^^^^^^^^^^^^^^^^^^^
�[36m(AgentLoopWorker pid=118037)�[0m                                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
�[36m(AgentLoopWorker pid=118037)�[0m ray.exceptions.ActorDiedError: The actor died unexpectedly before finishing this task.
�[36m(AgentLoopWorker pid=118037)�[0m 	class_name: TokenBucketWorker
�[36m(AgentLoopWorker pid=118037)�[0m 	actor_id: c6ac09706cd7e07b4f06144701000000
�[36m(AgentLoopWorker pid=118037)�[0m 	pid: 194826
�[36m(AgentLoopWorker pid=118037)�[0m 	name: rate-limiter
�[36m(AgentLoopWorker pid=118037)�[0m 	namespace: d99e26b4-e0a3-4627-8821-46f695dc8641
�[36m(AgentLoopWorker pid=118037)�[0m 	ip: 10.48.90.221
�[36m(AgentLoopWorker pid=118037)�[0m The actor is dead because its owner has died. Owner Id: 0c9085d20c7700587d14ba3be8178e9f62fcf32ff0001eb872e76430 Owner Ip address: 10.48.90.221 Owner worker exit type: INTENDED_SYSTEM_EXIT Worker exit detail: Owner's worker process has crashed.

Ray log of TokenBucketWorker(pid: 194826)

[2026-02-15 13:40:09,282 I 194826 194826] event.cc:500: Ray Event initialized for CORE_WORKER
[2026-02-15 13:40:09,282 I 194826 194826] event.cc:500: Ray Event initialized for EXPORT_TASK
[2026-02-15 13:40:09,282 I 194826 194826] event.cc:331: Set ray event level to warning
[2026-02-15 13:40:09,282 I 194826 195235] accessor.cc:784: Received notification for node, IsAlive = 1 node_id=fd9f113d528bec5c0dcd031d1d073a882a6f13e8236d59e05571bc7e
[2026-02-15 13:40:09,282 I 194826 195235] core_worker.cc:5148: Number of alive nodes:1
[2026-02-15 13:40:09,284 I 194826 194826] actor_task_submitter.cc:73: Set actor max pending calls to -1 actor_id=c6ac09706cd7e07b4f06144701000000
[2026-02-15 13:40:09,284 I 194826 194826] core_worker.cc:3405: Creating actor actor_id=c6ac09706cd7e07b4f06144701000000
[2026-02-15 13:40:09,991 I 194826 194826] task_receiver.cc:160: Actor creation task finished, task_id: ffffffffffffffffc6ac09706cd7e07b4f06144701000000, actor_id: c6ac09706cd7e07b4f06144701000000, actor_repr_name: 
[2026-02-15 13:40:19,350 W 194826 195230] metric_exporter.cc:105: [1] Export metrics to agent failed: RpcError: RPC Error message: ; RPC Error details:  rpc_code: 12. This won't affect Ray, but you can lose metrics from the cluster.
[2026-02-15 13:40:19,892 I 194826 195235] core_worker.cc:4583: Force kill actor request has received. exiting immediately... The actor is dead because its owner has died. Owner Id: 0c9085d20c7700587d14ba3be8178e9f62fcf32ff0001eb872e76430 Owner Ip address: 10.48.90.221 Owner worker exit type: INTENDED_SYSTEM_EXIT Worker exit detail: Owner's worker process has crashed.
[2026-02-15 13:40:19,892 W 194826 195235] core_worker.cc:1263: Force exit the process.  Details: Worker exits because the actor is killed. The actor is dead because its owner has died. Owner Id: 0c9085d20c7700587d14ba3be8178e9f62fcf32ff0001eb872e76430 Owner Ip address: 10.48.90.221 Owner worker exit type: INTENDED_SYSTEM_EXIT Worker exit detail: Owner's worker process has crashed.
[2026-02-15 13:40:19,902 I 194826 194826] core_worker.cc:1172: Exit signal received, this process will exit after all outstanding tasks have finished, exit_type=SYSTEM_ERROR, detail=Worker exits unexpectedly by a signal. SystemExit is raised (sys.exit is called). Exit code: 1. The process receives a SIGTERM.
[2026-02-15 13:40:19,902 I 194826 194826] core_worker.cc:1208: Wait for currently executing tasks in the underlying thread pools to finish.
[2026-02-15 13:40:19,902 I 194826 194826] concurrency_group_manager.cc:99: Default executor is joining. If the 'Default executor is joined.' message is not printed after this, the worker is probably hanging because the actor task is running an infinite loop.
[2026-02-15 13:40:19,903 I 194826 194826] concurrency_group_manager.cc:103: Default executor is joined.
[2026-02-15 13:40:19,903 I 194826 194826] concurrency_group_manager.cc:99: Default executor is joining. If the 'Default executor is joined.' message is not printed after this, the worker is probably hanging because the actor task is running an infinite loop.
[2026-02-15 13:40:19,903 I 194826 194826] concurrency_group_manager.cc:103: Default executor is joined.
[2026-02-15 13:40:19,903 I 194826 194826] concurrency_group_manager.cc:99: Default executor is joining. If the 'Default executor is joined.' message is not printed after this, the worker is probably hanging because the actor task is running an infinite loop.
[2026-02-15 13:40:19,903 I 194826 194826] concurrency_group_manager.cc:103: Default executor is joined.
[2026-02-15 13:40:19,903 I 194826 194826] concurrency_group_manager.cc:99: Default executor is joining. If the 'Default executor is joined.' message is not printed after this, the worker is probably hanging because the actor task is running an infinite loop.
[2026-02-15 13:40:19,903 I 194826 194826] concurrency_group_manager.cc:103: Default executor is joined.
[2026-02-15 13:40:19,903 I 194826 194826] concurrency_group_manager.cc:99: Default executor is joining. If the 'Default executor is joined.' message is not printed after this, the worker is probably hanging because the actor task is running an infinite loop.
[2026-02-15 13:40:19,903 I 194826 194826] concurrency_group_manager.cc:103: Default executor is joined.
[2026-02-15 13:40:19,903 I 194826 194826] concurrency_group_manager.cc:99: Default executor is joining. If the 'Default executor is joined.' message is not printed after this, the worker is probably hanging because the actor task is running an infinite loop.
[2026-02-15 13:40:19,903 I 194826 194826] concurrency_group_manager.cc:103: Default executor is joined.
[2026-02-15 13:40:19,903 I 194826 194826] concurrency_group_manager.cc:99: Default executor is joining. If the 'Default executor is joined.' message is not printed after this, the worker is probably hanging because the actor task is running an infinite loop.
[2026-02-15 13:40:19,903 I 194826 194826] concurrency_group_manager.cc:103: Default executor is joined.
[2026-02-15 13:40:19,903 I 194826 194826] concurrency_group_manager.cc:99: Default executor is joining. If the 'Default executor is joined.' message is not printed after this, the worker is probably hanging because the actor task is running an infinite loop.
[2026-02-15 13:40:19,903 I 194826 194826] concurrency_group_manager.cc:103: Default executor is joined.
[2026-02-15 13:40:19,903 I 194826 194826] concurrency_group_manager.cc:99: Default executor is joining. If the 'Default executor is joined.' message is not printed after this, the worker is probably hanging because the actor task is running an infinite loop.
[2026-02-15 13:40:19,903 I 194826 194826] concurrency_group_manager.cc:103: Default executor is joined.
[2026-02-15 13:40:19,903 I 194826 194826] concurrency_group_manager.cc:99: Default executor is joining. If the 'Default executor is joined.' message is not printed after this, the worker is probably hanging because the actor task is running an infinite loop.
[2026-02-15 13:40:19,903 I 194826 194826] concurrency_group_manager.cc:103: Default executor is joined.
[2026-02-15 13:40:19,903 I 194826 194826] concurrency_group_manager.cc:99: Default executor is joining. If the 'Default executor is joined.' message is not printed after this, the worker is probably hanging because the actor task is running an infinite loop.
[2026-02-15 13:40:19,903 I 194826 194826] concurrency_group_manager.cc:103: Default executor is joined.
[2026-02-15 13:40:19,903 I 194826 194826] concurrency_group_manager.cc:99: Default executor is joining. If the 'Default executor is joined.' message is not printed after this, the worker is probably hanging because the actor task is running an infinite loop.
[2026-02-15 13:40:19,903 I 194826 194826] concurrency_group_manager.cc:103: Default executor is joined.
[2026-02-15 13:40:19,903 I 194826 194826] concurrency_group_manager.cc:99: Default executor is joining. If the 'Default executor is joined.' message is not printed after this, the worker is probably hanging because the actor task is running an infinite loop.
[2026-02-15 13:40:19,903 I 194826 194826] concurrency_group_manager.cc:103: Default executor is joined.
[2026-02-15 13:40:19,903 I 194826 194826] concurrency_group_manager.cc:99: Default executor is joining. If the 'Default executor is joined.' message is not printed after this, the worker is probably hanging because the actor task is running an infinite loop.
[2026-02-15 13:40:19,903 I 194826 194826] concurrency_group_manager.cc:103: Default executor is joined.
[2026-02-15 13:40:19,903 I 194826 194826] concurrency_group_manager.cc:99: Default executor is joining. If the 'Default executor is joined.' message is not printed after this, the worker is probably hanging because the actor task is running an infinite loop.
[2026-02-15 13:40:19,903 I 194826 194826] concurrency_group_manager.cc:103: Default executor is joined.
[2026-02-15 13:40:19,903 I 194826 194826] concurrency_group_manager.cc:99: Default executor is joining. If the 'Default executor is joined.' message is not printed after this, the worker is probably hanging because the actor task is running an infinite loop.
[2026-02-15 13:40:19,903 I 194826 194826] concurrency_group_manager.cc:103: Default executor is joined.
[2026-02-15 13:40:19,903 I 194826 194826] concurrency_group_manager.cc:99: Default executor is joining. If the 'Default executor is joined.' message is not printed after this, the worker is probably hanging because the actor task is running an infinite loop.
[2026-02-15 13:40:19,903 I 194826 194826] concurrency_group_manager.cc:103: Default executor is joined.
[2026-02-15 13:40:19,903 I 194826 194826] concurrency_group_manager.cc:99: Default executor is joining. If the 'Default executor is joined.' message is not printed after this, the worker is probably hanging because the actor task is running an infinite loop.
[2026-02-15 13:40:19,903 I 194826 194826] concurrency_group_manager.cc:103: Default executor is joined.
[2026-02-15 13:40:19,903 I 194826 194826] concurrency_group_manager.cc:99: Default executor is joining. If the 'Default executor is joined.' message is not printed after this, the worker is probably hanging because the actor task is running an infinite loop.
[2026-02-15 13:40:19,903 I 194826 194826] concurrency_group_manager.cc:103: Default executor is joined.
[2026-02-15 13:40:19,903 I 194826 194826] concurrency_group_manager.cc:99: Default executor is joining. If the 'Default executor is joined.' message is not printed after this, the worker is probably hanging because the actor task is running an infinite loop.
[2026-02-15 13:40:19,903 I 194826 194826] concurrency_group_manager.cc:103: Default executor is joined.
[2026-02-15 13:40:19,903 I 194826 194826] concurrency_group_manager.cc:99: Default executor is joining. If the 'Default executor is joined.' message is not printed after this, the worker is probably hanging because the actor task is running an infinite loop.
[2026-02-15 13:40:19,903 I 194826 194826] concurrency_group_manager.cc:103: Default executor is joined.
[2026-02-15 13:40:19,903 I 194826 194826] concurrency_group_manager.cc:99: Default executor is joining. If the 'Default executor is joined.' message is not printed after this, the worker is probably hanging because the actor task is running an infinite loop.
[2026-02-15 13:40:19,903 I 194826 194826] concurrency_group_manager.cc:103: Default executor is joined.
[2026-02-15 13:40:19,903 I 194826 194826] core_worker.cc:1250: Not draining reference counter since this is an actor worker.
[2026-02-15 13:40:20,281 I 194826 194826] core_worker.cc:1143: Try killing all child processes of this worker as it exits. Child process pids: 
[2026-02-15 13:40:20,281 I 194826 195235] core_worker.cc:1143: Try killing all child processes of this worker as it exits. Child process pids: 
[2026-02-15 13:40:20,281 I 194826 194826] core_worker.cc:1097: Sending disconnect message to the local raylet.
[2026-02-15 13:40:20,281 I 194826 194826] raylet_client.cc:73: RayletClient::Disconnect, exit_type=SYSTEM_ERROR, exit_detail=Worker exits unexpectedly by a signal. SystemExit is raised (sys.exit is called). Exit code: 1. The process receives a SIGTERM., has creation_task_exception_pb_bytes=0

Ray log of the owner of TokenBucketWorker(pid: 194826), i.e., Id: 0c9085d20c7700587d14ba3be8178e9f62fcf32ff0001eb872e76430

[2026-02-15 13:40:05,077 W 119097 119097] actor_manager.cc:110: Failed to look up actor with name 'rate-limiter'. This could because 1. You are trying to look up a named actor you didn't create. 2. The named actor died. 3. You did not use a namespace matching the namespace of the actor.
[2026-02-15 13:40:05,078 W 119097 119097] actor_manager.cc:110: Failed to look up actor with name 'rate-limiter'. This could because 1. You are trying to look up a named actor you didn't create. 2. The named actor died. 3. You did not use a namespace matching the namespace of the actor.
[2026-02-15 13:40:05,087 I 119097 119097] actor_task_submitter.cc:73: Set actor max pending calls to -1 actor_id=c6ac09706cd7e07b4f06144701000000
[2026-02-15 13:40:05,096 I 119097 119097] task_receiver.cc:160: Actor creation task finished, task_id: ffffffffffffffff9642452fdfa500cefef93d7001000000, actor_id: 9642452fdfa500cefef93d7001000000, actor_repr_name: 
[2026-02-15 13:40:05,174 I 119097 194730] actor_task_submitter.cc:73: Set actor max pending calls to -1 actor_id=9642452fdfa500cefef93d7001000000
[2026-02-15 13:40:05,176 I 119097 119464] actor_manager.cc:218: received notification on actor, state: PENDING_CREATION, ip address: , port: 0, num_restarts: 0, death context type=CONTEXT_NOT_SET actor_id=c6ac09706cd7e07b4f06144701000000 worker_id=NIL_ID node_id=fd9f113d528bec5c0dcd031d1d073a882a6f13e8236d59e05571bc7e
[2026-02-15 13:40:09,995 I 119097 119464] actor_manager.cc:218: received notification on actor, state: ALIVE, ip address: 10.48.90.221, port: 44425, num_restarts: 0, death context type=CONTEXT_NOT_SET actor_id=c6ac09706cd7e07b4f06144701000000 worker_id=58f98ec45f5b889b1118fcd5ac929cc23b7f4383744f1b02c8a251e4 node_id=fd9f113d528bec5c0dcd031d1d073a882a6f13e8236d59e05571bc7e
[2026-02-15 13:40:10,361 I 119097 194736] actor_task_submitter.cc:73: Set actor max pending calls to -1 actor_id=9642452fdfa500cefef93d7001000000
[2026-02-15 13:40:10,771 I 119097 194738] actor_task_submitter.cc:73: Set actor max pending calls to -1 actor_id=9642452fdfa500cefef93d7001000000
[2026-02-15 13:40:11,036 I 119097 194739] actor_task_submitter.cc:73: Set actor max pending calls to -1 actor_id=9642452fdfa500cefef93d7001000000
[2026-02-15 13:40:11,765 I 119097 194741] actor_task_submitter.cc:73: Set actor max pending calls to -1 actor_id=9642452fdfa500cefef93d7001000000
[2026-02-15 13:40:12,570 I 119097 194742] actor_task_submitter.cc:73: Set actor max pending calls to -1 actor_id=9642452fdfa500cefef93d7001000000
[2026-02-15 13:40:13,144 I 119097 194745] actor_task_submitter.cc:73: Set actor max pending calls to -1 actor_id=9642452fdfa500cefef93d7001000000
[2026-02-15 13:40:13,625 I 119097 194748] actor_task_submitter.cc:73: Set actor max pending calls to -1 actor_id=9642452fdfa500cefef93d7001000000
[2026-02-15 13:40:14,389 I 119097 194749] actor_task_submitter.cc:73: Set actor max pending calls to -1 actor_id=9642452fdfa500cefef93d7001000000
[2026-02-15 13:40:14,762 I 119097 194731] actor_task_submitter.cc:73: Set actor max pending calls to -1 actor_id=9642452fdfa500cefef93d7001000000
[2026-02-15 13:40:15,458 I 119097 194730] actor_task_submitter.cc:73: Set actor max pending calls to -1 actor_id=9642452fdfa500cefef93d7001000000
[2026-02-15 13:40:16,026 I 119097 194736] actor_task_submitter.cc:73: Set actor max pending calls to -1 actor_id=9642452fdfa500cefef93d7001000000
[2026-02-15 13:40:16,495 I 119097 194738] actor_task_submitter.cc:73: Set actor max pending calls to -1 actor_id=9642452fdfa500cefef93d7001000000
[2026-02-15 13:40:18,287 I 119097 194739] actor_task_submitter.cc:73: Set actor max pending calls to -1 actor_id=9642452fdfa500cefef93d7001000000
[2026-02-15 13:40:18,583 I 119097 194741] actor_task_submitter.cc:73: Set actor max pending calls to -1 actor_id=9642452fdfa500cefef93d7001000000
[2026-02-15 13:40:19,072 I 119097 194742] actor_task_submitter.cc:73: Set actor max pending calls to -1 actor_id=9642452fdfa500cefef93d7001000000
[2026-02-15 13:40:19,513 I 119097 119464] core_worker.cc:4583: Force kill actor request has received. exiting immediately... The actor is dead because all references to the actor were removed.
[2026-02-15 13:40:19,513 W 119097 119464] core_worker.cc:1263: Force exit the process.  Details: Worker exits because the actor is killed. The actor is dead because all references to the actor were removed.
[2026-02-15 13:40:19,884 I 119097 119464] core_worker.cc:1143: Try killing all child processes of this worker as it exits. Child process pids: 
[2026-02-15 13:40:19,885 I 119097 119464] core_worker.cc:1097: Sending disconnect message to the local raylet.
[2026-02-15 13:40:19,885 I 119097 119464] raylet_client.cc:73: RayletClient::Disconnect, exit_type=INTENDED_SYSTEM_EXIT, exit_detail=Worker exits because the actor is killed. The actor is dead because all references to the actor were removed., has creation_task_exception_pb_bytes=0
[2026-02-15 13:40:19,885 I 119097 119464] core_worker.cc:1103: Disconnected from the local raylet.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request correctly identifies and addresses a critical stability issue where the rate-limiter actor was being terminated prematurely due to its ownership by a non-detached worker. By transitioning the TokenBucketWorker to a detached actor, its lifecycle is now independent of the creating worker, which effectively prevents the ActorDiedError. My feedback focuses on improving the robustness of this fix by avoiding potential name collisions in the global Ray actor registry and simplifying the initialization logic using Ray's built-in 'get or create' functionality.

Comment on lines +82 to +94
namespace = ray.get_runtime_context().namespace
try:
actor = ray.get_actor("rate-limiter", namespace=namespace)
logger.info("Reusing existing detached rate-limiter actor.")
return actor
except ValueError:
logger.info("Creating new detached rate-limiter actor.")
return TokenBucketWorker.options(
name="rate-limiter",
lifetime="detached",
namespace=namespace,
get_if_exists=True
).remote(rate_limit)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

high

The use of the generic name "rate-limiter" for a detached actor is risky because it can collide with other tools (such as SandboxFusionTool in this repository) that might also use the same name within the same Ray namespace. If another tool creates a non-detached actor with this name first, this code will reuse it, and the ActorDiedError will still occur when that other tool is destroyed. Additionally, the try...except block and manual namespace handling are redundant; Ray's get_if_exists=True option in options() already handles the 'get or create' logic safely and respects the current namespace by default.

        # Use a specific name to avoid collisions with other tools in the same namespace.
        return TokenBucketWorker.options(
            name="search-tool-rate-limiter",
            lifetime="detached",
            get_if_exists=True
        ).remote(rate_limit)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant