Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Cancelling tasks who is waiting for dependencies hangs result objects #46315

Open
rynewang opened this issue Jun 28, 2024 · 0 comments
Open
Assignees
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P0 Issues that should be fixed in short order

Comments

@rynewang
Copy link
Contributor

What happened + What you expected to happen

When you have a task returning a Ray Object, and the task is in "waiting dependencies" status; and you do ray.cancel(ref) on the ref. You expect the next ray.get(ref) raises TaskCancelledError but it hangs now.

This issue also applies for streaming generators, and when the task is resubmitted via lineage reconstruction.

This comes from

// This case is reached for tasks that have unresolved dependencies.
// No executing tasks, so cancelling is a noop.

where if a task being cancelled have unresolved deps, it does noop.

This may be true in 4 yrs ago, but now we need to fail the task and mark all objects as failed with the exception.

Versions / Dependencies

master

Reproduction script

Run this script multiple times, it's 70% hanging and 30% passing.

import ray
import numpy as np
import pytest
from ray._private.test_utils import SignalActor
import time

def test_cancel_pending_arg_running():
    with ray.init():
        @ray.remote(max_retries=-1)
        def wait_forever():
            print("wait_and_reset starting")
            while True:
                time.sleep(10000)

        @ray.remote(max_retries=-1)
        def has_deps(may_block):
            big = may_block
            return big * 2

        may_block = wait_forever.remote()
        ref = has_deps.remote(may_block)
        print(f"{ref=}")
        
        ready, not_ready = ray.wait([may_block, ref], timeout=1)
        assert not ready
        assert len(not_ready) == 2
        print(f"both are not ready: {may_block=}, {ref=}")

        # Now, the generator is pending arg. Cancel it.
        print("Cancelling")
        ray.cancel(ref)

        print("Getting")
        with pytest.raises(ray.exceptions.TaskCancelledError):
            ray.get(ref)  # This should raise TaskCancelledError, but hangs


if __name__ == "__main__":
    import os,sys

    if os.environ.get("PARALLEL_CI"):
        sys.exit(pytest.main(["-n", "auto", "--boxed", "-vs", __file__]))
    else:
        sys.exit(pytest.main(["-sv", __file__]))

Issue Severity

High: It blocks me from completing my task.

@rynewang rynewang added bug Something that is supposed to be working; but isn't triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jun 28, 2024
@jjyao jjyao added core Issues that should be addressed in Ray Core P0 Issues that should be fixed in short order and removed triage Needs triage (eg: priority, bug/not-bug, and owning component) labels Jun 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something that is supposed to be working; but isn't core Issues that should be addressed in Ray Core P0 Issues that should be fixed in short order
Projects
None yet
Development

No branches or pull requests

2 participants