Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handle job log directory deleted for active task #6425

Closed
oliver-sanders opened this issue Oct 17, 2024 · 4 comments · Fixed by #6577
Closed

handle job log directory deleted for active task #6425

oliver-sanders opened this issue Oct 17, 2024 · 4 comments · Fixed by #6577
Assignees
Labels
bug Something is wrong :(
Milestone

Comments

@oliver-sanders
Copy link
Member

Spotted in the wild!

If you delete the job log directory for an active task, Cylc will preserve its last known status indefinitely. I.e, Cylc will consider the job to be submitted/running forever.

In this case it was caused by a housekeep task being triggered whilst other tasks in the cycle were still running. The housekeep task tarred up the log/job/<cycle> dir removing the job status files in the process.

This situation should be handled similarly to the job no longer appearing in the queue, i.e, the job is dead, long live the job. Stick it into the failed/submit-failed state as appropriate.

@oliver-sanders oliver-sanders added the bug Something is wrong :( label Oct 17, 2024
@oliver-sanders oliver-sanders added this to the 8.3.x milestone Oct 17, 2024
@MetRonnie MetRonnie modified the milestones: 8.4.x, 8.4.1 Jan 9, 2025
@wxtim wxtim self-assigned this Jan 23, 2025
@wxtim
Copy link
Member

wxtim commented Jan 24, 2025

@oliver-sanders - do you have a recipe for reproduction of this bug. From the description I tried

[scheduling]
    cycling mode = integer
    [[graph]]
        R1 = task:started => housekeep

[runtime]
    [[task]]
        script = sleep 12
        platform = remote  # Tried this in case it were necessary
    [[housekeep]]
        script = """
            RMTHIS=${CYLC_WORKFLOW_RUN_DIR}/log/job/1/task
            echo "Housekeeping ${RMTHIS}"
            rm -fr "${RMTHIS}"
        """

@oliver-sanders
Copy link
Member Author

No id don't, your example seems lonjg the right lines, you might want to try using a batch system rather than background as the polling is different.

@wxtim
Copy link
Member

wxtim commented Jan 28, 2025

Finally have a replicable example (Thank you @oliver-sanders)

[scheduling]
    cycling mode = integer
    [[graph]]
        R1 = task

[runtime]
    [[task]]
        script = """
            rm ${CYLC_WORKFLOW_RUN_DIR}/.service/contact
            rm -r "${CYLC_WORKFLOW_RUN_DIR}/log/job/${CYLC_TASK_CYCLE_POINT}/${CYLC_TASK_NAME}"
        """
        platform = _remote_pbs

@MetRonnie
Copy link
Member

Just to note from #6577:

If you delete the job log directory for an active task, Cylc will preserve its last known status indefinitely. I.e, Cylc will consider the job to be submitted/running forever.

This does not seem to be true. It is only true if the job log dir and the contact file is removed.

Discussed today:

  • Although the job may yet succeed even if the job log dir is deleted, we have decided that it is best to put in the failed state as this is the best we can do if we can't poll anymore.
  • We will leave the job log retrieval as it is, as it's user error to delete the job log dir prematurely they will have to suffer the consequences

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something is wrong :(
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants