Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Complete execution control in the client API ? #296

Open
denisri opened this issue Oct 6, 2023 · 2 comments
Open

Complete execution control in the client API ? #296

denisri opened this issue Oct 6, 2023 · 2 comments

Comments

@denisri
Copy link
Collaborator

denisri commented Oct 6, 2023

In the engine API (which I'm trying to document a little bit, by the way), I see start(), stop(), run(), wait() methods, but I don't see stop() or kill(). Isn't there a way to force stop a running workflow ?
What about restart() ?
if I understand the way it works, there is no "pending" state in the database: as soon as a workflow is inserted, jobs can be queried by workers and start to run.
We must think about the case when a job fails for some reason, but can be started again. Then execution should be restarted for all workflow jobs which depend on it. We could change their status to "not started" so that they can run again ?

@sapetnioc
Copy link
Collaborator

No, there is no such method yet. It may not be easy to interrupt an ongoing job although not impossible. However, it is quite easy to restart a job. Internally, there are five lists of jobs in the database: ready, waiting, ongoing, done and failed. Each job is exactly in one list and move to others during processing. The status of the job indicate in which list the job is. When a job is created, it goes either to read or waiting depending if it need other jobs to be finished or not before running. When a worker is not executing a job, it takes a job from ready to put it on ongoing and starts its processing. When a job is finished, it goes either to done or failed. If succeeded, waiting jobs depending only on the finished job go to ready. If failed, all dependent jobs go to failed.

Restarting a failed job can be done by putting it back to ready and put all its dependent jobs from failed and to waiting. This have to be an atomic database operation, therefore it has to be done in LUA. The first step would be to define the user API including the information we would like to add to a restarted jobs (for instance, it would be easy to add an execution count for each job). Then I can make the implementation on my own or with you so that you will have a better knowledge of Capsul v3 internal structure.

@denisri
Copy link
Collaborator Author

denisri commented Oct 9, 2023

OK let's do that.
The "interrupt job" feature might be mandatory in the future, to kill stuck jobs, or long jobs submitted by mistake with wrong parameters, etc. If workers run sub-processes (as OS processes) there is obviously a way to kill them (if not the worker). If not it's more complex, I know. Killing workers should also be an option in order to completely shut down an engine.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants