-
Notifications
You must be signed in to change notification settings - Fork 193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
verdi process kill
takes long time for erroneous processes
#6524
Comments
I think it should still say that it is in the waiting state, although it would be better if it was updated with the number of failures. So ideally it would show:
or something like that. Once it hits the maximum number of retries (5 by default) it will go in the paused state. The reason for this functionality is that this way transient problems, such as failing to connect due to connection problems, won't fully except the calcjob. Once the problem is fixed, it can be resumed again.
The The real problem I suspect here is in the handling of the
Since the process is already waiting for the next try of the failed upload task, the RPC to call |
Thanks for the clarification, @sphuber, and sorry for the slow reply.
Actually, that makes sense, yes.
Pinging @khsrali here for one example how and why RMQ is used, interesting!
Will look into this! |
I just faced this issue again, in a different scenario, but apparently the same cause (connection): Steps to reproduce:
Note: In my particular case,
And the following reports no job between RMQ an d daemon was found: verdi daemon stop
verdi process repair
verdi daemon start There is just no way to get rid of these parasitic jobs. [UPDATE still possible to delete a running process node, I'm not sure what consequences it might have] |
Hold on, how can you do |
Ah sorry, I had one other profile in my script than the one I deleted, stupid 🤣 Anyways, the bug is reproducible. Note: @sphuber What is the consequence of deleting a process node instead of killing it? |
I added breakpoints here and there, it seems the stucking point is here:
|
Remember that a node is just a proxy for the state of a process. When a process is running, it just persists its state on the node in the database. What actually determines the lifetime of a process is its associated task with RabbitMQ. If you delete the node, the task is not (immediately) affected. If the task was not yet with a daemon worker, the task will be destroyed as soon as a daemon worker picks it up, as it won't be able to load the associated node. If the process was already running with a worker, it will hit an exception as soon as it tries to store a change in state on the node as it no longer exists. At that point the process task is destroyed. So, in short: you can delete a process node and if it had an existing task with RMQ, it will eventually be destroyed. NOTE: if it is a
I don't understand what you mean with "parasitic jobs" here.
In your test case here, the problem is that you have deleted the node and most likely by the time that you call |
Thanks a lot for explanation. Very nice.
sorry, a wrong term.
I see. To understand this, I tried again this time I don't delete any nodes. And the problem is still there. To reproduce:Steps to reproduce:
|
Ok, that is a good concrete example. What is happening here is an exception that happens during one of the When the process is paused, I think the code may contain a bug where the kill action can never actually be called because the process needs to be playing for that to happen. But I am not a 100% sure, we would have to check. In your example though, it looks like the process hasn't hit the 5 times yet and is just waiting for the next try. Perhaps here there is also already something preventing from the kill command being executed. Again not sure exactly what but certainly plausible. In both cases, we have to verify what (if anything) is blocking the RPC from being executed and when we find it, find a way to let the RPC take precedence and still go through, despite the process being handled by the EBM. I am not sure when I will have time to look into this myself, but hopefully this gives you more information to start digging into how this works a bit more yourself. |
As noted by @khsrali and me when trying out the new FirecREST implementation. So maybe not the most general example, but this is where it occurred for me: Say, one installs
aiida-firecrest
and submits a job using that, but forgets to runverdi daemon restart
before, leading to the requested transport plugin not being available.verdi daemon logshow
shows theexpected exception (though, doesn't actually matter)
and the job is stuck in the state
⏵ Waiting Waiting for transport task: upload
as expected. (though, actually, here it could also show that the job excepted, rather than being stuck in the waiting state?)
Now, running
verdi process kill
leads to the command being stuck for minutes on end, while`verdi daemon logshow` gives the following output
showing that
plumpy
reports that it has killed the process, but it's not being picked up by the daemon. Opening this issue for future reference. Please correct me if I got something wrong, @khsrali, or if you have another, simpler and more general example where this occurs.EDIT: To add here, running
verdi daemon stop
thenleads to an update in `verdi daemon logshow`
and starting the daemon again then leads to the process actually entering the
QUEUED
state (rather than being killed), so theverdi process kill
command is somewhat lost.The text was updated successfully, but these errors were encountered: