[14.0][IMP] queue_job: cancel job before it retries #773

florentx · 2025-06-02T07:26:32Z

Use case:

a job hangs due to some concurrent access in the database, it prevents other job from running
after some time it fails with retryable error
it is retried short time after, and it can continue like this for some time
since the button "Cancel" is not authorized on running jobs, the operator cannot cancel this job

This patch allows to Cancel running jobs:

job is started, then operator chooses to cancel it (because it already retried 2 times, for example, and is blocking other jobs)
when button is pressed, job record is set to "Cancelled"
job will not be interrupted
if job ends successfully: its state is saved to "Done"
if it fails with an error, its state stays "Cancelled" and it is not retried

OCA-git-bot · 2025-06-02T07:26:36Z

Hi @guewen,
some modules you are maintaining are being modified, check this out!

guewen

Makes sense, thanks!

sbidoul

I'm not convinced we should do this.

From a UX perspective this is not cancelling the job, it merely prevents it to be retried. Could we have something doing that explicitly? Such as limiting the retry count?

As a side note, I still think the concept of canceling job is unclear. In my mind, a job is conceptually part of the transaction that created it and we expect it to be eventually completed. That is why there was also a Set Done button initially.

Also, last time I checked it did not work with job graphs.

florentx · 2025-06-03T06:40:17Z

I understand that it is not perfect solution, however it helps in real situations.

Using action Cancel is a last resort solution when there's some Production issue. It is not so important that the job will finish running (trying) before it is really cancelled.

Moreover, one of the case which is tackled here does not comply with the max_retries parameter. So an alternative solution will not work.
These errors are PostgreSQL concurrency errors, like SERIALIZATION_FAILURE. They are retried indefinitely.

In my opinion this button Cancel should be convenient for the system admin which needs some tool to unblock Production.

Then when system load goes down, then Administrator can find easily the "cancelled" jobs and use button "Requeue" in order to process them, one at a time.
With button "set to done", it is not so easy to identify which ones should be requeued later (or dealt with in a different manner).

sbidoul · 2025-06-03T08:09:27Z

But if I understand correctly, in your use case, you could cancel a running job, and it could still end up in done state? Is that not counter intuitive?

These errors are PostgreSQL concurrency errors, like SERIALIZATION_FAILURE. They are retried indefinitely.

I did not know these were retried indefinitely. Is that intended? Shouldn't we address that instead?

florentx · 2025-06-03T08:32:21Z

But if I understand correctly, in your use case, you could cancel a running job, and it could still end up in done state? Is that not counter intuitive?

Yes, if the job is just started, it can finish successfully, even if operator presses button to cancel.
This was also the case previously, because Job state was not re-checked before setting "Cancelled".
In my opinion, this behavior is acceptable, since button Cancel is for abnormal situations only.

I did not know these were retried indefinitely. Is that intended? Shouldn't we address that instead?

Regarding the infinite retry, code is here:
https://github.com/OCA/queue/blob/18.0/queue_job/controllers/main.py#L115-L122

Maybe it's legit to keep retrying, if the database load decreases later, and it can terminate successfully. I don't know.

amh-mw · 2025-06-03T11:43:57Z

I did not know these were retried indefinitely. Is that intended? Shouldn't we address that instead?

Regarding the infinite retry, code is here: https://github.com/OCA/queue/blob/18.0/queue_job/controllers/main.py#L115-L122

Maybe it's legit to keep retrying, if the database load decreases later, and it can terminate successfully. I don't know.

TIL. I have been catching SerializationFailure and rethrowing RetryableJobError myself for years! I have never seen a job actually make it all the way to "infinite" retries. Even my most congested jobs running in tens of thousands with a couple dozen workers in parallel still resolve after my retry_pattern spaces them out.

Alternate perspective: Would it make more sense to have the job run out of retries and fail (with some sort of notification), then an admin could restart the job later, rather than have to recover it in real time?

github-actions · 2025-10-19T12:39:17Z

There hasn't been any activity on this pull request in the past 4 months, so it has been marked as stale and it will be closed automatically if no further activity occurs in the next 30 days.
If you want this PR to never become stale, please ask a PSC member to apply the "no stale" label.

florentx changed the title ~~[IMP] queue_job: cancel job before it retries~~ [14.0][IMP] queue_job: cancel job before it retries Jun 2, 2025

florentx mentioned this pull request Jun 2, 2025

[18.0][UPD] Forward port changes from 14.0 #776

Merged

guewen approved these changes Jun 2, 2025

View reviewed changes

twalter-c2c approved these changes Jun 2, 2025

View reviewed changes

OCA-git-bot added the approved label Jun 2, 2025

sbidoul requested changes Jun 3, 2025

View reviewed changes

OCA-git-bot removed the approved label Jun 3, 2025

florentx force-pushed the 14.0_enh_job_cancel branch from dc2571c to ca3e0dd Compare June 3, 2025 12:55

[IMP] queue_job: cancel job before it retries

b7bcdc3

florentx force-pushed the 14.0_enh_job_cancel branch from ca3e0dd to b7bcdc3 Compare June 18, 2025 12:35

github-actions bot added the stale PR/Issue without recent activity, it'll be soon closed automatically. label Oct 19, 2025

github-actions bot closed this Nov 23, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

[14.0][IMP] queue_job: cancel job before it retries #773

[14.0][IMP] queue_job: cancel job before it retries #773

Uh oh!

florentx commented Jun 2, 2025

Uh oh!

OCA-git-bot commented Jun 2, 2025

Uh oh!

guewen left a comment

Uh oh!

sbidoul left a comment

Uh oh!

florentx commented Jun 3, 2025

Uh oh!

sbidoul commented Jun 3, 2025

Uh oh!

florentx commented Jun 3, 2025

Uh oh!

amh-mw commented Jun 3, 2025

Uh oh!

github-actions bot commented Oct 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

Uh oh!

[14.0][IMP] queue_job: cancel job before it retries #773

[14.0][IMP] queue_job: cancel job before it retries #773

Uh oh!

Conversation

florentx commented Jun 2, 2025

Uh oh!

OCA-git-bot commented Jun 2, 2025

Uh oh!

guewen left a comment

Choose a reason for hiding this comment

Uh oh!

sbidoul left a comment

Choose a reason for hiding this comment

Uh oh!

florentx commented Jun 3, 2025

Uh oh!

sbidoul commented Jun 3, 2025

Uh oh!

florentx commented Jun 3, 2025

Uh oh!

amh-mw commented Jun 3, 2025

Uh oh!

github-actions bot commented Oct 19, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants