Skip to content

Conversation

florentx
Copy link
Contributor

@florentx florentx commented Jun 2, 2025

Use case:

  • a job hangs due to some concurrent access in the database, it prevents other job from running
  • after some time it fails with retryable error
  • it is retried short time after, and it can continue like this for some time
  • since the button "Cancel" is not authorized on running jobs, the operator cannot cancel this job

This patch allows to Cancel running jobs:

  • job is started, then operator chooses to cancel it (because it already retried 2 times, for example, and is blocking other jobs)
  • when button is pressed, job record is set to "Cancelled"
  • job will not be interrupted
  • if job ends successfully: its state is saved to "Done"
  • if it fails with an error, its state stays "Cancelled" and it is not retried

@OCA-git-bot
Copy link
Contributor

Hi @guewen,
some modules you are maintaining are being modified, check this out!

@florentx florentx changed the title [IMP] queue_job: cancel job before it retries [14.0][IMP] queue_job: cancel job before it retries Jun 2, 2025
Copy link
Member

@guewen guewen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense, thanks!

Copy link
Member

@sbidoul sbidoul left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not convinced we should do this.

From a UX perspective this is not cancelling the job, it merely prevents it to be retried. Could we have something doing that explicitly? Such as limiting the retry count?

As a side note, I still think the concept of canceling job is unclear. In my mind, a job is conceptually part of the transaction that created it and we expect it to be eventually completed. That is why there was also a Set Done button initially.

Also, last time I checked it did not work with job graphs.

@florentx
Copy link
Contributor Author

florentx commented Jun 3, 2025

I understand that it is not perfect solution, however it helps in real situations.

Using action Cancel is a last resort solution when there's some Production issue. It is not so important that the job will finish running (trying) before it is really cancelled.

Moreover, one of the case which is tackled here does not comply with the max_retries parameter. So an alternative solution will not work.
These errors are PostgreSQL concurrency errors, like SERIALIZATION_FAILURE. They are retried indefinitely.

In my opinion this button Cancel should be convenient for the system admin which needs some tool to unblock Production.

Then when system load goes down, then Administrator can find easily the "cancelled" jobs and use button "Requeue" in order to process them, one at a time.
With button "set to done", it is not so easy to identify which ones should be requeued later (or dealt with in a different manner).

@sbidoul
Copy link
Member

sbidoul commented Jun 3, 2025

But if I understand correctly, in your use case, you could cancel a running job, and it could still end up in done state? Is that not counter intuitive?

These errors are PostgreSQL concurrency errors, like SERIALIZATION_FAILURE. They are retried indefinitely.

I did not know these were retried indefinitely. Is that intended? Shouldn't we address that instead?

@florentx
Copy link
Contributor Author

florentx commented Jun 3, 2025

But if I understand correctly, in your use case, you could cancel a running job, and it could still end up in done state? Is that not counter intuitive?

Yes, if the job is just started, it can finish successfully, even if operator presses button to cancel.
This was also the case previously, because Job state was not re-checked before setting "Cancelled".
In my opinion, this behavior is acceptable, since button Cancel is for abnormal situations only.

I did not know these were retried indefinitely. Is that intended? Shouldn't we address that instead?

Regarding the infinite retry, code is here:
https://github.com/OCA/queue/blob/18.0/queue_job/controllers/main.py#L115-L122

Maybe it's legit to keep retrying, if the database load decreases later, and it can terminate successfully. I don't know.

@amh-mw
Copy link
Member

amh-mw commented Jun 3, 2025

I did not know these were retried indefinitely. Is that intended? Shouldn't we address that instead?

Regarding the infinite retry, code is here: https://github.com/OCA/queue/blob/18.0/queue_job/controllers/main.py#L115-L122

Maybe it's legit to keep retrying, if the database load decreases later, and it can terminate successfully. I don't know.

TIL. I have been catching SerializationFailure and rethrowing RetryableJobError myself for years! I have never seen a job actually make it all the way to "infinite" retries. Even my most congested jobs running in tens of thousands with a couple dozen workers in parallel still resolve after my retry_pattern spaces them out.

Alternate perspective: Would it make more sense to have the job run out of retries and fail (with some sort of notification), then an admin could restart the job later, rather than have to recover it in real time?

@florentx florentx force-pushed the 14.0_enh_job_cancel branch from dc2571c to ca3e0dd Compare June 3, 2025 12:55
@florentx florentx force-pushed the 14.0_enh_job_cancel branch from ca3e0dd to b7bcdc3 Compare June 18, 2025 12:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants