Possible Duplicate Job Execution in Cluster After DB Connection loss and resume #585

KBA2024 · 2024-12-17T18:15:37Z

KBA2024
Dec 17, 2024

Wondering what is the behavior of db-scheduler in this scenario

Node A starts consuming a job but loses the database connection, so its heartbeat stops.
The lock expires, and Node B starts processing the same job.
Node A regains the connection and commits the transaction.
Does this lead to the job being processed twice by node A and Node B ?

Dec 18, 2024

If you loose the db-connection to such a degree that heartbeating also stops (which runs in another thread on another db-connection), then most likely the ongoing transaction will never commit, since the app probably need to re-establish all database-connections.

There are scenarios where it still might happen, like stop-the-world events (gc or similar) with a duration longer than the limit for dead-executions discovery. You could in theory get the execution from the database before committing and checking that the version number is unchanged. I don't know of anyone doing this though (it is I think overly paranoid), but you could

(the version-number is incremented on most updates on the e…

View full answer

kagkarlsson · 2024-12-18T10:27:51Z

kagkarlsson
Dec 18, 2024
Maintainer

Node A regains the connection and commits the transaction.

It is a bit unlikely that Node A just resumes working. That it is idling just waiting for a healthy connection.
But should that happen, then code will run twice, yes.

You can choose to not allow reviving dead executions, or define some special handler for them by overriding DeadExecutionHandler for the Task.

3 replies

KBA2024 Dec 18, 2024
Author

Can I, just before the commit of the transaction, generate an additional query to make sure that the current node is the owner of the Lock or it's not a good practice ?

kagkarlsson Dec 18, 2024
Maintainer

If you loose the db-connection to such a degree that heartbeating also stops (which runs in another thread on another db-connection), then most likely the ongoing transaction will never commit, since the app probably need to re-establish all database-connections.

There are scenarios where it still might happen, like stop-the-world events (gc or similar) with a duration longer than the limit for dead-executions discovery. You could in theory get the execution from the database before committing and checking that the version number is unchanged. I don't know of anyone doing this though (it is I think overly paranoid), but you could

(the version-number is incremented on most updates on the execution, e.g. rescheduling/reviving)

Answer selected by KBA2024

KBA2024 Dec 18, 2024
Author

I agree with you that this is a corner case, but I have a critical job where I must ensure it never runs twice under any circumstances (as it involves payment processing, and duplicate execution could lead to double payments).

The version-check approach you mentioned sounds promising, as it resembles the @Version column mechanism in JPA, which unfortunately I can't leverage in my current setup.

Thanks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Possible Duplicate Job Execution in Cluster After DB Connection loss and resume #585

{{title}}

Replies: 1 comment 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Possible Duplicate Job Execution in Cluster After DB Connection loss and resume #585

KBA2024 Dec 17, 2024

Replies: 1 comment · 3 replies

kagkarlsson Dec 18, 2024 Maintainer

KBA2024 Dec 18, 2024 Author

kagkarlsson Dec 18, 2024 Maintainer

KBA2024 Dec 18, 2024 Author

KBA2024
Dec 17, 2024

Replies: 1 comment 3 replies

kagkarlsson
Dec 18, 2024
Maintainer

KBA2024 Dec 18, 2024
Author

kagkarlsson Dec 18, 2024
Maintainer

KBA2024 Dec 18, 2024
Author