Multiple Airflow Schedulers Processing Same Dagruns? #53470
Unanswered
JakeKandell
asked this question in
Q&A
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Hi all, we run a large Airflow cluster and recently encountered some strange scheduling issues. We have a distributed Kubernetes setup with multiple scheduler pods and a MySQL DB on Airflow 2.9.2.
In the scheduler logs, we see that there are multiple schedulers processing the same finished dagrun simulatenously. They all try to mark this dagrun as failed.
Furthermore, some of the schedulers report
active_runs=1
Other schedulers try to set the
next_dagrun
(meaningactive_runs
was correctly identified as 0)We believe this may have led to a race condition which caused
next_dagrun_create_after
to be set to NULL preventing this DAG from being properly scheduled moving forward.Here are the full logs for this DAG.
From my review of the code, I see that the dagrun rows are supposed to be locked when selected, so it's not clear to me how the same dagrun is being processed by different schedulers. I haven't been able find a way to reproduce this in a test cluster.
Has anyone experienced this? Do we know how this can happen?
Beta Was this translation helpful? Give feedback.
All reactions