You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Honor long running tasks (will not deem celery to be unhealthy while it is processing tasks)
Will always restart the pod / container when celery is no longer healthy / not processing any tasks for whatever reason. Error: No nodes replied within time constraint should definitely NOT update the /app/tmp/celery_worker_heartbeat.
Problem description:
There have been multiple incidents relating to the issue where celery stops processing tasks, see: celery/celery#7276
This health check has now been implemented in all components using celery.
Unfortunately the health check does not always work, deeming the celery worker "healthy" while it is not. We have been unable to replicate consistently when this occurs. It seems to happen during cluster reboots or full re-deployments. This seems to be happening quite regularly for objecten.
What we have observed is that while the celery worker is no longer processing requests, confirmed by running a celery inspect command (all celery commands will fail with this error).
celery inspect active
Error: No nodes replied within time constraint
The health check will continue to update the celery_worker_heartbeat file in /app/tmp/. Therefore the celery worker is deemed to be healthy and the pod will not be restarted, requiring manual intervention after the issue is detected by the user.
Second, the health check also does not seem to honor long running tasks, causing it to restart a pod while it is processing long running tasks. This is the main reasons the health check was created and we are not just using celery inspect ping -d celery@$HOSTNAME", as described in https://medium.com/ambient-innovation/health-checks-for-celery-in-kubernetes-cf3274a3e106 .
Therefore we now advise to use the native celery ping with a high threshold / periodseconds so the celery worker had time to process longer running tasks.
Create celery worker liveness probe that will:
Error: No nodes replied within time constraint
should definitely NOT update the/app/tmp/celery_worker_heartbeat
.Problem description:
There have been multiple incidents relating to the issue where celery stops processing tasks, see: celery/celery#7276
The issue was first reported for openforms: open-formulieren/open-forms#2927
A liveness probe health check was created to workaround the issue open-formulieren/open-forms#3014
This health check has now been implemented in all components using celery.
Unfortunately the health check does not always work, deeming the celery worker "healthy" while it is not. We have been unable to replicate consistently when this occurs. It seems to happen during cluster reboots or full re-deployments. This seems to be happening quite regularly for objecten.
What we have observed is that while the celery worker is no longer processing requests, confirmed by running a celery inspect command (all celery commands will fail with this error).
The health check will continue to update the
celery_worker_heartbeat
file in/app/tmp/
. Therefore the celery worker is deemed to be healthy and the pod will not be restarted, requiring manual intervention after the issue is detected by the user.Second, the health check also does not seem to honor long running tasks, causing it to restart a pod while it is processing long running tasks. This is the main reasons the health check was created and we are not just using
celery inspect ping -d celery@$HOSTNAME"
, as described in https://medium.com/ambient-innovation/health-checks-for-celery-in-kubernetes-cf3274a3e106 .Therefore we now advise to use the native celery ping with a high threshold / periodseconds so the celery worker had time to process longer running tasks.
The text was updated successfully, but these errors were encountered: