-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[improve][fn] Add API endpoint for function-worker for liveness check with configurable flag #23829
base: master
Are you sure you want to change the base?
Conversation
@mukesh154 Please add the following content to your PR description and select a checkbox:
|
Although adding a liveness check could be useful for many reasons, it would be better to primarily address this issue. Are you able to isolate and reproduce the issue? What Pulsar version are you running? It looks like the leader election for the function worker uses a consumer with pulsar/pulsar-functions/worker/src/main/java/org/apache/pulsar/functions/worker/LeaderService.java Lines 76 to 87 in 3396065
There might be a bug in the Pulsar implementation of notifying the active consumer change: Lines 213 to 223 in 49aa308
Lines 238 to 241 in 49aa308
At least by reading the code, it's hard to see how that could work. |
Now I can see it. There are multiple references to |
How many function worker instances do you have when you encounter this problem? |
Thanks for the feedback! I understand your point about addressing the primary issue first. Regarding the issue reproduction, I haven’t been able to isolate it myself yet. I'm using Pulsar 3.0, so it could potentially be related to that version. |
I have 2 function worker instances when I encounter the problem. |
Motivation
This pull request introduces a health check functionality for Kubernetes deployments, specifically adding a liveness probe for the function worker. The liveness probe is crucial for Kubernetes-based applications, enabling automated pod restarts in case of failure. This change ensures that the function worker recovers when a
ProducerFencedException
occurs, which causes the worker to get stuck and not recover.For instance, when a client makes a request like:
For
POST
,PUT
, andDELETE
operations, the following error is returned under heavy load:And, when the following error occurs in the function worker currently:
The function worker does not recover, leading to an ongoing failure. With this update, the worker will automatically restart with the help of health check with liveliness probe upon encountering this error, ensuring proper recovery and continuity of operations.
Modifications
This update introduces an API endpoint to perform a liveliness check on the function worker pod. The API returns an HTTP status of
200 (OK)
when theisLive
flag withinFunctionImpl
is true. If the flag is false, typically after aProducerFencedException
occurs, the API will return a status of503 (Service Unavailable)
.The Kubernetes deployment configuration has been updated to use this new API endpoint in the liveness probe along with existing
metrics
endpoint, allowing the system to monitor the health and availability of the function worker.Verifying this change
(Please pick either of the following options)
This change is a trivial rework / code cleanup without any test coverage.
Does this pull request potentially affect one of the following parts:
If
yes
was chosen, please highlight the changesDocumentation
Check the box below or label this PR directly (if you have committer privilege).
Need to update docs?
doc
doc-required
doc-not-needed
doc-complete