[improve][fn] Add API endpoint for function-worker for liveness check with configurable flag #23829

mukesh154 · 2025-01-09T06:19:42Z

Motivation

This pull request introduces a health check functionality for Kubernetes deployments, specifically adding a liveness probe for the function worker. The liveness probe is crucial for Kubernetes-based applications, enabling automated pod restarts in case of failure. This change ensures that the function worker recovers when a ProducerFencedException occurs, which causes the worker to get stuck and not recover.

For instance, when a client makes a request like:

curl --location --request PUT 'https://localhost:6651/admin/v3/functions/test/test/test' --header 'Authorization: Bearer <token>' --header '...' --form '[email protected]'

For POST, PUT, and DELETE operations, the following error is returned under heavy load:

{"reason":"Internal Error updating function at the leader"}

And, when the following error occurs in the function worker currently:

ERROR org.apache.pulsar.functions.worker.FunctionMetaDataManager - Could not write into Function Metadata topic │
│ org.apache.pulsar.client.api.PulsarClientException$ProducerFencedException: Producer was fenced

The function worker does not recover, leading to an ongoing failure. With this update, the worker will automatically restart with the help of health check with liveliness probe upon encountering this error, ensuring proper recovery and continuity of operations.

Modifications

This update introduces an API endpoint to perform a liveliness check on the function worker pod. The API returns an HTTP status of 200 (OK) when the isLive flag within FunctionImpl is true. If the flag is false, typically after a ProducerFencedException occurs, the API will return a status of 503 (Service Unavailable).

The Kubernetes deployment configuration has been updated to use this new API endpoint in the liveness probe along with existing metrics endpoint, allowing the system to monitor the health and availability of the function worker.

Verifying this change

Make sure that the change passes the CI checks.

(Please pick either of the following options)

This change is a trivial rework / code cleanup without any test coverage.

Does this pull request potentially affect one of the following parts:

If yes was chosen, please highlight the changes

Dependencies (does it add or upgrade a dependency): (no)
The public API: (no)
The schema: (no)
The default values of configurations: (no)
The wire protocol: (no)
The rest endpoints: (no)
The admin cli options: (no)
Anything that affects deployment: (no)

Documentation

Check the box below or label this PR directly (if you have committer privilege).

Need to update docs?

doc
doc-required
doc-not-needed
doc-complete

…able flag

github-actions · 2025-01-09T06:20:19Z

@mukesh154 Please add the following content to your PR description and select a checkbox:

- [ ] `doc` <!-- Your PR contains doc changes -->
- [ ] `doc-required` <!-- Your PR changes impact docs and you will update later -->
- [ ] `doc-not-needed` <!-- Your PR changes do not impact docs -->
- [ ] `doc-complete` <!-- Docs have been already added -->

lhotari · 2025-01-09T21:49:44Z

The function worker does not recover, leading to an ongoing failure. With this update, the worker will automatically restart with the help of health check with liveliness probe upon encountering this error, ensuring proper recovery and continuity of operations.

Although adding a liveness check could be useful for many reasons, it would be better to primarily address this issue. Are you able to isolate and reproduce the issue? What Pulsar version are you running?

It looks like the leader election for the function worker uses a consumer with consumerEventListener to find who is the leader:

pulsar/pulsar-functions/worker/src/main/java/org/apache/pulsar/functions/worker/LeaderService.java

Lines 76 to 87 in 3396065

    
           // the leaders service is using a `coordination` topic for leader election. 
        
           // we don't produce any messages into this topic, we only use the `failover` subscription 
        
           // to elect an active consumer as the leader worker. The leader worker will be responsible 
        
           // for scheduling snapshots for FMT and doing task assignment. 
        
           consumer = (ConsumerImpl<byte[]>) pulsarClient.newConsumer() 
        
                   .topic(workerConfig.getClusterCoordinationTopic()) 
        
                   .subscriptionName(COORDINATION_TOPIC_SUBSCRIPTION) 
        
                   .subscriptionType(SubscriptionType.Failover) 
        
                   .consumerEventListener(this) 
        
                   .property(WORKER_IDENTIFIER, consumerName) 
        
                   .consumerName(consumerName) 
        
                   .subscribe();

There might be a bug in the Pulsar implementation of notifying the active consumer change:

pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/AbstractDispatcherSingleActiveConsumer.java

Lines 213 to 223 in 49aa308

    
           if (!pickAndScheduleActiveConsumer()) { 
        
               // the active consumer is not changed 
        
               Consumer currentActiveConsumer = getActiveConsumer(); 
        
               if (null == currentActiveConsumer) { 
        
                   if (log.isDebugEnabled()) { 
        
                       log.debug("Current active consumer disappears while adding consumer {}", consumer); 
        
                   } 
        
               } else { 
        
                   consumer.notifyActiveConsumerChange(currentActiveConsumer); 
        
               } 
        
           }

pulsar/pulsar-broker/src/main/java/org/apache/pulsar/broker/service/AbstractDispatcherSingleActiveConsumer.java

Lines 238 to 241 in 49aa308

    
           if (closeFuture == null && !consumers.isEmpty()) { 
        
               pickAndScheduleActiveConsumer(); 
        
               return; 
        
           }

At least by reading the code, it's hard to see how that could work.

lhotari · 2025-01-09T22:47:36Z

At least by reading the code, it's hard to see how that could work.

Now I can see it. There are multiple references to notifyActiveConsumerChanged in other code locations. It seems that the solution works, but the code is just hard to understand. Found #1818 with some explanations.

lhotari · 2025-01-09T22:49:27Z

How many function worker instances do you have when you encounter this problem?

mukesh154 · 2025-01-11T16:26:39Z

Although adding a liveness check could be useful for many reasons, it would be better to primarily address this issue. Are you able to isolate and reproduce the issue? What Pulsar version are you running?

Thanks for the feedback! I understand your point about addressing the primary issue first. Regarding the issue reproduction, I haven’t been able to isolate it myself yet. I'm using Pulsar 3.0, so it could potentially be related to that version.

mukesh154 · 2025-01-11T16:27:24Z

How many function worker instances do you have when you encounter this problem?

I have 2 function worker instances when I encounter the problem.

Add API endpoint for function-worker for liveness check with configur…

cad69d5

…able flag

github-actions bot added the doc-label-missing label Jan 9, 2025

github-actions bot added doc-not-needed Your PR changes do not impact docs and removed doc-label-missing labels Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[improve][fn] Add API endpoint for function-worker for liveness check with configurable flag #23829

[improve][fn] Add API endpoint for function-worker for liveness check with configurable flag #23829

mukesh154 commented Jan 9, 2025 •

edited

Loading

github-actions bot commented Jan 9, 2025

lhotari commented Jan 9, 2025 •

edited

Loading

lhotari commented Jan 9, 2025

lhotari commented Jan 9, 2025

mukesh154 commented Jan 11, 2025

mukesh154 commented Jan 11, 2025

[improve][fn] Add API endpoint for function-worker for liveness check with configurable flag #23829

Are you sure you want to change the base?

[improve][fn] Add API endpoint for function-worker for liveness check with configurable flag #23829

Conversation

mukesh154 commented Jan 9, 2025 • edited Loading

Motivation

Modifications

Verifying this change

Does this pull request potentially affect one of the following parts:

Documentation

github-actions bot commented Jan 9, 2025

lhotari commented Jan 9, 2025 • edited Loading

lhotari commented Jan 9, 2025

lhotari commented Jan 9, 2025

mukesh154 commented Jan 11, 2025

mukesh154 commented Jan 11, 2025

mukesh154 commented Jan 9, 2025 •

edited

Loading

lhotari commented Jan 9, 2025 •

edited

Loading