You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, thanks again for this great library. Some of our clusters have a large number of nodes (>1000), and the leader election has become a significant portion of the requests made to the k8s API server. As a stopgap, we're considering tweaking the leader election parameters to reduce the number of lease calls made to the k8s API server.
Happy to make the PR, this I see enough examples in the codebase to handle
If we tweak the numbers to reduce the number of lease calls, this opens up the duration during which we have no leader. What happens during that period of no-leadership? Existing nodes will have a potentially-stale peer list, while new nodes will not be able to discover their peers?
Another option we're considering is using the kubernetes endpoint to discover peers (equivalent of kubectl get endpoints spegel, but for port 5001) to maybe-circumvent the need to use a leader to discover peers, but I'm not too clear if this'll cause weird behavior, would appreciate feedback there
The text was updated successfully, but these errors were encountered:
Another possibility is to split the leader election as a separate deployment from the worker daemonset:
a leader election set of pods will vie for the lease. Leader publishes its key as a configmap (ideally a service, but noticed the lease holder ID is not a straightforward IP)
The daemonset acts as a set of worker pods - never vie for leadership, and instead watch the configmap to get the leader key
This brings up an interesting point. Leader election was probably never built for applications running in a daemonset, especially in such a large cluster.
I do see how this is a problem for you. Using leader election is probably not the best solution long term for Spegel. It was used as the best solutions for two problems that had to be solved for bootstrapping. The first is that for a peer to connect to another it needs the ID of the peer which includes a randomly generated public key. The second is that all peers need to agree on the same set of peer(s) to initially connect to. If a random peer was selected a split cluster could in theory have been created.
I am happy to discuss alternatives to solve this problem. The two main things that any solution needs to provide is that the public key needs to e shared, and the same peers need to be selected. One option is for example to choose the oldest Spegel instances, but that does not solve the sharing public key part.
Describe the problem to be solved
Hello, thanks again for this great library. Some of our clusters have a large number of nodes (>1000), and the leader election has become a significant portion of the requests made to the k8s API server. As a stopgap, we're considering tweaking the leader election parameters to reduce the number of lease calls made to the k8s API server.
Currently those value are hard-coded here.
Proposed solution to the problem
A couple of questions/notes:
Another option we're considering is using the kubernetes endpoint to discover peers (equivalent of
kubectl get endpoints spegel
, but for port 5001) to maybe-circumvent the need to use a leader to discover peers, but I'm not too clear if this'll cause weird behavior, would appreciate feedback thereThe text was updated successfully, but these errors were encountered: