Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

High availability deployment for istio #399

Open
nishant-dash opened this issue Apr 2, 2024 · 3 comments
Open

High availability deployment for istio #399

nishant-dash opened this issue Apr 2, 2024 · 3 comments
Labels
bug Something isn't working

Comments

@nishant-dash
Copy link

Bug Description

Is there a way we can deploy Istio in a high availability setup for a single cluster?

Given just one kubeflow cluster, does it make sense to have istio be a daemon set as proposed by @kimwnasptd ?

I went through [0] and their model of High-Availability usually refers to multiple clusters using Istio and having multiple Istio control planes so that failure of a single control plane mesh can be tolerated for example.
However [0] does not really mention anything about availability in the context of a single cluster only (which will contain a single Istio control plane).

[0] https://istio.io/latest/docs/ops/deployment/deployment-models/

To Reproduce

N/A

Environment

N/A

Relevant Log Output

N/A

Additional Context

N/A

@nishant-dash nishant-dash added the bug Something isn't working label Apr 2, 2024
Copy link

Thank you for reporting us your feedback!

The internal ticket has been created: https://warthogs.atlassian.net/browse/KF-5502.

This message was autogenerated

@kimwnasptd
Copy link
Contributor

The above Istio docs describe a combination of situations (single/multi cluster and single/multi networks). With an initial look I can't understand though how they would suggest we configure for each case.

I also see at some point that they describe running Istio's Control Plane in a separate cluster, but this would need a bit of investigation.

@kimwnasptd
Copy link
Contributor

kimwnasptd commented May 2, 2024

We'll try to tackle this in steps, as discussed with @ca-scribner

The first one will be to ensure that the IngressGateway Pods will have HA. This will ensure that if a pod that handle the Gateway Istio CR is down, then the rest of Kubeflow can still be accessed.

The first approach we discussed was to

  1. Keep having our ingressgateway Charm to create a Deployment
  2. In that Deployment we will use affinity.podAntiAffinity
    https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#more-practical-use-cases

With the above we can increase the number of replicas and ensure that the Pods will not be getting scheduled in the same nodes.

(The extreme of this would be to convert the Deployment to a DaemonSet, which would create a Pod for every node)

I tried configuring the Istio IngressGateway deployment like this (in an upstream KF) and indeed it worked as expected:

   spec:
      affinity:
        nodeAffinity: {}
        podAntiAffinity:
          requiredDuringSchedulingIgnoredDuringExecution:
          - labelSelector:
              matchExpressions:
              - key: app
                operator: In
                values:
                - istio-ingressgateway

In a 2-node cluster, when I set the replicas to 2 indeed the pods got scheduled in different nodes

Then when increasing the replicas to 3, it showed the expected error

Events:
  Type     Reason            Age                  From               Message
  ----     ------            ----                 ----               -------
  Warning  FailedScheduling  2m39s (x3 over 12m)  default-scheduler  0/3 nodes are available: 1 node(s) had untolerated taint {node-role.kubernetes.io/control-plane: }, 2 node(s) didn't match pod anti-affinity rules. preemption: 0/3 nodes are available: 1 Preemption is not helpful for scheduling, 2 No preemption victims found for incoming pod..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants