Milestone 3, Part 2

Problem Statement

A Micro-service architecture service serves to solve many issues however it's imperative to dive into the capabilities that a service mesh can offer. Our Problem statement is threefold. Our first phase includes studying the implementation of an open source software into our Kubernetes cluster and more generally on how it compares different service mesh's.
Secondly, addressing the issues pertaining to our application, more importantly isolating and analyzing the problem through visual representation.
Our third phase included studying our application and applying chaos engineering into its dimensionality. The intent was to help us recognize and learn about how our API Calls responds to failures and find ways to make the entire system more reliable by adding Chaos in the Kubernetes cluster. In all, we have conducted research pertaining the study about how micro-services can reduce the operational complexity of micro-services and implemented certain applications within it. Differences from Initial Problem Statement: Our initial problem statement stated the outcomes of deploying service mesh on our application and did not speak about the actual pressure points faced when deploying a micro-service architecture, research supporting how to fix them and a comparative analysis of technologies. The initial project proposal spoke about comparing service mesh’s and picking the best one. However, simply reading about architecture would not solve our problem. We propose a solution that provides viable statistics onto your application by bench-marking statistics of service meshes. Even though our project proposal has not digressed from the initial plan we have now provides substantiate insights into service mesh deployment and therefore, bolstered our proposal’s cause.

Phase 1 - Bench-marking service meshes

Hypothesis:

There exist many new open-source services that can interactively interpret the various capabilities service meshes had to offer in terms of throughput, requests per second etc. Meshery was one of them and we decided to use it in order to compare analysis of service meshes with observable statistics.

Implementation Methodology Service mesh

http://149.165.170.140:31986 Our Meshery dashboard can be configured to add multiple service meshes.
In the performance test tab, one can choose service mesh For showing derivable results we considered only Istios and LinkerD
We used the curl command to add http://149.165.170.140:30030
Add the Time in seconds
In performance, you can see the result of our applying various service mesh architectures into out Kubernetes cluster

Select the ones you'd like to compare and click on the double arrow icon to compare performance results Assess your service mesh configuration against deployment and operational best practices with Meshery's configuration validator.

Analysis: By using the power of this open-source software we were able to virtually deploy both Linkerd and Istios onto our cluster. We were able to provide some statistics using Meshery and generate performance Histograms: LinkerD

istio

Analysis

We compared the performance of Linkerd and Istios on our application Istios proved to produce less time delay and latencies on an average and performed better with an average max throughput of 0.0912 as compared to linkerd’s 0.134. Meshery also allows one to download a statistical xml format to parse the data and add it to any external graphs needed.

Learning Outcomes

Meshery or similar software can be successfully used inside a cluster for users to derive important statistics regarding different types of service mesh’s.
A real-time application like this is much more explicitly defined than simply comparing the broad term advantages and disadvantages of each service mesh.
One can test their service mesh by increasing the number of hits and requests per seconds and the total time an application workload is hit.

Phase 2 - Istio installation, Ingress gateway and dashboard visualizations.

After the initial research through Meshery on the kind of service mesh technology, we decided to go ahead with Istio. Installed istio, creating an istio-system namespace and enabling sidecar injection. Creating ingress gateway and virtual service to access the application from an external source, since the cloud provider does not support an in-built Load Balancer.

Access application via ingress gateway on http://149.165.170.140:32214/

Kiali highlights the network and architectural topology. It helps identify and visualize how your application is connected to further isolate the point of failure. We visualized the different services and their flow, noting the HTTP traffic inflow and outflow. As shown in the picture below, which helps us isolate the error as well as check up on the health of each of the micro-services. Overall, Kiali does offer a good overview of the whole application alongside Service Mesh to identify high-level issues and also navigate between services and further drill down to the root cause.

Kiali Dashboard is accessible on http://149.165.170.140:31987/kiali

Username: admin Password: admin

Dashboard

Prometheus provides a web-based interface for querying metric values. Below shows the latency querying metric for the application.

Dashboard

Grafana provides insights for analyzing and visualizing different metrics like memory, CPU, disk, and I/O utilization over time from the data gathered from Prometheus.

Dashboard

Learning Outcomes

The learning from this phase was to enable istio and working on our application. Specifically when the cloud storage did not support a Load Balancer.
Enabling the Ingress gateway to access the application from outside through a nodePort. Connecting the ingress gateway with the virtual service.
Istio provides varied features such as traffic management, security, policies. We focused on the observational aspect through dashboard visualization. Gathering insights about the point of failures from the metrics and plots.

Phase 3 - Chaos Engineering

Hypothesis:

We know that when a certain microservice fails, or the front-end crashes, it means we can't access our website. Our service is down. As micro-service based architecture can create uncertainty in resolving the issue. As issues are bound to occur but detecting the root cause becomes important as well. And without detecting the actual problem it’s hard to resolve the issue. Hence in order to expect what actually happens, we decided to go ahead and apply Chaos engineering into our model. The Istio service mesh on your Kubernetes cluster(s) can give you more control and observability over network traffic. But, it can also help you break things Chaos engineering is a term coined at Netflix, and it can be boiled down to breaking your systems in production and designing solutions to remediate the side effects before things have a chance to break unexpectedly.

Chaos Engineering is a practice to intentionally introduce faults and failures into your microservice architecture to test the resilience and stability of your system. Istio lets you inject errors at the HTTP layer instead of delaying the packets or killing the pods at the network layer.

Here we implemented the following virtual services. Each one serves its own purpose of providing more stability to the application. And create a more reliable microservice-based Structure. Using the following three virtual services we have tried to minimize the flaws that we observed while working on the application. In the application, We observed, our requests to the gateway from the front end were failing more often. To handle the issue we implemented 3 virtual services,

Retry

Sometimes random problems cause a site to throw a 503 error and then the site works normally when you retry the connection moments later.
When the HTTP calls come to the gateway, and it fails, Here the virtual service retries the same call for 5 times till we get response code 200 for a successful call.
With the use of retry virtual service on the gateway microservice, we observed the more successful calls than before.
To specify the stats, earlier the login call was working with a minimum of 3 tries. Now with the implementation of this virtual service, the Application seems to be more stable.

Delays

when request-responses to a downstream system get slower, We can handle such calls by inserting the latency experiments and increase the reliability of the system.
Here we have added the latency of 7 seconds in 50% of the calls.
The main motivation of the implementation is to achieve more reliability by inserting a small amount of delay in the HTTP calls.

Timeout Limits

What if, instead of allowing everything in our application to be forced to wait when one microservice is slow, we instead just canceled the request after a certain amount of time and moved on?
In the real world, an application faces most failures due to timeouts. It can be because of more load on the application or any other latency in serving the request. Your application should have proper timeouts defined, before declaring any request as "Failed"
Wait only for N seconds before failing and giving up.

Learning Outcomes

With the Implementation of chaos engineering, we can increase the stability of the application
Fault injection is a testing method to introduce errors into your microservice architecture to ensure it can withstand the error conditions.

Extended Research on Security

Even though adding authorizations and authentication is a big part of Istios applications to offer,. Given the size and use case of our application security wasn’t too big of a concern for our project. We still decided to go ahead and implement a few security features mentioned on the Istio documentation. We used the Istio documentation to try out various authentication mechanisms to understand which could make our application more stable and derived various challenges situated with adding mlts authorization.

MTLS

Istio completely shifts the burden of configuring security for each individual service away from developers. Istio supports mutual TLS, which validates the identity of both the client and the server services. Thus we applied a strict policy onto our Kubernetes cluster and observed no request being passed since strict mode allows us to use Istio provisioned certificate for mutual TLS to connect to upstream. This caused our application to fail since we did not apply provisions for this. This issue on the Istios GitHub repo inspired us to test out how and why https request fails on a tls configures service gateway. We tested it out and realized that even though HTTPS command is mentioned in the configuration files, our server fails due to SSL error

Authentication via namespace:

We used a default Kubernetes namespace and duplicate namespaces and tested injecting authorization policies on both in order to allow and deny access to the dedicated namespaces. Thus various security measures can be used to ensure only the authenticated users/and or namespace is able to access the Istios service mesh and configure it.

Research and References

Research Links

Team Member Contributions

Contribution by percentage 👍

Arjun Bhavsar: 33%
Ishita Kumar: 33%
Kasturi Nikharge: 33%

GitHub Issues

Arjun Bhavsar: 45,46,47,48
Ishita Kumar: 44, 50
Kasturi Nikharge: 42,43,49,51

GitHub Related Commits

Arjun Bhavsar: 426e795c10bcaeb178e053c8c9a27957ac676593, 431e5fcdc3514094607d3dd42c2ee5700d0766a5, Aee6c9aa9454b82e723d97c683e250b1ef85bc2f, 426e795c10bcaeb178e053c8c9a27957ac676593,78bc132446407482ca13d730d559bc748fe175e8, ec334d3fd055d5a74b036b95ee876aa3528512a1
Ishita Kumar: 0e11667fe8ebdb35e10eb0c3e94368c457489ff0, 43c3a3089424b200fbc7bff30b8e119115912926,acb2d1f17f3eba6e606a12bbf08beb7914f11897,b31e1e94663423e947f1d946dad35354a5292816
Kasturi Nikharge: 6910e0057c24b0bc9b7af1811c3ca06611b4707a, 1e77d0cdbe359ef895f059471d10e809146e6c88

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Milestone 3, Part 2

Problem Statement

Phase 1 - Bench-marking service meshes

Hypothesis:

Analysis

Learning Outcomes

Phase 2 - Istio installation, Ingress gateway and dashboard visualizations.

Learning Outcomes

Phase 3 - Chaos Engineering

Hypothesis:

Learning Outcomes

Extended Research on Security

MTLS

Authentication via namespace:

Research and References

Team Member Contributions

Clone this wiki locally