-
Notifications
You must be signed in to change notification settings - Fork 0
Project Description
This project will implement a load balancing system in an Internet of Things (IoT) environment. We have millions of devices with modest processing capabilities. These devices, the clients, maintain a persistent TCP connection with "the cloud", a cluster of interconnected servers in a datacenter. Devices can restart, crash, lose power, etc. The servers are implemented in Erlang.
There are many types of clients. For example, a client could be a wireless router, a temperature sensor or a mobile phone application where a user issues commands to administer one of these devices. Clients are assigned to an organization. Clients communicate with other clients in the same organization.
Send clients in the same organization to the same server.
We want to minimize inter-node communication in the cluster. While it is possible for two separate nodes in the cluster to communicate, there is a cost associated with it. It's always cheaper to call a local method than a remote method. Additionally, if all (or most) of the machines in an organization connect to the same machine, we can cache information about their state locally.
At first glance, the problem seems pretty simple. Let's assume a device comes on line and connects to the cluster for the first time. Assume that the device knows which group it belongs to and includes the group name in the request.
Client1: Hi, I'm device "1" and I belong to group "Banana"
Server: I haven't your group before, so group "Banana" is now assigned to server "A".
Client1 connects to server "A"
Client2: Hi, I'm device "2" and I belong to group "Banana"
Server: The group Banana is assigned to server "A"
Client 2 connects to server "A"
What if two devices connect at the same time? How do we make sure a group doesn't get assigned to two separate servers? A common solution would be to use the transactional capabilities of a relational database. However, a relational database is a heavyweight solution to route a request. If the database goes down, nobody can connect. This may be a good case for eventual consistency. It is probably fine if the devices in the same organization eventually end up on the same server.
What if server "A" crashes? There has to be a way to detect when a server crashes and redirect clients in a group to a different server while keeping them in the same group.
What if server "A" is using 98% of available RAM? The load balancer needs to take into account system resources when deciding where to send clients.
Let's say a new server is added to the cluster. How do we move some groups off of existing servers onto the new server to evenly distribute load? Rebalancing can occur periodically, when a server is added or when a server crashes.
Where does the load balancer sit? Is every node in the cluster capable of acting as a load balancer? Or, do clients connect to dedicated load balancer servers and then redirect to a server? How do we prevent the load balancer from becoming a single point of failure?
How do we abstract the load balancing system from different protocols? Ideally, the system should work with devices connected over Websockets or MQTT. Because Websockets are just a pipe, you'd want to know if the clients are speaking JSON-RPC, MQTT or a custom protocol over the Websockets. These protocols don't include a redirect command within their standards. So, you'll need a way to abstract the redirect command. For example, you could pass in a callback function that would send the correct redirect command that the client understands.
You cannot always communicate with other nodes in the network. Net splits are a fact of life in distributed computing. Let's say the load balancer cannot reach server "A" in the cluster, but a client can. The load balancer may redirect clients in the group to a new server, while other members of the group remain on server "A".
It seems like we'll need to periodically scan the cluster, identify groups that may be split across machines and rebalance. Executing the functionality at a fixed interval (let's say 5 minutes for now). Which server keeps track of the elapsed time for rebalancing? What if it crashes? Maybe any machine in the cluster should be capable of stepping up and serving the role to eliminate a single point of failure. This is a good place for a leader election algorithm. The most recent discussion of leader election in Erlang I have seen recently is here,