RFC: Establishing and refining the vision #255

kevinbader · 2019-09-03T11:36:56Z

A Refined Vision

We're need to know what we're aiming at, and users need to know what they can expect from using RIG in their infrastructure. Given new requirements or change requests, it should be easier to decide whether or not they match the goals of the project.

The subject of this issue is to formulate this vision, but also to track the progress of related tasks (spoiler: we're not quite there yet).

The tasks are defined inline in the following vision statement, which will become a document on the website eventually. Feedback welcome.

Update README with the vision statement
Update repository tagline with the vision statement
Update the website with the text below

Reactive Interaction Gateway a.k.a. RIG

RIG enables event-driven frontends in a secure, reliable and scaleable way.

What does that mean, exactly? Well, let's break it down.

Event-driven frontends

Frontends react to what happens on the backend. Upon receiving business events, they can change their state, adapt their UI or use HTTP requests to fetch new data.

This is not just Server-Sent Events versus polling for data; it also decouples the frontends from the backends, as they describe what kind of events they're interested in, but they don't need to know where and when to fetch them.

Secure

RIG employs process supervisors, which makes it very unlikely that a node fails as a whole. For example, a fault in the Kafka connection might cause the Kafka subsystem to fail, but the rest of the system would be unaffected. The respective supervisor would restart the Kafka process automatically; as soon as the connection would be re-established successfully, the node would be fully operational again.

Running on the BEAM (i.e., the Erlang runtime machine), there is tooling available for introspection, code updates and instrumentation (e.g., tracing) at runtime, in production.

Describe, and link to, the tooling mentioned here.

Verbose logging and a Prometheus metrics endpoint also help with monitoring a node in production.

Describe how to set change log levels (using env vars and at runtime).

To limit a cluster's resource consumption, an administrator can configure a number of soft limits:

- limit on the size of the JWT blacklist,
- limit on the number of connections per jti Add limit on the number of connections per jti #258,
- limit on the number of subscriptions per jti Add limit on the number of subscriptions per jti #256,
- limit on the rate of new connections Add limit on the rate of new connections #257,
- limit on the size of the event buffer Allow to configure the event buffer size #236 Support last-event-id for SSE connections by buffering events #253.

Plugins are sandboxed, as they are either

external (see AUTHORIZATION_CHECK and SUBMISSION_CHECK), or
- run in a WebAssembly sandbox environment.

Reliable

Multiple RIG nodes form a self-healing cluster, as described in the following.

Scenario 1: A RIG node goes offline and leaves the cluster

After a node has failed, the frontends that were connected to that node reconnect to another node. They will be notified upon reconnect whether their subscriptions are still available and whether events have been lost (that depends on whether a frontend's session was hosted on the failed node). If subscriptions need to be set up again or missed data has to be requested, the frontend can act accordingly.

Any Kafka or Kinesis partitions the node had handled before its failure are reassigned to other nodes in the cluster automatically. As a consequence of how Kafka/Kinesis consumer groups work, messages may be delivered more than once.

Scenario 2: A RIG node comes back online and (re-)joins the cluster

When a node (re-)joins the cluster, it replicates the subscriptions table as well as the JWT blacklist from other nodes. It might get assigned Kafka or Kinesis partitions and starts to accept frontend connections.

Scaleable

RIG is designed to be scaleable with respect to two key metrics: the number of connected frontends and the number of backend events it can process without generating a noticeable* delay in transmission.

*For our tests we assume 1 second as the highest tolerable delay between RIG's consumption of a message from Kafka and the reception of that message on the frontend via an established SSE connection.

Scalability with respect to the number of connected frontends

For handling large amounts of frontend connections per node we rely on the light-weight processes ("actors") that BEAM, the Erlang runtime, provides. Those processes each have their own lifecycle (and even their own garbage collector) and are scheduled by BEAM on all available CPU cores. They are also memory-efficient and designed to be created and destroyed very quickly. Those features make them ideal for running TCP connections, where they provide the isolation and runtime efficiency needed for handling the long-lived connections in RIG.

Benchmarks

Scaleability with respect to the number of processed backend events

Designed as the event-handling backbone for "everything UI" in a microservice architecture, RIG must be able to process large amounts of data. The trick: RIG immediately discards events that no users are subscribed to.

For example, in a banking system, the backend might generate a million events per second that need to be consumed by RIG. At the same time, only a few thousand users are online at any point in time, which allows RIG to drop most of the messages right at the consuming node.

Benchmarks

kevinbader added this to the 3.0.0 milestone Sep 3, 2019

kevinbader self-assigned this Sep 3, 2019

kevinbader pinned this issue Sep 3, 2019

kevinbader unpinned this issue Feb 13, 2020

kevinbader modified the milestones: 3.0.0, 3.1.0 May 27, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFC: Establishing and refining the vision #255

RFC: Establishing and refining the vision #255

kevinbader commented Sep 3, 2019 •

edited

Loading

RFC: Establishing and refining the vision #255

RFC: Establishing and refining the vision #255

Comments

kevinbader commented Sep 3, 2019 • edited Loading

A Refined Vision

Reactive Interaction Gateway a.k.a. RIG

Event-driven frontends

Secure

Reliable

Scenario 1: A RIG node goes offline and leaves the cluster

Scenario 2: A RIG node comes back online and (re-)joins the cluster

Scaleable

Scalability with respect to the number of connected frontends

Scaleability with respect to the number of processed backend events

kevinbader commented Sep 3, 2019 •

edited

Loading