You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardexpand all lines: _posts/2015-09-15-streaming-systems.md
+95-2
Original file line number
Diff line number
Diff line change
@@ -28,7 +28,7 @@ There are three key ways in which Spark streaming differs from previous approach
28
28
29
29
* Previous approaches have primarily employed a record-at-a-time model, where each node continuously receives records,
30
30
updates internal state, and sends new records. In contrast, in Spark Streaming, all records that arrive within a time interval
31
-
are store reliably as an RDD, and processed via deterministic parallel operations to produce system output or input to the next
31
+
are stored reliably as an RDD, and processed via deterministic parallel operations to produce system output or input to the next
32
32
interval.
33
33
34
34
* As a consequence of a record-at-a-time approach, all nodes need to maintain complex state; this makes recovery challenging
@@ -59,22 +59,115 @@ advantages, which work in favor of Spark Streaming's adoption.
59
59
60
60
_Is the Problem Real?_
61
61
62
+
The problem addressed in this paper is the design of a stream processing platform that provides the following properties:
63
+
64
+
* Exactly-once guarantees
65
+
* Low latency
66
+
* High throughput
67
+
* Expressive computation model
68
+
* Low overhead of fault-tolerance in the absence of of failures
69
+
* Flow control, i.e., ablility to tolerate straggling operators.
70
+
71
+
The problem itself is quite similar to the one addressed in the Spark Streaming paper, and hence, the discussion on how real the
72
+
problem is quite similar.
73
+
62
74
_What is the solution's main idea?_
63
75
76
+
The solution's key idea is to exploit distributed snapshots based on Chandy-Lamport's algorithm, guaranteeing "exactly-once"
77
+
semantics. In particular, Flink periodically creates snapshots of _all_ the states the streaming compuation is in.
78
+
The snapshots are stored on persistent storage, and recovery proceeds by reading the latest snapshot and
79
+
reconstructing the stream.
80
+
64
81
_Why is the solution different from previous work?_
65
82
83
+
Micro-batching suffers from the following issues:
84
+
* They employ a modified programming model, which limits expressiveness; for example, users cannot compute on window data in periods other than multiples of checkpoint interval. Moreover, there is no support for count-based or session windows that are popular among many applications.
85
+
* It is hard for them to deal with backpressure
86
+
* The micro-batch latency limits applications that require sub-second latencies.
87
+
88
+
When compared with Google's Cloud Data Flow model, Flink uses distributed snapshots while Cloud Data Flow uses distributed
89
+
transactions. In distributed transactions, record deliveries are logged along with the updates to the state. During a failure,
90
+
both state and record deliveries are repeated from the log, making recovery expensive. Distributed snapshots overcomes this issue
91
+
by allowing the periodicity of snapshots determine the time spent in recovery.
92
+
66
93
_Does the paper identify any fundamental/hard trade-offs?_
67
94
95
+
The key tradeoff lies in the fault tolerance scheme employed, which plays an important role in the end-to-end design
96
+
of such a system. By employing a snapshot-based scheme, the tradeoff is essentially between recovery time and
97
+
the overhead of periodic snapshot creations. If there are too many snapshots, and the operators have a lot of state,
98
+
then the data processing performance (latency, throughput) degrades; on the other hand, if there are too few snaphots,
99
+
then there is significant amount of time spent in recovering from a failure.
100
+
68
101
_Will the paper be influential in 10 years?_
69
102
103
+
The key contribution in Apache Flink is its choice of fault-tolerance mechanism for stream-processing, in that it does
104
+
not constrain the system's ability to achieve other requirements for a stream processing framework. The mechanism for
105
+
fault-tolerance itself is a simple adaptation of Chandy-Lamport algorithm. While the identified requirements for a stream
106
+
processing system is interesting (although not new -- the Data Flow model paper already had identified a number of them),
107
+
the choice of fault-tolerance scheme itself isn't fundamental to the design of the stream processing system (unless it
108
+
employs a radically different data processing framework such as micro-batch processing). Since the system is still quite
109
+
new, it is hard to predict how impactful it would be in industry. It is unclear if it would be academically influential
110
+
in the next 10 years.
111
+
70
112
## The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing
71
113
72
114
_Is the Problem Real?_
73
115
116
+
The problem addressed in this paper is the correct modeling of data processing for unbounded, unordered, large-scale
117
+
datasets with sophisticated requirements, such as low latency execution of event-time ordering, windowing, etc. There
118
+
are a large number of systems that have evolved over time to cater to these widely contrasting requirements, and
119
+
typically achieving different degrees of correctness, latency, and cost. Correctly modeling the semantics of a system
120
+
that can cater to such problems could provide the ability to cater to all of these requirements, while the degree of
121
+
correctness, latency and cost would still be governed by the underlying implementation. In this aspect, I feel this
122
+
is an interesting and real problem to solve.
123
+
74
124
_What is the solution's main idea?_
75
125
126
+
The key aspect in this paper isn't the system design itself, but the real-world experiences and a set of principles
127
+
that governed their system design:
128
+
129
+
* Systems dealing with unbounded datasets shouldn't rely on any notion of completeness.
130
+
* It is important to be flexible in terms of the set of operations that can be supported by the system.
131
+
* The interface for using the system for analyzing unbounded datasets should be simple
132
+
* The system should support robust analysis of data in the context in which it occurred.
133
+
134
+
In particular, the Data Flow model provides a way to compute "event-time" ordered results, windowed by features of the
135
+
unbounded data, with tunable correctness, latency and cost across a spectrum of combinations. This is made possible by
136
+
decomposing the pipeline across the following dimensions:
137
+
138
+
- What results are being computed
139
+
- Where in event time are they being computed
140
+
- When in processing time are they materialized
141
+
- How earlier results relate to their later refinements
142
+
143
+
and also by segregating the logical data processing from the underlying implementation (e.g., via batch, micro-batch
144
+
or stream processing). Different degrees of correctness, latency and cost are left to the choice of the underlying
145
+
implementation.
146
+
76
147
_Why is the solution different from previous work?_
77
148
149
+
The authors provide a detailed description of existing systems that cater to analyzing unbounded and bounded datasets.
150
+
The key observation is that while different existing systems acheive some degree of correctness, low-latency, low-cost,
151
+
flexibility and simplicity, they often sacrifice a number of these aspects that could be attained together by employing
152
+
a different model for handling unbounded data. Moreover, the systems are too closely tied to the underlying physical
153
+
implementation in order to acheive the desired clarity, composability and flexibility that is expected from these systems.
154
+
The Data Flow model addresses these shortcomings. However the limitations of the underlying implementation are still left
155
+
in place; what the Data Flow model facilitates is a common platform that can express parallel computation that is independent of the
156
+
underlying implementation, without sacrificing on expressiveness of the model itself.
157
+
78
158
_Does the paper identify any fundamental/hard trade-offs?_
79
159
80
-
_Will the paper be influential in 10 years?_
160
+
The paper identifies three key metrics in data processing of unbounded datasets that are always in tension, namely,
161
+
correctness, latency, and cost. Existing systems also trade-off flexibility and simplicity along with the metrics above,
162
+
however that is simply an artifact of a strong coupling between the data processing model and the underlying
163
+
implementation. By separating the model from implementation, the authors show that the model can acheive the
164
+
required flexibility, simplicity and composability for processing such datasets; the tradeoff is reduced to that
165
+
between correctness, latency and cost, which is dependent on the selected framework (batch, micro-batch or record-at-a-time
166
+
streaming).
167
+
168
+
_Will the paper be influential in 10 years?_
169
+
170
+
The paper is motivated by strong real-world applications. The Data Flow model provides expressiveness sufficient to
171
+
cater to large number of such applications, and shows promise in being able to support new applications as well. As
172
+
such, since there are no constraints on the underlying implementation for adopting the Data Flow model itself, I feel
173
+
this paper will be influential both in academia and in industry for designing processing engines for unbounded datasets.
_Why is the solution different from previous work?_
13
+
14
+
_Does the paper identify any fundamental/hard trade-offs?_
15
+
16
+
_Will the paper be influential in 10 years?_
17
+
18
+
## Omega: flexible, scalable schedulers for large compute clusters
19
+
20
+
_Is the Problem Real?_
21
+
22
+
The problem addressed in this paper is cluster scheduling in a large-scale environment (scalability),
23
+
with rapidly changing requirements (flexibility). Target high utilization, efficiency in shared
24
+
large-scale compute clusters, running a mix of:
25
+
26
+
* CPU-intensive and Memory-intensive jobs
27
+
* batch and low-latency jobs
28
+
* user-facing and infrastructure services
29
+
30
+
As clusters/workloads keep growing, since scheduler’s workload is roughly proportional to cluster size,
31
+
scheduler becomes source of scalability bottleneck.
32
+
33
+
The authors state that Omega's design is driven by real-life Google’s production workloads. In particular:
34
+
> "Google’s production job scheduler has experienced all of this. Over the years, it has evolved into a
35
+
> complicated, sophisticated system that is hard to change."
36
+
37
+
_What is the solution's main idea?_
38
+
39
+
The authors employ a parallel scheduler architecture with shared state, using lock-free optimistic concurrency
40
+
control, to achieve both implementation extensibility and performance scalability.
41
+
42
+
_Why is the solution different from previous work?_
43
+
44
+
The authors categorize previous work in cluster scheduling as:
45
+
*_Monolithic schedulers_ — single, centralized scheduling algorithm for all jobs (Google’s existing scheduler).
46
+
*_Two-level schedulers_ — single, active resource manager that offers compete resources to multiple parallel, independent "scheduler frameworks" (Mesos, Hadoop-on-Demand).
47
+
48
+
The drawbacks of previous work included:
49
+
50
+
Monolithic schedulers:
51
+
* harder to add new policies
52
+
* may not scale up to large cluster sizes
53
+
54
+
Two-level schedulers:
55
+
* conservative resource visibility and locking algorithms limit flexibility and parallelism
56
+
* hard to schedule “picky” jobs
57
+
* hard to make decisions that require access to the sate of the entire cluster
58
+
59
+
Omega overcomes these by using a parallel scheduler architecture with shared state, using lock-free optimistic concurrency
60
+
control, to achieve both implementation extensibility and performance scalability.
61
+
62
+
_Does the paper identify any fundamental/hard trade-offs?_
63
+
64
+
The primary tradeoff explored in this work is that between degree of parallelism in scheduling resources in a cluster
65
+
enabeld by a shared-state, optimistic concurrency control and the amount of wasted work when the optimistic concurrency
66
+
assumptions fail.
67
+
68
+
_Will the paper be influential in 10 years?_
69
+
70
+
Omega has already had significant impact since it is the cluster management system used in Google’s production system.
71
+
I believe the taxonomy of the option space for cluster scheduling development, together with the ideas of shared state
72
+
for scalable and flexible cluster scheduling will have long lasting impact in resource scheduling research.
73
+
74
+
## Sparrow: Distributed, Low Latency Scheduling
75
+
76
+
_Is the Problem Real?_
77
+
78
+
_What is the solution's main idea?_
79
+
80
+
_Why is the solution different from previous work?_
81
+
82
+
_Does the paper identify any fundamental/hard trade-offs?_
0 commit comments