Skip to content

Commit 8852d37

Browse files
committedOct 2, 2015
Updated posts
1 parent 5ba919c commit 8852d37

5 files changed

+287
-2
lines changed
 

‎_posts/2015-09-15-streaming-systems.md

+95-2
Original file line numberDiff line numberDiff line change
@@ -28,7 +28,7 @@ There are three key ways in which Spark streaming differs from previous approach
2828

2929
* Previous approaches have primarily employed a record-at-a-time model, where each node continuously receives records,
3030
updates internal state, and sends new records. In contrast, in Spark Streaming, all records that arrive within a time interval
31-
are store reliably as an RDD, and processed via deterministic parallel operations to produce system output or input to the next
31+
are stored reliably as an RDD, and processed via deterministic parallel operations to produce system output or input to the next
3232
interval.
3333

3434
* As a consequence of a record-at-a-time approach, all nodes need to maintain complex state; this makes recovery challenging
@@ -59,22 +59,115 @@ advantages, which work in favor of Spark Streaming's adoption.
5959

6060
_Is the Problem Real?_
6161

62+
The problem addressed in this paper is the design of a stream processing platform that provides the following properties:
63+
64+
* Exactly-once guarantees
65+
* Low latency
66+
* High throughput
67+
* Expressive computation model
68+
* Low overhead of fault-tolerance in the absence of of failures
69+
* Flow control, i.e., ablility to tolerate straggling operators.
70+
71+
The problem itself is quite similar to the one addressed in the Spark Streaming paper, and hence, the discussion on how real the
72+
problem is quite similar.
73+
6274
_What is the solution's main idea?_
6375

76+
The solution's key idea is to exploit distributed snapshots based on Chandy-Lamport's algorithm, guaranteeing "exactly-once"
77+
semantics. In particular, Flink periodically creates snapshots of _all_ the states the streaming compuation is in.
78+
The snapshots are stored on persistent storage, and recovery proceeds by reading the latest snapshot and
79+
reconstructing the stream.
80+
6481
_Why is the solution different from previous work?_
6582

83+
Micro-batching suffers from the following issues:
84+
* They employ a modified programming model, which limits expressiveness; for example, users cannot compute on window data in periods other than multiples of checkpoint interval. Moreover, there is no support for count-based or session windows that are popular among many applications.
85+
* It is hard for them to deal with backpressure
86+
* The micro-batch latency limits applications that require sub-second latencies.
87+
88+
When compared with Google's Cloud Data Flow model, Flink uses distributed snapshots while Cloud Data Flow uses distributed
89+
transactions. In distributed transactions, record deliveries are logged along with the updates to the state. During a failure,
90+
both state and record deliveries are repeated from the log, making recovery expensive. Distributed snapshots overcomes this issue
91+
by allowing the periodicity of snapshots determine the time spent in recovery.
92+
6693
_Does the paper identify any fundamental/hard trade-offs?_
6794

95+
The key tradeoff lies in the fault tolerance scheme employed, which plays an important role in the end-to-end design
96+
of such a system. By employing a snapshot-based scheme, the tradeoff is essentially between recovery time and
97+
the overhead of periodic snapshot creations. If there are too many snapshots, and the operators have a lot of state,
98+
then the data processing performance (latency, throughput) degrades; on the other hand, if there are too few snaphots,
99+
then there is significant amount of time spent in recovering from a failure.
100+
68101
_Will the paper be influential in 10 years?_
69102

103+
The key contribution in Apache Flink is its choice of fault-tolerance mechanism for stream-processing, in that it does
104+
not constrain the system's ability to achieve other requirements for a stream processing framework. The mechanism for
105+
fault-tolerance itself is a simple adaptation of Chandy-Lamport algorithm. While the identified requirements for a stream
106+
processing system is interesting (although not new -- the Data Flow model paper already had identified a number of them),
107+
the choice of fault-tolerance scheme itself isn't fundamental to the design of the stream processing system (unless it
108+
employs a radically different data processing framework such as micro-batch processing). Since the system is still quite
109+
new, it is hard to predict how impactful it would be in industry. It is unclear if it would be academically influential
110+
in the next 10 years.
111+
70112
## The Dataflow Model: A Practical Approach to Balancing Correctness, Latency, and Cost in Massive-Scale, Unbounded, Out-of-Order Data Processing
71113

72114
_Is the Problem Real?_
73115

116+
The problem addressed in this paper is the correct modeling of data processing for unbounded, unordered, large-scale
117+
datasets with sophisticated requirements, such as low latency execution of event-time ordering, windowing, etc. There
118+
are a large number of systems that have evolved over time to cater to these widely contrasting requirements, and
119+
typically achieving different degrees of correctness, latency, and cost. Correctly modeling the semantics of a system
120+
that can cater to such problems could provide the ability to cater to all of these requirements, while the degree of
121+
correctness, latency and cost would still be governed by the underlying implementation. In this aspect, I feel this
122+
is an interesting and real problem to solve.
123+
74124
_What is the solution's main idea?_
75125

126+
The key aspect in this paper isn't the system design itself, but the real-world experiences and a set of principles
127+
that governed their system design:
128+
129+
* Systems dealing with unbounded datasets shouldn't rely on any notion of completeness.
130+
* It is important to be flexible in terms of the set of operations that can be supported by the system.
131+
* The interface for using the system for analyzing unbounded datasets should be simple
132+
* The system should support robust analysis of data in the context in which it occurred.
133+
134+
In particular, the Data Flow model provides a way to compute "event-time" ordered results, windowed by features of the
135+
unbounded data, with tunable correctness, latency and cost across a spectrum of combinations. This is made possible by
136+
decomposing the pipeline across the following dimensions:
137+
138+
- What results are being computed
139+
- Where in event time are they being computed
140+
- When in processing time are they materialized
141+
- How earlier results relate to their later refinements
142+
143+
and also by segregating the logical data processing from the underlying implementation (e.g., via batch, micro-batch
144+
or stream processing). Different degrees of correctness, latency and cost are left to the choice of the underlying
145+
implementation.
146+
76147
_Why is the solution different from previous work?_
77148

149+
The authors provide a detailed description of existing systems that cater to analyzing unbounded and bounded datasets.
150+
The key observation is that while different existing systems acheive some degree of correctness, low-latency, low-cost,
151+
flexibility and simplicity, they often sacrifice a number of these aspects that could be attained together by employing
152+
a different model for handling unbounded data. Moreover, the systems are too closely tied to the underlying physical
153+
implementation in order to acheive the desired clarity, composability and flexibility that is expected from these systems.
154+
The Data Flow model addresses these shortcomings. However the limitations of the underlying implementation are still left
155+
in place; what the Data Flow model facilitates is a common platform that can express parallel computation that is independent of the
156+
underlying implementation, without sacrificing on expressiveness of the model itself.
157+
78158
_Does the paper identify any fundamental/hard trade-offs?_
79159

80-
_Will the paper be influential in 10 years?_
160+
The paper identifies three key metrics in data processing of unbounded datasets that are always in tension, namely,
161+
correctness, latency, and cost. Existing systems also trade-off flexibility and simplicity along with the metrics above,
162+
however that is simply an artifact of a strong coupling between the data processing model and the underlying
163+
implementation. By separating the model from implementation, the authors show that the model can acheive the
164+
required flexibility, simplicity and composability for processing such datasets; the tradeoff is reduced to that
165+
between correctness, latency and cost, which is dependent on the selected framework (batch, micro-batch or record-at-a-time
166+
streaming).
167+
168+
_Will the paper be influential in 10 years?_
169+
170+
The paper is motivated by strong real-world applications. The Data Flow model provides expressiveness sufficient to
171+
cater to large number of such applications, and shows promise in being able to support new applications as well. As
172+
such, since there are no constraints on the underlying implementation for adopting the Data Flow model itself, I feel
173+
this paper will be influential both in academia and in industry for designing processing engines for unbounded datasets.
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
---
2+
layout: post
3+
title: Storage and File Systems
4+
---
5+
6+
## The Google File System
7+
8+
_Is the Problem Real?_
9+
10+
_What is the solution's main idea?_
11+
12+
_Why is the solution different from previous work?_
13+
14+
_Does the paper identify any fundamental/hard trade-offs?_
15+
16+
_Will the paper be influential in 10 years?_
17+
18+
## Flat Datacenter Storage
19+
20+
_Is the Problem Real?_
21+
22+
_What is the solution's main idea?_
23+
24+
_Why is the solution different from previous work?_
25+
26+
_Does the paper identify any fundamental/hard trade-offs?_
27+
28+
_Will the paper be influential in 10 years?_
29+
30+
## IOFlow: A Software-Defined Storage Architecture
31+
32+
_Is the Problem Real?_
33+
34+
_What is the solution's main idea?_
35+
36+
_Why is the solution different from previous work?_
37+
38+
_Does the paper identify any fundamental/hard trade-offs?_
39+
40+
_Will the paper be influential in 10 years?_
+28
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,28 @@
1+
---
2+
layout: post
3+
title: Memory Based Systems
4+
---
5+
6+
## FaRM: Fast Remote Memory
7+
8+
_Is the Problem Real?_
9+
10+
_What is the solution's main idea?_
11+
12+
_Why is the solution different from previous work?_
13+
14+
_Does the paper identify any fundamental/hard trade-offs?_
15+
16+
_Will the paper be influential in 10 years?_
17+
18+
## The Case for RAMClouds: Scalable High-Performance Storage Entirely in DRAM
19+
20+
_Is the Problem Real?_
21+
22+
_What is the solution's main idea?_
23+
24+
_Why is the solution different from previous work?_
25+
26+
_Does the paper identify any fundamental/hard trade-offs?_
27+
28+
_Will the paper be influential in 10 years?_
+40
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
---
2+
layout: post
3+
title: Resource Management
4+
---
5+
6+
## Mesos: A Platform for Fine-Grained Resource Sharing in the Data Center
7+
8+
_Is the Problem Real?_
9+
10+
_What is the solution's main idea?_
11+
12+
_Why is the solution different from previous work?_
13+
14+
_Does the paper identify any fundamental/hard trade-offs?_
15+
16+
_Will the paper be influential in 10 years?_
17+
18+
## Apache Hadoop YARN: Yet Another Resource Negotiator
19+
20+
_Is the Problem Real?_
21+
22+
_What is the solution's main idea?_
23+
24+
_Why is the solution different from previous work?_
25+
26+
_Does the paper identify any fundamental/hard trade-offs?_
27+
28+
_Will the paper be influential in 10 years?_
29+
30+
## Kubernetes: Open Source Container Cluster Orchestration
31+
32+
_Is the Problem Real?_
33+
34+
_What is the solution's main idea?_
35+
36+
_Why is the solution different from previous work?_
37+
38+
_Does the paper identify any fundamental/hard trade-offs?_
39+
40+
_Will the paper be influential in 10 years?_
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,84 @@
1+
---
2+
layout: post
3+
title: Resource Allocation and Scheduling
4+
---
5+
6+
## Dominant Resource Fairness (DRF)
7+
8+
_Is the Problem Real?_
9+
10+
_What is the solution's main idea?_
11+
12+
_Why is the solution different from previous work?_
13+
14+
_Does the paper identify any fundamental/hard trade-offs?_
15+
16+
_Will the paper be influential in 10 years?_
17+
18+
## Omega: flexible, scalable schedulers for large compute clusters
19+
20+
_Is the Problem Real?_
21+
22+
The problem addressed in this paper is cluster scheduling in a large-scale environment (scalability),
23+
with rapidly changing requirements (flexibility). Target high utilization, efficiency in shared
24+
large-scale compute clusters, running a mix of:
25+
26+
* CPU-intensive and Memory-intensive jobs
27+
* batch and low-latency jobs
28+
* user-facing and infrastructure services
29+
30+
As clusters/workloads keep growing, since scheduler’s workload is roughly proportional to cluster size,
31+
scheduler becomes source of scalability bottleneck.
32+
33+
The authors state that Omega's design is driven by real-life Google’s production workloads. In particular:
34+
> "Google’s production job scheduler has experienced all of this. Over the years, it has evolved into a
35+
> complicated, sophisticated system that is hard to change."
36+
37+
_What is the solution's main idea?_
38+
39+
The authors employ a parallel scheduler architecture with shared state, using lock-free optimistic concurrency
40+
control, to achieve both implementation extensibility and performance scalability.
41+
42+
_Why is the solution different from previous work?_
43+
44+
The authors categorize previous work in cluster scheduling as:
45+
* _Monolithic schedulers_ — single, centralized scheduling algorithm for all jobs (Google’s existing scheduler).
46+
* _Two-level schedulers_ — single, active resource manager that offers compete resources to multiple parallel, independent "scheduler frameworks" (Mesos, Hadoop-on-Demand).
47+
48+
The drawbacks of previous work included:
49+
50+
Monolithic schedulers:
51+
* harder to add new policies
52+
* may not scale up to large cluster sizes
53+
54+
Two-level schedulers:
55+
* conservative resource visibility and locking algorithms limit flexibility and parallelism
56+
* hard to schedule “picky” jobs
57+
* hard to make decisions that require access to the sate of the entire cluster
58+
59+
Omega overcomes these by using a parallel scheduler architecture with shared state, using lock-free optimistic concurrency
60+
control, to achieve both implementation extensibility and performance scalability.
61+
62+
_Does the paper identify any fundamental/hard trade-offs?_
63+
64+
The primary tradeoff explored in this work is that between degree of parallelism in scheduling resources in a cluster
65+
enabeld by a shared-state, optimistic concurrency control and the amount of wasted work when the optimistic concurrency
66+
assumptions fail.
67+
68+
_Will the paper be influential in 10 years?_
69+
70+
Omega has already had significant impact since it is the cluster management system used in Google’s production system.
71+
I believe the taxonomy of the option space for cluster scheduling development, together with the ideas of shared state
72+
for scalable and flexible cluster scheduling will have long lasting impact in resource scheduling research.
73+
74+
## Sparrow: Distributed, Low Latency Scheduling
75+
76+
_Is the Problem Real?_
77+
78+
_What is the solution's main idea?_
79+
80+
_Why is the solution different from previous work?_
81+
82+
_Does the paper identify any fundamental/hard trade-offs?_
83+
84+
_Will the paper be influential in 10 years?_

0 commit comments

Comments
 (0)
Please sign in to comment.