Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

latex compatibility fixes #14

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
*.pdf
77 changes: 32 additions & 45 deletions README.markdown
Original file line number Diff line number Diff line change
Expand Up @@ -9,13 +9,13 @@ lecture and discussion. Participants will gain an intuitive understanding of
key distributed systems terms, an overview of the algorithmic landscape, and
explore production concerns.

## What makes a thing distributed?
## What makes a thing distributed

Lamport, 1987:

> A distributed system is one in which the failure of a computer
> you didn't even know existed can render your own computer
> unusable.
> A distributed system is one in which the failure of a computer
> you didn't even know existed can render your own computer
> unusable.

- First glance: \*nix boxen in our colo, running processes communicating via
TCP or UDP.
Expand Down Expand Up @@ -162,7 +162,7 @@ Lamport, 1987:
- This causes all kinds of havoc in, say, metrics collection
- And debugging it is *hard*
- TCP gives you flow control and repacks logical messages into packets
- You'll need to re-build flow-control and backpressure
- You'll need to re-build flow-control and back-pressure
- TLS over UDP is a thing, but tough
- UDP is really useful where TCP FSM overhead is prohibitive
- Memory pressure
Expand All @@ -187,15 +187,15 @@ Lamport, 1987:
- Caveat: Hardware can drift
- Caveat: By *centuries*
- NTP might not care
- http://rachelbythebay.com/w/2017/09/27/2153/
- <http://rachelbythebay.com/w/2017/09/27/2153/>
- Caveat: NTP can still jump the clock backwards (default: delta > 128 ms)
- https://www.eecis.udel.edu/~mills/ntp/html/clock.html
- <https://www.eecis.udel.edu/~mills/ntp/html/clock.html>
- Caveat: POSIX time is not monotonic by *definition*
- Cloudflare 2017: Leap second at midnight UTC meant time flowed backwards
- At the time, Go didn't offer access to CLOCK_MONOTONIC
- Computed a negative duration, then fed it to rand.int63n(), which paniced
- Caused DNS resolutions to fail: 1% of HTTP requests affected for several hours
- https://blog.cloudflare.com/how-and-why-the-leap-second-affected-cloudflare-dns/
- <https://blog.cloudflare.com/how-and-why-the-leap-second-affected-cloudflare-dns/>
- Caveat: The timescales you want to measure may not be attainable
- Caveat: Threads can sleep
- Caveat: Runtimes can sleep
Expand Down Expand Up @@ -252,7 +252,6 @@ Lamport, 1987:
- I don't know who's doing it yet, but I'd bet datacenters in the
future will offer dedicated HW interfaces for bounded-accuracy time.


## Review

We've covered the fundamental primitives of distributed systems. Nodes
Expand All @@ -261,9 +260,6 @@ various ways. Protocols like TCP and UDP give us primitive channels for
processes to communicate, and we can order events using clocks. Now, we'll
discuss some high-level *properties* of distributed systems.




## Availability

- Availability is basically the fraction of attempted operations which succeed.
Expand Down Expand Up @@ -309,7 +305,6 @@ discuss some high-level *properties* of distributed systems.
- "Apdex for the user service just dropped to 0.5; page ops!"
- Ideally: integral of happiness delivered by your service?


## Consistency

- A consistency model is the set of "safe" histories of events in the system
Expand Down Expand Up @@ -465,7 +460,6 @@ discuss some high-level *properties* of distributed systems.
- Monotonic Reads
- Monotonic Writes


### Harvest and Yield

- Fox & Brewer, 1999: Harvest, Yield, and Scalable Tolerant Systems
Expand All @@ -482,8 +476,8 @@ discuss some high-level *properties* of distributed systems.
- e.g. "99% of the time, you can read 90% of your prior writes"
- Strongly dependent on workload, HW, topology, etc
- Can tune harvest vs yield on a per-request basis
- "As much as possible in 10ms, please"
- "I need everything, and I understand you might not be able to answer"
- "As much as possible in 10ms, please"
- "I need everything, and I understand you might not be able to answer"

### Hybrid systems

Expand All @@ -505,7 +499,6 @@ consistency models generally come at the cost of performance and availability.
Next, we'll talk about different ways to build systems, from weak to strong
consistency.


## Avoid Consensus Wherever Possible

### CALM conjecture
Expand Down Expand Up @@ -534,7 +527,6 @@ consistency.
- Unordered programming with flow analysis
- Can tell you where coordination *would* be required


### Gossip

- Message broadcast system
Expand All @@ -552,7 +544,7 @@ consistency.
- Hop up to a connector node which relays to other connector nodes
- Reduces superfluous messages
- Reduces latency
- Plumtree (Leit ̃ao, Pereira, & Rodrigues, 2007: Epidemic Broadcast Trees)
- Plumtree (Leitao, Pereira, & Rodrigues, 2007: Epidemic Broadcast Trees)
- Push-Sum et al
- Sum inputs from everyone you've received data from
- Broadcast that to a random peer
Expand Down Expand Up @@ -603,7 +595,6 @@ consistency.
- Probably best in concert with stronger transactional systems
- See also: COPS, Swift, Eiger, Calvin, etc


## Fine, We Need Consensus, What Now?

- The consensus problem:
Expand Down Expand Up @@ -648,7 +639,6 @@ consistency.
majority of nodes.
- More during cluster transitions.


### Paxos

- Paxos is the Gold Standard of consensus algorithms
Expand Down Expand Up @@ -740,8 +730,6 @@ transactions. Serializability and linearizability require *consensus*, which we
can obtain through Paxos, ZAB, VR, or Raft. Now, we'll talk about different
*scales* of distributed systems.



## Characteristic latencies

- Latency is *never* zero
Expand Down Expand Up @@ -791,12 +779,11 @@ can obtain through Paxos, ZAB, VR, or Raft. Now, we'll talk about different
- Network is within an order of mag compared to uncached disk seeks
- Or faster, in EC2
- EC2 disk latencies can routinely hit 20ms
- 200ms?
- *20,000* ms???
- Because EBS is actually other computers
- LMAO if you think anything in EC2 is real
- Wait, *real disks do this too*?
- What even are IO schedulers?
- 200ms? *20,000* ms???
- Because EBS is actually other computers
- LMAO if you think anything in EC2 is real
- Wait, *real disks do this too*?
- What even are IO schedulers?
- But network is waaaay slower than memory/computation
- If your aim is *throughput*, work units should probably take longer than a
millisecond
Expand Down Expand Up @@ -849,7 +836,6 @@ latencies are short enough for many network hops before users take notice. In
geographically replicated systems, high latencies drive eventually consistent
and datacenter-pinned solutions.


## Common distributed systems

### Outsourced heaps
Expand Down Expand Up @@ -951,7 +937,6 @@ low-latency processing of datasets, and tend to look more like frameworks than
databases. Their dual, distributed queues, focus on the *messages* rather
than the *transformations*.


## A Pattern Language

- General recommendations for building distributed systems
Expand Down Expand Up @@ -1093,7 +1078,7 @@ than the *transformations*.
- Sharding for scalability
- Avoiding coordination via CRDTs
- Flake IDs: *mostly* time-ordered identifiers, zero-coordination
- See http://yellerapp.com/posts/2015-02-09-flake-ids.html
- See <http://yellerapp.com/posts/2015-02-09-flake-ids.html>
- Partial availability: users can still use some parts of the system
- Processing a queue: more consumers reduces the impact of expensive events

Expand Down Expand Up @@ -1290,8 +1275,6 @@ scale. As software grows, different components must scale independently,
and we break out libraries into distinct services. Service structure goes
hand-in-hand with teams.



## Production Concerns

- More than design considerations
Expand Down Expand Up @@ -1362,8 +1345,8 @@ hand-in-hand with teams.
- In relation to its dependencies
- Which can, in turn, drive new tests
- In a way, good monitoring is like continuous testing
- But not a replacement: these are distinct domains
- Both provide assurance that your changes are OK
- But not a replacement: these are distinct domains
- Both provide assurance that your changes are OK
- Want high-frequency monitoring
- Production behaviors can take place on 1ms scales
- TCP incast
Expand All @@ -1381,13 +1364,13 @@ hand-in-hand with teams.
- Key metrics for most systems
- Apdex: successful response WITHIN latency SLA
- Latency profiles: 0, 0.5, 0.95, 0.99, 1
- Percentiles, not means
- BTW you can't take the mean of percentiles either
- ^ Percentiles, not means
- ^ BTW you can't take the mean of percentiles either
- Overall throughput
- Queue statistics
- Subjective experience of other systems latency/throughput
- The DB might think it's healthy, but clients could see it as slow
- Combinatorial explosion--best to use this when drilling into a failure
- ^ The DB might think it's healthy, but clients could see it as slow
- ^ Combinatorial explosion--best to use this when drilling into a failure
- You probably have to write this instrumentation yourself
- Invest in a metrics library
- Out-of-the-box monitoring usually doesn't measure what really matters: your
Expand Down Expand Up @@ -1504,7 +1487,6 @@ hand-in-hand with teams.
- Ask Jeff Hodges why it's hard: see his RICON West 2013 talk
- See Zach Tellman - Everything Will Flow


## Review

Running distributed systems requires cooperation between developers, QA, and
Expand All @@ -1519,10 +1501,15 @@ special care.

### Online

- Mixu has a delightful book on distributed systems with incredible detail. http://book.mixu.net/distsys/
- Jeff Hodges has some excellent, production-focused advice. https://www.somethingsimilar.com/2013/01/14/notes-on-distributed-systems-for-young-bloods/
- The Fallacies of Distributed Computing is a classic text on mistaken assumptions we make designing distributed systems. http://www.rgoarchitects.com/Files/fallacies.pdf
- Christopher Meiklejohn has a list of key papers in distributed systems. http://christophermeiklejohn.com/distributed/systems/2013/07/12/readings-in-distributed-systems.html
- Mixu has a delightful book on distributed systems with incredible detail.
<http://book.mixu.net/distsys/>
- Jeff Hodges has some excellent, production-focused advice.
<https://www.somethingsimilar.com/2013/01/14/notes-on-distributed-systems-for-young-bloods/>
- The Fallacies of Distributed Computing is a classic text
on mistaken assumptions we make designing distributed systems.
<http://www.rgoarchitects.com/Files/fallacies.pdf>
- Christopher Meiklejohn has a list of key papers in distributed systems.
<http://christophermeiklejohn.com/distributed/systems/2013/07/12/readings-in-distributed-systems.html>

### Trees

Expand Down
3 changes: 3 additions & 0 deletions pandoc_pdf.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
#!/bin/sh

pandoc --verbose --from=markdown_github --output=aphyr-distsys-intro.pdf --variable classoption=twocolumn --standalone README.markdown