Skip to content

Commit

Permalink
auto
Browse files Browse the repository at this point in the history
  • Loading branch information
elimelt committed Jan 1, 2025
1 parent ef61cd0 commit 758457d
Show file tree
Hide file tree
Showing 50 changed files with 6,361 additions and 91 deletions.
1 change: 1 addition & 0 deletions deploy.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
python scripts/clean.py && python scripts/build.py . site && git add . && git commit -am 'auto' && git push
Original file line number Diff line number Diff line change
@@ -1,3 +1,10 @@
---
title: Fundamentals of Data-Intensive Application Design and Scalability
category: Software Engineering
tags: reliability, scalability, maintainability, data systems, performance
description: This document explores the key principles of designing data-intensive applications, focusing on reliability, scalability, and maintainability. It discusses various aspects of system design, including handling different types of faults, measuring and managing load, and ensuring long-term system health through good operational practices and simplicity.
---

# Chapter 1
## Reliable, Scalable, and Maintainable Applications

Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,10 @@
---
title: Data Models and Relationships in Database Systems
category: Database Design
tags: relational databases, document databases, graph databases, data modeling, query languages
description: This document explores various data models including relational, document, and graph databases. It discusses the strengths and weaknesses of each model, focusing on how they handle relationships between data. The chapter also covers query languages, data locality, and the concept of MapReduce for data processing.
---

# Chapter 2

## Data Models and Query Languages
Expand Down
Original file line number Diff line number Diff line change
@@ -1,4 +1,12 @@
---
title: Storage and Retrieval Techniques for Database Systems
category: Database Systems
tags: data structures, indexing, oltp vs olap, column-oriented storage
description: Explores various storage and retrieval techniques for database systems, including log-structured storage, indexing methods, and optimizations. Compares OLTP and OLAP systems, discussing their respective storage strategies and performance considerations.
---

# Chapter 3

## Storage and Retrieval

```bash
Expand All @@ -22,6 +30,7 @@ To make reads faster, we can use an **index**. An index is an additional structu
Map keys to offsets in the data file. This is fast for equality queries, but not for range queries. Store log-structured key-value data like above in binary format, and use a hash index to find the offset of the key in the data file. **Delete** by marking the key as deleted in the data file (sometimes with a "tombstone"), and periodically reindex the data file to remove the deleted keys.

For **crash recovery**:

- reread log file from beginning to end, building hash table indicies in memory (slow, but no additional storage)
- store snapshots of the hash table indicies to disk periodically (fast, but requires additional storage)
- use checksums to detect partial corruption of the hash table indicies (fast, but requires additional storage)
Expand Down Expand Up @@ -81,22 +90,22 @@ B-trees are good for one-dimensional indexes, but not for **multi-dimensional in

As RAM gets cheaper, it makes more sense to keep data in memory. **In-memory databases** are good for applications that need low latency and high throughput, but make crash recovery more difficult. **Write-ahead logs** are a good way to make crash recovery easier, but can slow down writes. Some databases (e.g. Redis) let you choose between durability and performance, and have "weak durability" by asyncronously writing to disk. Some in-memory databases can even exceed the amount of RAM available by using an eviction policy similar to a cache.


### Transaction Processing or Analytics?

**OLTP** (online transaction processing) is good for real-time stateful workloads, where low latency and high throughput are important. **OLAP** (online analytics processing) is good for async batch processing, where high throughput is important, but latency is not.

#### OTLP:

- transactional databases, relational databases, key-value stores, etc.
- ACID transactions, concurrency control, indexes, etc.
- enterprise typically made up of several OLTP systems that require high-availability and low-latency for reads and writes.

#### OLAP:

- data warehouses, batch processing, batch analytics, Hadoop, Spark, etc.
- read only copy of data, typically loaded and queried in batches
- typically used for business intelligence, reporting, and data mining. Not as critical to keep up and running 24/7, and queries are able to hog system resources without consequence.


Most of the previous indexes are more OTLP focused, whereas "data warehouses" are more OLAP focused, and often use a different schema and index model.

### Stars and Snowflakes: Schemas for Analytics
Expand All @@ -121,7 +130,7 @@ WHERE
dim_product.category IN ('Fresh fruit', 'Candy')
GROUP BY
dim_date.weekday, dim_product.category;
```
```

In transactional databases, storage is "row-oriented", so an entire row needs to be loaded, including columns that aren't used. **column-oriented** storage is better for OLAP, since it only needs to load the columns that are used, which can amount to a lot of space over large datasets. It also allows for better compression, since columns are often similar.

Expand All @@ -144,5 +153,3 @@ These optimizations are good for reads and make sense in a data-warehouse, but c
**Data cubes** are a way to precompute aggregations over multiple dimensions. They are good for speeding up queries, but can be expensive to maintain. Essentially a multi-dimensional array, where each cell is an aggregation over a subset of the dimensions. Very expensive to maintain, and inflexible for queries that aren't covered by the precomputed aggregations.

Oftentimes, it makes more sense to store raw data, and then benchmark queries to see which ones are slow, and then precompute aggregations for those queries if they need to be faster.


Original file line number Diff line number Diff line change
@@ -1,4 +1,12 @@
---
title: Encoding, Evolution, and Data Flow in Distributed Systems
category: Distributed Systems
tags: data serialization, schema evolution, compatibility, message passing, encoding formats
description: This document explores various aspects of data encoding and evolution in distributed systems. It covers different serialization formats, schema evolution strategies, and modes of data flow including databases, services, and message passing systems, with a focus on maintaining compatibility as systems change over time.
---

# Chapter 4

## Encoding and Evolution

**evolvability**: the ability to evolve as requirements change.
Expand All @@ -13,7 +21,8 @@

### Formats for Encoding Data

Programs typically work with at least two different representations of data:
Programs typically work with at least two different representations of data:

- **in-memory data structures** (objects, arrays, hash maps, etc.)
- **serialized** (encoded) **bytes** (often stored on disk or sent over the network).

Expand All @@ -35,11 +44,10 @@ Many built in serialization formats that are included with languages are not per

**Avro** Another binary encoding that was made as an alternative to protobuf and thrift for Hadoop. Uses a schema to specify encoding. Has two schema langs: IDL, for human editing, and one similar to JSON, for machine reading. Doesn't use tag numbers or any identifying field info; just lengths and data values. Uses variable length integers to encode lengths. Uses writer/reader schema setup.


### What makes protobuf, thrift and avro good?

- **Compact structure**: data is encoded as a sequence of fields, each of which contains a key (field tag) and a value. The key indicates the field's name and data type. Field tags are used to identify fields from the schema, and variable length integers are used for the field tags.
- **Schema evolution**: new fields can be added to the schema, but must have a new tag number and either be optional or have a default value. Old code can read data with new fields (ignores new fields), and new code can read data with old fields. Fields can be removed, but only if the field tag is not reused. Fields can be renamed, but only if the field tag is not reused. Fields can be reordered without breaking compatibility.
- **Compact structure**: data is encoded as a sequence of fields, each of which contains a key (field tag) and a value. The key indicates the field's name and data type. Field tags are used to identify fields from the schema, and variable length integers are used for the field tags.
- **Schema evolution**: new fields can be added to the schema, but must have a new tag number and either be optional or have a default value. Old code can read data with new fields (ignores new fields), and new code can read data with old fields. Fields can be removed, but only if the field tag is not reused. Fields can be renamed, but only if the field tag is not reused. Fields can be reordered without breaking compatibility.

### Writer Reader Schema Setup

Expand All @@ -56,7 +64,6 @@ However, in a database where data is written over a long period of time, you can

In an inter-service communication context, schemas can be negotiated between client and server on connection setup. Using a database of schema versions might be a good idea here.


### Dynamically Generated Schemas

**Situation:** Want to dump database to file on disk using a binary encoding format.
Expand All @@ -69,7 +76,6 @@ This kind of dynamic schema generation was a design goal of Avro, and is less so

Keeping a database of schema versions is a good idea, and can be used to generate Avro schemas for each version of the schema. This can be used to generate a schema for each version of the database, and then used to encode/decode data from each version of the database.


## Modes of Dataflow

Generally, one process encodes, and another decodes.
Expand Down Expand Up @@ -98,16 +104,14 @@ Encoder: client process that sends request to server. Decoder: server process th

I'm not even going to get into **SOAP**.

**RPC** is a design philosophy that builds on the idea of calling a function on a remote server. Tries to provide *location transperency* by abstracting network communication when triggering remote method calls. Many are overly complex, and are not compatible with other languages. **gRPC** is a modern RPC framework that uses protobuf as the interface definition language, and HTTP/2 as the underlying protocol. It is a good choice for internal services, but not for public APIs.

**RPC** is a design philosophy that builds on the idea of calling a function on a remote server. Tries to provide _location transperency_ by abstracting network communication when triggering remote method calls. Many are overly complex, and are not compatible with other languages. **gRPC** is a modern RPC framework that uses protobuf as the interface definition language, and HTTP/2 as the underlying protocol. It is a good choice for internal services, but not for public APIs.

#### Problems with RPC

Network requests are fundamentally different from local function calls. They can fail, be delayed, or be delivered multiple times. This makes designing systems that use RPC more complex, since you need to take into account these failure modes. **Idempotence** is a good way to allow transparent retries, since it means that the same request can be sent multiple times without causing unintended side effects. REST doesn't try to hide the fact that it is a network protocol, and thus is more transparent about failure modes.

Furthermore, the platform lock-in of RPC is a problem. This ties you to a specific language, and makes it difficult to use other languages. This is especially true for public APIs, where you have no control over the client.


#### Current Direction of RPC

Various RPC frameworks exist on top of all the previously mentioned encodings. New generations of RPC frameworks use futures/promises to represent asynchronous responses, and use streaming to represent long lived connections. Some frameworks also include service discovery, which allows clients to find servers without hardcoding their location.
Expand Down Expand Up @@ -143,7 +147,6 @@ Several advantages:

Usually one way communication, but can be used for request/response. This is useful for decoupling the sender from the receiver, and for handling high volumes of messages.


### Distributed Actor Frameworks

**Actor model**: a model of concurrent computation that treats actors as the universal primitives of concurrent computation. In response to a message that it receives, an actor can: make local decisions, create more actors, send more messages, and determine how to respond to the next message received. Actors are essentially state machines that communicate by sending messages to each other.
Expand All @@ -155,4 +158,3 @@ The same message passing system is used for communication between actors on the
- **Akka** uses Java's serialization by default, but can be configured to use protobuf for rolling upgrades.
- **Orleans** uses custom encoding by default, and can be configured to use other encodings. Need to set up and shut down clusters when migrating schemas/versioning.
- **Erlang OTP** experimental support for mapping Erlang data types to protobuf. Still need to be careful about schema evolution.

Original file line number Diff line number Diff line change
@@ -1,3 +1,10 @@
---
title: Replication Strategies in Distributed Data Systems
category: Distributed Systems
tags: data replication, leader-follower model, synchronous vs asynchronous, failover, replication logs
description: This document explores various replication strategies in distributed data systems, focusing on leader-follower models, synchronous and asynchronous replication, and handling node failures. It also discusses different implementations of replication logs and their implications for system consistency and durability.
---

# Chapter 5
## Replication

Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,10 @@
---
title: Batch Processing Systems and MapReduce Fundamentals
category: Distributed Systems
tags: batch processing, mapreduce, distributed filesystems, data analysis, etl
description: This document explores batch processing systems, focusing on MapReduce and distributed filesystems. It covers Unix tools for log analysis, the MapReduce programming model, and various join techniques in distributed environments. The document also discusses applications of batch processing in search indexing and recommendation systems.
---

# Chapter 10
## Batch Processing

Expand Down
193 changes: 193 additions & 0 deletions site/categories/database design.html
Original file line number Diff line number Diff line change
@@ -0,0 +1,193 @@
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Category: Database Design</title>
<meta name="description" content="Pages in category Database Design">
<style>
:root {
--text-color: #1a1a1a;
--background-color: #ffffff;
--accent-color: #2563eb;
--border-color: #e5e7eb;
--nav-background: rgba(255, 255, 255, 0.95);
}

@media (prefers-color-scheme: dark) {
:root {
--text-color: #f3f4f6;
--background-color: #1a1a1a;
--accent-color: #60a5fa;
--border-color: #374151;
--nav-background: rgba(26, 26, 26, 0.95);
}
}

body {
font-family: -apple-system, BlinkMacSystemFont, "Segoe UI", Roboto, sans-serif;
line-height: 1.6;
max-width: 50rem;
margin: 0 auto;
padding: 2rem;
color: var(--text-color);
background: var(--background-color);
}

nav {
position: sticky;
top: 0;
background: var(--nav-background);
backdrop-filter: blur(10px);
border-bottom: 1px solid var(--border-color);
padding: 1rem 0;
margin-bottom: 2rem;
display: flex;
flex-wrap: wrap;
gap: 1rem;
z-index: 1000;
}

nav a {
color: var(--accent-color);
text-decoration: none;
padding: 0.5rem 1rem;
border-radius: 4px;
transition: background-color 0.2s;
}

nav a:hover {
background-color: var(--border-color);
}

.breadcrumbs {
margin-bottom: 2rem;
color: var(--text-color);
opacity: 0.8;
}

.breadcrumbs a {
color: var(--accent-color);
text-decoration: none;
}

.content {
margin-top: 2rem;
}

h1, h2, h3, h4, h5, h6 {
margin-top: 2rem;
margin-bottom: 1rem;
line-height: 1.3;
}

code {
background: var(--border-color);
padding: 0.2rem 0.4rem;
border-radius: 3px;
font-size: 0.9em;
font-family: 'SFMono-Regular', Consolas, 'Liberation Mono', Menlo, Courier, monospace;
}

pre {
background: var(--border-color);
padding: 1rem;
border-radius: 4px;
overflow-x: auto;
margin: 1.5rem 0;
}

pre code {
background: none;
padding: 0;
border-radius: 0;
}

img {
max-width: 100%;
height: auto;
border-radius: 4px;
margin: 1.5rem 0;
}

.meta {
color: var(--text-color);
opacity: 0.8;
font-size: 0.9em;
margin-bottom: 2rem;
display: flex;
flex-wrap: wrap;
gap: 1rem;
}

.tags {
margin-top: 2rem;
padding-top: 1rem;
border-top: 1px solid var(--border-color);
}

.tags a {
display: inline-block;
background: var(--border-color);
color: var(--text-color);
padding: 0.2rem 0.6rem;
border-radius: 3px;
text-decoration: none;
font-size: 0.9em;
margin-right: 0.5rem;
margin-bottom: 0.5rem;
}

.tags a:hover {
background: var(--accent-color);
color: white;
}

table {
width: 100%;
border-collapse: collapse;
margin: 1.5rem 0;
}

th, td {
padding: 0.75rem;
border: 1px solid var(--border-color);
}

th {
background: var(--border-color);
}

blockquote {
margin: 1.5rem 0;
padding-left: 1rem;
border-left: 4px solid var(--accent-color);
color: var(--text-color);
opacity: 0.8;
}
</style>
</head>
<body>
<nav>
<a href="/index.html">Home</a>
<a href="/categories/index.html">Categories</a>
<a href="/tags/index.html">Tags</a>
</nav>
<main>
<div class="breadcrumbs">
<a href="/index.html">Home</a> » Category: Database Design
</div>
<h1>Category: Database Design</h1>
<div class="meta">
<span>Last modified: 2025-01-01</span>

</div>
<div class="content">
<h2>Category: Database Design</h2>
<ul>
<li><a href="/designing-data-intensive-applications/part-1-foundations-of-data-systems/ch2-data-models-and-query-languages.html">Data Models and Relationships in Database Systems</a></li></ul>
</div>

</main>
</body>
</html>
Loading

0 comments on commit 758457d

Please sign in to comment.