Skip to content

Commit 42faa61

Browse files
committed
First draft of README.md
1 parent 1db9135 commit 42faa61

File tree

2 files changed

+215
-1
lines changed

2 files changed

+215
-1
lines changed

README.md

Lines changed: 215 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,11 +1,225 @@
11
# Sinker: Synchronize Postgres to Elasticsearch
22

3-
What are you sinking about?
3+
What are you [sinking about](https://www.youtube.com/watch?v=yR0lWICH3rY)?
44

55
[![CI status](https://github.com/paradigm-operations/sinker/actions/workflows/test.yml/badge.svg)](gh-ci)
66
[![License: MIT](https://img.shields.io/badge/License-MIT-yellow.svg)](https://opensource.org/licenses/MIT)
77
[![License](https://img.shields.io/badge/License-Apache_2.0-blue.svg)](https://opensource.org/licenses/Apache-2.0)
88
[![codecov](https://codecov.io/gh/paradigm-operations/sinker/branch/main/graph/badge.svg?token=AIGMBZR0IG)](https://codecov.io/gh/paradigm-operations/sinker)
99

10+
## What is Sinker?
1011

12+
Sinker is middleware that synchronizes relational data from a Postgres database to Elasticsearch.
13+
It is simple to operate, requires minimal RAM, and handles arbitrarily complex schemas.
1114

15+
### For Example
16+
17+
In Postgres, you might have a normalized schema like this:
18+
19+
- A Student and a Teacher refer to a Person
20+
- A Course is taught by a Teacher
21+
- Students have and belong to many Courses through the Enrollment join table
22+
23+
![schema](sinker_schema.png)
24+
25+
In Elasticsearch, you might want to index the Course data in an index called `courses` like this:
26+
27+
```json
28+
{
29+
"name": "Reth",
30+
"description": "How to build a modern Ethereum node",
31+
"teacher": {
32+
"salary": 100000.0,
33+
"person": {
34+
"name": "Prof Georgios"
35+
}
36+
},
37+
"enrollments": [
38+
{
39+
"grade": 3.14,
40+
"student": {
41+
"gpa": 3.99,
42+
"person": {
43+
"name": "Loren"
44+
}
45+
}
46+
},
47+
{
48+
"grade": 3.50,
49+
"student": {
50+
"gpa": 4.00,
51+
"person": {
52+
"name": "Abigail"
53+
}
54+
}
55+
}
56+
]
57+
}
58+
```
59+
60+
Now you can easily query Elasticsearch for courses taught by Prof Georgios, or students with high GPAs named Loren.
61+
To do this, you need to do two things reliably:
62+
63+
1. Denormalize the normalized data from the five Postgres tables into a single Elasticsearch document with the Course as
64+
the parent and the other four tables nested appropriately inside it.
65+
2. Keep the Elasticsearch document in sync with the Postgres data, so that if Abigail changes her name in the database
66+
to Abby, it's reflected in the `Course->Enrollments->Student->Person.name` field.
67+
68+
## How it Works
69+
70+
Sinker transforms the normalized Postgres data into JSON documents stored in a simple key-value materialized view where
71+
the key is the Elasticsearch document ID and the value is the JSON document to be stored in Elasticsearch.
72+
73+
Sinker creates triggers on the Postgres tables that you want to synchronize (e.g., the five tables in the example
74+
above). When a row is inserted, updated, or deleted in any of these tables, the trigger schedules the materialized view
75+
to be refreshed at the next interval.
76+
77+
The changes to the materialized view are sent to a logical replication slot. Sinker reads from this slot and indexes the
78+
documents in Elasticsearch.
79+
80+
You define the query behind the materialized view, so you can denormalize the data however you want, filter out unwanted
81+
documents, transform some fields, etc. If you can express it in SQL, you can build your materialized view around it.
82+
83+
You also configure the Elasticsearch index settings and mappings however you like.
84+
85+
## Installation
86+
87+
```shell
88+
pip install sinker
89+
```
90+
91+
### Environment Variables
92+
93+
Here are some of the environment variables that you'll want to set:
94+
95+
| Environment Variable | Value |
96+
|-------------------------|-----------|
97+
| SINKER_DEFINITIONS_PATH | . |
98+
| SINKER_SCHEMA | public |
99+
| SINKER_POLL_INTERVAL | 10 |
100+
| SINKER_LOG_LEVEL | DEBUG |
101+
| PGPASSWORD | secret! |
102+
| PGHOST | localhost |
103+
| PGUSER | dev |
104+
| PGDATABASE | dev_db |
105+
| ELASTICSEARCH_HOST | localhost |
106+
| ELASTICSEARCH_SCHEME | http |
107+
108+
See sinker/settings.py for the full list.
109+
110+
## Configuration
111+
112+
Sinker's main configuration file `views_to_indices.json` specifies the mapping between the root Postgres materialized
113+
view names and the Elasticsearch indexes that will get populated by them, e.g.:
114+
115+
```json
116+
{
117+
"person_mv": "people",
118+
"course_mv": "courses"
119+
}
120+
```
121+
122+
This tells Sinker to define a Postgres materialized view called `person_mv` based on the query in the `person_mv.sql`
123+
file and an Elasticsearch index called `people` based on the settings and mappings in the `people.json` file. It will
124+
then populate the `people` index with the documents from the `person_mv` materialized view. It will then do the same for
125+
the `course_mv` materialized view and the `courses` index.
126+
127+
### Materialized View Configuration
128+
129+
The `person_mv` materialized view is defined by the SQL in the `person_mv.sql` file, e.g.:
130+
131+
```sql
132+
select id,
133+
json_build_object(
134+
'name', "name") as "person"
135+
from "person"
136+
```
137+
138+
The `course_mv` SQL is more complex, but you can see how it denormalizes the data from the five tables into a single
139+
JSON document:
140+
141+
```sql
142+
select id,
143+
json_build_object('name', "name", 'description', "description", 'teacher',
144+
(select json_build_object('salary', "salary", 'person',
145+
(select json_build_object('name', "name")
146+
from person
147+
where person.id = person_id))
148+
from teacher
149+
where teacher.id = teacher_id), 'enrollments',
150+
(select json_agg(json_build_object('grade', "grade",
151+
'student', (select json_build_object(
152+
'gpa', "gpa",
153+
'person',
154+
(select json_build_object(
155+
'name',
156+
"name"
157+
)
158+
from person
159+
where person.id = person_id)
160+
)
161+
from student
162+
where student.id = student_id)
163+
))
164+
from enrollment
165+
where enrollment.course_id = course.id)
166+
) as "course"
167+
from "course";
168+
```
169+
170+
### Index Configuration
171+
172+
The Elasticsearch index configurations are stored in the `people.json` and `courses.json` files, e.g.:
173+
174+
```json
175+
{
176+
"mappings": {
177+
"dynamic": "strict",
178+
"properties": {
179+
"name": {
180+
"type": "keyword"
181+
}
182+
}
183+
},
184+
"settings": {
185+
"index": {
186+
"number_of_shards": "1",
187+
"number_of_replicas": "0"
188+
}
189+
}
190+
}
191+
```
192+
193+
## Running
194+
195+
Once you have the environment variables and configuration files set up, you can run Sinker with:
196+
197+
```shell
198+
sinker
199+
```
200+
201+
### Performance
202+
203+
Once you have Sinker running, you may well want it to run faster. Here are some things you can do to improve
204+
performance:
205+
206+
1. Decrease `SINKER_POLL_INTERVAL`. This will make Sinker refresh the materialized views more
207+
frequently (and thus keep Elasticsearch in closer sync), but it will also increase the load on the database. Note
208+
that the materialized views are only refreshed when one of the underlying tables has changed, so this won't increase
209+
the load on the database if there are no changes.
210+
2. Increase the `PGCHUNK_SIZE`. This will make Sinker read more rows from the logical replication slot at a time, which
211+
will reduce the number of round trips to the database. However, it will also increase the memory usage of Sinker.
212+
3. Increase the `ELASTICSEARCH_CHUNK_SIZE`. This will make Sinker index more documents in a single Elasticsearch
213+
bulk request, which will reduce the number of round trips to Elasticsearch. However, it will also increase the memory
214+
usage of Sinker and the CPU load on the Elasticsearch cluster.
215+
4. Run `EXPLAIN ANALYZE` on your materialized view queries to see if you can optimize them (e.g., by adding indexes on
216+
the foreign keys).
217+
218+
## Contributing
219+
220+
Contributions are welcome! Please open an issue or submit a pull request.
221+
222+
## Acknowledgements
223+
224+
Sinker was inspired by [pgsync](https://github.com/toluaina/pgsync) and [debezium](https://debezium.io/). Each project
225+
takes a different approach to the problem, so check them out to see which one is best for you.

sinker_schema.png

88.5 KB
Loading

0 commit comments

Comments
 (0)