In this tutorial we'll work through the basic steps of setting up a minimal Druzhba pipeline.
To run this example you will need a source database to move data from and a, target Redshift database to move data to, and an S3 location to use as a staging area for temporary files. If you do not already have those set up, or if you run into any permissions errors, see: :ref:`Demo Environment Setup <demosetup>`.
Install locally in a Python3 virtual environment or wherever you like:
pip install druzhba
Or clone the source code and install from there
git clone [email protected]:seatgeek/druzhba.git
pip install -e .
A Druzhba pipeline is defined by a directory of YAML configuration files. At run time, Druzhba will read these files and a special tracking table in the destination database to determine what data to extract from source databases
As minimal example we're going to configure Druzhba to transfer the contents of a single table in a PostgreSQL database to our data warehouse. We'll start by creating a directory to hold our pipeline configuration.
Using your favorite text editor, create a file pipeline/_pipeline.yaml
:
---
connection:
host: testserver.123456789012.us-east-1.redshift.amazonaws.com
port: 5439
database: testserver
user: ${REDSHIFT_USER}
password: ${REDSHIFT_PASSWORD}
index:
schema: druzhba_raw
table: pipeline_index
s3:
bucket: my-bucket
prefix: druzhba/
iam_copy_role: arn:aws:iam::123456789012:role/RedshiftCopyUnload
sources:
- alias: demodb
type: postgres
This file defines a pipeline. A pipeline definition consists of a destination
database connection (currently only `Amazon Redshift`_ is supported), an
optional index table definition, a mandatory S3 location that the Druzhba
process will have read/write access to (temporary files will be written here
before calling COPY
on the Redshift instance), an `IAM copy role`_, and a
list of source databases to pull. Each source has a unique alias
and a
type
, which must be one of postgres
, mysql
, or mssql
.
Druzhba supports limited templating of YAML configuration files and to allow
injection of environment variables into the configuration. For example, the
user
field will be populated by the value of the REDSHIFT_USER
environment variable.
Replace host
, database
, and iam_copy_role
with appropriate values
for the Redshift instance you'll be using in this test.
Next we will create a file for each source database in our pipeline -- in this case, one. Similarly to above this configuration will define a connection to the source database, but will also contain a list of tables to copy from that source database -- again, only one in this example.
Create a file pipeline/demodb.yaml
:
---
connection_string: postgresql://postgres:docker@localhost:54320/postgres
tables:
- source_table_name: starter_table
destination_table_name: starter_table
destination_schema_name: druzhba_raw
index_column: updated_at
primary_key:
- id
For each table we define the table in the source database to use, the schema and table to create in the target database, and two special columns that we usually want on every table in the pipeline. The index column is a column on the source table that is only increasing -- generally an auto-incrementing identifier for append-only tables, or an updated timestamp or row version for update-in-place tables. The primary key field is mandatory and used by Druzhba to uniquely identify a row. Druzhba will ignore any primary key defined in the source database. Updated rows where the primary key already exists in the destination table will result in updates rather than inserts.
See :ref:`configuration` for more on the configuration files, and |example-link| for more examples.
Now we are ready to finish configuring our environment. We'll need to make sure
we have appropriate AWS credentials available to Druzhba, through the default
provider chain. Then we need to create environment variables to hold our
destination database credentials that our config file was set up to read.
Finally we set the DRUZHBA_CONFIG_DIR
variable to point at the configuration
we want to run.
export DRUZHBA_CONFIG_DIR=pipeline
export REDSHIFT_USER=druzhba_test
export REDSHIFT_PASSWORD=Druzhba123
Extract and load starter_table
from demodb
with:
druzhba -d demodb -t starter_table
Your data is now in Redshift! Subsequent invocations will incrementally pull updated rows from the source table. Of course, this is just the beginning of your pipeline.
Note that you could also just run the command druzhba
with no arguments to
run the entire pipeline. See :ref:`CLI Help <cli-help>` for more on the command
line interface.
That's it! you should now have a working Druzhba pipeline. Next consider reading the :ref:`configuration guide <configuration>`.