Skip to content

Latest commit

 

History

History
537 lines (408 loc) · 14.1 KB

notes.md

File metadata and controls

537 lines (408 loc) · 14.1 KB

Week 1

Docker

Run the ubuntu image in interactive (-it) mode:

docker run -it ubuntu bash

There is one bash parameter. Everything that comes after the image name are a parameters.
If you exit the container and then run it again, all changes will be lost. It will be a new container. It's an isolation.

To run a python and interact with it:

docker run -it python:3.10.12

To run a python with entrypoint (what is executed when a container is run):

docker run -it entrypoint=bash python:3.10.12

And install the libs into the container:

pip install pandas

After running a container, we need to install the libs inside it. To automate this, we can use dockerfile:

# which image to use
FROM python:3.10.12
RUN pip install pandas
ENTRYPOINT ["bash"]

Then we need to create this dockerfile with name = "test_name", tag = "ver_01" and the current path = ":

docker build -t test_name:ver_01 .

Now we can run a new container:

docker run -it test_name:ver_01

There is an example with a more useful container:

# which image we want to use
FROM python:3.10.12
RUN pip install pandas
# Work path where our files will be copied 
WORKDIR /app
# Copy file from current location to container /app path
COPY pipeline.py pipeline.py
ENTRYPOINT ["python", "pipeline.py"]
import sys
import pandas as pd 
print(sys.argv)
first_parametr = sys.argv[1]
second_parametr = sys.argv[2]
print(first_parametr)
print(second_parametr)
print('Pipeline has finished!')

And there are commands to build and run this container:

docker build -t test_name:ver_01 .
docker run -it test_name:ver_01 hello! 42

>> It's a output
>> ['pipeline.py', 'hello!', '42']
>> hello!
>> 42

Docker + Postgres

Install a client to access to Postgres:

pip install pgcli
sudo apt install pgcli

Then run (execute pgcli) pgcli and if you have these errors:

ImportError: no pq wrapper available.
Attempts made:
- couldn't import psycopg 'c' implementation: No module named 'psycopg_c'
- couldn't import psycopg 'binary' implementation: No module named 'psycopg_binary'
- couldn't import psycopg 'python' implementation: libpq library not found

Install postgresql on your system:

sudo apt-get install postgresql

Run Docker with Postgres:

docker run -it \
    -e POSTGRES_USER="root" \
    -e POSTGRES_PASSWORD="root" \
    -e POSTGRES_DB="ny_taxi" \
    -v $(pwd)/postgresql/ny_taxi_data:/var/lib/postgresql/data \
    -p 5432:5432 \
    postgres:16.2

ny_taxi_data will be the name of the database. And be careful with the Postgres path in the Docker. The correct path should be: /var/lib/postgresql/data!

Access to Postgres via terminal

To access to the ny_taxi_data database:

pgcli -h localhost -p 5432 -u root -d ny_taxi

Access to Postgres via Jupyther

Here is a data: https://github.com/DataTalksClub/nyc-tlc-data/releases/tag/yellow

To get this data:

wget https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz

To create a table with a schema as in the existing table (via Jupyther):

import pandas as pd
from sqlalchemy import create_engine
from time import time

# 2. Get data and DDL for Postgres 
# I load only 1000 rows to optimize
df = pd.read_csv('./../yellow_tripdata_2021-01.csv', nrows = 1000)
df['tpep_pickup_datetime'] = pd.to_datetime(df['tpep_pickup_datetime'])
df['tpep_dropoff_datetime'] = pd.to_datetime(df['tpep_dropoff_datetime'])
df.head()

df.dtypes

# Create DDL. I want to create a table in Postgres with same schema like in df
print(pd.io.sql.get_schema(df, name='yellow_tripdata_2021-01'))

# Create an engine to connect
user = "root"
password = "root"
port = 5432
db = "ny_taxi"
engine = create_engine(f"postgresql://{user}:{password}@localhost:{port}/{db}")
table_name = 'yellow_tripdata_2021-01'

# Create DDL. I want to create a table in Postgres with same schema
print(pd.io.sql.get_schema(df, name=table_name, con=engine))


# 3. Read and write df in DB by chunks
df_iter = pd.read_csv('./../yellow_tripdata_2021-01.csv', iterator=True, chunksize=100_000)
df_iter


%%time
# Process the first chunk separately to create/replace the table structure
t_start = time()
try:
    df = next(df_iter)
    df['tpep_pickup_datetime'] = pd.to_datetime(df['tpep_pickup_datetime'])
    df['tpep_dropoff_datetime'] = pd.to_datetime(df['tpep_dropoff_datetime'])
    df.head(0).to_sql(name=table_name, con=engine, if_exists='replace')
    df.to_sql(name=table_name, con=engine, if_exists='append')
    t_end = time()
    print(f"Inserted first chunk..., took {t_end-t_start:.2f} seconds")
except StopIteration:
    print('No data to process.')


%%time
while True:
    t_start = time()
    try:
        df = next(df_iter)
        df['tpep_pickup_datetime'] = pd.to_datetime(df['tpep_pickup_datetime'])
        df['tpep_dropoff_datetime'] = pd.to_datetime(df['tpep_dropoff_datetime'])
        df.to_sql(name=table_name, con=engine, if_exists='append')
        t_end = time()
        print(f"Inserted chunk..., took {t_end-t_start:.2f} seconds")
    except StopIteration:
        print('Finished processing all chunks.')
        break

# Check data
query = f"""
SELECT COUNT(1) FROM "{table_name}"
"""

pd.read_sql(query, con=engine)

If you get the error: No module named 'psycopg2', run pip install psycopg2-binary.

To convert jupyther notebook to script:

jupyter nbconvert --to=script %NOTEBOOK_TO_CONVERT%

pgAdmin

It's a tool to access to database via web browser:

docker run -it \
    -e PGADMIN_DEFAULT_EMAIL="[email protected]" \
    -e PGADMIN_DEFAULT_PASSWORD="root" \
    -p 8080:80 \
    dpage/pgadmin4:latest

To use write in a browser:

localhost:8080

Docker Network

Then we have to connect the container with pgAdmin and the container with Postgres. We're running both containers and connect them by Docker Network:

docker network create pg-network

Run 2 containers:

docker run -it \
    -e POSTGRES_USER="root" \
    -e POSTGRES_PASSWORD="root" \
    -e POSTGRES_DB="ny_taxi" \
    -v $(pwd)/postgresql/ny_taxi_data:/var/lib/postgresql/data \
    -p 5432:5432 \
    --network=pg-network \
    --name pg-database \
    postgres:16.2

The name parameter allows you to specify a name for the container.

docker run -it \
    -e PGADMIN_DEFAULT_EMAIL="[email protected]" \
    -e PGADMIN_DEFAULT_PASSWORD="root" \
    -p 8080:80 \
    --network=pg-network \
    --name pgadmin \
    dpage/pgadmin4:latest

Now we can connect to database pg-database:

Upgrade pipeline

There is ingest_data.py:

# # 0. Description

# https://github.com/DataTalksClub/nyc-tlc-data/releases/tag/yellow
# 
# wget https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz
# 

# Yellow Trips: 
# * https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_yellow.pdf
#   
# Green Trips:
# * https://www1.nyc.gov/assets/tlc/downloads/pdf/data_dictionary_trip_records_green.pdf

import os
import pandas as pd
import argparse 
from sqlalchemy import create_engine
from time import time

def main(params):
    user = params.user
    password = params.password
    host = params.host
    port = params.port
    db = params.db
    table_name = params.table_name
    url = params.url
    csv_name = 'output.csv'

    # download file and write in csv_name
    os.system(f"wget {url} -O {csv_name}")

    engine = create_engine(f"postgresql://{user}:{password}@{host}:{port}/{db}")
    df_iter = pd.read_csv(f'{csv_name}', compression='gzip', low_memory=False, iterator=True, chunksize=100_000)

    # Process the first chunk separately to create/replace the table structure
    t_start = time()
    try:
        df = next(df_iter)
        df['tpep_pickup_datetime'] = pd.to_datetime(df['tpep_pickup_datetime'])
        df['tpep_dropoff_datetime'] = pd.to_datetime(df['tpep_dropoff_datetime'])
        df.head(0).to_sql(name=table_name, con=engine, if_exists='replace')
        df.to_sql(name=table_name, con=engine, if_exists='append')
        t_end = time()
        print(f"Inserted first chunk..., took {t_end-t_start:.2f} seconds")
    except StopIteration:
        print('No data to process.')

    while True:
        t_start = time()
        try:
            df = next(df_iter)
            df['tpep_pickup_datetime'] = pd.to_datetime(df['tpep_pickup_datetime'])
            df['tpep_dropoff_datetime'] = pd.to_datetime(df['tpep_dropoff_datetime'])
            df.to_sql(name=table_name, con=engine, if_exists='append')
            t_end = time()
            print(f"Inserted chunk..., took {t_end-t_start:.2f} seconds")
        except StopIteration:
            print('Finished processing all chunks.')
            break

if __name__ == '__main__':
    parser = argparse.ArgumentParser(description='Ingest CSV data to Postgres')
    parser.add_argument('--user', help='username for Postgres')
    parser.add_argument('--password', help='password for Postgres')
    parser.add_argument('--host', help='host for Postgres')
    parser.add_argument('--port', help='port for Postgres')
    parser.add_argument('--db', help='database for Postgres')
    parser.add_argument('--table_name', help='Name of table where results will been')
    parser.add_argument('--url', help='url to CSV with data')
    args = parser.parse_args()
    #print(args.accumulate(args.integers))

    main(args)

To run via python:

python3 ingest_data.py \
    --user=root \
    --password=root \
    --host=localhost \
    --port=5432 \
    --db=ny_taxi \
    --table_name=yellow_tripdata \
    --url="https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz"

Dockerfile:

FROM python:3.10.12

RUN apt-get install wget
RUN pip install pandas sqlalchemy psycopg2-binary

# Work path where our files will be copied 
WORKDIR /app
# Copy file from current location to container /app path
COPY ingest_data.py ingest_data.py

ENTRYPOINT ["python", "ingest_data.py"]

To build it:

docker build -t taxi_ingest:v001 .

Run 2 containers with Postgres and pgAdmin:

docker start pg-database
docker start pgadmin

To run via Docker:

docker run -it \
    --network=pg-network \
    taxi_ingest:v001 \
    --user=root \
    --password=root \
    --host=pg-database \
    --port=5432 \
    --db=ny_taxi \
    --table_name=yellow_tripdata \
    --url="https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz"

Be careful with host parameter. There is have to be name of the Docker Postgres container!

Docker compose

Instead, to run a few Docker containers, we can run Docker Compose file:

services:
  pgdatabase:
    image: postgres:16.2
    environment:
      - POSTGRES_USER=root
      - POSTGRES_PASSWORD=root
      - POSTGRES_DB=ny_taxi
    volumes:
      - "./../../postgresql/ny_taxi_data:/var/lib/postgresql/data:rw"
    ports:
      - "5432:5432"
  pgadmin:
    image: dpage/pgadmin4:latest
    environment:
      - [email protected]
      - PGADMIN_DEFAULT_PASSWORD=root
    ports:
      - "8080:80"

To install click.

We don't have to think about networking. Docker Compose creates it for us.

docker compose up -d

There is a -d parameter which means that the terminal will be available after the run. To stop use docker compose down (or ctr + c if use without -d!).

Also there can be a some problems:

  • cannot stop container ... permission denied: To solve it run sudo systemctl restart docker.socket docker.service
  • PORT_NUMBER bind: address already in use To solve it run ```sudo kill -9 `sudo lsof -t -i:PORT_NUMBER````

After run we can use the previous code to ingest the data:

docker build -t taxi_ingest:v001 .

To find a Docker Compose Network:

docker network ls

In my case it was third_pipeline_default. And host will be the name from Docker Compose file:

docker run -it \
    --network=third_pipeline_default \
    taxi_ingest:v001 \
    --user=root \
    --password=root \
    --host=pgdatabase \
    --port=5432 \
    --db=ny_taxi \
    --table_name=yellow_tripdata \
    --url="https://github.com/DataTalksClub/nyc-tlc-data/releases/download/yellow/yellow_tripdata_2021-01.csv.gz"

And a note about ports. I can use another ports. For example, let's use "5431:5432" for pgdatabase. To accept this database I have to use port 5431: pgcli -h localhost -p 5431 -u root -d ny_taxi

But inside the Docker Compose I'll be use a port 5432:

And there is a good video about ports: click

Terraform

Infrastructure as code video 1 video 2 video 3

Install.

Connect to Google CP via terminal

A good video

  1. Create a ssh key: link
  2. Add the ssh key to VM compute engine metadata
  3. To connect ssh -i ~/.ssh/NAME_OF_FILE_WITH_SSH_KEY YOUR_LOGIN@IP_OF_VM
  4. Also you can create a config file in the ~/.ssh directory:
Host de-zoomcamp
    HostName IP_OF_VM
    User YOUR_LOGIN
    IdentityFile ~/.ssh/gcp

In my case NAME_OF_FILE_WITH_SSH_KEY is gcp. Now you can connect by ssh de-zoomcamp

And it easy to send files from localhost to GCP:

scp hello.txt de-zoomcamp:~/tmp

Or from GCP to localhost:

scp de-zoomcamp:~/tmp/hello.txt ~/tmp