Skip to content

Commit

Permalink
Merge pull request #12 from wednesday-solutions/feat/update-documenta…
Browse files Browse the repository at this point in the history
…tion

Feat: update documentation
  • Loading branch information
idipanshu authored Feb 13, 2024
2 parents 92d77a8 + ef83459 commit 484c21d
Show file tree
Hide file tree
Showing 7 changed files with 167 additions and 43 deletions.
26 changes: 26 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
.PHONY: build

setup-glue-local:
chmod +x automation/glue_setup.sh
. automation/glue_setup.sh $(SOURCE_FILE_PATH)

glue-demo-env:
cp app/.custom_env .env

install:
pip3 install -r requirements.txt

type-check:
mypy ./ --ignore-missing-imports

lint:
pylint app tests jobs setup.py

test:
export KAGGLE_KEY=MOCKKEY
export KAGGLE_USERNAME=MOCKUSERNAME
coverage run --source=app -m unittest discover -s tests

coverage-report:
coverage report
coverage html
112 changes: 77 additions & 35 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,73 +1,115 @@
# Multi-cloud ETL Pipeline

## Main Objective
## Objective

To run the same ETL code in multiple cloud services based on your preference, thus saving time & to develop the ETL scripts for different environments & clouds. Currently supports Azure Databricks + AWS Glue
- To run the same ETL code in multiple cloud services based on your preference, thus saving time.
- To develop ETL scripts for different environments and clouds.

## Note

- Azure Databricks can't be configured locally, you can only connect your local IDE to running cluster in databricks. Push your code in Github repo then make a workflow in databricks with URL of the repo & file.
- For AWS Glue I'm setting up a local environment using the Docker image, then deploying it to AWS glue using github actions.
- The "tasks.txt" file contents the details of transformations done in the main file.
- This repository currently supports Azure Databricks + AWS Glue.
- Azure Databricks can't be configured locally, We can only connect our local IDE to running cluster in databricks. It works by pushing code in a Github repository then adding a workflow in databricks with URL of the repo & file.
- For AWS Glue we will set up a local environment using glue Docker image, then deploying it to AWS glue using github actions.
- The "tasks.txt" file contains the details of transformations done in the main file.

## Requirements for Azure Databricks (for local connect only)
- [Unity Catalog](https://learn.microsoft.com/en-us/azure/databricks/data-governance/unity-catalog/enable-workspaces) enabled workspace.
- [Databricks Connect](https://learn.microsoft.com/en-us/azure/databricks/dev-tools/databricks-connect/python/install) configured on local machine. Running cluster.
## Pre-requisite

## Requirements for AWS Glue (local setup)
1. [Python3.7 with PIP](https://www.python.org/downloads/)
2. [AWS CLI configured locally](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html)
3. [Install Java 8](https://www.oracle.com/in/java/technologies/downloads/#java8-mac).
```bash
# Make sure to export JAVA_HOME like this:
export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_261.jdk/Contents/Home
```


## Quick Start

1. Clone this repo _(for Windows use WSL)_.

2. For setting up required libraries and packages locally, run:
```bash
# If default SHELL is zsh use
make setup-glue-local SOURCE_FILE_PATH=~/.zshrc
# If default SHELL is bash use
make setup-glue-local SOURCE_FILE_PATH=~/.bashrc
```

- For Unix-based systems you can refer: [Data Engineering Onboarding Starter Setup](https://github.com/wednesday-solutions/Data-Engineering-Onboarding-Starter#setup)
3. Source SHELL profile using:

- For Windows-based systems you can refer: [AWS Glue Developing using a Docker image](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-libraries.html#develop-local-docker-image)
```bash
# For zsh
source ~/.zshrc
## Steps
# For bash
source ~/.bashrc
```

1. Clone this repo in your own repo. For Windows recommend use WSL.
4. Install Dependencies:
```bash
make install
```

2. Give your S3, ADLS & Kaggle (optional) paths in the ```app/.custom_env``` file for Databricks. Make ```.evn``` file in the root folder for local Docker Glue to use.
Make sure to pass KAGGLE_KEY & KAGGLE_USERNAME values if you are going to use Kaggle. Else make the kaggle_extraction flag as False.
## Change Your Paths

3. Run ```automation/init_docker.sh``` passing your aws credential location & project root location. If you are using Windows Powershell or CommandPrompt then run the commands manually by copy-pasting.
1. Enter your S3, ADLS & Kaggle (optional) paths in the ```app/.custom_env``` file for Databricks. This file will be used by Databricks.

4. Write your jobs in the ```jobs``` folder. Refer ```demo.py``` file. One example is the ```jobs/main.py``` file.
2. Similarly, we'll make ```.evn``` file in the root folder. This file will be used by local glue job. To create the required files run:
```bash
make glue-demo-env
```
This command will copy your paths from in the ```.env``` file.
5. Check your setup is correct, by running scripts in the docker container locally using ```spark-submit jobs/demo.py```. Make sure you see the "Execution Complete" statement printed.
3. _(Optional)_ If you want to extract from kaggle, enter KAGGLE_KEY & KAGGLE_USERNAME in ```.evn``` file only. Note: Don't enter any sensitive keys in ```app/.custom_env``` file.

## Setup Check
Finally, check if everything is working correctly by running:
```bash
gluesparksubmit jobs/demo.py
```
Ensure "Execution Complete" is printed.

## Make New Jobs

Write your jobs in the ```jobs``` folder. Refer ```demo.py``` file. One example is the ```jobs/main.py``` file.

## Deployment

1. In your your GitHub Actions Secrets, setup the following keys with their values:
```
1. Set up a Github action for AWS Glue. Make sure to pass the following secrets in your repository:

```
AWS_ACCESS_KEY_ID
AWS_SECRET_ACCESS_KEY
S3_BUCKET_NAME
S3_SCRIPTS_PATH
AWS_REGION
AWS_GLUE_ROLE
```
Rest all the key-value pairs that you wrote in your .env file, make sure you pass them using the ```automation/deploy_glue_jobs.sh``` file.
```

2. For Azure Databricks, make a workflow with the link of your repo & main file. Pass the following parameters with their correct values:
Rest all the key-value pairs that entered in the `.env` file. make sure to pass them using `automation/deploy_glue_jobs.sh` file.

```
2. For Azure Databricks, make a workflow with the link to your repo & main file. Pass the following parameters with their correct values:

```
kaggle_username
kaggle_token
storage_account_name
datalake_access_key
```
```

## Documentation
## Run Tests & Coverage Report

[Multi-cloud Pipeline Documentation](https://docs.google.com/document/d/1npCpT_FIpw7ZuxAzQrEH3IsPKCDt7behmF-6VjrSFoQ/edit?usp=sharing)
To run tests & coverage report, run the following commands in the root folder of the project:

## References
```bash
make test
[Glue Programming libraries](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html)
## Run Tests & Coverage Report
# To see the coverage report
make coverage-report
```

To run tests in the root of the directory use:
## References

coverage run --source=app -m unittest discover -s tests
coverage report
[Glue Programming libraries](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html)

Note that AWS Glue libraries are not available to download, so use AWS Glue 4 Docker container.
5 changes: 4 additions & 1 deletion app/.custom_env
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# this is my custom file for read & write path based on environment
# this is env file for paths, read only on databricks
# for local glue, make a similar one in root named as ".env"

GLUE_READ_PATH="s3://glue-bucket-vighnesh/rawdata/"
GLUE_WRITE_PATH="s3://glue-bucket-vighnesh/transformed/"
Expand All @@ -7,3 +8,5 @@ DATABRICKS_READ_PATH="/mnt/rawdata/"
DATABRICKS_WRITE_PATH="/mnt/transformed/"

KAGGLE_PATH="mastmustu/insurance-claims-fraud-data"

# Give KAGGLE_KEY & KAGGLE_USERNAME Below
51 changes: 51 additions & 0 deletions automation/glue_setup.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# Parameter 1 --> Shell profile path
SOURCE_FILE=$1
echo $SOURCE_FILE

echo -e "FIRST RUN TIME ESTIMATION: 30-45 MINS\nPlease do NOT exit"

export PROJECT_ROOT=$(pwd)

# Doing all the work in separate folder "glue-libs"
cd ~
mkdir glue-libs
cd glue-libs

# Clone AWS Glue Python Lib
git clone https://github.com/awslabs/aws-glue-libs.git
export AWS_GLUE_HOME=$(pwd)/aws-glue-libs

# Install Apache Maven
curl https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-common/apache-maven-3.6.0-bin.tar.gz -o apache-maven-3.6.0-bin.tar.gz
tar -xvf apache-maven-3.6.0-bin.tar.gz
ln -s apache-maven-3.6.0 maven
export MAVEN_HOME=$(pwd)/maven

# Install Apache Spark
curl https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-3.0/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3.tgz -o spark-3.1.1-amzn-0-bin-3.2.1-amzn-3.tgz
tar -xvf spark-3.1.1-amzn-0-bin-3.2.1-amzn-3.tgz
ln -s spark-3.1.1-amzn-0-bin-3.2.1-amzn-3 spark
export SPARK_HOME=$(pwd)/spark

# Export Path
export PATH=$PATH:$SPARK_HOME/bin:$MAVEN_HOME/bin:$AWS_GLUE_HOME/bin
export PYTHONPATH=$PROJECT_ROOT

# Download Glue ETL .jar files
cd $AWS_GLUE_HOME
chmod +x bin/glue-setup.sh
./bin/glue-setup.sh
mvn install dependency:copy-dependencies
cp $AWS_GLUE_HOME/jarsv1/AWSGlue*.jar $SPARK_HOME/jars/
cp $AWS_GLUE_HOME/jarsv1/aws*.jar $SPARK_HOME/jars/

echo "export AWS_GLUE_HOME=$AWS_GLUE_HOME
export MAVEN_HOME=$MAVEN_HOME
export SPARK_HOME=$SPARK_HOME
export PATH=$PATH:$SPARK_HOME/bin:$MAVEN_HOME/bin:$AWS_GLUE_HOME/bin
export PYTHONPATH=$PROJECT_ROOT" >> $SOURCE_FILE


cd $PROJECT_ROOT

echo -e "\nGLUE LOCAL SETUP COMPLETE"
3 changes: 2 additions & 1 deletion jobs/demo.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,7 +2,8 @@
from dotenv import load_dotenv
import app.environment as env

load_dotenv("../app/.custom-env")
load_dotenv("../app/.custom_env") # Loading env for databricks
load_dotenv() # Loading env for glue

# COMMAND ----------

Expand Down
3 changes: 2 additions & 1 deletion jobs/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,8 @@
import app.environment as env
import app.spark_wrapper as sw

load_dotenv("../app/.custom_env")
load_dotenv("../app/.custom_env") # Loading env for databricks
load_dotenv() # Loading env for glue

# COMMAND ----------

Expand Down
10 changes: 5 additions & 5 deletions requirements.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
mypy~=1.7.1
pylint~=3.0.2
coverage~=7.3.2
python-dotenv~=1.0.0
mypy
pylint
coverage
python-dotenv
kaggle~=1.5.16
pre-commit~=3.6.0
pre-commit

0 comments on commit 484c21d

Please sign in to comment.