Merge pull request #12 from wednesday-solutions/feat/update-documenta…

…tion Feat: update documentation
wednesday-solutions · Feb 13, 2024 · 484c21d · 484c21d
2 parents 92d77a8 + ef83459
commit 484c21d
Show file tree

Hide file tree

Showing 7 changed files with 167 additions and 43 deletions.
diff --git a/Makefile b/Makefile
@@ -0,0 +1,26 @@
+.PHONY: build
+
+setup-glue-local:
+	chmod +x automation/glue_setup.sh
+	. automation/glue_setup.sh $(SOURCE_FILE_PATH)
+
+glue-demo-env:
+	cp app/.custom_env .env
+
+install:
+	pip3 install -r requirements.txt
+
+type-check:
+	mypy ./ --ignore-missing-imports
+
+lint:
+	pylint app tests jobs setup.py
+
+test:
+	export KAGGLE_KEY=MOCKKEY
+	export KAGGLE_USERNAME=MOCKUSERNAME
+	coverage run --source=app -m unittest discover -s tests
+
+coverage-report:
+	coverage report
+	coverage html
diff --git a/README.md b/README.md
@@ -1,73 +1,115 @@
 # Multi-cloud ETL Pipeline
 
-## Main Objective
+## Objective
 
-To run the same ETL code in multiple cloud services based on your preference, thus saving time & to develop the ETL scripts for different environments & clouds. Currently supports Azure Databricks + AWS Glue
+- To run the same ETL code in multiple cloud services based on your preference, thus saving time.
+- To develop ETL scripts for different environments and clouds.
 
 ## Note
 
-- Azure Databricks can't be configured locally, you can only connect your local IDE to running cluster in databricks. Push your code in Github repo then make a workflow in databricks with URL of the repo & file.
-- For AWS Glue I'm setting up a local environment using the Docker image, then deploying it to AWS glue using github actions.
-- The "tasks.txt" file contents the details of transformations done in the main file.
+- This repository currently supports Azure Databricks + AWS Glue.
+- Azure Databricks can't be configured locally, We can only connect our local IDE to running cluster in databricks. It works by pushing code in a Github repository then adding a workflow in databricks with URL of the repo & file.
+- For AWS Glue we will set up a local environment using glue Docker image, then deploying it to AWS glue using github actions.
+- The "tasks.txt" file contains the details of transformations done in the main file.
 
-## Requirements for Azure Databricks (for local connect only)
-- [Unity Catalog](https://learn.microsoft.com/en-us/azure/databricks/data-governance/unity-catalog/enable-workspaces) enabled workspace.
-- [Databricks Connect](https://learn.microsoft.com/en-us/azure/databricks/dev-tools/databricks-connect/python/install) configured on local machine. Running cluster.
+## Pre-requisite
 
-## Requirements for AWS Glue (local setup)
+1. [Python3.7 with PIP](https://www.python.org/downloads/)
+2. [AWS CLI configured locally](https://docs.aws.amazon.com/cli/latest/userguide/cli-configure-quickstart.html)
+3. [Install Java 8](https://www.oracle.com/in/java/technologies/downloads/#java8-mac).
+    ```bash
+    # Make sure to export JAVA_HOME like this:
+    export JAVA_HOME=/Library/Java/JavaVirtualMachines/jdk1.8.0_261.jdk/Contents/Home
+    ```
+
+
+## Quick Start
+
+1. Clone this repo _(for Windows use WSL)_.
+
+2. For setting up required libraries and packages locally, run:
+```bash
+    # If default SHELL is zsh use
+    make setup-glue-local SOURCE_FILE_PATH=~/.zshrc
+
+    # If default SHELL is bash use
+    make setup-glue-local SOURCE_FILE_PATH=~/.bashrc
+```
 
-- For Unix-based systems you can refer: [Data Engineering Onboarding Starter Setup](https://github.com/wednesday-solutions/Data-Engineering-Onboarding-Starter#setup)
+3. Source SHELL profile using:
 
-- For Windows-based systems you can refer: [AWS Glue Developing using a Docker image](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-etl-libraries.html#develop-local-docker-image)
+```bash
+    # For zsh
+    source ~/.zshrc
 
-## Steps
+    # For bash
+    source ~/.bashrc
+```
 
-1. Clone this repo in your own repo. For Windows recommend use WSL.
+4. Install Dependencies:
+```bash
+    make install
+```
 
-2. Give your S3, ADLS & Kaggle (optional) paths in the ```app/.custom_env``` file for Databricks. Make ```.evn``` file in the root folder for local Docker Glue to use.
-Make sure to pass KAGGLE_KEY & KAGGLE_USERNAME values if you are going to use Kaggle. Else make the kaggle_extraction flag as False.
+## Change Your Paths
 
-3. Run ```automation/init_docker.sh``` passing your aws credential location & project root location. If you are using Windows Powershell or CommandPrompt then run the commands manually by copy-pasting.
+1. Enter your S3, ADLS & Kaggle (optional) paths in the ```app/.custom_env``` file for Databricks. This file will be used by Databricks.
 
-4. Write your jobs in the ```jobs``` folder. Refer ```demo.py``` file. One example is the ```jobs/main.py``` file.
+2. Similarly, we'll make ```.evn``` file in the root folder. This file will be used by local glue job. To create the required files run:
+```bash
+    make glue-demo-env
+```
+This command will copy your paths from in the ```.env``` file.
 
-5. Check your setup is correct, by running scripts in the docker container locally using ```spark-submit jobs/demo.py```. Make sure you see the "Execution Complete" statement printed.
+3. _(Optional)_ If you want to extract from kaggle, enter KAGGLE_KEY & KAGGLE_USERNAME in ```.evn``` file only. Note: Don't enter any sensitive keys in ```app/.custom_env``` file.
+
+## Setup Check
+Finally, check if everything is working correctly by running:
+```bash
+    gluesparksubmit jobs/demo.py
+```
+Ensure "Execution Complete" is printed.
+
+## Make New Jobs
+
+Write your jobs in the ```jobs``` folder. Refer ```demo.py``` file. One example is the ```jobs/main.py``` file.
 
 ## Deployment
 
-1. In your your GitHub Actions Secrets, setup the following keys with their values:
-    ```
+1. Set up a Github action for AWS Glue. Make sure to pass the following secrets in your repository:
+
+```
     AWS_ACCESS_KEY_ID
     AWS_SECRET_ACCESS_KEY
     S3_BUCKET_NAME
     S3_SCRIPTS_PATH
     AWS_REGION
     AWS_GLUE_ROLE
-    ```
-    Rest all the key-value pairs that you wrote in your .env file, make sure you pass them using the ```automation/deploy_glue_jobs.sh``` file.
+```
 
-2. For Azure Databricks, make a workflow with the link of your repo & main file. Pass the following parameters with their correct values:
+Rest all the key-value pairs that entered in the `.env` file. make sure to pass them using `automation/deploy_glue_jobs.sh` file.
 
-    ```
+2. For Azure Databricks, make a workflow with the link to your repo & main file. Pass the following parameters with their correct values:
+
+```
     kaggle_username
     kaggle_token
     storage_account_name
     datalake_access_key
-    ```
+```
 
-## Documentation
+## Run Tests & Coverage Report
 
-[Multi-cloud Pipeline Documentation](https://docs.google.com/document/d/1npCpT_FIpw7ZuxAzQrEH3IsPKCDt7behmF-6VjrSFoQ/edit?usp=sharing)
+To run tests & coverage report, run the following commands in the root folder of the project:
 
-## References
+```bash
+    make test
 
-[Glue Programming libraries](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html)
-
-## Run Tests & Coverage Report
+    # To see the coverage report
+    make coverage-report
+```
 
-To run tests in the root of the directory use:
+## References
 
-    coverage run --source=app -m unittest discover -s tests
-    coverage report
+[Glue Programming libraries](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html)
 
-Note that AWS Glue libraries are not available to download, so use AWS Glue 4 Docker container.
diff --git a/app/.custom_env b/app/.custom_env
@@ -1,4 +1,5 @@
-# this is my custom file for read & write path based on environment
+# this is env file for paths, read only on databricks
+# for local glue, make a similar one in root named as ".env"
 
 GLUE_READ_PATH="s3://glue-bucket-vighnesh/rawdata/"
 GLUE_WRITE_PATH="s3://glue-bucket-vighnesh/transformed/"
@@ -7,3 +8,5 @@ DATABRICKS_READ_PATH="/mnt/rawdata/"
 DATABRICKS_WRITE_PATH="/mnt/transformed/"
 
 KAGGLE_PATH="mastmustu/insurance-claims-fraud-data"
+
+# Give KAGGLE_KEY & KAGGLE_USERNAME Below
diff --git a/automation/glue_setup.sh b/automation/glue_setup.sh
@@ -0,0 +1,51 @@
+# Parameter 1 --> Shell profile path
+SOURCE_FILE=$1
+echo $SOURCE_FILE
+
+echo -e "FIRST RUN TIME ESTIMATION: 30-45 MINS\nPlease do NOT exit"
+
+export PROJECT_ROOT=$(pwd)
+
+# Doing all the work in separate folder "glue-libs"
+cd ~
+mkdir glue-libs
+cd glue-libs
+
+# Clone AWS Glue Python Lib
+git clone https://github.com/awslabs/aws-glue-libs.git
+export AWS_GLUE_HOME=$(pwd)/aws-glue-libs
+
+# Install Apache Maven
+curl https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-common/apache-maven-3.6.0-bin.tar.gz -o apache-maven-3.6.0-bin.tar.gz
+tar -xvf apache-maven-3.6.0-bin.tar.gz
+ln -s apache-maven-3.6.0 maven
+export MAVEN_HOME=$(pwd)/maven
+
+# Install Apache Spark
+curl https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-3.0/spark-3.1.1-amzn-0-bin-3.2.1-amzn-3.tgz -o spark-3.1.1-amzn-0-bin-3.2.1-amzn-3.tgz
+tar -xvf spark-3.1.1-amzn-0-bin-3.2.1-amzn-3.tgz
+ln -s spark-3.1.1-amzn-0-bin-3.2.1-amzn-3 spark
+export SPARK_HOME=$(pwd)/spark
+
+# Export Path
+export PATH=$PATH:$SPARK_HOME/bin:$MAVEN_HOME/bin:$AWS_GLUE_HOME/bin
+export PYTHONPATH=$PROJECT_ROOT
+
+# Download Glue ETL .jar files
+cd $AWS_GLUE_HOME
+chmod +x bin/glue-setup.sh
+./bin/glue-setup.sh
+mvn install dependency:copy-dependencies
+cp $AWS_GLUE_HOME/jarsv1/AWSGlue*.jar $SPARK_HOME/jars/
+cp $AWS_GLUE_HOME/jarsv1/aws*.jar $SPARK_HOME/jars/
+
+echo "export AWS_GLUE_HOME=$AWS_GLUE_HOME
+export MAVEN_HOME=$MAVEN_HOME
+export SPARK_HOME=$SPARK_HOME
+export PATH=$PATH:$SPARK_HOME/bin:$MAVEN_HOME/bin:$AWS_GLUE_HOME/bin
+export PYTHONPATH=$PROJECT_ROOT" >> $SOURCE_FILE
+
+
+cd $PROJECT_ROOT
+
+echo -e "\nGLUE LOCAL SETUP COMPLETE"
diff --git a/jobs/demo.py b/jobs/demo.py
@@ -2,7 +2,8 @@
 from dotenv import load_dotenv
 import app.environment as env
 
-load_dotenv("../app/.custom-env")
+load_dotenv("../app/.custom_env") # Loading env for databricks
+load_dotenv() # Loading env for glue
 
 # COMMAND ----------
 

diff --git a/jobs/main.py b/jobs/main.py
@@ -9,7 +9,8 @@
 import app.environment as env
 import app.spark_wrapper as sw
 
-load_dotenv("../app/.custom_env")
+load_dotenv("../app/.custom_env") # Loading env for databricks
+load_dotenv() # Loading env for glue
 
 # COMMAND ----------
 

diff --git a/requirements.txt b/requirements.txt
@@ -1,6 +1,6 @@
-mypy~=1.7.1
-pylint~=3.0.2
-coverage~=7.3.2
-python-dotenv~=1.0.0
+mypy
+pylint
+coverage
+python-dotenv
 kaggle~=1.5.16
-pre-commit~=3.6.0
+pre-commit