Skip to content

Latest commit

 

History

History
137 lines (112 loc) · 5.04 KB

README.md

File metadata and controls

137 lines (112 loc) · 5.04 KB

How to develop, and deploy properly AWS Glue Job using AWS Glue interactive sessions and AWS CDK

PS: A SAM version is available on sam branch

This repo aim to demonstrate how to develop AWS Glue Job efficiently:

  • Be able to develop locally
  • Get a fast feedback loop
  • Be able to commit with no manual copy paste between tools

In addition this repo shows how to deploy this AWS Glue Job through a proper CI/CD pipeline leveraging Infrastructure as code.

Two options are proposed here: "Use this repo" or "Do it your self"

Use This repo

Prerequisites

  1. Clone this repo
    git clone https://github.com/flochaz/aws-glue-job-e2e-dev-life-cycle.git
    cd aws-glue-job-e2e-dev-life-cycle
  2. setup virtual env
    python3 -m venv .venv
    source .venv/bin/activate
  3. Install CDK
    npm install -g aws-cdk

deploy dev env

In order to run glue job locally we will need some specific elements such as

  • an iam role to assume while running local notebook
  • a glue database to store the data
  • a glue crawler to extract the schema and data from raw source csv files
  • Trigger the crawler ...

This CDK app will deploy all those for you to be ready to work on the glue job itself

  1. Install deps
    pip install -r requirements.txt
  2. Bootstrap account
    cdk bootstrap
  3. Deploy Glue role, crawler etc.
cdk deploy infrastructure

Local dev experience

AWS Glue service offer a way to run your job remotely while developping locally through the Interactive Sessions feature.

  1. Set up interactive session:
pip install -r requirements-dev.txt
SITE_PACKAGES=$(pip show aws-glue-sessions | grep Location | awk '{print $2}')
jupyter kernelspec install $SITE_PACKAGES/aws_glue_interactive_sessions_kernel/glue_pyspark # Add "--user" if getting "[Errno 13] Permission denied: '/usr/local/share/jupyter'"
jupyter kernelspec install $SITE_PACKAGES/aws_glue_interactive_sessions_kernel/glue_spark # Add "--user" if getting "[Errno 13] Permission denied: '/usr/local/share/jupyter'"
  1. Setup glue role by copying the output called awsConfigUPDATE of the previous cdk deploy command into ~/.aws/config under [default]
    cat ~/.aws/config
    [default]
    glue_role_arn=xxxxxx
  2. Launch notebook
    jupyter notebook # add "--ip 0.0.0.0" if running in a remote IDE such as cloud9 (PS: you will need to open your security group for TCP connection on 8888 port as well !)
  3. Play with glue_job_source/data_cleaning_and_lambda.ipynb
  4. Commit your changes to git
  5. Optionally deploy your changes to dev env
    cdk deploy infrastructure

Deploy through pipeline

If deploying to same account / region, first you will need to destroy your dev stack to avoid resource collision (especially glue role, crawler, database etc.)

cdk destroy infrastructure
  1. Create a repo by deploying the pipeline stack
    cdk deploy GlueJobPipelineStack
  2. Push code to repo
    # Remove github origin
    git remote remove origin
    # Add code commit repo as origin
    git remote add origin <YOUR CODE COMMIT REPO URL (THE COMMAND SHOULD BE FOUND IN THE PREVIOUS "cdk deploy GlueJobPipelineStack" output)>
    git push -u master
  3. Observe the deployment through code pipeline

Do it your self

  1. Get into your aws account
  2. Setup your online IDE: Cloud 9
  3. Add your glue job (you can take this one for instance https://github.com/aws-samples/aws-glue-samples/blob/master/examples/data_cleaning_and_lambda.py)
  4. Add interactive sessions + notebook CI/CD (optional)
  5. https://docs.aws.amazon.com/glue/latest/dg/interactive-sessions.html
  6. Quick hack
    1. vim ~/.aws/config glue_role_arn
    2. vim ~/.aws/credentials
    3. jupyter notebook —ip 0.0.0.0
    4. jupyter nbconvert --to script ./data_cleaning_and_lambda.ipynb
  7. Create your first CDK app
  8. Add glue infrastructure: https://docs.aws.amazon.com/cdk/api/v2/python/aws_cdk.aws_glue_alpha/README.html
  9. Glue database
  10. Glue Role
  11. Glue Crawler
  12. Glue Job
  13. Add CI/CD using the official doc or workshop

TODO

  • Inject config (such as output_bucket, stage, database name etc ...)
  • Add dev life cycle diagram and screenshots
  • Add example for external file inclusion in notebook with aws s3Sync and %extra_py_files etc.
  • Add integration tests to pipeline
  • Describe how to add stage with manual approval
  • Fix CDK unit tests

Feel free to contribute !!!