The titanic-survival-prediction.py sample runs a Spark ML pipeline to train a classfication model using random forest on AWS Elastic Map Reduce(EMR).
Titanic: Machine Learning from Disaster copy of Jeffwan's code for gitops workshop Also pipeline sample from kubeflow/pipeline
# Set default region
export AWS_REGION=$(curl -s 169.254.169.254/latest/dynamic/instance-identity/document | jq -r '.region') # For ec2 client or cloud9
export AWS_DEFAULT_REGION=$AWS_REGION
# I encountered issues with EMR in eu-west-2 but it works fine in eu-west-1.
# Use this variable to set the region to run the EMR cluster in if you encounter issues in your local/default region
export EMR_REGION=$AWS_REGION
export BUCKET_NAME=mlops-kubeflow-pipeline-data2
aws iam create-user --user-name mlops-user
aws iam create-access-key --user-name mlops-user > $HOME/mlops-user.json
export THE_ACCESS_KEY_ID=$(jq '."AccessKey"["AccessKeyId"]' $HOME/mlops-user.json)
echo $THE_ACCESS_KEY_ID
export THE_SECRET_ACCESS_KEY=$(jq '."AccessKey"["SecretAccessKey"]' $HOME/mlops-user.json)
export ACCOUNT_ID=$(aws sts get-caller-identity --output text --query Account)
aws iam create-policy --policy-name mlops-s3-access \
--policy-document https://raw.githubusercontent.com/paulcarlton-ww/ml-workshop/master/resources/s3-policy.json > s3-policy.json
aws iam create-policy --policy-name mlops-emr-access \
--policy-document https://raw.githubusercontent.com/paulcarlton-ww/ml-workshop/master/resources/emr-policy.json > emr-policy.json
aws iam create-policy --policy-name mlops-iam-access \
--policy-document https://raw.githubusercontent.com/paulcarlton-ww/ml-workshop/master/resources/iam-policy.json > iam-policy.json
aws iam attach-user-policy --user-name mlops-user --policy-arn $(jq '."Policy"["Arn"]' s3-policy.json)
aws iam attach-user-policy --user-name mlops-user --policy-arn $(jq '."Policy"["Arn"]' emr-policy.json)
curl https://raw.githubusercontent.com/paulcarlton-ww/ml-workshop/master/resources/kubeflow-aws-secret.yaml | \
sed s/YOUR_BASE64_SECRET_ACCESS/$(echo -n "$THE_SECRET_ACCESS_KEY" | base64)/ | \
sed s/YOUR_BASE64_ACCESS_KEY/$(echo -n "$THE_ACCESS_KEY_ID" | base64)/ | kubectl apply -f -;echo
aws s3api create-bucket --bucket $BUCKET_NAME --region $AWS_DEFAULT_REGION --create-bucket-configuration LocationConstraint=$AWS_DEFAULT_REGIONcurl -s "https://get.sdkman.io" | bash
source "$HOME/.sdkman/bin/sdkman-init.sh"
sdk install java
sdk install sbtgit clone [email protected]:paulcarlton-ww/mlops-titanic
cd mlops-titanic/
sbt clean package
aws s3api put-object --bucket $BUCKET_NAME --key emr/titanic/titanic-survivors-prediction_2.11-1.0.jar --body target/scala-2.11/titanic-survivors-prediction_2.11-1.0.jarNote: EMR has all spark libariries and this project doesn't reply on third-party library. We don't need to build fat jars.
Check Kaggle Titanic: Machine Learning from Disaster for more details about this problem. 70% training dataset is used to train model and rest 30% for validation.
A copy of train.csv is included in this repository, it needs to be uploaded to S3.
aws s3api put-object --bucket $BUCKET_NAME --key emr/titanic/train.csv --body train.csvSee building a pipeline to install the Kubeflow Pipelines SDK. The following command will install the tools required
wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
source /home/ec2-user/.bashrc
conda create --name mlpipeline python=3.7
pip3 install --user kfp --upgrade
rm Miniconda3-latest-Linux-x86_64.sh cd aws-titanic
sed s/mlops-kubeflow-pipeline-data/$BUCKET_NAME/g titanic-survival-prediction.py | sed s/aws-region/$EMR_REGION/ > build/titanic-survival-prediction.py
dsl-compile --py build/titanic-survival-prediction.py --output build/titanic-survival-prediction.tar.gz
aws s3api put-object --bucket $BUCKET_NAME --key emr/titanic/titanic-survival-prediction.tar.gz --body build/titanic-survival-prediction.tar.gzUsing the portforward access to the Kubeflow UI the upload from URL option does not work so it is necessary to download the file to your workstation. A shell script is provided to generate the commands required.
get-tar-cmds.shNow use the kubeflow UI to upload the pipeline file and run an experiment.
Open the Kubeflow pipelines UI. Create a new pipeline, and then upload the compiled specification (.tar.gz file) as a new pipeline template.
Once the pipeline done, you can go to the S3 path specified in output to check your prediction results. There're three columes, PassengerId, prediction, Survived (Ground True value)
...
4,1,1
5,0,0
6,0,0
7,0,0
...
Find the result file name:
aws s3api list-objects --bucket $BUCKET_NAME --prefix emr/titanic/outputDownload it and analyse:
export RESULT_FILE=<result file>
aws s3api get-object --bucket $BUCKET_NAME --key emr/titanic/output/$RESULT_FILE \$HOME/$RESULT_FILE.csv
grep ",1,1\|,0,0" $HOME/$RESULT_FILE | wc -l # To count correct results
wc -l $RESULT_FILE # To count items in fileCreate Cluster: source code
Submit Spark Job: source code
Delete Cluster: source code