Skip to content

This solution shows how to deliver reusable and self-contained custom components to Amazon SageMaker environment using AWS Service Catalog, AWS CloudFormation, SageMaker Projects and SageMaker Pipelines.

License

Notifications You must be signed in to change notification settings

aws-samples/amazon-sagemaker-reusable-components

Amazon SageMaker reusable components with SageMaker Projects

This solution shows how to deliver reusable and self-contained ML components to Amazon SageMaker environment using AWS Service Catalog, AWS CloudFormation, SageMaker Projects and SageMaker Pipelines.

Refer to the blog Enhance your machine learning development by using a modular architecture with Amazon SageMaker projects for more details.

Solution overview

As an example of an ML workflow that spans several development domains, the proposed solution implements a use case of an automated pipeline for data transformation, feature extraction, and ingestion into Amazon SageMaker Feature Store.

On a high level, the workflow comprises the following steps:

  1. An upstream data ingestion component uploads data file or files to an Amazon Simple Storage Service (Amazon S3) bucket
  2. The data upload event launches a data processing and transformation process
  3. The data transformation process extracts, processes, and transforms features, and ingests them into a designated feature group in Feature Store

Solution architecture

The detailed component architecture of the solution is presented in the following diagram.

A product portfolio (1) defines the automated Feature Store data ingestion product (2) together with the associated user roles that are allowed to use the portfolio and the containing products. CloudFormation templates define both the product portfolio (1) and the product (2). A CloudFormation template (3) contains all the resources, source code, configuration, and permissions that are needed to provision the product in your SageMaker environment.

When AWS CloudFormation deploys the product, it creates a new SageMaker project (4).

The SageMaker project implements the feature ingestion workflow (5). The workflow contains an AWS Lambda function, which is launched by an Amazon EventBridge rule each time new objects are uploaded into a monitored S3 bucket. The Lambda function starts an Amazon SageMaker Pipeline (6), which is defined and provisioned as a part of the SageMaker project. The pipeline implements data transformation and ingestion in Feature Store.

The project also provisions CI/CD automation (7) with an AWS CodeCommit repository with source code, AWS CodeBuild with a pipeline build script, and AWS CodePipeline to orchestrate the build and deployment of the SageMaker pipeline (6).

ML pipeline

This solution implements an ML pipeline by using Amazon SageMaker Pipelines, an ML workflow creation and orchestration framework. The pipeline contains a single step with an Amazon SageMaker Data Wrangler processor for data transformation and ingestion into a feature group in Feature Store. The following diagram shows a data processing pipeline implemented by this solution.

IAM roles and permissions

The following diagram shows all the IAM roles involved and which service or resource assumes which role.

The IAM setup contains the following roles:

  1. The SageMaker Service Catalog products launch role. This role calls the iam:PassRole API for the SageMaker Service Catalog products use role (2) and the Lambda execution role (4).
  2. The SageMaker Service Catalog products use role. Project resources assume this role to perform their tasks.
  3. The SageMaker execution role. Studio notebooks use this role to access all resources, including S3 buckets.
  4. The Lambda execution role. The Lambda function assumes this role.
  5. The Lambda function resource policy allows EventBridge to invoke the function.

Refer to SageMaker Studio Permissions Required to Use Projects for more details on the Studio permission setup for projects.

Deployment

To deploy the solution, you must have Administrator (or Power User) permissions to package the CloudFormation templates, upload templates in your Amazon S3 bucket, and run the deployment commands.

You must also have AWS CLI. If you do not have it, see Installing, updating, and uninstalling the AWS CLI.

Clone the solution GitHub repository to your local development environment:

git clone https://github.com/aws-samples/amazon-sagemaker-reusable-components.git
cd amazon-sagemaker-reusable-components

Create SageMaker Studio

If you don't have SageMaker Studio created and configured in your account, you must create a new instance of Studio. If you already have Studio, you can continue from Deploy SageMaker project portfolio step.

  1. Open Amazon SageMaker Console
  2. Select Amazon SageMaker Studio:
  3. In the configuration dialog select Standard setup, AWS Identity and Access Management (IAM) and make sure SageMaker Projects and JumpStart enabled for the account and Studio users: Select Create a new role in Permission configuration. Review, amend if needed, and accept the permissions for the Studio IAM role:
  4. In the Network and storage configuration select VPC and Network Access for Studio. If you select the VPC only option, make sure your VPC has internet access:
  5. Click Submit

❗ You don't need to wait until Studio becomes available and can move on to the next deployment step.

Deploy SageMaker project portfolio

This solution includes a SageMaker custom project template to demonstrate the usage of re-usable governed components in Studio, more specifically for automation of feature transformation and ingestion into the SageMaker Feature Store. This project template is delivered as AWS Service Catalog product and available for usage in Studio SageMaker resources menu under Projects.

You must follow the following deployment steps to provision all necessary artifacts before starting Studio:

  1. Package CloudFormation templates
  2. Get the ARN of the SageMaker execution role:
    a. Get the SageMaker DomainId:
    export SM_DOMAIN_ID=$(aws sagemaker list-domains \
        --output text --query 'Domains[0].DomainId')
    b. Get the SageMaker execution role ARN:
    export SM_EXECUTION_ROLE=$(aws sagemaker describe-domain \
        --domain-id $SM_DOMAIN_ID \
        --output text --query 'DefaultUserSettings.ExecutionRole')
  3. Deploy AWS Service Catalog product portfolio:
S3_BUCKET_NAME=<S3 bucket name you used to package CloudFormation templates in step 1>
SC_PORTFOLIO_STACK_NAME=sm-project-sc-portfolio

aws cloudformation create-stack \
    --template-url https://s3.$AWS_DEFAULT_REGION.amazonaws.com/$S3_BUCKET_NAME/amazon-sagemaker-reusable-components/sm-project-sc-portfolio.yaml \
    --region $AWS_DEFAULT_REGION \
    --stack-name $SC_PORTFOLIO_STACK_NAME  \
    --disable-rollback \
    --capabilities CAPABILITY_NAMED_IAM \
    --parameters \
        ParameterKey=SCPortfolioPrincipalRoleArn,ParameterValue=$SM_EXECUTION_ROLE

Wait until CloudFormation stack is successfully deployed into your account and proceed with the next step.

Add permissions to Service Catalog launch and SageMaker execution IAM roles

AWS Service Catalog uses a default AmazonSageMakerServiceCatalogProductsLaunchRole IAM role to launch CloudFormation templates with SageMaker projects. This role is automatically created during provisioning of SageMaker Studio if you enable SageMaker Projects for Studio users.

To deploy our Feature Store ingestion product as a SageMaker project, this role needs additional permissions. All needed permissions are defined in a managed policy resource AmazonSageMakerServiceCatalogFSIngestionProductPolicy, which we must attach to AmazonSageMakerServiceCatalogProductsLaunchRole role before we can start SageMaker project deployment.

To run some code cells in the provided notebooks, for example with calls to CloudFormation API, the SageMaker execution role needs additional permissions. These permissions are defined in managed policy resource AmazonSageMakerExecutionRolePolicy and must be attached to the SageMaker execution role.

Run the following CLI commands to attach the created managed policies to the AmazonSageMakerServiceCatalogProductsLaunchRole and SageMaker execution IAM roles:
Retrieve the managed policy ARNs and SageMaker execution role name:

export SM_SC_FS_INGESTION_POLICY_ARN=$(aws cloudformation describe-stacks \
    --stack-name $SC_PORTFOLIO_STACK_NAME \
    --output text \
    --query 'Stacks[0].Outputs[?OutputKey==`FSIngestionProductPolicyArn`].OutputValue')

export SM_EXECUTION_ROLE_POLICY_ARN=$(aws cloudformation describe-stacks \
    --stack-name $SC_PORTFOLIO_STACK_NAME \
    --output text \
    --query 'Stacks[0].Outputs[?OutputKey==`AmazonSageMakerExecutionRolePolicyArn`].OutputValue')

export SM_EXECUTION_ROLE_NAME=$(aws cloudformation describe-stacks \
    --stack-name $SC_PORTFOLIO_STACK_NAME \
    --output text \
    --query 'Stacks[0].Outputs[?OutputKey==`AmazonSageMakerExecutionRoleName`].OutputValue')

Attach the policies to the roles:

aws iam attach-role-policy \
    --role-name AmazonSageMakerServiceCatalogProductsLaunchRole \
    --policy-arn $SM_SC_FS_INGESTION_POLICY_ARN

aws iam attach-role-policy \
    --role-name  $SM_EXECUTION_ROLE_NAME \
    --policy-arn $SM_EXECUTION_ROLE_POLICY_ARN

Start Studio

To launch Studio you must go to SageMaker Dashboard, click Open SageMaker Studio and check that Status is Ready. Make sure that Amazon SageMaker project templates are enabled for your account and Studio users in Studio Summary pane:

If projects are disabled, click on Edit Settings:

and enable the projects in the settings window, click Submit:

If you don't have a studio user, click on the Add user in the SageMaker Studio Control panel. Otherwise click on Open Studio.

To add a new studio user, enter the corresponding SageMaker execution role which was created in the step Create SageMaker Studio in the Add user panel and click Submit:

After a user profile created, you can click on Open Studio link. You will be redirected to a new browser window with Studio IDE.

Wait until Studio finishes creation of a default JupyterServer application:

and you presented with Launcher landing page:

Clone code repository

To use the provided notebooks you must clone the source code repository into your Studio environment. Open a system terminal in Studio in the Launcher window:

Run the following command in the terminal:

git clone https://github.com/aws-samples/amazon-sagemaker-reusable-components.git

The code repository will be downloaded and saved in your home directory in Studio. Now go to the file browser and open 00-setup notebook:

The first start of the notebook kernel on a new KernelGateway app takes about 5 minutes. Continue with further instructions in the notebook after Kernel is ready.

Solution walk-through

Please use the provided setup and feature-store-ingest-pipeline notebooks to run all solution steps.

The delivered notebooks take you through the following implementation:

  • Setup
    • set up the working environment, create an S3 bucket for data upload, explore the test dataset
    • optional - create a Data Wrangler flow for data transformation and feature ingestion
    • create a feature group in Feature Store where features are stored
    • query the data from the feature group
  • Feature Store ingestion pipeline
    • provision a SageMaker project with a data pipeline
    • explore the project resources
    • test the data pipeline by uploading new data into the monitored S3 bucket
    • run the data pipeline on demand via Python SDK
    • query the data from the feature group
  • Clean up
    • delete the project and project's resources
    • delete the feature group
    • delete project-provisioned S3 buckets and S3 objects

Clean up

To avoid charges, you must remove all project-provisioned and generated resources from your AWS account.

  1. Run all steps in the provided clean-up notebook
  2. Detach the managed policies from the roles:
aws iam detach-role-policy \
    --role-name AmazonSageMakerServiceCatalogProductsLaunchRole \
    --policy-arn $SM_SC_FS_INGESTION_POLICY_ARN

aws iam detach-role-policy \
    --role-name  $SM_EXECUTION_ROLE_NAME \
    --policy-arn $SM_EXECUTION_ROLE_POLICY_ARN
  1. Delete the SageMaker product portfolio stack:
aws cloudformation delete-stack --stack-name $SC_PORTFOLIO_STACK_NAME
  1. If you used the provided CloudFormation package and deployment stack, delete it:
aws cloudformation delete-stack --stack-name amazon-sagemaker-reusable-components-package-cfn
  1. Delete the SageMaker Studio Domain.
    Refer to the detailed step-by-step instructions in the Amazon SageMaker Developer Guide.

Dataset

We use a well-known Abalone dataset in this solution. The dataset contains 4177 rows of data, and 8 features.

Dua, D. and Graff, C. (2019). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

You can download the dataset from UCI website:

wget -t inf http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data
wget -t inf http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.names

The instructions, how to download and use the dataset, are provided in the 00-setup notebook

Resources

License

This library is licensed under the MIT-0 License. See the LICENSE file.

Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. SPDX-License-Identifier: MIT-0

About

This solution shows how to deliver reusable and self-contained custom components to Amazon SageMaker environment using AWS Service Catalog, AWS CloudFormation, SageMaker Projects and SageMaker Pipelines.

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published