Skip to content

Latest commit

 

History

History
361 lines (256 loc) · 25.9 KB

File metadata and controls

361 lines (256 loc) · 25.9 KB

Guidance for Deploying Secure High-Performance Computing Clusters on AWS

This guidance aims to instruct and guide users how to build a cloud HPC for the new NIST SP 800-223 standard using AWS Parallel Cluster. The National Institute of Standards and Technology (NIST) has published NIST SP 800-223: High-Performance Computing (HPC) Security: Architecture, Threat Analysis, and Security Posture standard. This new standard provides guidance on how to configure and secure an HPC cluster. This guidance instructs users on how to build a cloud HPC for the new NIST SP 800-223 compliance using AWS CloudFormation and AWS Parallel Cluster.

Table of Contents

Table of Contents

Overview

Amazon Web Services (AWS) provides the most elastic and scalable cloud infrastructure to run your hpc workloads. With virtually unlimited capacity - engineers, researchers, HPC system administrators, and organizations can innovate beyond the limitations of on-premises HPC infrastructure.

High Performance Compute (HPC) on AWS removes the long wait times and lost productivity often associated with on-premises HPC clusters. Flexible HPC cluster configurations and virtually unlimited scalability allows you to grow and shrink your infrastructure as your workloads dictate, not the other way around.

This guidance provides a comprehensive approach to deploying a secure, compliant, and high-performance HPC environment on AWS. It addresses the unique security challenges of HPC systems while maintaining the performance requirements critical for computationally intensive workloads.

We developed this guidance in response to the growing need for secure HPC environments in cloud settings. Many organizations, especially those in research, engineering, and data-intensive fields, require immense computational power but struggle to balance this with stringent security and compliance requirements. The NIST SP 800-223 publication provides an excellent framework for addressing these challenges, and we wanted to demonstrate how to implement these recommendations using AWS services.

Architecture Overview

Architecture diagrams

Architecture diagrams below show sample NIST 800-223 based infrastructure architecture, provisoning and deployment process using cloudformation, HPC Cluster deployment, and user interactions via AWS ParallelCluster. Depending on the region you deploy the guidance in, it will automatically scale to from 2-4 AZs in order to maximize availability and redundancy of your cluster.

Reference Architecture

(1) Admin/DevOps users can deploy this architecture using a series of AWS CloudFormation templates. These templates provision networking resources, including Amazon Virtual Private Cloud (Amazon VPC) and subnets. The templates also provision resources for security and storage, such as Amazon Simple Storage Service (Amazon S3), Amazon Elastic File System (Amazon EFS), and Amazon FSx for Lustre. There are optional templates included to deploy a Slurm accounting database (DB) and a Microsoft Active Directory user directory.
(2) Four logical subnets (zones) are created, each in multiple Availability Zones (AZs), based on the target AWS Region. All required networking, networking access control list (ACLs), routes, and security resources are deployed. deployed subnets for the Storage Zone. An FSx for Lustre file system is created that is used as a highly performant scratch file system in the preferred AZ. The four zones are: 1) Access Zone (public subnet), 2) Compute Zone, 3) Management Zone, and 4) Storage Zone (all private subnets).
(3) An Amazon RDS for MySQL instance is created that will be used as the Slurm Accounting Database. This is set up in a single zone, or can be modified to be multi-AZ if preferred. One AWS Directory Service user directory is created across two AZs.
(4) An Amazon EFS file system is created for shared cluster storage that is mounted in all of the deployed subnets for the Storage Zone. An FSx for Lustre file system is created that is used as a highly performant scratch file system in the preferred AZ.
(5) Two Amazon S3 buckets are created: one for campaign storage using Amazon S3 Intelligent-Tiering, and one for archival storage using Amazon S3 Glacier.
(6) Random passwords are generated for both the Slurm accounting database and the Directory Service that are stored securely in AWS Secrets Manager.

Reference Architecture

(1) Admin/DevOps users use the AWS ParallelCluster AWS CloudFormation stack to deploy HPC resources. Resources can reference the network, storage, security, database, and user directory from the previously launched CloudFormation stacks.
(2) The AWS ParallelCluster CloudFormation template provisions a sample cluster configuration, which includes a head node deployed in a single Availability Zone within the Management zone. It also provisions a login node deployed in a single Availability Zone within the Access zone.
(3) The Slurm workload manager is deployed on the head node and used for managing the HPC workflow processes.
(4) The sample cluster configuration included creates two Slurm queues that provision compute nodes within the Compute zone. One queue uses compute-optimized Amazon Elastic Compute Cloud (Amazon EC2) instances, while the other queue utilizes GPU-accelerated EC2 instances.
(5) Users access this guidance by establishing a connection to the deployed login node within the Access zone, utilizing either a NICE DCV, SSH, or an AWS Systems Manager Session Manager.
(6) Users authenticate to the log in node using a username and password stored in the AWS Managed Microsoft Active Directory.

Cost

You are responsible for the cost of the AWS services deployed and used running this guidance. As of December 2024, the cost for running this guidance with the default settings in the US East (N. Virginia) region is approximately $1,156 per month.

We recommend creating a Budget through AWS Cost Explorer to help manage costs. Prices are subject to change. For full details, refer to the pricing webpage for each AWS service used in this Guidance.

Cost Table

The following table provides a sample cost breakdown for deploying this guidance with the default parameters in the US East (N. Virginia) us-east-1 Region for one month.

Stack Name AWS Services Cost [USD]
Network VPC, Subnets, NAT Gateway, VPC Endpoints $596.85/month
Security Security Groups $0.00/month
Storage S3, EFS, FSx, EBS $172.19/month
Slurm Accounting RDS Database $73.84/month
Active Directory Managed AD (Enterprise) $288.00/month
Cluster Head node, Login node $25.00/month
Total $1155.88/month

Note: This focus of this Guidance is to provide an example of securing the underlying AWS services and infrastructure that an HPC cluster will eventually run on. It does not aim to include any costs related to running actual HPC workloads. Please use the AWS Pricing Calculator to estimate any additional costs related to your specific HPC workload usecase.

Security

When you build systems on AWS infrastructure, security responsibilities are shared between you and AWS. This shared responsibility model reduces your operational burden because AWS operates, manages, and controls the components including the host operating system, the virtualization layer, and the physical security of the facilities in which the services operate. For more information about AWS security, visit AWS Cloud Security.

AWS ParallelCluster users can be securely authenticiated and authorized using Amazon Manageged Microsoft Active Directory. HPC cluster EC2 components are deployed into a Virtual Private Cloud (VPC) which provides additional network security isolation for all contained components. Login Node is depoyed into a Public subnet and available for access via secure connections (SSH and SSM), Head Node is depoyed into a Private subnet and available for access via secure connections (SSH and SSM), compute nodes are deployed into Private subnet and managed from Head node via SLURM package manager, Slurm accounting database is deployed into a Private subnet and managed from the Head node using Slurm. Data stored in Amazon S3, Amazon EFS, and Amazon FSx for Lustre is enrypted at rest and in transit. Access to other AWS services from AWS ParallelCluster components are secured over VPC Endpoints from a Private management subnet.

See CONTRIBUTING for more information.

Prerequisites

Secure Shell (SSH) Access

If you prefer to use SSH to access the login node or head node you will need to create a new SSH keypair in your account before launching the ParallelCluster CloudFormation template.

To do that, please follow the steps below:

  1. Login to your AWS account
  2. In the search bar at the top of the screen type in EC2
  3. In the list of services select EC2
  4. In the left-hand menu select Key Pairs under the Network & Security section
  5. Click Create key pair
  6. Enter a key pair name
  7. Select your preferred key pair type and format and click Create key pair
  8. This will automatically start a download of the private key for the key pair you just created
  9. Save this key in a secure location (this key can act as your password to login to the nodes launched by this template)

AWS Account Requirements

This deployment requires you have access to Amazon CloudFormation in your AWS account with permissions to create the following resources:

AWS Services in this Guidance:

Deployment Steps

  1. Clone the sample code repository:
    • git clone https://github.com/aws-solutions-library-samples/guidance-for-building-nist-sp-800-223-hpc-on-aws.git
  2. Change Directory to the deployment folder inside the repo
    • cd guidance-for-building-nist-sp-800-223-hpc-on-aws/deployment
  3. Locate the six Amazon CloudFormation templates and review them in order in a text editor of your choice or in the Amazon CloudFormation console
    • In most cases you will want to use the default settings, however, you have the ability to modify these templates to your specific needs. Below are default stack source file names with corresponding CFN stack names:

Default Stack Names

Template File Name Default Stack Name
0_network.yaml nist-network
1_security.yaml nist-security
2_storage.yaml nist-storage
3_slurm_db.yaml nist-database
4_active_directory.yaml nist-ad
5_pcluster.yaml nist-hpc
  1. Open a browser and login to your AWS Account
  2. Locate the search bar at the top of your screen and type in CloudFormation
  3. When presented with a list of services click CloudFormation to open the CloudFormation console:

CloudFormation

  1. Click the Create Stack button:

Create Stack

  1. In the Prepare template section select Choose an existing template:

Prepare Template

  1. In the Specify template section select Upload a template file
  2. Click the Choose file button:

Choose File

  1. Navigate to the location on your local computer where you cloned the sample code and navigate into the deployment folder. There you will find the CloudFormation templates prefaced with a number that will indicate the order to execute them in.
  2. Select the first template titled 0_network.yaml
  3. For each template you will be asked to provide a Stack name - this name must be a unique stack name for the region you are deploying in.

Important: The stack name should be noted for use in deployment of other templates. Downstream services will need to know this stack name in order to reference Amazon Resource Names (ARNs) or resource IDs that will be exported/output for each template

Stack Name

  1. For the Network stack, review the parameters and adjust as needed based on your specific use case or requirements
  2. Once you have reviewed and validated the parameters, click the Next button at the bottom of the page
  3. Leave the default options on the Configure stack options page
  4. You will need to scroll to the bottom of this page and select the check box to allow CloudFormation to create IAM resources on your behalf
  5. Click Next

Choose File

  1. On the Review and create screen review your selections again and then click the Submit button at the bottom of the page:

Submit

  1. Your CloudFormation stack will begin deploying
  2. You can monitor the deployment progress in the AWS CloudFormation console

Choose File

  1. Wait until you see the stack status update from "CREATE_IN_PROGRESS" to "CREATE_COMPLETE" before moving on to deploying the next template
  2. You can review the outputs generated by the stack by going to the Outputs tab for each stack or going to the Exports page on the left-hand menu
    • Note: The export values will be used by later templates to reference resources created in earlier templates

Stack Outputs View:

Outputs

Stack Exports View:

Exports

  1. Repeat the steps above starting with step 7. for the rest of guidance stacks, moving on to the next stack source file in the deployment folder (listed in Step 3. above)

Important: Stacks 1-5 have input parameters that ask for the previously deployed stack names. If you modify the stack names from their default values in Step 3, you will need to also update the parameters in each subsequent stack with the appropriate name so that the relevant services can be referenced.

Below as in example of modified input parameters used for deployment of the hpc-pc-cluster stack:

HPC cluster input parameters

Note: The storage, Slurm database, Active Directory, and AWS ParallelCluster stacks are intended to be simple examples for testing the NIST SP 800-223 reference architecture. For more production ready versions of these templates, please see AWS HPC Recipes repository

Deployment Validation

  • Open CloudFormation console and verify the status of the guidance templates with the names listed in Step 3.

Validate deployment

  • Make sure that all CloudFormation stacks are deployed successfully, with resulting status of "CREATE_COMPLETE"

You can also open a nested stack with the name starting with c-nist and look into its Outputs tab to get values of some important HPC cluster parameters:

Nested Stack Cluster parameters

If you need to verify that specified high performance FSx storage was provisioned, navigate to the FSx section of AWS Console and look for the FSx for Lustre entry like:

FSxL Storage

Other services/resources deployed by guidance can also be obtained from outputs of respective CloudFormation stacks

Next Steps

Now that you have successfully deployed the infrastructure need to comply with the guidelines and recommendations outlined in the NIST SP 800-223.

You can begin using the cluster by logging in to the management node to review or modify any of the Slurm settings. You can use SSM to securely open a terminal session to either the Login node or the Head node by following the instructions below:

HPC Login and Head Nodes

Login via AWS Systems Manager (SSM)

  1. In the search bar above type in EC2
  2. In the list of services select EC2
  3. On the left hand menu select Instances
  4. Locate either the head node or the Login node and select that instance by checking the box to the left of it
  5. Locate the Connect button near the top of the screen
  6. In the window that opens click the Session Manager tab
  7. Click the Connect button to open a secure terminal session in your browser

SSM Connect Head Node

Login via SSH

Alternatively, when you launch the 5_pcluster.yaml CloudFormation template, you can select an SSH Key pair that already exists in your AWS Account. If you completed the prerequistes steps to create a key pair you will see it populated in this list.

  1. Locate your SSH key pair
  2. Ensure you have the proper permissions set on the key pair (read-only access) chmod 400 /path/key/ssh_key.pem
  3. In the list of services select EC2
  4. On the left hand menu select Instances
  5. Locate either the head node or the login node and select one instance by checking the box to the left of the instance
  6. Locate the Connect button near the top of the screen
  7. In the window that opens click the SSH client tab
  8. Follow the instructions on the screen to login to your instance

Run Test SLURM Job

You can follow steps in AWS documentation to run a simple job using Slurm workload scheduler. You need to be using ec2-user user to get access to all pre-installed HPC utilities, that can be achieved by changing shell prompt:

sudo su ec2-user

After that, you can run commands below to check HPC Cluster nodes/capacity, compute module availability, create a test Hello World job:

sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
cpu*         up   infinite      2  idle~ cpu-dy-cpu-[1-2]
gpu          up   infinite      2  idle~ gpu-dy-gpu-[1-2]
[ec2-user@ip-10-0-2-161 ~]$ !3
module avail

------------------------------------------------------------------------ /opt/amazon/modules/modulefiles -------------------------------------------------------------------------
libfabric-aws/1.22.0amzn1.0 openmpi/4.1.6               openmpi5/5.0.3

------------------------------------------------------------------------- /usr/share/Modules/modulefiles -------------------------------------------------------------------------
dot         module-git  module-info modules     null        use.own

-------------------------------------------------------------------- /opt/intel/mpi/2021.13/etc/modulefiles/ ---------------------------------------------------------------------
intelmpi/2021.13
[ec2-user@ip-10-0-2-161 ~]$ cat hellojob.sh
#!/bin/bash
sleep 30
echo "Hello World from $(hostname)"
[ec2-user@ip-10-0-2-161 ~]$ sbatch ./hellojob.sh
Submitted batch job 1
[ec2-user@ip-10-0-2-161 ~]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 1       cpu hellojob ec2-user CF       0:08      1 cpu-dy-cpu-1
[ec2-user@ip-10-0-2-161 ~]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 1       cpu hellojob ec2-user CF       0:26      1 cpu-dy-cpu-1
[ec2-user@ip-10-0-2-161 ~]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 1       cpu hellojob ec2-user CF       0:28      1 cpu-dy-cpu-1
[ec2-user@ip-10-0-2-161 ~]$ squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
                 1       cpu hellojob ec2-user CF       0:33      1 cpu-dy-cpu-1

Slurm woukld launch a properly sized compute Node (EC2 instance) to run the scheduled job. Wait until the job completes (no output from squeue command) and then check completed job output:

squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
$ls -l
total 8
-rwxrwxr-x 1 ec2-user ec2-user 57 Dec 16 23:43 hellojob.sh
-rw-rw-r-- 1 ec2-user ec2-user 32 Dec 16 23:47 slurm-1.out
[ec2-user@ip-10-0-2-161 ~]$ cat slurm-1.out
Hello World from ip-10-0-54-243

This confirms that your provisioned HPC cluster can run a compute job on its node(s) controlled by SLURM.

Cleanup

When you no longer need to use the guidance, you should delete the AWS resources deployed by it in order to prevent ongoing charges for their use.

  1. In the AWS Management Console, navigate to CloudFormation and locate the 6 guidance stacks deployed.
  2. Starting with the most recent stack (not including any nested stacks), select the stack and click Delete button:

Initiate Stack Deletion

confirm your intend by clicking the Delete button on the pop-up screen:

Confirm Stack Deletion

  1. Repeat this for each of the 6 CloudFormation stacks deployed to remove all resources from your account, confirm that all stacks get deleted.

Notices

Customers are responsible for making their own independent assessment of the information in this Guidance. This Guidance: (a) is for informational purposes only, (b) represents AWS current product offerings and practices, which are subject to change without notice, and (c) does not create any commitments or assurances from AWS and its affiliates, suppliers or licensors. AWS products or services are provided “as is” without warranties, representations, or conditions of any kind, whether express or implied. AWS responsibilities and liabilities to its customers are controlled by AWS agreements, and this Guidance is not part of, nor does it modify, any agreement between AWS and its customers.

Authors

Chris Riddle - Sr. Solutions Architect AWS
Daniel Zilberman - Sr. Solutions Architect AWS

License

This library is licensed under the MIT-0 License. See the LICENSE file.