Distributed Training Workshop on Amazon SageMaker

Welcome to the art and science of optimizing neural networks at scale! In this workshop you'll get hands-on experience working with our high performance distributed training libraries to achieve the best performance on AWS.

Workshop Content

Today you'll walk through two hands-on labs. The first one focuses on data parallelism, and the second one is about model parallelism.

Prerequisites

This lab is self-contained. All of the content you need is produced by the notebooks themselves or included in the directory. However, if you are in an AWS-led workshop you will most likely use the Event Engine to manage your AWS account.

If not, please make sure you have an AWS account with a SageMaker Studio domain created. In this account please request a service limit increase for the ml.g4dn.12xlarge instance type within SageMaker training.

Top papers and case studies

Some relevant papers for your reference:

SageMaker Data Parallel, aka Herring. In this paper we introduce a custom high performance computing configuration for distributed gradient descent on AWS, available within Amazon SageMaker Training.
SageMaker Model Parallel. In this paper we propose a model parallelism framework available within Amazon SageMaker Training to reduce memory errors and enable training GPT-3 sized models and more! See our case study achieving 32 samples / second with 175B parameters on SageMaker over 140 p4d nodes.
Amazon Search speeds up training by 7.3x on SageMaker. In this blog post we introduce two new features on Amazon SageMaker: support for native PyTorch DDP and PyTorch Lightning integration with SM DDP. We also discuss how Amazon Search sped up their overall training time by 7.3x by moving to distributed training.

Name		Name	Last commit message	Last commit date
Latest commit History 310 Commits
10_llm_eval		10_llm_eval
11_prep_data		11_prep_data
12_mmrag		12_mmrag
13_torchtune		13_torchtune
14_torchtitan		14_torchtitan
15_mixtral_finetune_qlora		15_mixtral_finetune_qlora
16_smp_cp_fp8_llama3_1		16_smp_cp_fp8_llama3_1
1_data_parallel		1_data_parallel
2_model_parallel_sdp		2_model_parallel_sdp
3_hosting		3_hosting
4_stable_diffusion		4_stable_diffusion
5_boot_camp		5_boot_camp
6_xgboost		6_xgboost
7_twitch_notebooks		7_twitch_notebooks
8_bias		8_bias
9_rlhf		9_rlhf
img		img
prompt_engineering		prompt_engineering
slides		slides
.gitignore		.gitignore
02_22_LLM_WORKSHOP.md		02_22_LLM_WORKSHOP.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
STEP1_AWS_LOGIN.md		STEP1_AWS_LOGIN.md
STEP2_SAGEMAKER_LOGIN.md		STEP2_SAGEMAKER_LOGIN.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Distributed Training Workshop on Amazon SageMaker

Workshop Content

Prerequisites

Other helpful links

Top papers and case studies

About

Releases

Packages

Contributors 8

Languages

License

aws-samples/sagemaker-distributed-training-workshop

Folders and files

Latest commit

History

Repository files navigation

Distributed Training Workshop on Amazon SageMaker

Workshop Content

Prerequisites

Other helpful links

Top papers and case studies

About

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases

Packages 0

Contributors 8

Languages

Packages