Serverless computing has revolutionized the way distributed machine learning (ML) tasks are performed by eliminating the need for managing servers and allowing developers to focus purely on code. This document explores various serverless frameworks that have been specifically designed to enhance distributed ML tasks in a serverless environment.
SPIRT operates on a peer-based system where each worker maintains its own database and a serverless workflow orchestrated by AWS Step Function. The workflow encompasses:
- Fetching minibatches.
- Parallel gradient computation.
- Storing gradients in the worker's database.
- Averaging gradients within the database.
- Notifying completion via a synchronization queue.
- Polling the synchronization queue.
- Retrieving averaged gradients from peer databases.
- Aggregating these averages.
- Updating local models.
In MLLESS, the workflow includes:
- Fetching a minibatch.
- Computing the gradient.
- Storing significant gradients in a shared database.
- Monitoring queues for updates.
- Waiting for all updates as communicated by the supervisor.
- Fetching and aggregating the corresponding gradients.
- Updating the model with the aggregated gradients.
This approach involves:
- Fetching minibatches.
- Computing gradients.
- Dividing and distributing gradient chunks.
- Fetching and aggregating assigned chunks.
- Sending aggregated chunks back to the database.
- Retrieving and concatenating all aggregated chunks.
- Updating the model with the complete gradient.
The AllReduce-LambdaML framework proceeds as follows:
- Fetching a minibatch.
- Computing gradients.
- Sending gradients to a shared database.
- Aggregating all gradients into a single unified gradient (performed by a designated master worker).
- Sending the aggregated gradient back to the database.
- Each worker updates their local models with the aggregated gradient.
The GPU-based training framework proceeds as follows:
- Fetching a minibatch.
- Transferring the minibatch to GPU memory.
- Executing the forward propagation on the GPU to compute predictions.
- Performing the backward propagation on the GPU to calculate gradients.
- Updating the model parameters directly on the GPU via the optimizer.
- Logging metrics (e.g., loss, accuracy) and proceeding to the next minibatch.
This works was implemeted within AWS plateform.
This section provides a configuration setup for AWS Lambda Layers. Lambda functions require libraries and dependencies to fulfill their functionalities. This posed challenges due to AWS Lambda’s deployment package size limitations. These packages must adhere to a 50 MB limit when compressed, with the uncompressed file size not exceeding 250 MB. In contrast, ARM-based layers inherently offer a reduced size. To facilitate seamless deployment on our custom ARM architecture, we packaged ML dependencies, including the Pytorch library, in a zip file. In cases where additional dependencies are necessary, they can be seamlessly integrated as separate layers within the AWS Lambda service.
To use the provided Lambda layers, follow these steps to download and integrate them into your AWS Lambda function:
-
Navigate to the Lambda Layers Directory:
cd TrainingML_P2P_Serverless/Lambda_Layers -
Download the Required Layers:
# Example: Download a specific layer (replace 'your_layer.zip' with the actual file name) wget https://github.com/AmineBarrak/TrainingML_P2P_Serverless/raw/main/Lambda_Layers/your_layer.zip -
Upload Layers to AWS Lambda: Use the AWS CLI or AWS Management Console to upload the downloaded ZIP file as a new layer in Lambda.
aws lambda publish-layer-version --layer-name "YourLayerName" --zip-file fileb://your_layer.zip --compatible-architectures "arm64"
The directory 'TrainingML_P2P_Serverless/SPIRT/Functions/' contains a collection of AWS Lambda functions specifically designed to handle the workflow of a peer-to-peer (P2P) machine learning training process on a serverless architecture. Each function plays a critical role in various stages of the ML model lifecycle, from initialization through training to convergence checks.
- init_peers: Initializes RSA key pairs and signatures for peer verification.
- init_model: Standardizes model initialization across peers and starts the AWS Step Function for training.
- Auth: Verifies signatures from peers and manages their inclusion in the network.
- compute_gradients: Computes gradients for all batches.
- Trigger_Average_Gradient_sync: Averages gradients within the database.
- check_heartbeat: Monitors the health of peers' databases for fault tolerance.
- Aggregation: Gathers and aggregates gradients from all peers.
- Trigger_update_model: Triggers updates to the local model in the database.
- check-convergence: Checks for model convergence to finalize training.
- Update_Trigger_Next_epoch: Triggers the next training epoch via a new AWS Step Function.
To deploy these Lambda functions, follow these steps:
- Navigate to the Functions Directory: Change into the directory containing the Lambda functions.
cd TrainingML_P2P_Serverless/SPIRT/Functions
- Create Lambda Functions in AWS: For each function, use the AWS CLI to create a new Lambda function. Below is an example command to create one Lambda function. Repeat this for each script.
aws lambda create-function --function-name init_peers \
--runtime python3.8 \
--role [Role-ARN] \
--handler init_peers.lambda_handler \
--zip-file fileb://init_peers.zip \
--timeout 15 \
--memory-size 128
Note: Replace '[Role-ARN]' with the actual ARN of your IAM role that has Lambda execution permissions. Ensure each function's handler and file name is correctly specified.
For the orchestration of various tasks within the training process, we utilize AWS Step Functions. A dynamic state machine is initiated for each training epoch, coordinating tasks such as gradient computations and model updates. The state machine is defined in JSON format, which outlines the sequence of tasks and the conditions for transitions between states. Each state within the machine corresponds to a specific Lambda function, ensuring a structured and automated workflow for the machine learning training process.
To generate the JSON description of the state machine and to execute it specifying the number of batches, follow these steps:
- Navigate to the following directory:
cd TrainingML_P2P_Serverless/SPIRT/Step_Function/
- Run the Script: Execute the script with the desired number of batches. For example, to create a state machine for five batches:
python create_step_function.py --batches 5
-
Create the State Machine in AWS Step Functions: Use the JSON output from the script to create a state machine in AWS Step Functions via the AWS Management Console or the AWS CLI.
-
Assign IAM Roles: Ensure that the IAM roles associated with the state machine have the necessary permissions to invoke the Lambda functions.
Following this, we compiled the RedisAI module with enhanced capabilities using these steps:
Clear any previous builds: make -C opt clean ALL=1
Build the module: make -C opt
The current version is compiled, you do not need to compile again, unless you want to add more features.
To deploy RedisAI on an AWS EC2 instance, you would first need to set up an EC2 instance on Amazon Web Services. Ensure that Redis is installed on the EC2 instance. You also need to copy the downloaded RedisAI source code from your repository to the EC2 server.
To integrate the RedisAI module when initiating the Redis server, utilize the --loadmodule command line option and provide the accurate path to the module’s library, as shown below:
redis-server --port 6380 --loadmodule ./bin/linux-x64-release/src/redisai.so
We have added new functionality to RedisAI that allows the updating of model parameters directly within the database environment. The new command provided by RedisAI for updating model parameters directly in Redis is as follows:
AI.ModelUpdate Params key Lr Grads key
Parameters:
- Params key: The key in Redis where the model parameters are stored.
- Lr: Learning rate to be applied during the update.
- Grads key: The key where the gradients are stored that will be used to update the model.
This setup allows you to manage and update your AI models directly on your EC2 instance, leveraging the power and flexibility of cloud computing with AWS.
The MLLESS framework can be deployed with or without a significant update. Follow these instructions based on your specific requirements:
-
Choosing the Deployment Type
- For deployments with a significant update, use the files located in:
./Replication/Mlless_replication/MLLESS_significant_update. - For deployments without a significant update, use the files located in:
./Replication/Mlless_replication/MLLESS_non_significant.
- For deployments with a significant update, use the files located in:
-
Lambda Function Deployment
- Server Function:
- Deploy all the content from the appropriate server folder in the
MLLESS_significant_updateorMLLESS_non_significantpath to a new Lambda function named 'Server'.
- Deploy all the content from the appropriate server folder in the
- Worker Function:
- Similarly, deploy all the content from the appropriate worker folder to a new Lambda function named 'Worker'.
- Server Function:
-
Automated Parallel Triggering
- Server:
- Use the
trigger_server.pyscript to manually or automatically trigger the server function.
- Use the
- Worker:
- Use the
trigger_worker.pyscript to trigger parallel worker functions. Modify the number of workers within the script as needed to match your deployment's scale.
- Use the
- Server:
To deploy and run the ScatterReduce framework, please follow these instructions:
-
Prepare Files for Lambda Function
- Locate the required files within the folder:
./Replication/ScatterReduce_replication. - Upload these files to a new Lambda function in your cloud environment.
- Locate the required files within the folder:
-
Lambda Function Deployment
- Create a Lambda function and ensure that all files from the
./Replication/ScatterReduce_replicationfolder are included as the source code for the function.
- Create a Lambda function and ensure that all files from the
-
Add Required Layers
- The Lambda function requires specific layers to execute properly. Ensure to add the following layers:
numpy- For numerical operations.torch- For machine learning computations.PIL(Pillow) - For image processing capabilities.
- The Lambda function requires specific layers to execute properly. Ensure to add the following layers:
-
Executing the Lambda Function
- Trigger the Lambda function using the following event structure:
event_data = { "data_bucket": "your-bucket-name-here", "rank": rank, # Worker index for this process "num_workers": num_workers # Total number of workers }
- Replace
"your-bucket-name-here"with the actual name of your data bucket, and set therankandnum_workersas per your specific use case.
- Trigger the Lambda function using the following event structure:
-
Automated Parallel Triggering
- To trigger the Lambda function in parallel for multiple workers automatically, use the
trigger_scatter_reduce.pyscript. - This script helps manage and automate the triggering process for all Lambda instances concurrently, ensuring efficient parallel execution.
- To trigger the Lambda function in parallel for multiple workers automatically, use the
To set up and run the ALLreduce framework, follow these steps:
-
Lambda Function Deployment
- Navigate to the source code located at:
./Replication/AllReduce_replication/lambda_function.py. - Create a new Lambda function in your cloud environment and upload this file as the source code.
- Navigate to the source code located at:
-
Add Required Layers
- The Lambda function requires the following layers to execute properly:
numpy- A fundamental package for scientific computing with Python.torch- An open-source machine learning library.PIL(Pillow) - A Python Imaging Library adds image processing capabilities to your Python interpreter.
- The Lambda function requires the following layers to execute properly:
-
Executing the Lambda Function
- Trigger the Lambda function using the following event structure:
event_data = { "data_bucket": "your-bucket-name-here", "rank": rank, # Worker index for this process "num_workers": num_workers # Total number of workers }
- Replace
"your-bucket-name-here"with the actual name of your data bucket, and set therankandnum_workersas per your specific use case.
- Trigger the Lambda function using the following event structure:
-
Automated Parallel Triggering
- To trigger the Lambda function in parallel for multiple workers automatically, use the
trigger_all_reduce.pyscript. - This script facilitates managing and automating the triggering process for all Lambda instances concurrently, ensuring efficient parallel execution.
- To trigger the Lambda function in parallel for multiple workers automatically, use the
Here's the updated version with your preferred instruction style:
To run the GPU-based training baseline, follow these steps:
-
Launch a GPU Instance
- Use a GPU-enabled instance (e.g.,
g4dn.xlargeon AWS EC2). - Make sure Python 3.8+ is installed.
- Use a GPU-enabled instance (e.g.,
-
Install Required Packages
-
Install the following packages:
pip install torch torchvision boto3 pillow
-
-
Prepare the Training Script
- Script location:
./Replication/GPU_base_replication/train_worker.py
- Script location:
-
S3 Configuration
-
Make sure your AWS credentials are configured:
aws configure
-
In the
train_worker.pyscript, replace the S3 bucket name with your own bucket name where the dataset or logs are stored.
-
-
Run the Training
-
Execute the script with:
python train_worker.py
-




