Auto Scaling Central Issue #174

suhlrich · 2024-04-19T18:20:49Z

We'd like to have surge GPU capacity using AWS auto-scaling. We will have base capacity that is always running, so this will only activate if the queue is a certain length.

create the autoscaling group with 5 machines (@antoinefalisse please add spec), default target is 0 @sashasimkin
add a variable in cloudwatch desired_asg_gpu_instances that will get updated by the celery queue check and checked by the auto-scaling rule. @sashasimkin
add celery task that checks number of trials and updates desired_asg_gpu_instances on cloudwatch: Scale up logic #173 @olehkorkh-planeks
Pause GPU machine and remove scale-in protection: Backend scaling -- stop instance opencap-core#113. @suhlrich or @antoinefalisse and @sashasimkin
automatically start EC2 machine with opencap-core docker + IAM roles: https://github.com/stanfordnmbl/opencap-infrastructure/issues/14
create ASG scaling logic that gets desired_asg_gpu_instances from cloudwatch and spins up/down machines. Spun up machines should have scale-in protection.
create API endpoint to check desired_asg_gpu_instances. See discussion here Backend scaling -- stop instance opencap-core#113 @olehkorkh-planeks

@olehkorkh-planeks @sashasimkin @antoinefalisse please read over and update this.

The text was updated successfully, but these errors were encountered:

sashasimkin · 2024-04-23T21:50:58Z

Hi @suhlrich, I have a few small comments to the logic:

add a variable in cloudwatch desired_asg_gpu_instances that will get updated by the celery queue check and checked by the auto-scaling rule. @sashasimkin

There's no variables in CloudWatch per-se, all you need to do is call https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/cloudwatch/client/put_metric_data.html from the celery task.

add celery task that checks number of trials and updates desired_asg_gpu_instances on cloudwatch

I advise that instead of having the desired_asg_gpu_instances metric, we should simply put_metric_data, something like the example below.

import boto3
# Initialize the CloudWatch client
cloudwatch = boto3.client('cloudwatch')

metric_data = [{
    'MetricName': 'opencap_trials_pending',
    'Dimensions': [
        {
            'Name': 'Environment',
            'Value': f'{env}'
        }
    ],
    'Timestamp': datetime.datetime.now(),
    'Value': 100,  # the count of pending trials
    'Unit': 'Count'
}]
response = cloudwatch.put_metric_data(
    Namespace='YourApplicationNamespace',
    MetricData=metric_data
)
try:
    response = cloudwatch.put_metric_data(
        Namespace='YourApplicationNamespace',
        MetricData=metric_data
    )
    print("Metric successfully uploaded")
except Exception as e:
    print("Failed to upload metric:", e)

automatically start EC2 machine with opencap-core docker + IAM roles

I advice that we use ECS on EC2 to simplify running of the image that you are pushing to ECR. I saw in the infra repo some related to this code, but it needs checking and polishing to make it working in general and with auto-scaling.

create ASG scaling logic that gets desired_asg_gpu_instances from cloudwatch and spins up/down machines. Spun up machines should have scale-in protection.

This will be just https://docs.aws.amazon.com/AmazonECS/latest/developerguide/service-autoscaling-targettracking.html of the agreed value of opencap_trials_pending.

suhlrich · 2024-04-23T22:20:48Z

@sashasimkin: Is it not possible to use an n_desired_asg_instances variable for the ASG target? This way, we can implement whatever logic we like here (#173) that is accessible by the ASG and within the GPU servers so they can know when to shut down:

sashasimkin · 2024-04-24T10:55:54Z

@suhlrich it is possible to use desired_asg_gpu_instances metric for scaling the number of instances, but it's not how it's usually done - that's why I suggested a different, simpler approach.

In general, the application doesn't manage the number of instances to process the job, but this logic is implemented in the infrastructure layer based on various factors.

I've replied here about termination logic.

suhlrich · 2024-04-24T21:38:04Z

@sashasimkin So we can implement similar logic to here: #173 in the infrastructure level?

sashasimkin · 2024-04-25T12:00:20Z

@suhlrich yes - exactly, and the logic will be simpler.

I.e. instead of calculating the number of instances and tracking the numbers before/after scaling, we will have simpler target tracking that periodically checks if number of jobs is less or more than 5*n_machines and scales in/out accordingly between mix & max AS group size.

antoinefalisse · 2024-05-13T18:44:36Z

@sashasimkin let's use g5.2xlarge instances.

sashasimkin · 2024-05-13T19:42:06Z

@antoinefalisse ✔️ https://github.com/stanfordnmbl/opencap-infrastructure/commit/0c06a317bd3efaa84f66ac26d791c5977df9398e

suhlrich assigned antoinefalisse, suhlrich and olehkorkh-planeks Apr 19, 2024

suhlrich mentioned this issue Apr 19, 2024

Scaling backend machines #109

Closed

6 tasks

antoinefalisse mentioned this issue Apr 22, 2024

Planeks Priority List #104

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Auto Scaling Central Issue #174

Auto Scaling Central Issue #174

suhlrich commented Apr 19, 2024 •

edited

Loading

sashasimkin commented Apr 23, 2024

suhlrich commented Apr 23, 2024

sashasimkin commented Apr 24, 2024

suhlrich commented Apr 24, 2024 •

edited

Loading

sashasimkin commented Apr 25, 2024

antoinefalisse commented May 13, 2024

sashasimkin commented May 13, 2024

Auto Scaling Central Issue #174

Auto Scaling Central Issue #174

Comments

suhlrich commented Apr 19, 2024 • edited Loading

sashasimkin commented Apr 23, 2024

suhlrich commented Apr 23, 2024

sashasimkin commented Apr 24, 2024

suhlrich commented Apr 24, 2024 • edited Loading

sashasimkin commented Apr 25, 2024

antoinefalisse commented May 13, 2024

sashasimkin commented May 13, 2024

suhlrich commented Apr 19, 2024 •

edited

Loading

suhlrich commented Apr 24, 2024 •

edited

Loading