Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto Scaling Central Issue #174

Open
7 tasks
suhlrich opened this issue Apr 19, 2024 · 7 comments
Open
7 tasks

Auto Scaling Central Issue #174

suhlrich opened this issue Apr 19, 2024 · 7 comments
Assignees

Comments

@suhlrich
Copy link
Member

suhlrich commented Apr 19, 2024

We'd like to have surge GPU capacity using AWS auto-scaling. We will have base capacity that is always running, so this will only activate if the queue is a certain length.

@olehkorkh-planeks @sashasimkin @antoinefalisse please read over and update this.

@sashasimkin
Copy link

Hi @suhlrich, I have a few small comments to the logic:

add a variable in cloudwatch desired_asg_gpu_instances that will get updated by the celery queue check and checked by the auto-scaling rule. @sashasimkin

There's no variables in CloudWatch per-se, all you need to do is call https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/cloudwatch/client/put_metric_data.html from the celery task.

add celery task that checks number of trials and updates desired_asg_gpu_instances on cloudwatch

I advise that instead of having the desired_asg_gpu_instances metric, we should simply put_metric_data, something like the example below.

import boto3
# Initialize the CloudWatch client
cloudwatch = boto3.client('cloudwatch')

metric_data = [{
    'MetricName': 'opencap_trials_pending',
    'Dimensions': [
        {
            'Name': 'Environment',
            'Value': f'{env}'
        }
    ],
    'Timestamp': datetime.datetime.now(),
    'Value': 100,  # the count of pending trials
    'Unit': 'Count'
}]
response = cloudwatch.put_metric_data(
    Namespace='YourApplicationNamespace',
    MetricData=metric_data
)
try:
    response = cloudwatch.put_metric_data(
        Namespace='YourApplicationNamespace',
        MetricData=metric_data
    )
    print("Metric successfully uploaded")
except Exception as e:
    print("Failed to upload metric:", e)

automatically start EC2 machine with opencap-core docker + IAM roles

I advice that we use ECS on EC2 to simplify running of the image that you are pushing to ECR. I saw in the infra repo some related to this code, but it needs checking and polishing to make it working in general and with auto-scaling.

create ASG scaling logic that gets desired_asg_gpu_instances from cloudwatch and spins up/down machines. Spun up machines should have scale-in protection.

This will be just https://docs.aws.amazon.com/AmazonECS/latest/developerguide/service-autoscaling-targettracking.html of the agreed value of opencap_trials_pending.

@suhlrich
Copy link
Member Author

@sashasimkin: Is it not possible to use an n_desired_asg_instances variable for the ASG target? This way, we can implement whatever logic we like here (#173) that is accessible by the ASG and within the GPU servers so they can know when to shut down:

@sashasimkin
Copy link

@suhlrich it is possible to use desired_asg_gpu_instances metric for scaling the number of instances, but it's not how it's usually done - that's why I suggested a different, simpler approach.

In general, the application doesn't manage the number of instances to process the job, but this logic is implemented in the infrastructure layer based on various factors.

I've replied here about termination logic.

@suhlrich
Copy link
Member Author

suhlrich commented Apr 24, 2024

@sashasimkin So we can implement similar logic to here: #173 in the infrastructure level?

@sashasimkin
Copy link

@suhlrich yes - exactly, and the logic will be simpler.

I.e. instead of calculating the number of instances and tracking the numbers before/after scaling, we will have simpler target tracking that periodically checks if number of jobs is less or more than 5*n_machines and scales in/out accordingly between mix & max AS group size.

@antoinefalisse
Copy link
Collaborator

@sashasimkin let's use g5.2xlarge instances.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants