Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Backend scaling -- stop instance #113

Open
3 of 5 tasks
suhlrich opened this issue Sep 25, 2023 · 8 comments · Fixed by #117
Open
3 of 5 tasks

Backend scaling -- stop instance #113

suhlrich opened this issue Sep 25, 2023 · 8 comments · Fixed by #117

Comments

@suhlrich
Copy link
Member

suhlrich commented Sep 25, 2023

We need to make sure the backend machines aren't processing a job when they stop, so when a machine has not received a job for a certain amount of time, it should exit the app.py loop, remove its scale-in protection, and be ready for autoscaling to turn it off. There needs to be an env variable distinguishing always-on machines from asg machines.

@sashasimkin
Copy link

  1. I'll provide the bash script to disable the protection later, but the important part of it would be to configure IAM role & instance profile to have permissions to execute that command.
  2. I believe there's no need to have access to any metric and the logic should be if no jobs for X minutes --> pause work
  3. It's important to sleep in the cycle and not stop the process, because upon stopping ECS would restart the container which might have adverse effects of resetting counters.

@suhlrich
Copy link
Member Author

  1. The problem lies with our base capacity. We want to turn off the EC2 instances once the number of jobs falls below some threshold that we think is managable on our on-prem server. If we turn off EC2 instances when they don't get a job, then we only turn them off when the queue is empty, which would result in them remaining on more than necessary if they happen to pick up a job and the on-prem servers are not busy. If we query the number of desired machines, then we can have more complex logic regarding when to turn off.

@sashasimkin
Copy link

hey @suhlrich, I understand the base capacity problem and I was wrong regarding inactivity logic.
At the same time, what we can do is change logic to be something like:

if number_of_jobs < settings.ONPREM_CAPACITY:
    unprotect_instance()
    pause_work()

this approach leaves the auto-scaling logic to the configured policies in AWS and separates concerns between what the application does and what the infra control plane is doing.

@suhlrich
Copy link
Member Author

Sounds good. Sounds like we should create an API endpoint that returns number_of_jobs. (unless there's an easy way to get it from cloudwatch, but an http request sounds easy to me)

@sashasimkin
Copy link

Yes, the endpoint would work, the biggest value of it - it'll provide realtime and the most precise data possible (as this is source of truth).

But cloudwatch might be easier to implement - just adding AWS permissions to the ECS role, and a better fit - we can get an aggregated value over last x min with get_metric_statistics/get_metric_data.

@suhlrich
Copy link
Member Author

how would we query cloudwatch? Would this be another batch script? I guess it is probably better to be looking at the same value that the ASG rule is looking at.

@sashasimkin
Copy link

hey @suhlrich , so here's the code that will read the metric submitted by stanfordnmbl/opencap-api#173 (comment) :

This code requires the cloudwatch:GetMetricStatistics but similar to the as above, I'll add it to the worker ECS service permissions.

import boto3
from datetime import datetime, timedelta

def get_metric_average(namespace, metric_name, start_time, end_time, period):
    """
    Fetch the average value of a specific metric from AWS CloudWatch.

    Parameters:
    - namespace (str): The namespace for the metric data.
    - metric_name (str): The name of the metric.
    - start_time (datetime): Start time for the data retrieval.
    - end_time (datetime): End time for the data retrieval.
    - period (int): The granularity, in seconds, of the data points returned.
    """
    client = boto3.client('cloudwatch')
    response = client.get_metric_statistics(
        Namespace=namespace,
        MetricName=metric_name,
        StartTime=start_time,
        EndTime=end_time,
        Period=period,
        Statistics=['Average']  # Correctly specifying 'Average' here
    )
    return response

def get_number_of_pending_trials():
    # Time range setup for the last 1 minute
    end_time = datetime.utcnow()
    start_time = end_time - timedelta(minutes=1)
    period = 60  # Period in seconds

    # Fetch the metric data
    stats = get_metric_average(
        'Custom/opencap',  # or 'Custom/opencap' for production
        'opencap_trials_pending',
        start_time, end_time, period
    )

    if stats['Datapoints']:
        average = stats['Datapoints'][0]['Average']
        print(f"Average value of '{metric_name}' over the last minute: {average}")
    else:
        print("No data points found.")
        # Maybe raise an exception or do nothing to have control-loop retry this call later
        return None

    return average

@sashasimkin
Copy link

@suhlrich here's a bash script (and boto3 code, which should be a better fit) for un-protecting the instance:

#!/bin/bash
# Retrieve the Instance ID
INSTANCE_ID=$(curl -s http://169.254.169.254/latest/meta-data/instance-id)
# Retrieve the Auto Scaling Group name associated with this instance
ASG_NAME=$(aws autoscaling describe-auto-scaling-instances --instance-ids $INSTANCE_ID --query 'AutoScalingInstances[0].AutoScalingGroupName' --output text)
# Remove protection from this instance in its first(and only) autoscaling group
aws autoscaling set-instance-protection --instance-ids $INSTANCE_ID --auto-scaling-group-name $ASG_NAME --no-protected-from-scale-in

Similar approach but as boto3 to be called from python (as in the #117 ), because you will need boto3 for cloudwatch interactions anyway:

import boto3
import requests

def get_instance_id():
    """Retrieve the instance ID from EC2 metadata."""
    response = requests.get("http://169.254.169.254/latest/meta-data/instance-id")
    return response.text

def get_auto_scaling_group_name(instance_id):
    """Retrieve the Auto Scaling Group name using the instance ID."""
    client = boto3.client('autoscaling')
    response = client.describe_auto_scaling_instances(InstanceIds=[instance_id])
    asg_name = response['AutoScalingInstances'][0]['AutoScalingGroupName']
    return asg_name

def set_instance_protection(instance_id, asg_name, protect):
    """Set or remove instance protection."""
    client = boto3.client('autoscaling')
    client.set_instance_protection(
        InstanceIds=[instance_id],
        AutoScalingGroupName=asg_name,
        ProtectedFromScaleIn=protect
    )

def unprotect_current_instance():
    instance_id = get_instance_id()
    asg_name = get_auto_scaling_group_name(instance_id)
    set_instance_protection(instance_id, asg_name, protect=False)

Either version will require the following permissions for the processing worker process. But it's just for reference as I'll be adding them in terraform.

            "Action": [
                "autoscaling:DescribeAutoScalingInstances",
                "autoscaling:SetInstanceProtection"
            ],

P.s: Regarding your original point 2) - you don't need to configure any dedicated AWS credentials for this as the permissions will be inferred from the environment through ECS Task role.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants