-
Notifications
You must be signed in to change notification settings - Fork 34
Description
Migrate KEDA Autoscaling from CloudWatch Metrics to SQS Queue Depth
Problem Statement
Currently, HTC Grid uses KEDA to autoscale agent pods based on CloudWatch metrics. This approach has several limitations:
- Indirect Scaling Signal: CloudWatch metrics are published by a Lambda function (
lambda_scaling_metrics) that queries SQS queue depth and pushes it to CloudWatch. This adds latency and complexity. - Additional Lambda Cost: Running a Lambda function periodically just to publish metrics incurs unnecessary costs.
- Delayed Response: CloudWatch metric resolution and KEDA polling intervals compound to create slower autoscaling response times.
- Increased Complexity: More moving parts (Lambda, CloudWatch, KEDA) means more potential failure points.
KEDA natively supports direct SQS queue depth monitoring via the aws-sqs-queue scaler, which would eliminate the Lambda middleman and provide faster, more reliable autoscaling.
Current Implementation
KEDA ScaledObject Configuration
Location: deployment/grid/charts/agent-htc-lambda/templates/hpa.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: {{ include "agent-htc-lambda.fullname" . }}-scaling-metrics
spec:
scaleTargetRef:
name: {{ include "agent-htc-lambda.fullname" . }}
minReplicaCount: {{ .Values.hpa.minAgent }}
maxReplicaCount: {{ .Values.hpa.maxAgent }}
triggers:
- type: aws-cloudwatch
metadata:
identityOwner: operator
namespace: {{ .Values.hpa.metric.namespace }}
dimensionName: {{ .Values.hpa.metric.dimensionName }}
dimensionValue: {{ .Values.hpa.metric.dimensionValue }}
metricName: {{ .Values.hpa.metric.name }}
metricUnit: "Count"
metricStatPeriod: "30"
targetMetricValue: {{ .Values.hpa.metric.targetValue | quote}}
minMetricValue: "0"
awsRegion: {{ .Values.hpa.metric.region }}Current IAM Permissions
Location: deployment/grid/terraform/compute_plane/aws_iam.tf
resource "aws_iam_policy" "keda_permissions" {
name = "keda_permissions_policy_${local.suffix}"
path = "/"
description = "IAM policy for KEDA Permissions"
policy = <<EOF
{
"Version": "2012-10-17",
"Statement": [
{
"Action": [
"cloudwatch:GetMetricData",
"cloudwatch:GetMetricStatistics",
"cloudwatch:ListMetrics"
],
"Resource": "*",
"Effect": "Allow"
}
]
}
EOF
}Lambda Scaling Metrics Function
Location: deployment/grid/terraform/control_plane/lambda_scaling_metrics.tf
This Lambda function periodically:
- Queries SQS queue depth using
GetQueueAttributes - Publishes the metric to CloudWatch using
PutMetricData - Runs on a schedule (EventBridge/CloudWatch Events)
Proposed Solution
Migrate to KEDA's native aws-sqs-queue scaler to directly monitor SQS queue depth.
Benefits of Migration
- Faster Autoscaling: Direct SQS monitoring eliminates Lambda and CloudWatch metric delays
- Cost Reduction: No Lambda invocations for metric publishing
- Simplified Architecture: Fewer components to maintain and monitor
- Better Reliability: Native KEDA scaler is well-tested and maintained
- More Accurate Scaling: Direct queue depth measurement vs. periodic sampling
Additional Considerations
Multiple Priority Queues
If HTC Grid uses multiple SQS queues for priority levels (e.g., queue__0, queue__1), you may need:
- Multiple ScaledObject resources (one per queue)
- OR aggregate queue depth across all queues in a single ScaledObject (requires custom logic)
KEDA Version Compatibility
Ensure the KEDA version deployed supports the aws-sqs-queue scaler features needed:
scaleOnInFlight: Available in KEDA 2.0+identityOwner: operator: Available in KEDA 2.10+
Current KEDA configuration shows IRSA is enabled, which is compatible with the identityOwner: operator approach.
References
Files to Modify
Kubernetes/Helm
-
deployment/grid/charts/agent-htc-lambda/templates/hpa.yaml -
deployment/grid/charts/agent-htc-lambda/values.yaml
Terraform - Compute Plane
-
deployment/grid/terraform/compute_plane/aws_iam.tf -
deployment/grid/terraform/compute_plane/variables.tf -
deployment/grid/terraform/compute_plane/helm.tf
Terraform - Control Plane
-
deployment/grid/terraform/control_plane/outputs.tf
Terraform - Main
-
deployment/grid/terraform/main.tf