Skip to content

feat: Switch KEDA scaling to use SQS Queue directly #94

@fgogolli

Description

@fgogolli

Migrate KEDA Autoscaling from CloudWatch Metrics to SQS Queue Depth

Problem Statement

Currently, HTC Grid uses KEDA to autoscale agent pods based on CloudWatch metrics. This approach has several limitations:

  1. Indirect Scaling Signal: CloudWatch metrics are published by a Lambda function (lambda_scaling_metrics) that queries SQS queue depth and pushes it to CloudWatch. This adds latency and complexity.
  2. Additional Lambda Cost: Running a Lambda function periodically just to publish metrics incurs unnecessary costs.
  3. Delayed Response: CloudWatch metric resolution and KEDA polling intervals compound to create slower autoscaling response times.
  4. Increased Complexity: More moving parts (Lambda, CloudWatch, KEDA) means more potential failure points.

KEDA natively supports direct SQS queue depth monitoring via the aws-sqs-queue scaler, which would eliminate the Lambda middleman and provide faster, more reliable autoscaling.

Current Implementation

KEDA ScaledObject Configuration

Location: deployment/grid/charts/agent-htc-lambda/templates/hpa.yaml

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: {{ include "agent-htc-lambda.fullname" . }}-scaling-metrics
spec:
  scaleTargetRef:
    name: {{ include "agent-htc-lambda.fullname" . }}
  minReplicaCount: {{ .Values.hpa.minAgent }}
  maxReplicaCount: {{ .Values.hpa.maxAgent }}
  triggers:
    - type: aws-cloudwatch
      metadata:
        identityOwner: operator
        namespace: {{ .Values.hpa.metric.namespace }}
        dimensionName: {{ .Values.hpa.metric.dimensionName }}
        dimensionValue: {{ .Values.hpa.metric.dimensionValue }}
        metricName: {{ .Values.hpa.metric.name }}
        metricUnit: "Count"
        metricStatPeriod: "30"
        targetMetricValue: {{ .Values.hpa.metric.targetValue | quote}}
        minMetricValue: "0"
        awsRegion: {{ .Values.hpa.metric.region }}

Current IAM Permissions

Location: deployment/grid/terraform/compute_plane/aws_iam.tf

resource "aws_iam_policy" "keda_permissions" {
  name        = "keda_permissions_policy_${local.suffix}"
  path        = "/"
  description = "IAM policy for KEDA Permissions"
  policy      = <<EOF
{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Action": [
        "cloudwatch:GetMetricData",
        "cloudwatch:GetMetricStatistics",
        "cloudwatch:ListMetrics"
      ],
      "Resource": "*",
      "Effect": "Allow"
    }
  ]
}
EOF
}

Lambda Scaling Metrics Function

Location: deployment/grid/terraform/control_plane/lambda_scaling_metrics.tf

This Lambda function periodically:

  1. Queries SQS queue depth using GetQueueAttributes
  2. Publishes the metric to CloudWatch using PutMetricData
  3. Runs on a schedule (EventBridge/CloudWatch Events)

Proposed Solution

Migrate to KEDA's native aws-sqs-queue scaler to directly monitor SQS queue depth.

Benefits of Migration

  1. Faster Autoscaling: Direct SQS monitoring eliminates Lambda and CloudWatch metric delays
  2. Cost Reduction: No Lambda invocations for metric publishing
  3. Simplified Architecture: Fewer components to maintain and monitor
  4. Better Reliability: Native KEDA scaler is well-tested and maintained
  5. More Accurate Scaling: Direct queue depth measurement vs. periodic sampling

Additional Considerations

Multiple Priority Queues

If HTC Grid uses multiple SQS queues for priority levels (e.g., queue__0, queue__1), you may need:

  • Multiple ScaledObject resources (one per queue)
  • OR aggregate queue depth across all queues in a single ScaledObject (requires custom logic)

KEDA Version Compatibility

Ensure the KEDA version deployed supports the aws-sqs-queue scaler features needed:

  • scaleOnInFlight: Available in KEDA 2.0+
  • identityOwner: operator: Available in KEDA 2.10+

Current KEDA configuration shows IRSA is enabled, which is compatible with the identityOwner: operator approach.

References

Files to Modify

Kubernetes/Helm

  • deployment/grid/charts/agent-htc-lambda/templates/hpa.yaml
  • deployment/grid/charts/agent-htc-lambda/values.yaml

Terraform - Compute Plane

  • deployment/grid/terraform/compute_plane/aws_iam.tf
  • deployment/grid/terraform/compute_plane/variables.tf
  • deployment/grid/terraform/compute_plane/helm.tf

Terraform - Control Plane

  • deployment/grid/terraform/control_plane/outputs.tf

Terraform - Main

  • deployment/grid/terraform/main.tf

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions