opea-project · poussa · Jun 26, 2025 · May 2, 2025 · May 19, 2025 · May 28, 2025
@@ -12,6 +12,10 @@
   - [Install](#install)
   - [Post-install](#post-install)
 - [Verify](#verify)
+- [Scaling metric considerations](#scaling-metric-considerations)
+  - [Autoscaling principles](#autoscaling-principles)
+  - [Current scaling metrics](#current-scaling-metrics)
+  - [Other potential metrics](#other-potential-metrics)
 
 ## Introduction
 
@@ -133,3 +137,64 @@ watch -n 5 scale-monitor-helm.sh default chatqna
 **NOTE**: inferencing services provide metrics only after they've processed their first request.
 The reranking service is used only after the query context data has been uploaded. Until then,
 no metrics will be available for them.
+
+## Scaling metric considerations
+
+### Autoscaling principles
+
+The used model, underlying HW and engine parameters are supposed to be selected so that engine
+instance can satisfy service SLA (Service Level Agreement) requirements for its own requests,
+also when it's becoming saturated. Autoscaling is then intended to scale up the service so that
+requests can be directed to unsaturated instances.
+
+Problem is finding a good metric, and its threshold, for indicatating this saturation point.
+Preferably it should be something that can anticipate this point, so that startup delay for
+the new engine instances does not cause SLA breakage (or in worse case requests being rejected,
+if engine queue fills up).
+
+Note: Another problem is Kubernetes service routing sending requests (also) to already saturated
+instances, instead of idle ones. Using [KubeAI](../kubeai/#readme) (instead of HPA) to manage
+both engine scaling + query routing can solve that.
+
+### Current scaling metrics
+
+Currently following inference engine metrics are used to autoscale their replica counts:
+
+- vLLM: Active requests i.e. count of waiting (queued) + (already) running requests
+  - Good overall scaling metric, used also by [KubeAI](../kubeai/#readme) for scaling vLLM
+  - Threshold depends on how many requests underlying HW / engine config can process for given model in parallel
+- TGI / TEI: Queue size, i.e. how many requests are waiting to be processed
+  - Used because TGI and TEI do not offer metric for (already) running requests, just waiting ones
+  - Independent of the used model, so works well as an example, but not that good for production because
+    scaling happens late and fluctuates a lot (due to metric dropping to zero when engine is not saturated)
+
+### Other potential metrics
+
+All the metrics provided by the inference engines are listed in their documentation:
+
+- [vLLM metrics](https://docs.vllm.ai/en/v0.8.5/serving/metrics.html)
+  - [Metric design](https://docs.vllm.ai/en/v0.8.5/design/v1/metrics.html)
+- [TGI metrics](https://huggingface.co/docs/text-generation-inference/en/reference/metrics)
+  - TEI (embed and reranking) services provide a subset of these TGI metrics
+
+OPEA application [dashboard](monitoring.md#dashboards) provides (Prometheus query) examples
+for deriving service performance metrics out of engine Histogram metrics.
+
+Their suitability for autoscaling:
+
+- Request latency, request per second (RPS) - not suitable
+  - Depends completely on input and output token counts and is an indicator for past performance, not incoming load
+- First token latency (TTFT) - potential
+  - Relevancy depends on use-case; number of used tokens and what's important
+- Next token latency (TPOT, ITL), tokens per second (TPS) - potential
+  - Relevancy depends on use-case; number of used tokens and what's important
+
+Performance metrics will be capped by the performance of the underlying engine setup
+=> at some point, they stop corresponding to incoming load / how much scaling would be needed.
+
+Therefore such metrics could be used in production _when_ their thresholds are carefully
+fine-tuned and rechecked every time underlying setup (model, HW, engine config) changes.
+In OPEA Helm charts that setup is user selectable, so such metrics are unsuitable for
+autoscaling examples.
+
+(General [explanation](https://docs.nvidia.com/nim/benchmarking/llm/latest/metrics.html) on how these metrics are measured.)
@@ -1,44 +1,64 @@
-# Copyright (C) 2024 Intel Corporation
+# Copyright (C) 2024-2025 Intel Corporation
 # SPDX-License-Identifier: Apache-2.0
 
-# Enable HorizontalPodAutoscaler (HPA)
+# Enable HorizontalPodAutoscaler (HPA) for ChatQnA and its components
 #
-# That will overwrite named PrometheusAdapter configMap with ChatQnA specific
-# custom metric queries for embedding, reranking, and LLM services.
+# Will create configMap with ChatQnA specific custom metric queries for embedding, reranking,
+# and LLM inferencing services, which can be used to overwrite current PrometheusAdapter rules.
+# This will then provide custom metrics used by HorizontalPodAutoscaler rules of each service.
 #
-# Default upstream configMap is in:
+# Default upstream adapter configMap is in:
 #  - https://github.com/kubernetes-sigs/prometheus-adapter/blob/master/deploy/manifests/config-map.yaml
 
-dashboard:
-  scaling: true
-
 autoscaling:
   enabled: true
 
 global:
-  # K8s custom metrics (used for scaling thresholds) are based on metrics from service monitoring
+  # Both Grafana dashboards and k8s custom metrics need (Prometheus) metrics for services
   monitoring: true
 
 # Override values in specific subcharts
+#
+# Note: enabling "autoscaling" for any of the subcharts requires enabling it also above!
+
+dashboard:
+  # add also scaling metrics dashboard to Grafana
+  scaling: true
 
-# Enabling "autoscaling" for any of the subcharts requires enabling it also above!
 vllm:
+  # vLLM startup takes too long for autoscaling, especially with Gaudi
+  VLLM_SKIP_WARMUP: "true"
   autoscaling:
+    enabled: true
     minReplicas: 1
     maxReplicas: 4
-    enabled: true
+    activeRequestsTarget:
+      accel: 120
+      cpu: 10
+
 tgi:
   autoscaling:
+    enabled: true
     minReplicas: 1
     maxReplicas: 4
-    enabled: true
+    queueSizeTarget:
+      accel: 10
+      cpu: 10
+
 teirerank:
   autoscaling:
+    enabled: true
     minReplicas: 1
     maxReplicas: 3
-    enabled: true
+    queueSizeTarget:
+      accel: 10
+      cpu: 10
+
 tei:
   autoscaling:
+    enabled: true
     minReplicas: 1
     maxReplicas: 2
-    enabled: true
+    queueSizeTarget:
+      accel: 10
+      cpu: 10
@@ -1,30 +1,28 @@
-# Copyright (C) 2024 Intel Corporation
+{{- if and .Values.global.monitoring .Values.autoscaling.enabled }}
+# Copyright (C) 2024-2025 Intel Corporation
 # SPDX-License-Identifier: Apache-2.0
 
-{{- if and .Values.global.monitoring .Values.autoscaling.enabled }}
 apiVersion: v1
 kind: ConfigMap
 metadata:
-  # easy to find for the required manual step
+  # easy to find for the manual step required to install this for Prometheus-adapter
   namespace: default
   name: {{ include "chatqna.fullname" . }}-custom-metrics
   labels:
     app.kubernetes.io/name: prometheus-adapter
 data:
   config.yaml: |
     rules:
-    {{- if and .Values.vllm.enabled .Values.vllm.autoscaling.enabled }}
     # check metric with:
     # kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/namespaces/default/service/*/<metric> | jq
     #
-    - seriesQuery: '{__name__="vllm:time_per_output_token_seconds_sum",service="{{ include "vllm.fullname" .Subcharts.vllm }}"}'
-      # Average output token latency from vLLM histograms, over 1 min
-      # (interval should be at least 4x serviceMonitor query interval,
-      # 0.001 divider add is to make sure there's always a valid value)
-      metricsQuery: 'rate(vllm:time_per_output_token_seconds_sum{service="{{ include "vllm.fullname" .Subcharts.vllm }}",<<.LabelMatchers>>}[1m]) / (0.001+rate(vllm:time_per_output_token_seconds_count{service="{{ include "vllm.fullname" .Subcharts.vllm }}",<<.LabelMatchers>>}[1m]))'
+    {{- if and .Values.vllm.enabled .Values.vllm.autoscaling.enabled }}
+    - seriesQuery: '{__name__="vllm:num_requests_waiting",service="{{ include "vllm.fullname" .Subcharts.vllm }}"}'
+      # Sum of active requests in pods, both ones already being processed, and ones waiting to be processed
+      metricsQuery: 'sum by (<<.GroupBy>>)(vllm:num_requests_running{<<.LabelMatchers>>} + <<.Series>>{<<.LabelMatchers>>})'
       name:
-        matches: ^vllm:time_per_output_token_seconds_sum
-        as: "{{ include "vllm.metricPrefix" .Subcharts.vllm }}_token_latency"
+        matches: ^vllm:num_requests_waiting
+        as: "{{ include "vllm.metricPrefix" .Subcharts.vllm }}_active_request_sum"
       resources:
         # HPA needs both namespace + suitable object resource for its query paths:
         # /apis/custom.metrics.k8s.io/v1beta1/namespaces/default/service/*/<metric>
@@ -34,63 +32,37 @@ data:
           service:   {resource: "service"}
     {{- end }}
     {{- if and .Values.tgi.enabled .Values.tgi.autoscaling.enabled }}
-    {{- if .Values.tgi.accelDevice }}
     - seriesQuery: '{__name__="tgi_queue_size",service="{{ include "tgi.fullname" .Subcharts.tgi }}"}'
       # TGI instances queue_size sum
-      metricsQuery: 'sum by (namespace,service) (tgi_queue_size{service="{{ include "tgi.fullname" .Subcharts.tgi }}",<<.LabelMatchers>>})'
+      # - GroupBy/LabelMatches provide labels from resources section
+      metricsQuery: 'sum by (<<.GroupBy>>)(<<.Series>>{<<.LabelMatchers>>})'
       name:
         matches: ^tgi_queue_size
         as: "{{ include "tgi.metricPrefix" .Subcharts.tgi }}_queue_size_sum"
-    {{- else }}
-    - seriesQuery: '{__name__="tgi_request_inference_duration_sum",service="{{ include "tgi.fullname" .Subcharts.tgi }}"}'
-      # Average request latency from TGI histograms, over 1 min
-      metricsQuery: 'rate(tgi_request_inference_duration_sum{service="{{ include "tgi.fullname" .Subcharts.tgi }}",<<.LabelMatchers>>}[1m]) / (0.001+rate(tgi_request_inference_duration_count{service="{{ include "tgi.fullname" .Subcharts.tgi }}",<<.LabelMatchers>>}[1m]))'
-      name:
-        matches: ^tgi_request_inference_duration_sum
-        as: "{{ include "tgi.metricPrefix" .Subcharts.tgi }}_request_latency"
-    {{- end }}
       resources:
         overrides:
           namespace: {resource: "namespace"}
           service:   {resource: "service"}
     {{- end }}
-    {{- if .Values.teirerank.autoscaling.enabled }}
-    {{- if .Values.teirerank.accelDevice }}
+    {{- if and .Values.teirerank.enabled .Values.teirerank.autoscaling.enabled }}
     - seriesQuery: '{__name__="te_queue_size",service="{{ include "teirerank.fullname" .Subcharts.teirerank }}"}'
       # TEI instances queue_size sum
-      metricsQuery: 'sum by (namespace,service) (te_queue_size{service="{{ include "teirerank.fullname" .Subcharts.teirerank }}",<<.LabelMatchers>>})'
+      metricsQuery: 'sum by (<<.GroupBy>>)(<<.Series>>{<<.LabelMatchers>>})'
       name:
         matches: ^te_queue_size
         as: "{{ include "teirerank.metricPrefix" .Subcharts.teirerank }}_queue_size_sum"
-    {{- else }}
-    - seriesQuery: '{__name__="te_request_inference_duration_sum",service="{{ include "teirerank.fullname" .Subcharts.teirerank }}"}'
-      # Average request latency from TEI histograms, over 1 min
-      metricsQuery: 'rate(te_request_inference_duration_sum{service="{{ include "teirerank.fullname" .Subcharts.teirerank }}",<<.LabelMatchers>>}[1m]) / (0.001+rate(te_request_inference_duration_count{service="{{ include "teirerank.fullname" .Subcharts.teirerank }}",<<.LabelMatchers>>}[1m]))'
-      name:
-        matches: ^te_request_inference_duration_sum
-        as: "{{ include "teirerank.metricPrefix" .Subcharts.teirerank }}_request_latency"
-    {{- end }}
       resources:
         overrides:
           namespace: {resource: "namespace"}
           service:   {resource: "service"}
     {{- end }}
     {{- if .Values.tei.autoscaling.enabled }}
-    {{- if .Values.tei.accelDevice }}
     - seriesQuery: '{__name__="te_queue_size",service="{{ include "tei.fullname" .Subcharts.tei }}"}'
       # TEI instances queue_size sum
-      metricsQuery: 'sum by (namespace,service) (te_queue_size{service="{{ include "tei.fullname" .Subcharts.tei }}",<<.LabelMatchers>>})'
+      metricsQuery: 'sum by (<<.GroupBy>>)(<<.Series>>{<<.LabelMatchers>>})'
       name:
         matches: ^te_queue_size
         as: "{{ include "tei.metricPrefix" .Subcharts.tei }}_queue_size_sum"
-    {{- else }}
-    - seriesQuery: '{__name__="te_request_inference_duration_sum",service="{{ include "tei.fullname" .Subcharts.tei }}"}'
-      # Average request latency from TEI histograms, over 1 min
-      metricsQuery: 'rate(te_request_inference_duration_sum{service="{{ include "tei.fullname" .Subcharts.tei }}",<<.LabelMatchers>>}[1m]) / (0.001+rate(te_request_inference_duration_count{service="{{ include "tei.fullname" .Subcharts.tei }}",<<.LabelMatchers>>}[1m]))'
-      name:
-        matches: ^te_request_inference_duration_sum
-        as: "{{ include "tei.metricPrefix" .Subcharts.tei }}_request_latency"
-    {{- end }}
       resources:
         overrides:
           namespace: {resource: "namespace"}

@@ -1137,7 +1137,7 @@ data:
                 "uid": "${Metrics}"
               },
               "editorMode": "code",
-              "expr": "sum by (service)(rate(tgi_request_mean_time_per_token_duration_count{service=\"$release-tgi\",namespace=\"$namespace\"}[$__rate_interval]))",
+              "expr": "sum by (service)(rate(tgi_request_generated_tokens_sum{service=\"$release-tgi\",namespace=\"$namespace\"}[$__rate_interval]))",
               "hide": false,
               "instant": false,
               "legendFormat": "TGI",

@@ -1,7 +1,7 @@
+{{- if and .Values.global.monitoring .Values.autoscaling.enabled }}
 # Copyright (C) 2024 Intel Corporation
 # SPDX-License-Identifier: Apache-2.0
 
-{{- if and .Values.global.monitoring .Values.autoscaling.enabled }}
 apiVersion: autoscaling/v2
 kind: HorizontalPodAutoscaler
 metadata:
@@ -22,24 +22,19 @@ spec:
         kind: Service
         name: {{ include "tei.fullname" . }}
       target:
-{{- if .Values.accelDevice }}
         # Metric is sum from all pods. "AverageValue" divides value returned from
-        # the custom metrics API by the number of Pods before comparing to the target:
+        # the custom metrics API by the number of Pods before comparing to the target
+        # (pods need to be in Ready state faster than specified stabilization window):
         #  https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#algorithm-details
         #  https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough/#autoscaling-on-multiple-metrics-and-custom-metrics
         type: AverageValue
-        averageValue: 15
-      metric:
-        name: {{ include "tei.metricPrefix" . }}_queue_size_sum
+{{- if .Values.accelDevice }}
+        averageValue: {{ .Values.autoscaling.queueSizeTarget.accel }}
 {{- else }}
-        # Metric is average for all the pods. To avoid replica fluctuation when pod
-        # startup + request processing takes longer than HPA evaluation period, this uses
-        # "Value" (replicas = metric.value / target.value), instead of "AverageValue" type.
-        type: Value
-        value: 4 # seconds
-      metric:
-        name: {{ include "tei.metricPrefix" . }}_request_latency
+        averageValue: {{ .Values.autoscaling.queueSizeTarget.cpu }}
 {{- end }}
+      metric:
+        name: {{ include "tei.metricPrefix" . }}_queue_size_sum
   behavior:
     scaleDown:
       stabilizationWindowSeconds: 180

@@ -12,9 +12,12 @@ replicaCount: 1
 # - Requires custom metrics ConfigMap available in the main application chart
 # - https://kubernetes.io/docs/concepts/workloads/autoscaling/
 autoscaling:
+  enabled: false
   minReplicas: 1
   maxReplicas: 2
-  enabled: false
+  queueSizeTarget:
+    accel: 10
+    cpu: 10
 
 port: 2081
 shmSize: 1Gi

@@ -1,7 +1,7 @@
+{{- if and .Values.global.monitoring .Values.autoscaling.enabled }}
 # Copyright (C) 2024 Intel Corporation
 # SPDX-License-Identifier: Apache-2.0
 
-{{- if and .Values.global.monitoring .Values.autoscaling.enabled }}
 apiVersion: autoscaling/v2
 kind: HorizontalPodAutoscaler
 metadata:
@@ -22,24 +22,19 @@ spec:
         kind: Service
         name: {{ include "teirerank.fullname" . }}
       target:
-{{- if .Values.accelDevice }}
         # Metric is sum from all pods. "AverageValue" divides value returned from
-        # the custom metrics API by the number of Pods before comparing to the target:
+        # the custom metrics API by the number of Pods before comparing to the target
+        # (pods need to be in Ready state faster than specified stabilization window):
         #  https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#algorithm-details
         #  https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough/#autoscaling-on-multiple-metrics-and-custom-metrics
         type: AverageValue
-        averageValue: 15
-      metric:
-        name: {{ include "teirerank.metricPrefix" . }}_queue_size_sum
+{{- if .Values.accelDevice }}
+        averageValue: {{ .Values.autoscaling.queueSizeTarget.accel }}
 {{- else }}
-        # Metric is average for all the pods. To avoid replica fluctuation when pod
-        # startup + request processing takes longer than HPA evaluation period, this uses
-        # "Value" (replicas = metric.value / target.value), instead of "AverageValue" type.
-        type: Value
-        value: 4 # seconds
-      metric:
-        name: {{ include "teirerank.metricPrefix" . }}_request_latency
+        averageValue: {{ .Values.autoscaling.queueSizeTarget.cpu }}
 {{- end }}
+      metric:
+        name: {{ include "teirerank.metricPrefix" . }}_queue_size_sum
   behavior:
     scaleDown:
       stabilizationWindowSeconds: 180

@@ -12,9 +12,12 @@ replicaCount: 1
 # - Requires custom metrics ConfigMap available in the main application chart
 # - https://kubernetes.io/docs/concepts/workloads/autoscaling/
 autoscaling:
+  enabled: false
   minReplicas: 1
   maxReplicas: 3
-  enabled: false
+  queueSizeTarget:
+    accel: 10
+    cpu: 10
 
 port: 2082
 shmSize: 1Gi