opea-project · poussa · Jun 26, 2025 · May 2, 2025 · May 19, 2025 · May 28, 2025
@@ -12,6 +12,10 @@
   - [Install](#install)
   - [Post-install](#post-install)
 - [Verify](#verify)
+- [Scaling metric considerations](#scaling-metric-considerations)
+  - [Autoscaling principles](#autoscaling-principles)
+  - [Current scaling metrics](#current-scaling-metrics)
+  - [Other potential metrics](#other-potential-metrics)
 
 ## Introduction
 
@@ -62,8 +66,8 @@ $ helm install  prometheus-adapter prometheus-community/prometheus-adapter --ver
   --set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false
 ```
 
-NOTE: the service name given above in `prometheus.url` must match the listed Prometheus service name,
-otherwise adapter cannot access it!
+> **NOTE**: the service name given above in `prometheus.url` must match the listed Prometheus
+> service name, otherwise adapter cannot access it!
 
 (Alternative for setting the above `prometheusSpec` variable to `false` is making sure that
 `prometheusRelease` value in top-level chart matches the release name given to the Prometheus
@@ -130,6 +134,68 @@ watch -n 5 scale-monitor-helm.sh default chatqna
 
 (Assumes that HPA scaled chart is installed to `default` namespace with `chatqna` release name.)
 
-**NOTE**: inferencing services provide metrics only after they've processed their first request.
-The reranking service is used only after the query context data has been uploaded. Until then,
-no metrics will be available for them.
+> **NOTE**: inferencing services provide metrics only after they've processed their first request.
+> The reranking service is used only after the query context data has been uploaded. Until then,
+> no metrics will be available for them.
+
+## Scaling metric considerations
+
+### Autoscaling principles
+
+The used model, underlying HW and engine parameters are supposed to be selected so that engine
+instance can satisfy service SLA (Service Level Agreement) requirements for its own requests,
+also when it's becoming saturated. Autoscaling is then intended to scale up the service so that
+requests can be directed to unsaturated instances.
+
+Problem is finding a good metric, and its threshold, for indicatating this saturation point.
+Preferably it should be something that can anticipate this point, so that startup delay for
+the new engine instances does not cause SLA breakage (or in the worst case requests being
+rejected, if the engine queue fills up).
+
+> **NOTE**: Another problem is Kubernetes service routing sending requests (also) to already saturated
+> instances, instead of idle ones. Using [KubeAI](../kubeai/#readme) (instead of HPA) to manage
+> both engine scaling + query routing can solve that.
+
+### Current scaling metrics
+
+The following inference engine metrics are used to autoscale their replica counts:
+
+- vLLM: Active requests i.e. count of waiting (queued) + (already) running requests
+  - Good overall scaling metric, used also by [KubeAI](../kubeai/#readme) for scaling vLLM
+  - Threshold depends on how many requests underlying HW / engine config can process for given model in parallel
+- TGI / TEI: Queue size, i.e. how many requests are waiting to be processed
+  - Used because TGI and TEI do not offer metric for (already) running requests, just waiting ones
+  - Independent of the used model, so works well as an example, but not that good for production because
+    scaling happens late and fluctuates a lot (due to metric being zero when engine is not saturated)
+
+### Other potential metrics
+
+All the metrics provided by the inference engines are listed in their documentation:
+
+- [vLLM metrics](https://docs.vllm.ai/en/v0.8.5/serving/metrics.html)
+  - [Metric design](https://docs.vllm.ai/en/v0.8.5/design/v1/metrics.html)
+- [TGI metrics](https://huggingface.co/docs/text-generation-inference/en/reference/metrics)
+  - TEI (embed and reranking) services provide a subset of these TGI metrics
+
+OPEA application [dashboard](monitoring.md#dashboards) provides (Prometheus query) examples
+for deriving service performance metrics out of engine Histogram metrics.
+
+Their suitability for autoscaling:
+
+- Request latency, request per second (RPS) - not suitable
+  - Depends completely on input and output token counts and is an indicator for past performance, not incoming load
+- First token latency (TTFT) - potential
+  - Relevancy depends on use-case; number of used tokens and what's important
+- Next token latency (TPOT, ITL), tokens per second (TPS) - potential
+  - Relevancy depends on use-case; number of used tokens and what's important
+
+Performance metrics will be capped by the performance of the underlying engine setup.
+Beyond a certain point, they no longer reflect the actual incoming load or indicate how
+much scaling is needed.
+
+Therefore such metrics could be used in production _when_ their thresholds are carefully
+fine-tuned and rechecked every time underlying setup (model, HW, engine config) changes.
+In OPEA Helm charts that setup is user selectable, so such metrics are unsuitable for
+autoscaling examples.
+
+(General [explanation](https://docs.nvidia.com/nim/benchmarking/llm/latest/metrics.html) on how these metrics are measured.)
@@ -1,44 +1,64 @@
-# Copyright (C) 2024 Intel Corporation
+# Copyright (C) 2024-2025 Intel Corporation
 # SPDX-License-Identifier: Apache-2.0
 
-# Enable HorizontalPodAutoscaler (HPA)
+# Enable HorizontalPodAutoscaler (HPA) for ChatQnA and its components
 #
-# That will overwrite named PrometheusAdapter configMap with ChatQnA specific
-# custom metric queries for embedding, reranking, and LLM services.
+# Will create configMap with ChatQnA specific custom metric queries for embedding, reranking,
+# and LLM inferencing services, which can be used to overwrite current PrometheusAdapter rules.
+# This will then provide custom metrics used by HorizontalPodAutoscaler rules of each service.
 #
-# Default upstream configMap is in:
+# Default upstream adapter configMap is in:
 #  - https://github.com/kubernetes-sigs/prometheus-adapter/blob/master/deploy/manifests/config-map.yaml
 
-dashboard:
-  scaling: true
-
 autoscaling:
   enabled: true
 
 global:
-  # K8s custom metrics (used for scaling thresholds) are based on metrics from service monitoring
+  # Both Grafana dashboards and k8s custom metrics need (Prometheus) metrics for services
   monitoring: true
 
 # Override values in specific subcharts
+#
+# Note: enabling "autoscaling" for any of the subcharts requires enabling it also above!
+
+dashboard:
+  # add also scaling metrics dashboard to Grafana
+  scaling: true
 
-# Enabling "autoscaling" for any of the subcharts requires enabling it also above!
 vllm:
+  # vLLM startup takes too long for autoscaling, especially with Gaudi
+  VLLM_SKIP_WARMUP: "true"
   autoscaling:
+    enabled: true
     minReplicas: 1
     maxReplicas: 4
-    enabled: true
+    activeRequestsTarget:
+      accel: 120
+      cpu: 10
+
 tgi:
   autoscaling:
+    enabled: true
     minReplicas: 1
     maxReplicas: 4
-    enabled: true
+    queueSizeTarget:
+      accel: 10
+      cpu: 10
+
 teirerank:
   autoscaling:
+    enabled: true
     minReplicas: 1
     maxReplicas: 3
-    enabled: true
+    queueSizeTarget:
+      accel: 10
+      cpu: 10
+
 tei:
   autoscaling:
+    enabled: true
     minReplicas: 1
     maxReplicas: 2
-    enabled: true
+    queueSizeTarget:
+      accel: 10
+      cpu: 10
@@ -1,30 +1,28 @@
-# Copyright (C) 2024 Intel Corporation
+{{- if and .Values.global.monitoring .Values.autoscaling.enabled }}
+# Copyright (C) 2024-2025 Intel Corporation
 # SPDX-License-Identifier: Apache-2.0
 
-{{- if and .Values.global.monitoring .Values.autoscaling.enabled }}
 apiVersion: v1
 kind: ConfigMap
 metadata:
-  # easy to find for the required manual step
+  # easy to find for the manual step required to install this for Prometheus-adapter
   namespace: default
   name: {{ include "chatqna.fullname" . }}-custom-metrics
   labels:
     app.kubernetes.io/name: prometheus-adapter
 data:
   config.yaml: |
     rules:
-    {{- if and .Values.vllm.enabled .Values.vllm.autoscaling.enabled }}
     # check metric with:
     # kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/namespaces/default/service/*/<metric> | jq
     #
-    - seriesQuery: '{__name__="vllm:time_per_output_token_seconds_sum",service="{{ include "vllm.fullname" .Subcharts.vllm }}"}'
-      # Average output token latency from vLLM histograms, over 1 min
-      # (interval should be at least 4x serviceMonitor query interval,
-      # 0.001 divider add is to make sure there's always a valid value)
-      metricsQuery: 'rate(vllm:time_per_output_token_seconds_sum{service="{{ include "vllm.fullname" .Subcharts.vllm }}",<<.LabelMatchers>>}[1m]) / (0.001+rate(vllm:time_per_output_token_seconds_count{service="{{ include "vllm.fullname" .Subcharts.vllm }}",<<.LabelMatchers>>}[1m]))'
+    {{- if and .Values.vllm.enabled .Values.vllm.autoscaling.enabled }}
+    - seriesQuery: '{__name__="vllm:num_requests_waiting",service="{{ include "vllm.fullname" .Subcharts.vllm }}"}'
+      # Sum of active requests in pods, both ones already being processed, and ones waiting to be processed
+      metricsQuery: 'sum by (<<.GroupBy>>)(vllm:num_requests_running{<<.LabelMatchers>>} + <<.Series>>{<<.LabelMatchers>>})'
       name:
-        matches: ^vllm:time_per_output_token_seconds_sum
-        as: "{{ include "vllm.metricPrefix" .Subcharts.vllm }}_token_latency"
+        matches: ^vllm:num_requests_waiting
+        as: "{{ include "vllm.metricPrefix" .Subcharts.vllm }}_active_request_sum"
       resources:
         # HPA needs both namespace + suitable object resource for its query paths:
         # /apis/custom.metrics.k8s.io/v1beta1/namespaces/default/service/*/<metric>
@@ -34,63 +32,37 @@ data:
           service:   {resource: "service"}
     {{- end }}
     {{- if and .Values.tgi.enabled .Values.tgi.autoscaling.enabled }}
-    {{- if .Values.tgi.accelDevice }}
     - seriesQuery: '{__name__="tgi_queue_size",service="{{ include "tgi.fullname" .Subcharts.tgi }}"}'
       # TGI instances queue_size sum
-      metricsQuery: 'sum by (namespace,service) (tgi_queue_size{service="{{ include "tgi.fullname" .Subcharts.tgi }}",<<.LabelMatchers>>})'
+      # - GroupBy/LabelMatches provide labels from resources section
+      metricsQuery: 'sum by (<<.GroupBy>>)(<<.Series>>{<<.LabelMatchers>>})'
       name:
         matches: ^tgi_queue_size
         as: "{{ include "tgi.metricPrefix" .Subcharts.tgi }}_queue_size_sum"
-    {{- else }}
-    - seriesQuery: '{__name__="tgi_request_inference_duration_sum",service="{{ include "tgi.fullname" .Subcharts.tgi }}"}'
-      # Average request latency from TGI histograms, over 1 min
-      metricsQuery: 'rate(tgi_request_inference_duration_sum{service="{{ include "tgi.fullname" .Subcharts.tgi }}",<<.LabelMatchers>>}[1m]) / (0.001+rate(tgi_request_inference_duration_count{service="{{ include "tgi.fullname" .Subcharts.tgi }}",<<.LabelMatchers>>}[1m]))'
-      name:
-        matches: ^tgi_request_inference_duration_sum
-        as: "{{ include "tgi.metricPrefix" .Subcharts.tgi }}_request_latency"
-    {{- end }}
       resources:
         overrides:
           namespace: {resource: "namespace"}
           service:   {resource: "service"}
     {{- end }}
-    {{- if .Values.teirerank.autoscaling.enabled }}
-    {{- if .Values.teirerank.accelDevice }}
+    {{- if and .Values.teirerank.enabled .Values.teirerank.autoscaling.enabled }}
     - seriesQuery: '{__name__="te_queue_size",service="{{ include "teirerank.fullname" .Subcharts.teirerank }}"}'
       # TEI instances queue_size sum
-      metricsQuery: 'sum by (namespace,service) (te_queue_size{service="{{ include "teirerank.fullname" .Subcharts.teirerank }}",<<.LabelMatchers>>})'
+      metricsQuery: 'sum by (<<.GroupBy>>)(<<.Series>>{<<.LabelMatchers>>})'
       name:
         matches: ^te_queue_size
         as: "{{ include "teirerank.metricPrefix" .Subcharts.teirerank }}_queue_size_sum"
-    {{- else }}
-    - seriesQuery: '{__name__="te_request_inference_duration_sum",service="{{ include "teirerank.fullname" .Subcharts.teirerank }}"}'
-      # Average request latency from TEI histograms, over 1 min
-      metricsQuery: 'rate(te_request_inference_duration_sum{service="{{ include "teirerank.fullname" .Subcharts.teirerank }}",<<.LabelMatchers>>}[1m]) / (0.001+rate(te_request_inference_duration_count{service="{{ include "teirerank.fullname" .Subcharts.teirerank }}",<<.LabelMatchers>>}[1m]))'
-      name:
-        matches: ^te_request_inference_duration_sum
-        as: "{{ include "teirerank.metricPrefix" .Subcharts.teirerank }}_request_latency"
-    {{- end }}
       resources:
         overrides:
           namespace: {resource: "namespace"}
           service:   {resource: "service"}
     {{- end }}
     {{- if .Values.tei.autoscaling.enabled }}
-    {{- if .Values.tei.accelDevice }}
     - seriesQuery: '{__name__="te_queue_size",service="{{ include "tei.fullname" .Subcharts.tei }}"}'
       # TEI instances queue_size sum
-      metricsQuery: 'sum by (namespace,service) (te_queue_size{service="{{ include "tei.fullname" .Subcharts.tei }}",<<.LabelMatchers>>})'
+      metricsQuery: 'sum by (<<.GroupBy>>)(<<.Series>>{<<.LabelMatchers>>})'
       name:
         matches: ^te_queue_size
         as: "{{ include "tei.metricPrefix" .Subcharts.tei }}_queue_size_sum"
-    {{- else }}
-    - seriesQuery: '{__name__="te_request_inference_duration_sum",service="{{ include "tei.fullname" .Subcharts.tei }}"}'
-      # Average request latency from TEI histograms, over 1 min
-      metricsQuery: 'rate(te_request_inference_duration_sum{service="{{ include "tei.fullname" .Subcharts.tei }}",<<.LabelMatchers>>}[1m]) / (0.001+rate(te_request_inference_duration_count{service="{{ include "tei.fullname" .Subcharts.tei }}",<<.LabelMatchers>>}[1m]))'
-      name:
-        matches: ^te_request_inference_duration_sum
-        as: "{{ include "tei.metricPrefix" .Subcharts.tei }}_request_latency"
-    {{- end }}
       resources:
         overrides:
           namespace: {resource: "namespace"}

@@ -1137,7 +1137,7 @@ data:
                 "uid": "${Metrics}"
               },
               "editorMode": "code",
-              "expr": "sum by (service)(rate(tgi_request_mean_time_per_token_duration_count{service=\"$release-tgi\",namespace=\"$namespace\"}[$__rate_interval]))",
+              "expr": "sum by (service)(rate(tgi_request_generated_tokens_sum{service=\"$release-tgi\",namespace=\"$namespace\"}[$__rate_interval]))",
               "hide": false,
               "instant": false,
               "legendFormat": "TGI",

@@ -1,7 +1,7 @@
+{{- if and .Values.global.monitoring .Values.autoscaling.enabled }}
 # Copyright (C) 2024 Intel Corporation
 # SPDX-License-Identifier: Apache-2.0
 
-{{- if and .Values.global.monitoring .Values.autoscaling.enabled }}
 apiVersion: autoscaling/v2
 kind: HorizontalPodAutoscaler
 metadata:
@@ -22,24 +22,19 @@ spec:
         kind: Service
         name: {{ include "tei.fullname" . }}
       target:
-{{- if .Values.accelDevice }}
         # Metric is sum from all pods. "AverageValue" divides value returned from
-        # the custom metrics API by the number of Pods before comparing to the target:
+        # the custom metrics API by the number of Pods before comparing to the target
+        # (pods need to be in Ready state faster than specified stabilization window):
         #  https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#algorithm-details
         #  https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough/#autoscaling-on-multiple-metrics-and-custom-metrics
         type: AverageValue
-        averageValue: 15
-      metric:
-        name: {{ include "tei.metricPrefix" . }}_queue_size_sum
+{{- if .Values.accelDevice }}
+        averageValue: {{ .Values.autoscaling.queueSizeTarget.accel }}
 {{- else }}
-        # Metric is average for all the pods. To avoid replica fluctuation when pod
-        # startup + request processing takes longer than HPA evaluation period, this uses
-        # "Value" (replicas = metric.value / target.value), instead of "AverageValue" type.
-        type: Value
-        value: 4 # seconds
-      metric:
-        name: {{ include "tei.metricPrefix" . }}_request_latency
+        averageValue: {{ .Values.autoscaling.queueSizeTarget.cpu }}
 {{- end }}
+      metric:
+        name: {{ include "tei.metricPrefix" . }}_queue_size_sum
   behavior:
     scaleDown:
       stabilizationWindowSeconds: 180

@@ -12,9 +12,12 @@ replicaCount: 1
 # - Requires custom metrics ConfigMap available in the main application chart
 # - https://kubernetes.io/docs/concepts/workloads/autoscaling/
 autoscaling:
+  enabled: false
   minReplicas: 1
   maxReplicas: 2
-  enabled: false
+  queueSizeTarget:
+    accel: 10
+    cpu: 10
 
 port: 2081
 shmSize: 1Gi

@@ -1,7 +1,7 @@
+{{- if and .Values.global.monitoring .Values.autoscaling.enabled }}
 # Copyright (C) 2024 Intel Corporation
 # SPDX-License-Identifier: Apache-2.0
 
-{{- if and .Values.global.monitoring .Values.autoscaling.enabled }}
 apiVersion: autoscaling/v2
 kind: HorizontalPodAutoscaler
 metadata:
@@ -22,24 +22,19 @@ spec:
         kind: Service
         name: {{ include "teirerank.fullname" . }}
       target:
-{{- if .Values.accelDevice }}
         # Metric is sum from all pods. "AverageValue" divides value returned from
-        # the custom metrics API by the number of Pods before comparing to the target:
+        # the custom metrics API by the number of Pods before comparing to the target
+        # (pods need to be in Ready state faster than specified stabilization window):
         #  https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#algorithm-details
         #  https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough/#autoscaling-on-multiple-metrics-and-custom-metrics
         type: AverageValue
-        averageValue: 15
-      metric:
-        name: {{ include "teirerank.metricPrefix" . }}_queue_size_sum
+{{- if .Values.accelDevice }}
+        averageValue: {{ .Values.autoscaling.queueSizeTarget.accel }}
 {{- else }}
-        # Metric is average for all the pods. To avoid replica fluctuation when pod
-        # startup + request processing takes longer than HPA evaluation period, this uses
-        # "Value" (replicas = metric.value / target.value), instead of "AverageValue" type.
-        type: Value
-        value: 4 # seconds
-      metric:
-        name: {{ include "teirerank.metricPrefix" . }}_request_latency
+        averageValue: {{ .Values.autoscaling.queueSizeTarget.cpu }}
 {{- end }}
+      metric:
+        name: {{ include "teirerank.metricPrefix" . }}_queue_size_sum
   behavior:
     scaleDown:
       stabilizationWindowSeconds: 180