Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
76 changes: 71 additions & 5 deletions helm-charts/HPA.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,10 @@
- [Install](#install)
- [Post-install](#post-install)
- [Verify](#verify)
- [Scaling metric considerations](#scaling-metric-considerations)
- [Autoscaling principles](#autoscaling-principles)
- [Current scaling metrics](#current-scaling-metrics)
- [Other potential metrics](#other-potential-metrics)

## Introduction

Expand Down Expand Up @@ -62,8 +66,8 @@ $ helm install prometheus-adapter prometheus-community/prometheus-adapter --ver
--set prometheus.prometheusSpec.serviceMonitorSelectorNilUsesHelmValues=false
```

NOTE: the service name given above in `prometheus.url` must match the listed Prometheus service name,
otherwise adapter cannot access it!
> **NOTE**: the service name given above in `prometheus.url` must match the listed Prometheus
> service name, otherwise adapter cannot access it!

(Alternative for setting the above `prometheusSpec` variable to `false` is making sure that
`prometheusRelease` value in top-level chart matches the release name given to the Prometheus
Expand Down Expand Up @@ -130,6 +134,68 @@ watch -n 5 scale-monitor-helm.sh default chatqna

(Assumes that HPA scaled chart is installed to `default` namespace with `chatqna` release name.)

**NOTE**: inferencing services provide metrics only after they've processed their first request.
The reranking service is used only after the query context data has been uploaded. Until then,
no metrics will be available for them.
> **NOTE**: inferencing services provide metrics only after they've processed their first request.
> The reranking service is used only after the query context data has been uploaded. Until then,
> no metrics will be available for them.

## Scaling metric considerations

### Autoscaling principles

The used model, underlying HW and engine parameters are supposed to be selected so that engine
instance can satisfy service SLA (Service Level Agreement) requirements for its own requests,
also when it's becoming saturated. Autoscaling is then intended to scale up the service so that
requests can be directed to unsaturated instances.

Problem is finding a good metric, and its threshold, for indicatating this saturation point.
Preferably it should be something that can anticipate this point, so that startup delay for
the new engine instances does not cause SLA breakage (or in the worst case requests being
rejected, if the engine queue fills up).

> **NOTE**: Another problem is Kubernetes service routing sending requests (also) to already saturated
> instances, instead of idle ones. Using [KubeAI](../kubeai/#readme) (instead of HPA) to manage
> both engine scaling + query routing can solve that.

### Current scaling metrics

The following inference engine metrics are used to autoscale their replica counts:

- vLLM: Active requests i.e. count of waiting (queued) + (already) running requests
- Good overall scaling metric, used also by [KubeAI](../kubeai/#readme) for scaling vLLM
- Threshold depends on how many requests underlying HW / engine config can process for given model in parallel
- TGI / TEI: Queue size, i.e. how many requests are waiting to be processed
- Used because TGI and TEI do not offer metric for (already) running requests, just waiting ones
- Independent of the used model, so works well as an example, but not that good for production because
scaling happens late and fluctuates a lot (due to metric being zero when engine is not saturated)

### Other potential metrics

All the metrics provided by the inference engines are listed in their documentation:

- [vLLM metrics](https://docs.vllm.ai/en/v0.8.5/serving/metrics.html)
- [Metric design](https://docs.vllm.ai/en/v0.8.5/design/v1/metrics.html)
- [TGI metrics](https://huggingface.co/docs/text-generation-inference/en/reference/metrics)
- TEI (embed and reranking) services provide a subset of these TGI metrics

OPEA application [dashboard](monitoring.md#dashboards) provides (Prometheus query) examples
for deriving service performance metrics out of engine Histogram metrics.

Their suitability for autoscaling:

- Request latency, request per second (RPS) - not suitable
- Depends completely on input and output token counts and is an indicator for past performance, not incoming load
- First token latency (TTFT) - potential
- Relevancy depends on use-case; number of used tokens and what's important
- Next token latency (TPOT, ITL), tokens per second (TPS) - potential
- Relevancy depends on use-case; number of used tokens and what's important

Performance metrics will be capped by the performance of the underlying engine setup.
Beyond a certain point, they no longer reflect the actual incoming load or indicate how
much scaling is needed.

Therefore such metrics could be used in production _when_ their thresholds are carefully
fine-tuned and rechecked every time underlying setup (model, HW, engine config) changes.
In OPEA Helm charts that setup is user selectable, so such metrics are unsuitable for
autoscaling examples.

(General [explanation](https://docs.nvidia.com/nim/benchmarking/llm/latest/metrics.html) on how these metrics are measured.)
48 changes: 34 additions & 14 deletions helm-charts/chatqna/hpa-values.yaml
Original file line number Diff line number Diff line change
@@ -1,44 +1,64 @@
# Copyright (C) 2024 Intel Corporation
# Copyright (C) 2024-2025 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

# Enable HorizontalPodAutoscaler (HPA)
# Enable HorizontalPodAutoscaler (HPA) for ChatQnA and its components
#
# That will overwrite named PrometheusAdapter configMap with ChatQnA specific
# custom metric queries for embedding, reranking, and LLM services.
# Will create configMap with ChatQnA specific custom metric queries for embedding, reranking,
# and LLM inferencing services, which can be used to overwrite current PrometheusAdapter rules.
# This will then provide custom metrics used by HorizontalPodAutoscaler rules of each service.
#
# Default upstream configMap is in:
# Default upstream adapter configMap is in:
# - https://github.com/kubernetes-sigs/prometheus-adapter/blob/master/deploy/manifests/config-map.yaml

dashboard:
scaling: true

autoscaling:
enabled: true

global:
# K8s custom metrics (used for scaling thresholds) are based on metrics from service monitoring
# Both Grafana dashboards and k8s custom metrics need (Prometheus) metrics for services
monitoring: true

# Override values in specific subcharts
#
# Note: enabling "autoscaling" for any of the subcharts requires enabling it also above!

dashboard:
# add also scaling metrics dashboard to Grafana
scaling: true

# Enabling "autoscaling" for any of the subcharts requires enabling it also above!
vllm:
# vLLM startup takes too long for autoscaling, especially with Gaudi
VLLM_SKIP_WARMUP: "true"
autoscaling:
enabled: true
minReplicas: 1
maxReplicas: 4
enabled: true
activeRequestsTarget:
accel: 120
cpu: 10

tgi:
autoscaling:
enabled: true
minReplicas: 1
maxReplicas: 4
enabled: true
queueSizeTarget:
accel: 10
cpu: 10

teirerank:
autoscaling:
enabled: true
minReplicas: 1
maxReplicas: 3
enabled: true
queueSizeTarget:
accel: 10
cpu: 10

tei:
autoscaling:
enabled: true
minReplicas: 1
maxReplicas: 2
enabled: true
queueSizeTarget:
accel: 10
cpu: 10
56 changes: 14 additions & 42 deletions helm-charts/chatqna/templates/custom-metrics-configmap.yaml
Original file line number Diff line number Diff line change
@@ -1,30 +1,28 @@
# Copyright (C) 2024 Intel Corporation
{{- if and .Values.global.monitoring .Values.autoscaling.enabled }}
# Copyright (C) 2024-2025 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

{{- if and .Values.global.monitoring .Values.autoscaling.enabled }}
apiVersion: v1
kind: ConfigMap
metadata:
# easy to find for the required manual step
# easy to find for the manual step required to install this for Prometheus-adapter
namespace: default
name: {{ include "chatqna.fullname" . }}-custom-metrics
labels:
app.kubernetes.io/name: prometheus-adapter
data:
config.yaml: |
rules:
{{- if and .Values.vllm.enabled .Values.vllm.autoscaling.enabled }}
# check metric with:
# kubectl get --raw /apis/custom.metrics.k8s.io/v1beta1/namespaces/default/service/*/<metric> | jq
#
- seriesQuery: '{__name__="vllm:time_per_output_token_seconds_sum",service="{{ include "vllm.fullname" .Subcharts.vllm }}"}'
# Average output token latency from vLLM histograms, over 1 min
# (interval should be at least 4x serviceMonitor query interval,
# 0.001 divider add is to make sure there's always a valid value)
metricsQuery: 'rate(vllm:time_per_output_token_seconds_sum{service="{{ include "vllm.fullname" .Subcharts.vllm }}",<<.LabelMatchers>>}[1m]) / (0.001+rate(vllm:time_per_output_token_seconds_count{service="{{ include "vllm.fullname" .Subcharts.vllm }}",<<.LabelMatchers>>}[1m]))'
{{- if and .Values.vllm.enabled .Values.vllm.autoscaling.enabled }}
- seriesQuery: '{__name__="vllm:num_requests_waiting",service="{{ include "vllm.fullname" .Subcharts.vllm }}"}'
# Sum of active requests in pods, both ones already being processed, and ones waiting to be processed
metricsQuery: 'sum by (<<.GroupBy>>)(vllm:num_requests_running{<<.LabelMatchers>>} + <<.Series>>{<<.LabelMatchers>>})'
name:
matches: ^vllm:time_per_output_token_seconds_sum
as: "{{ include "vllm.metricPrefix" .Subcharts.vllm }}_token_latency"
matches: ^vllm:num_requests_waiting
as: "{{ include "vllm.metricPrefix" .Subcharts.vllm }}_active_request_sum"
resources:
# HPA needs both namespace + suitable object resource for its query paths:
# /apis/custom.metrics.k8s.io/v1beta1/namespaces/default/service/*/<metric>
Expand All @@ -34,63 +32,37 @@ data:
service: {resource: "service"}
{{- end }}
{{- if and .Values.tgi.enabled .Values.tgi.autoscaling.enabled }}
{{- if .Values.tgi.accelDevice }}
- seriesQuery: '{__name__="tgi_queue_size",service="{{ include "tgi.fullname" .Subcharts.tgi }}"}'
# TGI instances queue_size sum
metricsQuery: 'sum by (namespace,service) (tgi_queue_size{service="{{ include "tgi.fullname" .Subcharts.tgi }}",<<.LabelMatchers>>})'
# - GroupBy/LabelMatches provide labels from resources section
metricsQuery: 'sum by (<<.GroupBy>>)(<<.Series>>{<<.LabelMatchers>>})'
name:
matches: ^tgi_queue_size
as: "{{ include "tgi.metricPrefix" .Subcharts.tgi }}_queue_size_sum"
{{- else }}
- seriesQuery: '{__name__="tgi_request_inference_duration_sum",service="{{ include "tgi.fullname" .Subcharts.tgi }}"}'
# Average request latency from TGI histograms, over 1 min
metricsQuery: 'rate(tgi_request_inference_duration_sum{service="{{ include "tgi.fullname" .Subcharts.tgi }}",<<.LabelMatchers>>}[1m]) / (0.001+rate(tgi_request_inference_duration_count{service="{{ include "tgi.fullname" .Subcharts.tgi }}",<<.LabelMatchers>>}[1m]))'
name:
matches: ^tgi_request_inference_duration_sum
as: "{{ include "tgi.metricPrefix" .Subcharts.tgi }}_request_latency"
{{- end }}
resources:
overrides:
namespace: {resource: "namespace"}
service: {resource: "service"}
{{- end }}
{{- if .Values.teirerank.autoscaling.enabled }}
{{- if .Values.teirerank.accelDevice }}
{{- if and .Values.teirerank.enabled .Values.teirerank.autoscaling.enabled }}
- seriesQuery: '{__name__="te_queue_size",service="{{ include "teirerank.fullname" .Subcharts.teirerank }}"}'
# TEI instances queue_size sum
metricsQuery: 'sum by (namespace,service) (te_queue_size{service="{{ include "teirerank.fullname" .Subcharts.teirerank }}",<<.LabelMatchers>>})'
metricsQuery: 'sum by (<<.GroupBy>>)(<<.Series>>{<<.LabelMatchers>>})'
name:
matches: ^te_queue_size
as: "{{ include "teirerank.metricPrefix" .Subcharts.teirerank }}_queue_size_sum"
{{- else }}
- seriesQuery: '{__name__="te_request_inference_duration_sum",service="{{ include "teirerank.fullname" .Subcharts.teirerank }}"}'
# Average request latency from TEI histograms, over 1 min
metricsQuery: 'rate(te_request_inference_duration_sum{service="{{ include "teirerank.fullname" .Subcharts.teirerank }}",<<.LabelMatchers>>}[1m]) / (0.001+rate(te_request_inference_duration_count{service="{{ include "teirerank.fullname" .Subcharts.teirerank }}",<<.LabelMatchers>>}[1m]))'
name:
matches: ^te_request_inference_duration_sum
as: "{{ include "teirerank.metricPrefix" .Subcharts.teirerank }}_request_latency"
{{- end }}
resources:
overrides:
namespace: {resource: "namespace"}
service: {resource: "service"}
{{- end }}
{{- if .Values.tei.autoscaling.enabled }}
{{- if .Values.tei.accelDevice }}
- seriesQuery: '{__name__="te_queue_size",service="{{ include "tei.fullname" .Subcharts.tei }}"}'
# TEI instances queue_size sum
metricsQuery: 'sum by (namespace,service) (te_queue_size{service="{{ include "tei.fullname" .Subcharts.tei }}",<<.LabelMatchers>>})'
metricsQuery: 'sum by (<<.GroupBy>>)(<<.Series>>{<<.LabelMatchers>>})'
name:
matches: ^te_queue_size
as: "{{ include "tei.metricPrefix" .Subcharts.tei }}_queue_size_sum"
{{- else }}
- seriesQuery: '{__name__="te_request_inference_duration_sum",service="{{ include "tei.fullname" .Subcharts.tei }}"}'
# Average request latency from TEI histograms, over 1 min
metricsQuery: 'rate(te_request_inference_duration_sum{service="{{ include "tei.fullname" .Subcharts.tei }}",<<.LabelMatchers>>}[1m]) / (0.001+rate(te_request_inference_duration_count{service="{{ include "tei.fullname" .Subcharts.tei }}",<<.LabelMatchers>>}[1m]))'
name:
matches: ^te_request_inference_duration_sum
as: "{{ include "tei.metricPrefix" .Subcharts.tei }}_request_latency"
{{- end }}
resources:
overrides:
namespace: {resource: "namespace"}
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -1137,7 +1137,7 @@ data:
"uid": "${Metrics}"
},
"editorMode": "code",
"expr": "sum by (service)(rate(tgi_request_mean_time_per_token_duration_count{service=\"$release-tgi\",namespace=\"$namespace\"}[$__rate_interval]))",
"expr": "sum by (service)(rate(tgi_request_generated_tokens_sum{service=\"$release-tgi\",namespace=\"$namespace\"}[$__rate_interval]))",
"hide": false,
"instant": false,
"legendFormat": "TGI",
Expand Down
21 changes: 8 additions & 13 deletions helm-charts/common/tei/templates/horizontal-pod-autoscaler.yaml
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{{- if and .Values.global.monitoring .Values.autoscaling.enabled }}
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

{{- if and .Values.global.monitoring .Values.autoscaling.enabled }}
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
Expand All @@ -22,24 +22,19 @@ spec:
kind: Service
name: {{ include "tei.fullname" . }}
target:
{{- if .Values.accelDevice }}
# Metric is sum from all pods. "AverageValue" divides value returned from
# the custom metrics API by the number of Pods before comparing to the target:
# the custom metrics API by the number of Pods before comparing to the target
# (pods need to be in Ready state faster than specified stabilization window):
# https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#algorithm-details
# https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough/#autoscaling-on-multiple-metrics-and-custom-metrics
type: AverageValue
averageValue: 15
metric:
name: {{ include "tei.metricPrefix" . }}_queue_size_sum
{{- if .Values.accelDevice }}
averageValue: {{ .Values.autoscaling.queueSizeTarget.accel }}
{{- else }}
# Metric is average for all the pods. To avoid replica fluctuation when pod
# startup + request processing takes longer than HPA evaluation period, this uses
# "Value" (replicas = metric.value / target.value), instead of "AverageValue" type.
type: Value
value: 4 # seconds
metric:
name: {{ include "tei.metricPrefix" . }}_request_latency
averageValue: {{ .Values.autoscaling.queueSizeTarget.cpu }}
{{- end }}
metric:
name: {{ include "tei.metricPrefix" . }}_queue_size_sum
behavior:
scaleDown:
stabilizationWindowSeconds: 180
Expand Down
5 changes: 4 additions & 1 deletion helm-charts/common/tei/values.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,12 @@ replicaCount: 1
# - Requires custom metrics ConfigMap available in the main application chart
# - https://kubernetes.io/docs/concepts/workloads/autoscaling/
autoscaling:
enabled: false
minReplicas: 1
maxReplicas: 2
enabled: false
queueSizeTarget:
accel: 10
cpu: 10

port: 2081
shmSize: 1Gi
Expand Down
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
{{- if and .Values.global.monitoring .Values.autoscaling.enabled }}
# Copyright (C) 2024 Intel Corporation
# SPDX-License-Identifier: Apache-2.0

{{- if and .Values.global.monitoring .Values.autoscaling.enabled }}
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
Expand All @@ -22,24 +22,19 @@ spec:
kind: Service
name: {{ include "teirerank.fullname" . }}
target:
{{- if .Values.accelDevice }}
# Metric is sum from all pods. "AverageValue" divides value returned from
# the custom metrics API by the number of Pods before comparing to the target:
# the custom metrics API by the number of Pods before comparing to the target
# (pods need to be in Ready state faster than specified stabilization window):
# https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/#algorithm-details
# https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale-walkthrough/#autoscaling-on-multiple-metrics-and-custom-metrics
type: AverageValue
averageValue: 15
metric:
name: {{ include "teirerank.metricPrefix" . }}_queue_size_sum
{{- if .Values.accelDevice }}
averageValue: {{ .Values.autoscaling.queueSizeTarget.accel }}
{{- else }}
# Metric is average for all the pods. To avoid replica fluctuation when pod
# startup + request processing takes longer than HPA evaluation period, this uses
# "Value" (replicas = metric.value / target.value), instead of "AverageValue" type.
type: Value
value: 4 # seconds
metric:
name: {{ include "teirerank.metricPrefix" . }}_request_latency
averageValue: {{ .Values.autoscaling.queueSizeTarget.cpu }}
{{- end }}
metric:
name: {{ include "teirerank.metricPrefix" . }}_queue_size_sum
behavior:
scaleDown:
stabilizationWindowSeconds: 180
Expand Down
Loading
Loading