Rework HPA scaling metrics and their configuration #1090

eero-t · 2025-06-03T18:38:15Z

Description

Fix OPEA app metrics Grafana dashboad query for TGI next tokens rate
Disable vLLM warmup when HPA is used, otherwise vLLM startup takes too long
Use .Series, .GroupBy and .LabelMatchers to simplify Prometheus-adapter custom metric rules
Document autoscaling metric considerations, currently used metrics, and pros & cons of other metrics
Change vLLM to be scaled by same metric as what KubeAI uses (active requests) + add Helm variable for its threshold
Drop request latency metric for TGI/TEI scaling, it's too dependent on input & output token counts to be useful
- CPU scaling uses now same metric as accelerated scaling
Add Helm variable for TGI/TEI queue size scaling threshold, with separate CPU and acceleration thresholds

Issues

n/a.

Type of change

List the type of change like below. Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds new functionality)

Nothing should break, but scaling will change somewhat.

Dependencies

n/a.

Tests

Manual testing for Gaudi scaling and thresholds for that.

Changes affect also CPU scaling, but threshold variable values for it were not tested. Would be good if somebody could test those examples too.

Copilot

Pull Request Overview

This PR reworks the autoscaling metrics and their configuration for multiple inference services by standardizing threshold parameters and updating HPA templates and queries.

Updated Helm chart values to include new autoscaling parameters (activeRequestsTarget, queueSizeTarget) for each service.
Modified HPA templates to reference new metric names and threshold values.
Enhanced documentation to detail autoscaling metric considerations and principles.

Reviewed Changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated no comments.

Show a summary per file

File	Description
helm-charts/common/vllm/values.yaml	Added new autoscaling flag and activeRequestsTarget thresholds for vLLM.
helm-charts/common/vllm/templates/horizontal-pod-autoscaler.yaml	Updated metric name and threshold retrieval for active requests in vLLM.
helm-charts/common/tgi/values.yaml	Introduced queueSizeTarget thresholds for TGI.
helm-charts/common/tgi/templates/horizontal-pod-autoscaler.yaml	Revised metric type and threshold usage for TGI autoscaling logic.
helm-charts/common/teirerank/values.yaml and templates	Similar changes applied to teirerank as in TGI.
helm-charts/common/tei/values.yaml and templates	Similar changes applied to tei as in TGI and teirerank.
helm-charts/common/dashboard/templates/configmap-metrics.yaml	Updated Prometheus query for TGI dashboards.
helm-charts/chatqna/templates/custom-metrics-configmap.yaml	Aligned custom metrics queries with new templating for autoscaling metrics.
helm-charts/chatqna/hpa-values.yaml	Updated autoscaling configuration and added vLLM-specific warmup override variable.
helm-charts/HPA.md	Expanded autoscaling documentation with detailed scaling metric considerations.

Comments suppressed due to low confidence (2)

helm-charts/chatqna/hpa-values.yaml:29

[nitpick] Consider aligning the naming convention of the 'VLLM_SKIP_WARMUP' variable with other autoscaling configuration variables to improve consistency.

VLLM_SKIP_WARMUP: "true"

helm-charts/common/tgi/templates/horizontal-pod-autoscaler.yaml:34

The CPU branch now uses 'AverageValue' instead of the previously used 'Value' type. Please confirm that this change reflects the intended autoscaling behavior for CPU-based scaling.

averageValue: {{ .Values.autoscaling.queueSizeTarget.cpu }}

eero-t · 2025-06-04T17:22:49Z

@lianhao, @yongfengdu dozen CI tests fail to issues unrelated to this PR. Any ideas?

Most app level CI tests fail to:

+ .github/workflows/scripts/e2e/chart_test.sh check_local_opea_image 100.83.122.251:5000/opea/nginx:latest
Failed to get image manifest 100.83.122.251:5000/opea/nginx:latest

And vLLM llevel tests fail to similar type errors, both for Gaudi:

 [pod/vllm03185935-5ff9cc58ff-9d2cd/vllm]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.6.post2.dev0+g6af2f675.d20250603.gaudi000-py3.10.egg/vllm/model_executor/models/llama.py", line 134, in __init__
[pod/vllm03185935-5ff9cc58ff-9d2cd/vllm]     self.q_size = self.num_heads * self.head_dim
[pod/vllm03185935-5ff9cc58ff-9d2cd/vllm] TypeError: unsupported operand type(s) for *: 'int' and 'NoneType'

And CPU one:

[pod/vllm03185407-854d874cdf-g4gfr/vllm]   File "/opt/venv/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 135, in __init__
[pod/vllm03185407-854d874cdf-g4gfr/vllm]     self.rotary_dim = int(partial_rotary_factor * self.head_dim)
[pod/vllm03185407-854d874cdf-g4gfr/vllm]                           ~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~
[pod/vllm03185407-854d874cdf-g4gfr/vllm] TypeError: unsupported operand type(s) for *: 'int' and 'NoneType'

Signed-off-by: Eero Tamminen <[email protected]>

* Use .Series, .GroupBy and .LabelMatchers to simplify rules * Drop request latency metric for TGI/TEI. It depending on number of generated tokens, makes it unsuitable as generic metric - With that, support for HPA Value type could also be dropped (leaving only queue size AverageValue) * Because vLLM token mean latency metric does not react much to vLLM load, and for consistency with TGI/TEI, switch vLLM also to be scaled based on queue size - KubeAI scales vLLM also based on queue size * Add queue size target Helm variables for all inferencing engines Signed-off-by: Eero Tamminen <[email protected]>

Signed-off-by: Eero Tamminen <[email protected]>

TGI and TEI queue size metric is only for requests that are waiting to be processed, but that number can fluctuate a lot, as it's non-zero only when pod is fully utilized, and needs to buffer requests. On the plus side, it's agnostic to how fast engine instance can process the queries for given model. vLLM provides also gauge metric for how many requests are currently being processed (running). Adding that to the waiting requests count (queue size) makes the resulting metric much more stable, and allows scaling up extra replicas before current ones are full. KubeAI autoscaling is also based on number of active requests, so results will be more comparable. Howeover, this means that the suitable threshold will be model and engine config specific (depending on how request batches HW can run in parallel). Signed-off-by: Eero Tamminen <[email protected]>

Signed-off-by: Eero Tamminen <[email protected]>

marquiz

Mostly proofreading from my side. Cannot comment on the specific helm parameters or metrics.

Some suggestions but can be merged even as is

helm-charts/chatqna/hpa-values.yaml

helm-charts/HPA.md

And use same formatting for all notes. Signed-off-by: Eero Tamminen <[email protected]>

marquiz

Thank you @eero-t. LGTM from me, FWIW

eero-t · 2025-06-23T16:24:02Z

@chensuyue, @lianhao, @yongfengdu the earlier listed CI issues are gone, but there are still these ones...

ROCM CI tests remain in pending state, until CI gives up.

ChatQnA Gaudi vLLM test just exits with error code 1, without log showing any errors.

FaqGen Gaudi TGI test:

[pod/chatqna23094245-tgi-5556fcbb78-pczg9/tgi] Error: ShardCannotStart
+ exit 1

Qdrant test fails to:

Error: could not download https://github.com/qdrant/qdrant-helm/releases/download/qdrant-1.13.1/qdrant-1.13.1.tgz: no cached repo found. (try 'helm repo update'): error loading /home/sdp/.cache/helm/repository/bitnami-index.yaml: empty index.yaml file
Error: Process completed with exit code 1.

And what's worse, although Helm install step failed, CI still tried to run the e2e test + Helm uninstall!

eero-t · 2025-06-25T15:36:46Z

@lianhao Could you review this?

lianhao · 2025-06-26T01:31:54Z

PR #1132 should get the vllm CI happy

eero-t requested review from lianhao and yongfengdu as code owners June 3, 2025 18:38

eero-t requested review from Copilot and removed request for lianhao and yongfengdu June 3, 2025 18:42

Copilot AI reviewed Jun 3, 2025

View reviewed changes

eero-t force-pushed the hpa-warmup branch from 63cbaed to c1cfc8c Compare June 3, 2025 18:48

eero-t requested review from lianhao, poussa and yongfengdu June 3, 2025 18:49

eero-t mentioned this pull request Jun 10, 2025

Autoscaling for ChatQnA megaservice #1098

Merged

1 task

eero-t force-pushed the hpa-warmup branch from 8b9d3ac to a271bd0 Compare June 11, 2025 12:23

eero-t requested a review from marquiz June 11, 2025 12:24

eero-t added 6 commits June 23, 2025 12:36

Disable vLLM warmup with HPA autoscaling

e0452d5

Signed-off-by: Eero Tamminen <[email protected]>

Fix metrics dashboad next tokens rate query for TGI

f479ea0

Signed-off-by: Eero Tamminen <[email protected]>

Add autoscaling metric documentation

30a3265

Signed-off-by: Eero Tamminen <[email protected]>

Improve ChatQnA HPA enabling comments

a39b5c0

Signed-off-by: Eero Tamminen <[email protected]>

eero-t force-pushed the hpa-warmup branch from a271bd0 to a39b5c0 Compare June 23, 2025 09:36

marquiz reviewed Jun 23, 2025

View reviewed changes

helm-charts/chatqna/hpa-values.yaml Outdated Show resolved Hide resolved

helm-charts/HPA.md Outdated Show resolved Hide resolved

helm-charts/HPA.md Outdated Show resolved Hide resolved

helm-charts/HPA.md Outdated Show resolved Hide resolved

helm-charts/HPA.md Outdated Show resolved Hide resolved

eero-t force-pushed the hpa-warmup branch from 3ae62d4 to 5579c0b Compare June 23, 2025 12:18

HPA document improvement suggestions from Markus

f2370c0

And use same formatting for all notes. Signed-off-by: Eero Tamminen <[email protected]>

eero-t force-pushed the hpa-warmup branch from 5579c0b to f2370c0 Compare June 23, 2025 12:21

marquiz approved these changes Jun 23, 2025

View reviewed changes

lianhao approved these changes Jun 26, 2025

View reviewed changes

poussa approved these changes Jun 26, 2025

View reviewed changes

poussa merged commit b2990c4 into opea-project:main Jun 26, 2025
48 of 64 checks passed

eero-t deleted the hpa-warmup branch June 26, 2025 08:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Rework HPA scaling metrics and their configuration #1090

Rework HPA scaling metrics and their configuration #1090

Uh oh!

eero-t commented Jun 3, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

eero-t commented Jun 4, 2025

Uh oh!

marquiz left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

marquiz left a comment

Uh oh!

eero-t commented Jun 23, 2025

Uh oh!

eero-t commented Jun 25, 2025

Uh oh!

lianhao commented Jun 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Rework HPA scaling metrics and their configuration #1090

Rework HPA scaling metrics and their configuration #1090

Uh oh!

Conversation

eero-t commented Jun 3, 2025

Description

Issues

Type of change

Dependencies

Tests

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

eero-t commented Jun 4, 2025

Uh oh!

marquiz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

marquiz left a comment

Choose a reason for hiding this comment

Uh oh!

eero-t commented Jun 23, 2025

Uh oh!

eero-t commented Jun 25, 2025

Uh oh!

lianhao commented Jun 26, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants