Skip to content

Conversation

@eero-t
Copy link
Collaborator

@eero-t eero-t commented Jun 3, 2025

Description

  • Fix OPEA app metrics Grafana dashboad query for TGI next tokens rate
  • Disable vLLM warmup when HPA is used, otherwise vLLM startup takes too long
  • Use .Series, .GroupBy and .LabelMatchers to simplify Prometheus-adapter custom metric rules
  • Document autoscaling metric considerations, currently used metrics, and pros & cons of other metrics
  • Change vLLM to be scaled by same metric as what KubeAI uses (active requests) + add Helm variable for its threshold
  • Drop request latency metric for TGI/TEI scaling, it's too dependent on input & output token counts to be useful
    • CPU scaling uses now same metric as accelerated scaling
  • Add Helm variable for TGI/TEI queue size scaling threshold, with separate CPU and acceleration thresholds

Issues

n/a.

Type of change

List the type of change like below. Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds new functionality)

Nothing should break, but scaling will change somewhat.

Dependencies

n/a.

Tests

Manual testing for Gaudi scaling and thresholds for that.

Changes affect also CPU scaling, but threshold variable values for it were not tested. Would be good if somebody could test those examples too.

@eero-t eero-t requested review from lianhao and yongfengdu as code owners June 3, 2025 18:38
@eero-t eero-t requested review from Copilot and removed request for lianhao and yongfengdu June 3, 2025 18:42
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR reworks the autoscaling metrics and their configuration for multiple inference services by standardizing threshold parameters and updating HPA templates and queries.

  • Updated Helm chart values to include new autoscaling parameters (activeRequestsTarget, queueSizeTarget) for each service.
  • Modified HPA templates to reference new metric names and threshold values.
  • Enhanced documentation to detail autoscaling metric considerations and principles.

Reviewed Changes

Copilot reviewed 12 out of 12 changed files in this pull request and generated no comments.

Show a summary per file
File Description
helm-charts/common/vllm/values.yaml Added new autoscaling flag and activeRequestsTarget thresholds for vLLM.
helm-charts/common/vllm/templates/horizontal-pod-autoscaler.yaml Updated metric name and threshold retrieval for active requests in vLLM.
helm-charts/common/tgi/values.yaml Introduced queueSizeTarget thresholds for TGI.
helm-charts/common/tgi/templates/horizontal-pod-autoscaler.yaml Revised metric type and threshold usage for TGI autoscaling logic.
helm-charts/common/teirerank/values.yaml and templates Similar changes applied to teirerank as in TGI.
helm-charts/common/tei/values.yaml and templates Similar changes applied to tei as in TGI and teirerank.
helm-charts/common/dashboard/templates/configmap-metrics.yaml Updated Prometheus query for TGI dashboards.
helm-charts/chatqna/templates/custom-metrics-configmap.yaml Aligned custom metrics queries with new templating for autoscaling metrics.
helm-charts/chatqna/hpa-values.yaml Updated autoscaling configuration and added vLLM-specific warmup override variable.
helm-charts/HPA.md Expanded autoscaling documentation with detailed scaling metric considerations.
Comments suppressed due to low confidence (2)

helm-charts/chatqna/hpa-values.yaml:29

  • [nitpick] Consider aligning the naming convention of the 'VLLM_SKIP_WARMUP' variable with other autoscaling configuration variables to improve consistency.
VLLM_SKIP_WARMUP: "true"

helm-charts/common/tgi/templates/horizontal-pod-autoscaler.yaml:34

  • The CPU branch now uses 'AverageValue' instead of the previously used 'Value' type. Please confirm that this change reflects the intended autoscaling behavior for CPU-based scaling.
averageValue: {{ .Values.autoscaling.queueSizeTarget.cpu }}

@eero-t
Copy link
Collaborator Author

eero-t commented Jun 4, 2025

@lianhao, @yongfengdu dozen CI tests fail to issues unrelated to this PR. Any ideas?

Most app level CI tests fail to:

+ .github/workflows/scripts/e2e/chart_test.sh check_local_opea_image 100.83.122.251:5000/opea/nginx:latest
Failed to get image manifest 100.83.122.251:5000/opea/nginx:latest

And vLLM llevel tests fail to similar type errors, both for Gaudi:

 [pod/vllm03185935-5ff9cc58ff-9d2cd/vllm]   File "/usr/local/lib/python3.10/dist-packages/vllm-0.6.6.post2.dev0+g6af2f675.d20250603.gaudi000-py3.10.egg/vllm/model_executor/models/llama.py", line 134, in __init__
[pod/vllm03185935-5ff9cc58ff-9d2cd/vllm]     self.q_size = self.num_heads * self.head_dim
[pod/vllm03185935-5ff9cc58ff-9d2cd/vllm] TypeError: unsupported operand type(s) for *: 'int' and 'NoneType'

And CPU one:

[pod/vllm03185407-854d874cdf-g4gfr/vllm]   File "/opt/venv/lib/python3.12/site-packages/vllm/model_executor/models/llama.py", line 135, in __init__
[pod/vllm03185407-854d874cdf-g4gfr/vllm]     self.rotary_dim = int(partial_rotary_factor * self.head_dim)
[pod/vllm03185407-854d874cdf-g4gfr/vllm]                           ~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~~~~~~~
[pod/vllm03185407-854d874cdf-g4gfr/vllm] TypeError: unsupported operand type(s) for *: 'int' and 'NoneType'

eero-t added 6 commits June 23, 2025 12:36
* Use .Series, .GroupBy and .LabelMatchers to simplify rules
* Drop request latency metric for TGI/TEI.  It depending on
  number of generated tokens, makes it unsuitable as generic metric
  - With that, support for HPA Value type could also be dropped
    (leaving only queue size AverageValue)
* Because vLLM token mean latency metric does not react much
  to vLLM load, and for consistency with TGI/TEI, switch vLLM
  also to be scaled based on queue size
  - KubeAI scales vLLM also based on queue size
* Add queue size target Helm variables for all inferencing engines

Signed-off-by: Eero Tamminen <[email protected]>
TGI and TEI queue size metric is only for requests that are waiting to
be processed, but that number can fluctuate a lot, as it's non-zero
only when pod is fully utilized, and needs to buffer requests.  On the
plus side, it's agnostic to how fast engine instance can process the
queries for given model.

vLLM provides also gauge metric for how many requests are currently
being processed (running).  Adding that to the waiting requests count
(queue size) makes the resulting metric much more stable, and allows
scaling up extra replicas before current ones are full. KubeAI autoscaling
is also based on number of active requests, so results will be more
comparable.

Howeover, this means that the suitable threshold will be model and
engine config specific (depending on how request batches HW can run in
parallel).

Signed-off-by: Eero Tamminen <[email protected]>
Copy link
Collaborator

@marquiz marquiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly proofreading from my side. Cannot comment on the specific helm parameters or metrics.

Some suggestions but can be merged even as is

And use same formatting for all notes.

Signed-off-by: Eero Tamminen <[email protected]>
Copy link
Collaborator

@marquiz marquiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @eero-t. LGTM from me, FWIW

@eero-t
Copy link
Collaborator Author

eero-t commented Jun 23, 2025

@chensuyue, @lianhao, @yongfengdu the earlier listed CI issues are gone, but there are still these ones...

ROCM CI tests remain in pending state, until CI gives up.

ChatQnA Gaudi vLLM test just exits with error code 1, without log showing any errors.

FaqGen Gaudi TGI test:

[pod/chatqna23094245-tgi-5556fcbb78-pczg9/tgi] Error: ShardCannotStart
+ exit 1

Qdrant test fails to:

Error: could not download https://github.com/qdrant/qdrant-helm/releases/download/qdrant-1.13.1/qdrant-1.13.1.tgz: no cached repo found. (try 'helm repo update'): error loading /home/sdp/.cache/helm/repository/bitnami-index.yaml: empty index.yaml file
Error: Process completed with exit code 1.

And what's worse, although Helm install step failed, CI still tried to run the e2e test + Helm uninstall!

@eero-t
Copy link
Collaborator Author

eero-t commented Jun 25, 2025

@lianhao Could you review this?

@lianhao
Copy link
Collaborator

lianhao commented Jun 26, 2025

PR #1132 should get the vllm CI happy

@poussa poussa merged commit b2990c4 into opea-project:main Jun 26, 2025
48 of 64 checks passed
@eero-t eero-t deleted the hpa-warmup branch June 26, 2025 08:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants