-
Notifications
You must be signed in to change notification settings - Fork 98
Rework HPA scaling metrics and their configuration #1090
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR reworks the autoscaling metrics and their configuration for multiple inference services by standardizing threshold parameters and updating HPA templates and queries.
- Updated Helm chart values to include new autoscaling parameters (activeRequestsTarget, queueSizeTarget) for each service.
- Modified HPA templates to reference new metric names and threshold values.
- Enhanced documentation to detail autoscaling metric considerations and principles.
Reviewed Changes
Copilot reviewed 12 out of 12 changed files in this pull request and generated no comments.
Show a summary per file
| File | Description |
|---|---|
| helm-charts/common/vllm/values.yaml | Added new autoscaling flag and activeRequestsTarget thresholds for vLLM. |
| helm-charts/common/vllm/templates/horizontal-pod-autoscaler.yaml | Updated metric name and threshold retrieval for active requests in vLLM. |
| helm-charts/common/tgi/values.yaml | Introduced queueSizeTarget thresholds for TGI. |
| helm-charts/common/tgi/templates/horizontal-pod-autoscaler.yaml | Revised metric type and threshold usage for TGI autoscaling logic. |
| helm-charts/common/teirerank/values.yaml and templates | Similar changes applied to teirerank as in TGI. |
| helm-charts/common/tei/values.yaml and templates | Similar changes applied to tei as in TGI and teirerank. |
| helm-charts/common/dashboard/templates/configmap-metrics.yaml | Updated Prometheus query for TGI dashboards. |
| helm-charts/chatqna/templates/custom-metrics-configmap.yaml | Aligned custom metrics queries with new templating for autoscaling metrics. |
| helm-charts/chatqna/hpa-values.yaml | Updated autoscaling configuration and added vLLM-specific warmup override variable. |
| helm-charts/HPA.md | Expanded autoscaling documentation with detailed scaling metric considerations. |
Comments suppressed due to low confidence (2)
helm-charts/chatqna/hpa-values.yaml:29
- [nitpick] Consider aligning the naming convention of the 'VLLM_SKIP_WARMUP' variable with other autoscaling configuration variables to improve consistency.
VLLM_SKIP_WARMUP: "true"
helm-charts/common/tgi/templates/horizontal-pod-autoscaler.yaml:34
- The CPU branch now uses 'AverageValue' instead of the previously used 'Value' type. Please confirm that this change reflects the intended autoscaling behavior for CPU-based scaling.
averageValue: {{ .Values.autoscaling.queueSizeTarget.cpu }}
|
@lianhao, @yongfengdu dozen CI tests fail to issues unrelated to this PR. Any ideas? Most app level CI tests fail to: And vLLM llevel tests fail to similar type errors, both for Gaudi: And CPU one: |
Signed-off-by: Eero Tamminen <[email protected]>
* Use .Series, .GroupBy and .LabelMatchers to simplify rules
* Drop request latency metric for TGI/TEI. It depending on
number of generated tokens, makes it unsuitable as generic metric
- With that, support for HPA Value type could also be dropped
(leaving only queue size AverageValue)
* Because vLLM token mean latency metric does not react much
to vLLM load, and for consistency with TGI/TEI, switch vLLM
also to be scaled based on queue size
- KubeAI scales vLLM also based on queue size
* Add queue size target Helm variables for all inferencing engines
Signed-off-by: Eero Tamminen <[email protected]>
Signed-off-by: Eero Tamminen <[email protected]>
TGI and TEI queue size metric is only for requests that are waiting to be processed, but that number can fluctuate a lot, as it's non-zero only when pod is fully utilized, and needs to buffer requests. On the plus side, it's agnostic to how fast engine instance can process the queries for given model. vLLM provides also gauge metric for how many requests are currently being processed (running). Adding that to the waiting requests count (queue size) makes the resulting metric much more stable, and allows scaling up extra replicas before current ones are full. KubeAI autoscaling is also based on number of active requests, so results will be more comparable. Howeover, this means that the suitable threshold will be model and engine config specific (depending on how request batches HW can run in parallel). Signed-off-by: Eero Tamminen <[email protected]>
Signed-off-by: Eero Tamminen <[email protected]>
Signed-off-by: Eero Tamminen <[email protected]>
marquiz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly proofreading from my side. Cannot comment on the specific helm parameters or metrics.
Some suggestions but can be merged even as is
And use same formatting for all notes. Signed-off-by: Eero Tamminen <[email protected]>
marquiz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @eero-t. LGTM from me, FWIW
|
@chensuyue, @lianhao, @yongfengdu the earlier listed CI issues are gone, but there are still these ones... ROCM CI tests remain in pending state, until CI gives up. ChatQnA Gaudi vLLM test just exits with error code 1, without log showing any errors. FaqGen Gaudi TGI test: Qdrant test fails to: And what's worse, although Helm install step failed, CI still tried to run the e2e test + Helm uninstall! |
|
@lianhao Could you review this? |
|
PR #1132 should get the vllm CI happy |
Description
.Series,.GroupByand.LabelMatchersto simplify Prometheus-adapter custom metric rulesIssues
n/a.Type of change
List the type of change like below. Please delete options that are not relevant.
Nothing should break, but scaling will change somewhat.
Dependencies
n/a.Tests
Manual testing for Gaudi scaling and thresholds for that.
Changes affect also CPU scaling, but threshold variable values for it were not tested. Would be good if somebody could test those examples too.