-
Notifications
You must be signed in to change notification settings - Fork 98
Autoscaling for ChatQnA megaservice #1098
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Marked as draft because while I've tested the dashboard changes, I've not tested them with OPEA configMap. I'll do that tomorrow. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR addresses performance bottlenecks in ChatQnA by introducing autoscaling support for the megaservice, updating dashboards with corresponding metrics, and removing the unused HPA settings from the AgentQnA chart.
- Added new Prometheus metric panels for tracking megaservice instance counts and latency in the dashboard configmap.
- Introduced an HPA manifest and updated autoscaling values for ChatQnA.
- Removed legacy HPA configuration from AgentQnA.
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| helm-charts/common/dashboard/templates/configmap-metrics.yaml | Added new metric panels for megaservice instances and token latency. |
| helm-charts/chatqna/templates/horizontal-pod-autoscaler.yaml | Defined a new HPA for ChatQnA using updated autoscaling values. |
| helm-charts/chatqna/hpa-values.yaml | Configured autoscaling parameters for ChatQnA including resource requests. |
| helm-charts/agentqna/values.yaml | Removed unused HPA configuration that is no longer required. |
Comments suppressed due to low confidence (2)
helm-charts/common/dashboard/templates/configmap-metrics.yaml:1714
- [nitpick] Consider using a more descriptive label for this metric (e.g. 'MegaService: token latency count') to reduce potential confusion with any other 'used' metrics in the dashboard.
"legendFormat": "MegaService: used",
helm-charts/chatqna/hpa-values.yaml:25
- [nitpick] Consider explicitly specifying CPU units (for example, '1000m') to ensure clarity and consistency with Kubernetes resource requests standards.
cpu: 1
|
This is related to the HPA rework PR #1090. |
|
@lianhao, @yongfengdu Gaudi CI tests fail to timeout during vLLM warmup, and ROCM tests are just in pending state. Is there some fix for these? |
|
The ROCM test pending should be something wrong with their runners, @chensuyue should be able to contact with them. For the vLLM warmup issue, the warmup time for different model is different, sometimes just retrigger the single test will pass CI check.
|
We should disable the warmup for the CI/CD. The warmup affects performance which is not the main focus on functional tests. |
|
Signed-off-by: Eero Tamminen <[email protected]>
Signed-off-by: Eero Tamminen <[email protected]>
Signed-off-by: Eero Tamminen <[email protected]>
|
@yongfengdu, @lianhao, @chensuyue There's still something wrong with CI side. Gaudi vLLM ChatQnA test fails to: Gaudi TGI AgentQnA fails to: And ROCM tests are all still in pending state? |
This indicates some docker container is consuming the Gaudi device which the k8s gaudi device plugin has no knowledge of.
I tried rerun the failed test but it turns out that the gaudi CI node is not available now. Will ping Suyue to figure out why |
lianhao
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@eero-t The gaudi CI is resumed.
Description
When rate of requests increases and backend inference engines get scaled, megaservice becomes performance bottleneck (as its query processing is single threaded), so that needs to be scaled too. This is the case already after scaling to few Gaudi vLLM instances with the default 8b Llama model.
Other changes:
Issues
n/a.Type of change
(Fixes scaling performance bottleneck.)
Dependencies
n/a.Tests
Tested manually.