Skip to content

Conversation

@eero-t
Copy link
Collaborator

@eero-t eero-t commented Jun 5, 2025

Description

When rate of requests increases and backend inference engines get scaled, megaservice becomes performance bottleneck (as its query processing is single threaded), so that needs to be scaled too. This is the case already after scaling to few Gaudi vLLM instances with the default 8b Llama model.

Other changes:

  • Add Megaservice scaling info to OPEA application dashboard
  • Remove unused HPA setting from AgentQnA

Issues

n/a.

Type of change

  • New feature (non-breaking change which adds new functionality)

(Fixes scaling performance bottleneck.)

Dependencies

n/a.

Tests

Tested manually.

@eero-t eero-t requested review from lianhao and yongfengdu as code owners June 5, 2025 20:06
@eero-t eero-t marked this pull request as draft June 5, 2025 20:07
@eero-t
Copy link
Collaborator Author

eero-t commented Jun 5, 2025

Marked as draft because while I've tested the dashboard changes, I've not tested them with OPEA configMap. I'll do that tomorrow.

@eero-t eero-t marked this pull request as ready for review June 10, 2025 14:20
@eero-t eero-t requested a review from Copilot June 10, 2025 14:22
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR addresses performance bottlenecks in ChatQnA by introducing autoscaling support for the megaservice, updating dashboards with corresponding metrics, and removing the unused HPA settings from the AgentQnA chart.

  • Added new Prometheus metric panels for tracking megaservice instance counts and latency in the dashboard configmap.
  • Introduced an HPA manifest and updated autoscaling values for ChatQnA.
  • Removed legacy HPA configuration from AgentQnA.

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
helm-charts/common/dashboard/templates/configmap-metrics.yaml Added new metric panels for megaservice instances and token latency.
helm-charts/chatqna/templates/horizontal-pod-autoscaler.yaml Defined a new HPA for ChatQnA using updated autoscaling values.
helm-charts/chatqna/hpa-values.yaml Configured autoscaling parameters for ChatQnA including resource requests.
helm-charts/agentqna/values.yaml Removed unused HPA configuration that is no longer required.
Comments suppressed due to low confidence (2)

helm-charts/common/dashboard/templates/configmap-metrics.yaml:1714

  • [nitpick] Consider using a more descriptive label for this metric (e.g. 'MegaService: token latency count') to reduce potential confusion with any other 'used' metrics in the dashboard.
"legendFormat": "MegaService: used",

helm-charts/chatqna/hpa-values.yaml:25

  • [nitpick] Consider explicitly specifying CPU units (for example, '1000m') to ensure clarity and consistency with Kubernetes resource requests standards.
cpu: 1

@eero-t eero-t requested a review from poussa June 10, 2025 14:24
@eero-t
Copy link
Collaborator Author

eero-t commented Jun 10, 2025

This is related to the HPA rework PR #1090.

@eero-t
Copy link
Collaborator Author

eero-t commented Jun 10, 2025

@lianhao, @yongfengdu Gaudi CI tests fail to timeout during vLLM warmup, and ROCM tests are just in pending state. Is there some fix for these?

@yongfengdu
Copy link
Collaborator

The ROCM test pending should be something wrong with their runners, @chensuyue should be able to contact with them.

For the vLLM warmup issue, the warmup time for different model is different, sometimes just retrigger the single test will pass CI check.
I've discussed with @lianhao for disabling the warmup with VLLM_SKIP_WARMUP: "true" for all workloads CI tests, the concern is inconsistency with compose deployment and user's environment.
If not doing that, the only option is to extend the timeout with larger numbers https://github.com/opea-project/GenAIInfra/blob/main/.github/workflows/_helm-e2e.yaml#L126 (It's already tuned from 600 seconds to 900 seconds for vllm warmup.

@lianhao, @yongfengdu Gaudi CI tests fail to timeout during vLLM warmup, and ROCM tests are just in pending state. Is there some fix for these?

@eero-t eero-t requested a review from marquiz June 11, 2025 12:21
@poussa
Copy link
Member

poussa commented Jun 13, 2025

For the vLLM warmup issue, the warmup time for different model is different, sometimes just retrigger the single test will pass CI check. I've discussed with @lianhao for disabling the warmup with VLLM_SKIP_WARMUP: "true" for all workloads CI tests, the concern is inconsistency with compose deployment and user's environment.

We should disable the warmup for the CI/CD. The warmup affects performance which is not the main focus on functional tests.

@yongfengdu
Copy link
Collaborator

#1126

We should disable the warmup for the CI/CD. The warmup affects performance which is not the main focus on functional tests.

@eero-t
Copy link
Collaborator Author

eero-t commented Jun 23, 2025

@yongfengdu, @lianhao, @chensuyue There's still something wrong with CI side.

Gaudi vLLM ChatQnA test fails to:
[pod/chatqna23100548-vllm-75b86d4d99-7l8vx/vllm] RuntimeError: synStatus=8 [Device not found] Device acquire failed.

Gaudi TGI AgentQnA fails to:

[pod/agentqna23095455-tgi-579879b8d4-rbgxb/tgi] 2025-06-23T10:05:05.048254Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
+ exit 1

And ROCM tests are all still in pending state?

@lianhao
Copy link
Collaborator

lianhao commented Jun 24, 2025

@yongfengdu, @lianhao, @chensuyue There's still something wrong with CI side.

Gaudi vLLM ChatQnA test fails to: [pod/chatqna23100548-vllm-75b86d4d99-7l8vx/vllm] RuntimeError: synStatus=8 [Device not found] Device acquire failed.

This indicates some docker container is consuming the Gaudi device which the k8s gaudi device plugin has no knowledge of.

Gaudi TGI AgentQnA fails to:

[pod/agentqna23095455-tgi-579879b8d4-rbgxb/tgi] 2025-06-23T10:05:05.048254Z  INFO shard-manager: text_generation_launcher: Waiting for shard to be ready... rank=0
+ exit 1

And ROCM tests are all still in pending state?

I tried rerun the failed test but it turns out that the gaudi CI node is not available now. Will ping Suyue to figure out why

Copy link
Collaborator

@lianhao lianhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@eero-t The gaudi CI is resumed.

@poussa poussa merged commit 4a0f386 into opea-project:main Jun 26, 2025
70 of 99 checks passed
@eero-t eero-t deleted the cpu-scale branch July 2, 2025 17:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants