Align helm charts values with compose yaml & release bug fix #1189

chensuyue · 2025-08-18T05:06:09Z

Description

Align helm charts values with compose yaml
Fix the model-downloader pods Operation not permitted issue, https://github.com/opea-project/GenAIExamples/actions/runs/17060819842/job/48367160723#step:6:381

Issues

#1174

Type of change

List the type of change like below. Please delete options that are not relevant.

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds new functionality)
Breaking change (fix or feature that would break existing design and interface)

Dependencies

List the newly introduced 3rd party dependency if exists.

Tests

Describe the tests that you ran to verify your changes.

Signed-off-by: chensuyue <[email protected]>

poussa · 2025-08-18T07:17:26Z

What about the changes in GenAIExamples/AudioQnA/kubernetes/helm ?

chensuyue · 2025-08-18T08:41:16Z

What about the changes in GenAIExamples/AudioQnA/kubernetes/helm ?

Values yaml sync to GenAIExamples after GenAIInfra code freeze.

eero-t · 2025-08-18T09:37:10Z

What about the changes in GenAIExamples/AudioQnA/kubernetes/helm ?

Which ones: https://github.com/opea-project/GenAIExamples/commits/main/AudioQnA/kubernetes/helm ?

(I don't see any recent changes / anything that would not be already in GenAIInfra.)

eero-t

Align helm charts values with compose yaml

Based on comments in the corresponding PR opea-project/GenAIExamples#2061 and linked vLLM ticket, some things have been marked privileged, and some others use VLLM_CPU_OMP_THREADS_BIND: all because vLLM nowadays tries to do NUMA alignment and all option disables that.

Whereas correct way to handle that seems to be adding CAP_SYS_NICE capability to the container to allow page migration: https://man.archlinux.org/man/capabilities.7.en#CAP_SYS_NICE ?

(Otherwise latest vLLM will just give warning of page migration failing: vllm-project/vllm#19241)

=> I'll file bugs to the other projects for things being sets as privileged.

chensuyue · 2025-08-19T02:12:52Z

=> I'll file bugs to the other projects for things being sets as privileged.

This PR aim to align docker compose change, if the config itself is not correct we can file an issue for GenAIExamples. And merge this pr at first. How do you think?

Signed-off-by: chensuyue <[email protected]>

for more information, see https://pre-commit.ci

chensuyue · 2025-08-19T07:08:52Z

All CI pass except ROCm, no ROCm k8s cluster for test.

ftian1 · 2025-08-19T07:19:36Z

helm-charts/common/vllm/values.yaml

 # Need to run as root until upstream fixed and released.
 securityContext:
-  readOnlyRootFilesystem: true
+  readOnlyRootFilesystem: false


is this change mandatory?

To fix this issue,
[pod/chatqna-1755584254-vllm-7f44887799-jjtsj/model-downloader] chmod: /data/models--meta-llama--Meta-Llama-3-8B-Instruct: Operation not permitted.
Without this update the test not able to execute chmod for the data path.

Why this is an issue now ? It was not before.

To fix this issue, [pod/chatqna-1755584254-vllm-7f44887799-jjtsj/model-downloader] chmod: /data/models--meta-llama--Meta-Llama-3-8B-Instruct: Operation not permitted. Without this update the test not able to execute chmod for the data path.

That's clearly a wrong thing to do. This is the security context for the vLLM container itself, and that should not be modifying anything model related. All model related updates are done by the downloader init container:
https://github.com/opea-project/GenAIInfra/blob/main/helm-charts/common/vllm/templates/deployment.yaml#L33

And that initContainer has hard-coded security context (not one coming from values file).

Additionally, models are on a separate volume from the root file system, and init container has the necessary capabilities to chmod etc the model files there, in case extra (vLLM) writes may be necessary with some of the models.

Potential reasons why things might fail now:

vLLM container is configured with a difference user/group => user/group should be updated

vLLM is configured to read model data from a different path => path should be fixed

vLLM needs to download additional files => initcontainer downloader should be asked to download those too, or if this is due to too old downloader, then the HF downloader image should be updated

vLLM writes now extra files to some other path => re-direct that to suitable path, or mount something appropriate there

Several tests block in this line

GenAIInfra/helm-charts/common/vllm/templates/deployment.yaml

Line 60 in 9c340f4

chmod -R g+w /data/models--{{ replace "/" "--" .Values.LLM_MODEL_ID }};

, if the code change here not correct, need to find a proper why to make this work.

You will see only 2 of the chatqna vllm related test failed in this issue, I don't know why.

And if you search in the helm charts files, there are 30+ readOnlyRootFilesystem: false setting, please also check if it reasonable.

And if you search in the helm charts files, there are 30+ readOnlyRootFilesystem: false setting, please also check if it reasonable.

Ouch. That's a clear regression from when they were last fixed by Lianhao, see: #815 (comment)

What permission should the model path be? I give all the models chmod -R 777, but still got this issue. What should be the user/group? Should it be the user deploy the test or root?

Looking at the error log: https://github.com/opea-project/GenAIExamples/actions/runs/17060819842/job/48367160723#step:6:381

Error is for the initcontainer. It can download data, but cannot change access rights for the downloaded data:
[pod/chatqna-1755584254-vllm-7f44887799-jjtsj/model-downloader] chmod: /data/models--meta-llama--Meta-Llama-3-8B-Instruct: Operation not permitted

With the chmod -R g+w /data/models--$LLM_MODEL_ID command in: https://github.com/opea-project/GenAIInfra/blob/main/helm-charts/common/vllm/templates/deployment.yaml#L60

Although (hard-coded) initContainer securityContext should have all the necessary capabilities to do that: https://github.com/opea-project/GenAIInfra/blob/main/helm-charts/common/vllm/templates/deployment.yaml#L38

as it has been working earlier...

InitContainer's /data path is at root of model-volume volume: https://github.com/opea-project/GenAIInfra/blob/main/helm-charts/common/vllm/templates/deployment.yaml#L108

Which according to the error log is in:

model-volume: Type: HostPath (bare host directory volume) Path: /data2/hf_model HostPathType: Directory

=> @chensuyue please provide output lf ls -la /data2/hf_model for all the Gaudi hosts where CI could currently run these pods.

(Do those host directory access rights differ from what was used on CI Gaudi hosts earlier?)

=> @chensuyue please provide output lf ls -la /data2/hf_model for all the Gaudi hosts where CI could currently run these pods.

(Do those host directory access rights differ from what was used on CI Gaudi hosts earlier?)

I have given the current data folder the most lenient permissions. I didn't apply any special setting for those data path earlier beside apply chmod 777, maybe cloud team did.

eero-t · 2025-08-19T10:59:30Z

This PR aim to align docker compose change, if the config itself is not correct we can file an issue for GenAIExamples. And merge this pr at first. How do you think?

Unless this is blocking (OPEA v1.4 release) PR, I would rather not propagate incorrect fixes.

poussa · 2025-08-19T12:29:41Z

This PR aim to align docker compose change, if the config itself is not correct we can file an issue for GenAIExamples. And merge this pr at first. How do you think?

Unless this is blocking (OPEA v1.4 release) PR, I would rather not propagate incorrect fixes.

I totally agree.

poussa

We need another way to solve the issue, or drop this PR.

chensuyue · 2025-08-19T13:29:59Z

This PR aim to align docker compose change, if the config itself is not correct we can file an issue for GenAIExamples. And merge this pr at first. How do you think?

Unless this is blocking (OPEA v1.4 release) PR, I would rather not propagate incorrect fixes.

Yes, those issue block v1.4 release. I am not the right person to fix those issues. Feel free to close this PR and fix all the issues with another one. Need to fix ASAP!
Gaudi issue found in release test: https://github.com/opea-project/GenAIExamples/actions/runs/17060819842
Xeon issue found in release test: https://github.com/opea-project/GenAIExamples/actions/runs/17062358312

AgentQnA, gaudi
ChatQnA, gaudi
CodeTrans, gaudi
ChatQnA, xeon

chensuyue · 2025-08-19T13:41:09Z

Unless this is blocking (OPEA v1.4 release) PR, I would rather not propagate incorrect fixes.

For this VLLM_CPU_OMP_THREADS_BIND: all alignment code, if it didn't block the test, you can drop it for this release.

Align helm charts values with compose yaml

aacdee7

Signed-off-by: chensuyue <[email protected]>

chensuyue requested review from lianhao and yongfengdu as code owners August 18, 2025 05:06

poussa requested review from eero-t and poussa and removed request for lianhao and yongfengdu August 18, 2025 06:30

eero-t requested changes Aug 18, 2025

View reviewed changes

chensuyue and others added 3 commits August 19, 2025 14:43

Update CodeTrans default model align with docker compose

b25276c

Signed-off-by: chensuyue <[email protected]>

Fix the model-downloader pods Operation not permitted issue

33abaae

Signed-off-by: chensuyue <[email protected]>

[pre-commit.ci] auto fixes from pre-commit.com hooks

9c340f4

for more information, see https://pre-commit.ci

chensuyue changed the title ~~Align helm charts values with compose yaml~~ Align helm charts values with compose yaml & release bug fix Aug 19, 2025

ZePan110 approved these changes Aug 19, 2025

View reviewed changes

ftian1 reviewed Aug 19, 2025

View reviewed changes

poussa requested changes Aug 19, 2025

View reviewed changes

chensuyue added this to the v1.4 milestone Aug 20, 2025

chensuyue closed this Aug 20, 2025

Align helm charts values with compose yaml & release bug fix #1189

Align helm charts values with compose yaml & release bug fix #1189

Uh oh!

Conversation

chensuyue commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issues

Type of change

Dependencies

Tests

Uh oh!

poussa commented Aug 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chensuyue commented Aug 18, 2025

Uh oh!

eero-t commented Aug 18, 2025

Uh oh!

eero-t left a comment

Choose a reason for hiding this comment

Uh oh!

chensuyue commented Aug 19, 2025

Uh oh!

chensuyue commented Aug 19, 2025

Uh oh!

ftian1 Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chensuyue Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

poussa Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

eero-t Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

eero-t Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

chensuyue Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chensuyue Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

eero-t Aug 19, 2025

Choose a reason for hiding this comment

Uh oh!

eero-t Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chensuyue Aug 20, 2025

Choose a reason for hiding this comment

Uh oh!

eero-t commented Aug 19, 2025

Uh oh!

poussa commented Aug 19, 2025

Uh oh!

poussa left a comment

Choose a reason for hiding this comment

Uh oh!

chensuyue commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chensuyue commented Aug 19, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

chensuyue commented Aug 18, 2025 •

edited

Loading

poussa commented Aug 18, 2025 •

edited

Loading

ftian1 Aug 19, 2025 •

edited

Loading

chensuyue Aug 19, 2025 •

edited

Loading

chensuyue Aug 19, 2025 •

edited

Loading

eero-t Aug 19, 2025 •

edited

Loading

chensuyue commented Aug 19, 2025 •

edited

Loading

chensuyue commented Aug 19, 2025 •

edited

Loading