Skip to content

Conversation

@chensuyue
Copy link
Collaborator

@chensuyue chensuyue commented Aug 18, 2025

Description

  1. Align helm charts values with compose yaml
  2. Fix the model-downloader pods Operation not permitted issue, https://github.com/opea-project/GenAIExamples/actions/runs/17060819842/job/48367160723#step:6:381

Issues

#1174

Type of change

List the type of change like below. Please delete options that are not relevant.

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds new functionality)
  • Breaking change (fix or feature that would break existing design and interface)

Dependencies

List the newly introduced 3rd party dependency if exists.

Tests

Describe the tests that you ran to verify your changes.

@poussa poussa requested review from eero-t and poussa and removed request for lianhao and yongfengdu August 18, 2025 06:30
@poussa
Copy link
Member

poussa commented Aug 18, 2025

What about the changes in GenAIExamples/AudioQnA/kubernetes/helm ?

@chensuyue
Copy link
Collaborator Author

What about the changes in GenAIExamples/AudioQnA/kubernetes/helm ?

Values yaml sync to GenAIExamples after GenAIInfra code freeze.

@eero-t
Copy link
Collaborator

eero-t commented Aug 18, 2025

What about the changes in GenAIExamples/AudioQnA/kubernetes/helm ?

Which ones: https://github.com/opea-project/GenAIExamples/commits/main/AudioQnA/kubernetes/helm ?

(I don't see any recent changes / anything that would not be already in GenAIInfra.)

Copy link
Collaborator

@eero-t eero-t left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Align helm charts values with compose yaml

Based on comments in the corresponding PR opea-project/GenAIExamples#2061 and linked vLLM ticket, some things have been marked privileged, and some others use VLLM_CPU_OMP_THREADS_BIND: all because vLLM nowadays tries to do NUMA alignment and all option disables that.

Whereas correct way to handle that seems to be adding CAP_SYS_NICE capability to the container to allow page migration: https://man.archlinux.org/man/capabilities.7.en#CAP_SYS_NICE ?

(Otherwise latest vLLM will just give warning of page migration failing: vllm-project/vllm#19241)

=> I'll file bugs to the other projects for things being sets as privileged.

@chensuyue
Copy link
Collaborator Author

=> I'll file bugs to the other projects for things being sets as privileged.

This PR aim to align docker compose change, if the config itself is not correct we can file an issue for GenAIExamples. And merge this pr at first. How do you think?

@chensuyue
Copy link
Collaborator Author

All CI pass except ROCm, no ROCm k8s cluster for test.

@chensuyue chensuyue changed the title Align helm charts values with compose yaml Align helm charts values with compose yaml & release bug fix Aug 19, 2025
# Need to run as root until upstream fixed and released.
securityContext:
readOnlyRootFilesystem: true
readOnlyRootFilesystem: false
Copy link
Collaborator

@ftian1 ftian1 Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this change mandatory?

Copy link
Collaborator Author

@chensuyue chensuyue Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To fix this issue,
[pod/chatqna-1755584254-vllm-7f44887799-jjtsj/model-downloader] chmod: /data/models--meta-llama--Meta-Llama-3-8B-Instruct: Operation not permitted.
Without this update the test not able to execute chmod for the data path.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this is an issue now ? It was not before.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To fix this issue, [pod/chatqna-1755584254-vllm-7f44887799-jjtsj/model-downloader] chmod: /data/models--meta-llama--Meta-Llama-3-8B-Instruct: Operation not permitted. Without this update the test not able to execute chmod for the data path.

That's clearly a wrong thing to do. This is the security context for the vLLM container itself, and that should not be modifying anything model related. All model related updates are done by the downloader init container:
https://github.com/opea-project/GenAIInfra/blob/main/helm-charts/common/vllm/templates/deployment.yaml#L33

And that initContainer has hard-coded security context (not one coming from values file).

Additionally, models are on a separate volume from the root file system, and init container has the necessary capabilities to chmod etc the model files there, in case extra (vLLM) writes may be necessary with some of the models.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential reasons why things might fail now:

  • vLLM container is configured with a difference user/group => user/group should be updated
  • vLLM is configured to read model data from a different path => path should be fixed
  • vLLM needs to download additional files => initcontainer downloader should be asked to download those too, or if this is due to too old downloader, then the HF downloader image should be updated
  • vLLM writes now extra files to some other path => re-direct that to suitable path, or mount something appropriate there

Copy link
Collaborator Author

@chensuyue chensuyue Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several tests block in this line

chmod -R g+w /data/models--{{ replace "/" "--" .Values.LLM_MODEL_ID }};
, if the code change here not correct, need to find a proper why to make this work.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You will see only 2 of the chatqna vllm related test failed in this issue, I don't know why.

And if you search in the helm charts files, there are 30+ readOnlyRootFilesystem: false setting, please also check if it reasonable.

image

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

And if you search in the helm charts files, there are 30+ readOnlyRootFilesystem: false setting, please also check if it reasonable.

Ouch. That's a clear regression from when they were last fixed by Lianhao, see: #815 (comment)

Copy link
Collaborator

@eero-t eero-t Aug 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What permission should the model path be? I give all the models chmod -R 777, but still got this issue. What should be the user/group? Should it be the user deploy the test or root?

Looking at the error log: https://github.com/opea-project/GenAIExamples/actions/runs/17060819842/job/48367160723#step:6:381

Error is for the initcontainer. It can download data, but cannot change access rights for the downloaded data:
[pod/chatqna-1755584254-vllm-7f44887799-jjtsj/model-downloader] chmod: /data/models--meta-llama--Meta-Llama-3-8B-Instruct: Operation not permitted

With the chmod -R g+w /data/models--$LLM_MODEL_ID command in: https://github.com/opea-project/GenAIInfra/blob/main/helm-charts/common/vllm/templates/deployment.yaml#L60

Although (hard-coded) initContainer securityContext should have all the necessary capabilities to do that: https://github.com/opea-project/GenAIInfra/blob/main/helm-charts/common/vllm/templates/deployment.yaml#L38

as it has been working earlier...

InitContainer's /data path is at root of model-volume volume: https://github.com/opea-project/GenAIInfra/blob/main/helm-charts/common/vllm/templates/deployment.yaml#L108

Which according to the error log is in:

   model-volume:
    Type:          HostPath (bare host directory volume)
    Path:          /data2/hf_model
    HostPathType:  Directory

=> @chensuyue please provide output lf ls -la /data2/hf_model for all the Gaudi hosts where CI could currently run these pods.

(Do those host directory access rights differ from what was used on CI Gaudi hosts earlier?)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

=> @chensuyue please provide output lf ls -la /data2/hf_model for all the Gaudi hosts where CI could currently run these pods.

(Do those host directory access rights differ from what was used on CI Gaudi hosts earlier?)

I have given the current data folder the most lenient permissions. I didn't apply any special setting for those data path earlier beside apply chmod 777, maybe cloud team did.
image
image
image

@eero-t
Copy link
Collaborator

eero-t commented Aug 19, 2025

This PR aim to align docker compose change, if the config itself is not correct we can file an issue for GenAIExamples. And merge this pr at first. How do you think?

Unless this is blocking (OPEA v1.4 release) PR, I would rather not propagate incorrect fixes.

@poussa
Copy link
Member

poussa commented Aug 19, 2025

This PR aim to align docker compose change, if the config itself is not correct we can file an issue for GenAIExamples. And merge this pr at first. How do you think?

Unless this is blocking (OPEA v1.4 release) PR, I would rather not propagate incorrect fixes.

I totally agree.

Copy link
Member

@poussa poussa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need another way to solve the issue, or drop this PR.

@chensuyue
Copy link
Collaborator Author

chensuyue commented Aug 19, 2025

This PR aim to align docker compose change, if the config itself is not correct we can file an issue for GenAIExamples. And merge this pr at first. How do you think?

Unless this is blocking (OPEA v1.4 release) PR, I would rather not propagate incorrect fixes.

Yes, those issue block v1.4 release. I am not the right person to fix those issues. Feel free to close this PR and fix all the issues with another one. Need to fix ASAP!
Gaudi issue found in release test: https://github.com/opea-project/GenAIExamples/actions/runs/17060819842
Xeon issue found in release test: https://github.com/opea-project/GenAIExamples/actions/runs/17062358312

AgentQnA, gaudi
ChatQnA, gaudi
CodeTrans, gaudi
ChatQnA, xeon

@chensuyue
Copy link
Collaborator Author

chensuyue commented Aug 19, 2025

Unless this is blocking (OPEA v1.4 release) PR, I would rather not propagate incorrect fixes.

For this VLLM_CPU_OMP_THREADS_BIND: all alignment code, if it didn't block the test, you can drop it for this release.

@chensuyue chensuyue added this to the v1.4 milestone Aug 20, 2025
@chensuyue chensuyue closed this Aug 20, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants