-
Notifications
You must be signed in to change notification settings - Fork 98
Align helm charts values with compose yaml & release bug fix #1189
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: chensuyue <[email protected]>
|
What about the changes in |
Values yaml sync to |
Which ones: https://github.com/opea-project/GenAIExamples/commits/main/AudioQnA/kubernetes/helm ? (I don't see any recent changes / anything that would not be already in GenAIInfra.) |
eero-t
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Align helm charts values with compose yaml
Based on comments in the corresponding PR opea-project/GenAIExamples#2061 and linked vLLM ticket, some things have been marked privileged, and some others use VLLM_CPU_OMP_THREADS_BIND: all because vLLM nowadays tries to do NUMA alignment and all option disables that.
Whereas correct way to handle that seems to be adding CAP_SYS_NICE capability to the container to allow page migration: https://man.archlinux.org/man/capabilities.7.en#CAP_SYS_NICE ?
(Otherwise latest vLLM will just give warning of page migration failing: vllm-project/vllm#19241)
=> I'll file bugs to the other projects for things being sets as privileged.
This PR aim to align docker compose change, if the config itself is not correct we can file an issue for GenAIExamples. And merge this pr at first. How do you think? |
Signed-off-by: chensuyue <[email protected]>
Signed-off-by: chensuyue <[email protected]>
for more information, see https://pre-commit.ci
|
All CI pass except ROCm, no ROCm k8s cluster for test. |
| # Need to run as root until upstream fixed and released. | ||
| securityContext: | ||
| readOnlyRootFilesystem: true | ||
| readOnlyRootFilesystem: false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this change mandatory?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To fix this issue,
[pod/chatqna-1755584254-vllm-7f44887799-jjtsj/model-downloader] chmod: /data/models--meta-llama--Meta-Llama-3-8B-Instruct: Operation not permitted.
Without this update the test not able to execute chmod for the data path.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this is an issue now ? It was not before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To fix this issue,
[pod/chatqna-1755584254-vllm-7f44887799-jjtsj/model-downloader] chmod: /data/models--meta-llama--Meta-Llama-3-8B-Instruct: Operation not permitted.Without this update the test not able to execute chmod for the data path.
That's clearly a wrong thing to do. This is the security context for the vLLM container itself, and that should not be modifying anything model related. All model related updates are done by the downloader init container:
https://github.com/opea-project/GenAIInfra/blob/main/helm-charts/common/vllm/templates/deployment.yaml#L33
And that initContainer has hard-coded security context (not one coming from values file).
Additionally, models are on a separate volume from the root file system, and init container has the necessary capabilities to chmod etc the model files there, in case extra (vLLM) writes may be necessary with some of the models.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potential reasons why things might fail now:
- vLLM container is configured with a difference user/group => user/group should be updated
- vLLM is configured to read model data from a different path => path should be fixed
- vLLM needs to download additional files => initcontainer downloader should be asked to download those too, or if this is due to too old downloader, then the HF downloader image should be updated
- vLLM writes now extra files to some other path => re-direct that to suitable path, or mount something appropriate there
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Several tests block in this line
| chmod -R g+w /data/models--{{ replace "/" "--" .Values.LLM_MODEL_ID }}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And if you search in the helm charts files, there are 30+ readOnlyRootFilesystem: false setting, please also check if it reasonable.
Ouch. That's a clear regression from when they were last fixed by Lianhao, see: #815 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What permission should the model path be? I give all the models chmod -R 777, but still got this issue. What should be the user/group? Should it be the user deploy the test or root?
Looking at the error log: https://github.com/opea-project/GenAIExamples/actions/runs/17060819842/job/48367160723#step:6:381
Error is for the initcontainer. It can download data, but cannot change access rights for the downloaded data:
[pod/chatqna-1755584254-vllm-7f44887799-jjtsj/model-downloader] chmod: /data/models--meta-llama--Meta-Llama-3-8B-Instruct: Operation not permitted
With the chmod -R g+w /data/models--$LLM_MODEL_ID command in: https://github.com/opea-project/GenAIInfra/blob/main/helm-charts/common/vllm/templates/deployment.yaml#L60
Although (hard-coded) initContainer securityContext should have all the necessary capabilities to do that: https://github.com/opea-project/GenAIInfra/blob/main/helm-charts/common/vllm/templates/deployment.yaml#L38
as it has been working earlier...
InitContainer's /data path is at root of model-volume volume: https://github.com/opea-project/GenAIInfra/blob/main/helm-charts/common/vllm/templates/deployment.yaml#L108
Which according to the error log is in:
model-volume:
Type: HostPath (bare host directory volume)
Path: /data2/hf_model
HostPathType: Directory
=> @chensuyue please provide output lf ls -la /data2/hf_model for all the Gaudi hosts where CI could currently run these pods.
(Do those host directory access rights differ from what was used on CI Gaudi hosts earlier?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
=> @chensuyue please provide output lf
ls -la /data2/hf_modelfor all the Gaudi hosts where CI could currently run these pods.(Do those host directory access rights differ from what was used on CI Gaudi hosts earlier?)
I have given the current data folder the most lenient permissions. I didn't apply any special setting for those data path earlier beside apply chmod 777, maybe cloud team did.



Unless this is blocking (OPEA v1.4 release) PR, I would rather not propagate incorrect fixes. |
I totally agree. |
poussa
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need another way to solve the issue, or drop this PR.
Yes, those issue block v1.4 release. I am not the right person to fix those issues. Feel free to close this PR and fix all the issues with another one. Need to fix ASAP! AgentQnA, gaudi |
For this |

Description
Issues
#1174
Type of change
List the type of change like below. Please delete options that are not relevant.
Dependencies
List the newly introduced 3rd party dependency if exists.
Tests
Describe the tests that you ran to verify your changes.