Skip to content

Conversation

@chyundunovDatamonsters
Copy link
Contributor

Description

Adding files for deploy application on ROCm vLLM and ROCm TGI with Helm

Issues

Type of change

List the type of change like below. Please delete options that are not relevant.

  • [*] New feature (non-breaking change which adds new functionality)

Dependencies

Tests

Copy link
Collaborator

@lianhao lianhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pls also update the valuesfiles.yaml which is used to sync the helm value files from GenAIInfra to GenAIExamples

@lianhao
Copy link
Collaborator

lianhao commented Apr 7, 2025

@chensuyue @yongfengdu I think the CI for AMD roc should be added in the GenAIInfra too. Do you know how to do that?

@chensuyue chensuyue added this to the v1.4 milestone Apr 10, 2025
@chensuyue
Copy link
Collaborator

@chensuyue @yongfengdu I think the CI for AMD roc should be added in the GenAIInfra too. Do you know how to do that?

Yes, we need it. I will take with AMD team to add the test machine into OPEA CI. And we also need to modify the CI workflow adapt with the rocm test.

@yongfengdu
Copy link
Collaborator

Could you rebase the PR with latest changes and address the comments?
If there is no special reason, you should remove the "tag: cpu-1.5", thus use the default one defined in values.yaml file which is cpu-1.6.

…sing AMD GPU using Helm

Signed-off-by: Chingis Yundunov <[email protected]>
Copy link
Collaborator

@lianhao lianhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

besides the embedded comment, please also fix the rebase conflict

@lianhao lianhao added the rocm label Apr 25, 2025
…sing AMD GPU using Helm

Signed-off-by: Chingis Yundunov <[email protected]>
@lianhao
Copy link
Collaborator

lianhao commented Apr 25, 2025

@chyundunovDatamonsters please do a manual rebase locally and fix the following conflict:

image

Also, your manual local rebase should fix the CI failure too.

…sing AMD GPU using Helm

Signed-off-by: Chingis Yundunov <[email protected]>
…o feature/ChatQnA_k8s

# Conflicts:
#	helm-charts/chatqna/README.md
…sing AMD GPU using Helm

Signed-off-by: Chingis Yundunov <[email protected]>
@lianhao
Copy link
Collaborator

lianhao commented Apr 28, 2025

Seems like the k8s-rocm K8s cluster has some issues. All running pods are automatically stopped and killed. @chyundunovDatamonsters please check the K8s cluster to make sure there is no node level resource pressure(i.e. cpu, memory, disk, etc.)

Copy link
Collaborator

@lianhao lianhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix the following yaml syntax error:

Error: parse error at (tgi/templates/deployment.yaml:101): unexpected "/" in operand

@chyundunovDatamonsters
Copy link
Contributor Author

The Gaudi tests fail. Please pay attention to this problem.

…sing AMD GPU using Helm

Signed-off-by: Chingis Yundunov <[email protected]>
@lianhao
Copy link
Collaborator

lianhao commented May 16, 2025

The Gaudi tests fail. Please pay attention to this problem.

@chyundunovDatamonsters The gaudi test env should fine now.

Please pay attention to my embedded comment above. Thanks!

…sing AMD GPU using Helm

Signed-off-by: Chingis Yundunov <[email protected]>
@chyundunovDatamonsters
Copy link
Contributor Author

@chyundunovDatamonsters please do a manual rebase locally and fix the following conflict:

image Also, your manual local rebase should fix the CI failure too.

Fixed

Copy link
Collaborator

@lianhao lianhao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All seems ok except for my last unresolved comment. @chyundunovDatamonsters

…sing AMD GPU using Helm

Signed-off-by: Chingis Yundunov <[email protected]>
@lianhao lianhao merged commit 4fdc4bb into opea-project:main May 16, 2025
29 checks passed
eero-t added a commit to eero-t/GenAIInfra that referenced this pull request May 16, 2025
@eero-t eero-t mentioned this pull request May 16, 2025
1 task
yongfengdu pushed a commit that referenced this pull request May 19, 2025
eero-t added a commit to eero-t/GenAIInfra that referenced this pull request May 21, 2025
Introduced also by opea-project#949, and update first README clause to indicate
that some of the subservices are conditional.

Signed-off-by: Eero Tamminen <[email protected]>
@eero-t eero-t mentioned this pull request May 21, 2025
1 task
eero-t added a commit to eero-t/GenAIInfra that referenced this pull request May 21, 2025
Introduced also by opea-project#949, and update first README clause to indicate
that some of the subservices are conditional.

Signed-off-by: Eero Tamminen <[email protected]>
poussa pushed a commit that referenced this pull request May 22, 2025
Introduced also by #949, and update first README clause to indicate
that some of the subservices are conditional.

Signed-off-by: Eero Tamminen <[email protected]>
@eero-t eero-t mentioned this pull request Jun 27, 2025
1 task
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants