Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add basic example of NIM with Run.ai inference #81

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from 4 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ This repo showcases different ways NVIDIA NIMs can be deployed. This repo contai
| | **Open Source Platforms** | |
| | | [KServe](https://github.com/NVIDIA/nim-deploy/tree/main/kserve) | |
| | **Independent Software Vendors** | |
| | | Run.ai (coming soon) | |
| | | [Run.ai](./run.ai/README.md) | |
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI, I've made a lot of changes to this README in a prior draft PR. I'm abandoning that for favor of yours in general... will wait to amend this top level README until yours merges.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just FYI, I had been hard linking these URLs to better support the current copy/paste workflow for posting updates to NGC. This really only applies to the KServe README.

| **Cloud Service Provider Deployments** | **Azure** | |
| | | [AKS Managed Kubernetes](https://github.com/NVIDIA/nim-deploy/tree/main/cloud-service-providers/azure/aks) | |
| | | [Azure ML](https://github.com/NVIDIA/nim-deploy/tree/main/cloud-service-providers/azure/azureml) | |
Expand Down
77 changes: 77 additions & 0 deletions run.ai/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,77 @@
# NIMs on Run.ai
mlsorensen marked this conversation as resolved.
Show resolved Hide resolved

Run.ai provides a fast and efficient platform for running AI workloads. It sits on top of a group of Kubernetes clusters and provides UI, GPU-aware scheduling, container orchestration, node pooling, organizational resource quota management, and more. It gives customers the tools to manage resources across multiple Kubernetes clusters and subdivide them across project and departments, and automates Kubernetes primitives with its own AI optimized resources.

mlsorensen marked this conversation as resolved.
Show resolved Hide resolved
## InferenceWorkload

Run.ai provides an [InferenceWorkload](https://docs.run.ai/latest/Researcher/workloads/inference-overview/) resource to help automate inference services like NIMs. It leverages Knative to automate the underlying service and routing of traffic.

mlsorensen marked this conversation as resolved.
Show resolved Hide resolved
It should be noted that InferenceWorkload is an optional add-on for Run.ai. Consult your Run.ai UI portal or cluster administrator to determine which clusters support InferenceWorkload.

### Basic Example

At the core, running NIMs with InferenceWorkload is quite simple. However, many customizations are possible, such as adding variables, PVCs to cache models, health checks, and other special configurations that will pass through to the pods backing the services. The `examples` directory can evolve over time with more complex deployment examples. The following example is a bare minimum configuration.

This example can also be deployed through [UI](https://docs.run.ai/latest/Researcher/workloads/inference-overview/) - including creating the secret and InferenceWorkload.

mlsorensen marked this conversation as resolved.
Show resolved Hide resolved
**Prerequisites**:
* A Runai Project (and corresponding Kubernetes namespace, which is the project name prefixed with `runai-`). You should be set up to run "kubectl" commands to the target cluster and namespace.
* An NGC API Key
* `curl` and `jq` for the test script
* A Docker registry secret for `nvcr.io` needs to exist in your Run.ai project. This can only be created through the UI, via "credentials" section. Add a new docker-registry credential, choose the scope to be your project, set username to `$oauthtoken` and password to your NGC API key. Set the registry url to `ngcr.io`. This only has to be done once per scope, and Run.ai will detect and use it when it is needed.

1. Deploy InferenceWorkload to your current Kubernetes context via Helm, with working directory being the same as this README, setting the neccessary environment variables
mlsorensen marked this conversation as resolved.
Show resolved Hide resolved

```
% export NAMESPACE=[namespace]
% export NGC_KEY=[ngc key]
% helm install --set namespace=$NAMESPACE --set ngcKey=$NGC_KEY my-llama-1 examples/basic-llama
```

Now, wait for the InferenceWorkload's ksvc to become ready.

```
% kubectl get ksvc basic-llama -o wide --watch
NAME URL LATESTCREATED LATESTREADY READY REASON
basic-llama http://basic-llama.runai-myproject.inference.12345678.dgxc.ngc.nvidia.com basic-llama-00001 Unknown RevisionMissing
basic-llama http://basic-llama.runai-myproject.inference.12345678.dgxc.ngc.nvidia.com basic-llama-00001 basic-llama-00001 Unknown RevisionMissing
basic-llama http://basic-llama.runai-myproject.inference.12345678.dgxc.ngc.nvidia.com basic-llama-00001 basic-llama-00001 Unknown IngressNotConfigured
basic-llama http://basic-llama.runai-myproject.inference.12345678.dgxc.ngc.nvidia.com basic-llama-00001 basic-llama-00001 Unknown Uninitialized
basic-llama http://basic-llama.runai-myproject.inference.12345678.dgxc.ngc.nvidia.com basic-llama-00001 basic-llama-00001 True
```

2. Query your new inference service

As seen above, you will get a new knative service accessible via hostname-based routing. Use the hostname from this URL to pass to the test script by setting an environment variable `LHOST`.

```
% export LHOST="basic-llama.runai-myproject.inference.12345678.dgxc.ngc.nvidia.com"
% ./examples/query-llama.sh
Here's a song about pizza:

**Verse 1**
I'm walkin' down the street, smellin' something sweet
Followin' the aroma to my favorite treat
A slice of heaven in a box, or so I've been told
Gimme that pizza love, and my heart will be gold
```

3. Remove inference service

```
% helm uninstall my-llama-1
release "my-llama-1" uninstalled
```
### PVC Example

The PVC example runs in much the same way. It adds a mounted PVC to the example NIM container in a place where it can be used as a cache - `/opt/nim/.cache`, and configured to be retained between helm uninstall and install, so that the model data need only be downloaded on first use.

```
% helm install --set namespace=$NAMESPACE --set ngcKey=$NGC_KEY my-llama-pvc examples/basic-llama-pvc

% kubectl get ksvc basic-llama-pvc --watch
```

mlsorensen marked this conversation as resolved.
Show resolved Hide resolved
### Troubleshooting

Users can troubleshoot workloads by looking at the underlying resources that are created. There should be deployments, pods, ksvcs to describe or view logs from.
23 changes: 23 additions & 0 deletions run.ai/examples/basic-llama-pvc/.helmignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# Patterns to ignore when building packages.
# This supports shell glob matching, relative path matching, and
# negation (prefixed with !). Only one pattern per line.
.DS_Store
# Common VCS dirs
.git/
.gitignore
.bzr/
.bzrignore
.hg/
.hgignore
.svn/
# Common backup files
*.swp
*.bak
*.tmp
*.orig
*~
# Various IDEs
.project
.idea/
*.tmproj
.vscode/
24 changes: 24 additions & 0 deletions run.ai/examples/basic-llama-pvc/Chart.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
apiVersion: v2
name: basic-llama-pvc
description: A Helm chart for Kubernetes

# A chart can be either an 'application' or a 'library' chart.
#
# Application charts are a collection of templates that can be packaged into versioned archives
# to be deployed.
#
# Library charts provide useful utilities or functions for the chart developer. They're included as
# a dependency of application charts to inject those utilities and functions into the rendering
# pipeline. Library charts do not define any templates and therefore cannot be deployed.
type: application

# This is the chart version. This version number should be incremented each time you make changes
# to the chart and its templates, including the app version.
# Versions are expected to follow Semantic Versioning (https://semver.org/)
version: 0.1.0

# This is the version number of the application being deployed. This version number should be
# incremented each time you make changes to the application. Versions are not expected to
# follow Semantic Versioning. They should reflect the version the application is using.
# It is recommended to use it with quotes.
appVersion: "1.0.0"
39 changes: 39 additions & 0 deletions run.ai/examples/basic-llama-pvc/templates/inferenceworkload.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
apiVersion: run.ai/v2alpha1
kind: InferenceWorkload
metadata:
name: basic-llama-pvc
namespace: {{ .Values.namespace }}
spec:
name:
value: basic-llama-pvc
environment:
items:
NGC_API_KEY:
value: SECRET:ngc-secret-pvc,NGC_API_KEY
gpu:
value: "1"
image:
value: "nvcr.io/nim/meta/llama-3.1-8b-instruct"
minScale:
value: 1
maxScale:
value: 2
runAsUid:
value: 1000
runAsGid:
value: 1000
ports:
items:
serving-port:
value:
container: 8000
protocol: http
serviceType: ServingPort
pvcs:
items:
pvc:
value:
claimName: nim-cache
existingPvc: true
path: /opt/nim/.cache
readOnly: false
8 changes: 8 additions & 0 deletions run.ai/examples/basic-llama-pvc/templates/ngc-secret.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
apiVersion: v1
kind: Secret
type: Opaque
metadata:
name: ngc-secret-pvc
namespace: {{ .Values.namespace}}
data:
NGC_API_KEY: {{ .Values.ngcKey | b64enc }}
14 changes: 14 additions & 0 deletions run.ai/examples/basic-llama-pvc/templates/pvc.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: nim-cache
namespace: {{ .Values.namespace }}
annotations:
helm.sh/resource-policy: "keep"
spec:
storageClassName: {{ .Values.storageClassName }}
accessModes:
- ReadWriteMany
resources:
requests:
storage: 32Gi
8 changes: 8 additions & 0 deletions run.ai/examples/basic-llama-pvc/values.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@

# These can be edited here locally, but should be overridden like so:
# helm install --set namespace=$NAMESPACE --set ngcKey=$NGC_KEY
namespace: override-with-flag
ngcKey: override-with-flag

## optional to override
storageClassName: standard-rwx
23 changes: 23 additions & 0 deletions run.ai/examples/basic-llama/.helmignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# Patterns to ignore when building packages.
# This supports shell glob matching, relative path matching, and
# negation (prefixed with !). Only one pattern per line.
.DS_Store
# Common VCS dirs
.git/
.gitignore
.bzr/
.bzrignore
.hg/
.hgignore
.svn/
# Common backup files
*.swp
*.bak
*.tmp
*.orig
*~
# Various IDEs
.project
.idea/
*.tmproj
.vscode/
24 changes: 24 additions & 0 deletions run.ai/examples/basic-llama/Chart.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
apiVersion: v2
name: basic-llama
description: A Helm chart for Kubernetes

# A chart can be either an 'application' or a 'library' chart.
#
# Application charts are a collection of templates that can be packaged into versioned archives
# to be deployed.
#
# Library charts provide useful utilities or functions for the chart developer. They're included as
# a dependency of application charts to inject those utilities and functions into the rendering
# pipeline. Library charts do not define any templates and therefore cannot be deployed.
type: application

# This is the chart version. This version number should be incremented each time you make changes
# to the chart and its templates, including the app version.
# Versions are expected to follow Semantic Versioning (https://semver.org/)
version: 0.1.0

# This is the version number of the application being deployed. This version number should be
# incremented each time you make changes to the application. Versions are not expected to
# follow Semantic Versioning. They should reflect the version the application is using.
# It is recommended to use it with quotes.
appVersion: "1.0.0"
27 changes: 27 additions & 0 deletions run.ai/examples/basic-llama/templates/inferenceworkload.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
apiVersion: run.ai/v2alpha1
kind: InferenceWorkload
metadata:
name: basic-llama
namespace: {{ .Values.namespace }}
spec:
name:
value: basic-llama
environment:
items:
NGC_API_KEY:
value: SECRET:ngc-secret,NGC_API_KEY
gpu:
value: "1"
image:
value: "nvcr.io/nim/meta/llama-3.1-8b-instruct"
minScale:
value: 1
maxScale:
value: 2
ports:
items:
serving-port:
value:
container: 8000
protocol: http
serviceType: ServingPort
8 changes: 8 additions & 0 deletions run.ai/examples/basic-llama/templates/ngc-secret.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
apiVersion: v1
kind: Secret
type: Opaque
metadata:
name: ngc-secret
namespace: {{ .Values.namespace}}
data:
NGC_API_KEY: {{ .Values.ngcKey | b64enc }}
5 changes: 5 additions & 0 deletions run.ai/examples/basic-llama/values.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@

# These can be edited here locally, but should be overridden like so:
# helm install --set namespace=$NAMESPACE --set ngcKey=$NGC_KEY
namespace: override-with-flag
ngcKey: override-with-flag
29 changes: 29 additions & 0 deletions run.ai/examples/query-llama.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
#!/bin/bash

if [[ -z $LHOST ]]; then
echo "please provide an LHOST env var"
exit 1
fi

Q="Write a song about pizza"
MODEL=$(curl -s "http://${LHOST}/v1/models" | jq -r '.data[0]|.id')

curl -s "http://${LHOST}/v1/chat/completions" \
-H "Accept: application/json" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{
"content": "'"${Q}"'",
"role": "user"
}
],
"model": "'"${MODEL}"'",
"max_tokens": 500,
"top_p": 0.8,
"temperature": 0.9,
"seed": '$RANDOM',
"stream": false,
"stop": ["hello\n"],
"frequency_penalty": 1.0
}' | jq -r '.choices[0]|.message.content'