Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add basic example of NIM with Run.ai inference #81

Open
wants to merge 7 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ This repo showcases different ways NVIDIA NIMs can be deployed. This repo contai
| | **Open Source Platforms** | |
| | | [KServe](https://github.com/NVIDIA/nim-deploy/tree/main/kserve) | |
| | **Independent Software Vendors** | |
| | | Run.ai (coming soon) | |
| | | [Run.ai](./docs/run.ai/README.md) | |
| **Cloud Service Provider Deployments** | **Azure** | |
| | | [AKS Managed Kubernetes](https://github.com/NVIDIA/nim-deploy/tree/main/cloud-service-providers/azure/aks) | |
| | | [Azure ML](https://github.com/NVIDIA/nim-deploy/tree/main/cloud-service-providers/azure/azureml) | |
Expand Down
94 changes: 94 additions & 0 deletions docs/run.ai/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,94 @@
# NIMs on Run.ai

[Run.ai](https://www.run.ai/) provides a platform for accelerating AI development delivering life cycle support spanning from concept to deployment of AI workloads. It layers on top of Kubernetes starting with a single cluster but extending to centralized multi-cluster management. It provides UI, GPU-aware scheduling, container orchestration, node pooling, organizational resource quota management, and more. And it offers administrators, researchers, and developers tools to manage resources across multiple Kubernetes clusters and subdivide them across project and departments, and automates Kubernetes primitives with its own AI optimized resources.

## Run.ai Deployment Options

The Run:ai Control Plane is available as a [hosted service](https://docs.run.ai/latest/home/components/#runai-control-plane-on-the-cloud) or alternatively as a [self-hosted](https://docs.run.ai/latest/home/components/#self-hosted-control-plane) option (including in disconnected "air-gapped" environments). In either case, the control plane can manage Run:ai "cluster engine" equipped clusters whether local or remotely cloud hosted.

## Prerequisites

1. A conformant Kubernetes cluster ([RunAI K8s version requirements](https://docs.run.ai/latest/admin/overview-administrator/))
2. RunAI Control Plane and cluster(s) [installed](https://docs.run.ai/latest/admin/runai-setup/cluster-setup/cluster-install/) and operational
3. [NVIDIA GPU Operator](https://github.com/NVIDIA/gpu-operator) installed
4. General NIM requirements: [NIM Prerequisites](https://docs.nvidia.com/nim/large-language-models/latest/getting-started.html#prerequisites)
5. An NVIDIA AI Enterprise (NVAIE) License: [Sign up for NVAIE license](https://build.nvidia.com/meta/llama-3-8b-instruct?snippet_tab=Docker&signin=true&integrate_nim=true&self_hosted_api=true) or [Request a Free 90-Day NVAIE License](https://enterpriseproductregistration.nvidia.com/?LicType=EVAL&ProductFamily=NVAIEnterprise) through the NVIDIA Developer Program.
6. An NVIDIA NGC API Key: please follow the guidance in the [NVIDIA NIM Getting Started](https://docs.nvidia.com/nim/large-language-models/latest/getting-started.html#option-2-from-ngc) documentation to generate a properly scoped API key if you haven't already.

Copy link

@resker resker Aug 30, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prerequisites

Required: ✔️
Provided: ✅

Run.ai SaaS Air-gapped NVIDIA DGX Cloud
A conformant Kubernetes cluster (Run.ai K8s version requirements) ✔️ ✔️
Run.ai Control Plane and cluster(s) installed and operational ✔️ ✔️
Knative Serving installed and configured for the Run.ai scheduler ✔️ ✔️
NVIDIA GPU Operator installed ✔️ ✔️
General NIM requirements: [NIM Prerequisites](https://docs.nvidia.com/nim/large-language-models/latest/getting-started.html#prerequisites ✔️ ✔️ ✔️
An NVIDIA AI Enterprise (NVAIE) License: Sign up for NVAIE license or Request a Free 90-Day NVAIE License through the NVIDIA ✔️ ✔️ ✔️
An NVIDIA NGC API Key: please follow the guidance in the NVIDIA NIM Getting Started documentation to generate a properly scoped API key if you haven't already. ✔️ ✔️ ✔️

Required: ✔️
Provided: ✅

## InferenceWorkload

Run.ai provides an [InferenceWorkload](https://docs.run.ai/latest/Researcher/workloads/inference-overview/) resource to help automate inference services like NIMs. It leverages [Knative](https://github.com/knative) to automate the underlying service and routing of traffic. YAML examples can be found [here](https://docs.run.ai/latest/developer/cluster-api/submit-yaml/#inference-workload-example).

It should be noted that InferenceWorkload is an optional add-on for Run.ai. Consult your Run.ai UI portal or cluster administrator to determine which clusters support InferenceWorkload.

### Basic Example

At the core, running NIMs with InferenceWorkload is quite simple. However, many customizations are possible, such as adding variables, PVCs to cache models, health checks, and other special configurations that will pass through to the pods backing the services. The `examples` directory can evolve over time with more complex deployment examples. The following example is a bare minimum configuration.

This example can also be deployed through [UI](https://docs.run.ai/latest/Researcher/workloads/inference-overview/) - including creating the secret and InferenceWorkload.

**Preparation**:
* A Runai Project (and corresponding Kubernetes namespace, which is the project name prefixed with `runai-`). You should be set up to run "kubectl" commands to the target cluster and namespace.
* An NGC API Key
* `curl` and `jq` for the test script
* A Docker registry secret for `nvcr.io` needs to exist in your Run.ai project. This can only be created through the UI, via "credentials" section. Add a new docker-registry credential, choose the scope to be your project, set username to `$oauthtoken` and password to your NGC API key. Set the registry url to `nvcr.io`. This only has to be done once per scope, and Run.ai will detect and use it when it is needed.

1. Deploy InferenceWorkload to your current Kubernetes context via Helm, with working directory being the same as this README, setting the necessary environment variables

```
% export NAMESPACE=[namespace]
% export NGC_KEY=[ngc key]
% helm install --set namespace=$NAMESPACE --set ngcKey=$NGC_KEY my-llama-1 examples/basic-llama
```

Now, wait for the InferenceWorkload's ksvc to become ready.

```
% kubectl get ksvc basic-llama -o wide --watch
NAME URL LATESTCREATED LATESTREADY READY REASON
basic-llama http://basic-llama.runai-myproject.inference.12345678.dgxc.ngc.nvidia.com basic-llama-00001 Unknown RevisionMissing
basic-llama http://basic-llama.runai-myproject.inference.12345678.dgxc.ngc.nvidia.com basic-llama-00001 basic-llama-00001 Unknown RevisionMissing
basic-llama http://basic-llama.runai-myproject.inference.12345678.dgxc.ngc.nvidia.com basic-llama-00001 basic-llama-00001 Unknown IngressNotConfigured
basic-llama http://basic-llama.runai-myproject.inference.12345678.dgxc.ngc.nvidia.com basic-llama-00001 basic-llama-00001 Unknown Uninitialized
basic-llama http://basic-llama.runai-myproject.inference.12345678.dgxc.ngc.nvidia.com basic-llama-00001 basic-llama-00001 True
```

2. Query your new inference service

As seen above, you will get a new knative service accessible via hostname-based routing. Use the hostname from this URL to pass to the test script by setting an environment variable `LHOST`.

```
% export LHOST="basic-llama.runai-myproject.inference.12345678.dgxc.ngc.nvidia.com"
% ./examples/query-llama.sh
Here's a song about pizza:

**Verse 1**
I'm walkin' down the street, smellin' something sweet
Followin' the aroma to my favorite treat
A slice of heaven in a box, or so I've been told
Gimme that pizza love, and my heart will be gold
```

3. Remove inference service

```
% helm uninstall my-llama-1
release "my-llama-1" uninstalled
```
### PVC Example

The PVC example runs in much the same way. It adds a mounted PVC to the example NIM container in a place where it can be used as a cache - `/opt/nim/.cache`, and configured to be retained between helm uninstall and install, so that the model data need only be downloaded on first use.

```
% helm install --set namespace=$NAMESPACE --set ngcKey=$NGC_KEY my-llama-pvc examples/basic-llama-pvc

% kubectl get ksvc basic-llama-pvc --watch
```

### Troubleshooting

Users can troubleshoot workloads by looking at the underlying resources that are created. There should be deployments, pods, ksvcs to describe or view logs from.

## Air-gapped operations

For scenarios in which Run:ai clusters are operating in air-gapped (disconnected) environments, please see NVIDIA NIM documentation for [serving models from local assets](https://docs.nvidia.com/nim/large-language-models/latest/getting-started.html#serving-models-from-local-assets).
23 changes: 23 additions & 0 deletions docs/run.ai/examples/basic-llama-pvc/.helmignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# Patterns to ignore when building packages.
# This supports shell glob matching, relative path matching, and
# negation (prefixed with !). Only one pattern per line.
.DS_Store
# Common VCS dirs
.git/
.gitignore
.bzr/
.bzrignore
.hg/
.hgignore
.svn/
# Common backup files
*.swp
*.bak
*.tmp
*.orig
*~
# Various IDEs
.project
.idea/
*.tmproj
.vscode/
24 changes: 24 additions & 0 deletions docs/run.ai/examples/basic-llama-pvc/Chart.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
apiVersion: v2
name: basic-llama-pvc
description: A Helm chart for Kubernetes

# A chart can be either an 'application' or a 'library' chart.
#
# Application charts are a collection of templates that can be packaged into versioned archives
# to be deployed.
#
# Library charts provide useful utilities or functions for the chart developer. They're included as
# a dependency of application charts to inject those utilities and functions into the rendering
# pipeline. Library charts do not define any templates and therefore cannot be deployed.
type: application

# This is the chart version. This version number should be incremented each time you make changes
# to the chart and its templates, including the app version.
# Versions are expected to follow Semantic Versioning (https://semver.org/)
version: 0.1.0

# This is the version number of the application being deployed. This version number should be
# incremented each time you make changes to the application. Versions are not expected to
# follow Semantic Versioning. They should reflect the version the application is using.
# It is recommended to use it with quotes.
appVersion: "1.0.0"
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
apiVersion: run.ai/v2alpha1
kind: InferenceWorkload
metadata:
name: basic-llama-pvc
namespace: {{ .Values.namespace }}
spec:
name:
value: basic-llama-pvc
environment:
items:
NGC_API_KEY:
value: SECRET:ngc-secret-pvc,NGC_API_KEY
gpu:
value: "1"
image:
value: "nvcr.io/nim/meta/llama-3.1-8b-instruct"
minScale:
value: 1
maxScale:
value: 2
runAsUid:
value: 1000
runAsGid:
value: 1000
ports:
items:
serving-port:
value:
container: 8000
protocol: http
serviceType: ServingPort
pvcs:
items:
pvc:
value:
claimName: nim-cache
existingPvc: true
path: /opt/nim/.cache
readOnly: false
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
apiVersion: v1
kind: Secret
type: Opaque
metadata:
name: ngc-secret-pvc
namespace: {{ .Values.namespace}}
data:
NGC_API_KEY: {{ .Values.ngcKey | b64enc }}
14 changes: 14 additions & 0 deletions docs/run.ai/examples/basic-llama-pvc/templates/pvc.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: nim-cache
namespace: {{ .Values.namespace }}
annotations:
helm.sh/resource-policy: "keep"
spec:
storageClassName: {{ .Values.storageClassName }}
accessModes:
- ReadWriteMany
resources:
requests:
storage: 32Gi
8 changes: 8 additions & 0 deletions docs/run.ai/examples/basic-llama-pvc/values.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@

# These can be edited here locally, but should be overridden like so:
# helm install --set namespace=$NAMESPACE --set ngcKey=$NGC_KEY
namespace: override-with-flag
ngcKey: override-with-flag

## optional to override
storageClassName: standard-rwx
23 changes: 23 additions & 0 deletions docs/run.ai/examples/basic-llama/.helmignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# Patterns to ignore when building packages.
# This supports shell glob matching, relative path matching, and
# negation (prefixed with !). Only one pattern per line.
.DS_Store
# Common VCS dirs
.git/
.gitignore
.bzr/
.bzrignore
.hg/
.hgignore
.svn/
# Common backup files
*.swp
*.bak
*.tmp
*.orig
*~
# Various IDEs
.project
.idea/
*.tmproj
.vscode/
24 changes: 24 additions & 0 deletions docs/run.ai/examples/basic-llama/Chart.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,24 @@
apiVersion: v2
name: basic-llama
description: A Helm chart for Kubernetes

# A chart can be either an 'application' or a 'library' chart.
#
# Application charts are a collection of templates that can be packaged into versioned archives
# to be deployed.
#
# Library charts provide useful utilities or functions for the chart developer. They're included as
# a dependency of application charts to inject those utilities and functions into the rendering
# pipeline. Library charts do not define any templates and therefore cannot be deployed.
type: application

# This is the chart version. This version number should be incremented each time you make changes
# to the chart and its templates, including the app version.
# Versions are expected to follow Semantic Versioning (https://semver.org/)
version: 0.1.0

# This is the version number of the application being deployed. This version number should be
# incremented each time you make changes to the application. Versions are not expected to
# follow Semantic Versioning. They should reflect the version the application is using.
# It is recommended to use it with quotes.
appVersion: "1.0.0"
27 changes: 27 additions & 0 deletions docs/run.ai/examples/basic-llama/templates/inferenceworkload.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,27 @@
apiVersion: run.ai/v2alpha1
kind: InferenceWorkload
metadata:
name: basic-llama
namespace: {{ .Values.namespace }}
spec:
name:
value: basic-llama
environment:
items:
NGC_API_KEY:
value: SECRET:ngc-secret,NGC_API_KEY
gpu:
value: "1"
image:
value: "nvcr.io/nim/meta/llama-3.1-8b-instruct"
minScale:
value: 1
maxScale:
value: 2
ports:
items:
serving-port:
value:
container: 8000
protocol: http
serviceType: ServingPort
8 changes: 8 additions & 0 deletions docs/run.ai/examples/basic-llama/templates/ngc-secret.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,8 @@
apiVersion: v1
kind: Secret
type: Opaque
metadata:
name: ngc-secret
namespace: {{ .Values.namespace}}
data:
NGC_API_KEY: {{ .Values.ngcKey | b64enc }}
5 changes: 5 additions & 0 deletions docs/run.ai/examples/basic-llama/values.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@

# These can be edited here locally, but should be overridden like so:
# helm install --set namespace=$NAMESPACE --set ngcKey=$NGC_KEY
namespace: override-with-flag
ngcKey: override-with-flag
29 changes: 29 additions & 0 deletions docs/run.ai/examples/query-llama.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
#!/bin/bash

if [[ -z $LHOST ]]; then
echo "please provide an LHOST env var"
exit 1
fi

Q="Write a song about pizza"
MODEL=$(curl -s "http://${LHOST}/v1/models" | jq -r '.data[0]|.id')

curl -s "http://${LHOST}/v1/chat/completions" \
-H "Accept: application/json" \
-H "Content-Type: application/json" \
-d '{
"messages": [
{
"content": "'"${Q}"'",
"role": "user"
}
],
"model": "'"${MODEL}"'",
"max_tokens": 500,
"top_p": 0.8,
"temperature": 0.9,
"seed": '$RANDOM',
"stream": false,
"stop": ["hello\n"],
"frequency_penalty": 1.0
}' | jq -r '.choices[0]|.message.content'