diff --git a/README.md b/README.md index 55488fd..aa72d59 100644 --- a/README.md +++ b/README.md @@ -14,7 +14,7 @@ This repo showcases different ways NVIDIA NIMs can be deployed. This repo contai | | **Open Source Platforms** | | | | | [KServe](https://github.com/NVIDIA/nim-deploy/tree/main/kserve) | | | | **Independent Software Vendors** | | -| | | Run.ai (coming soon) | | +| | | [Run.ai](./docs/run.ai/README.md) | | | **Cloud Service Provider Deployments** | **Azure** | | | | | [AKS Managed Kubernetes](https://github.com/NVIDIA/nim-deploy/tree/main/cloud-service-providers/azure/aks) | | | | | [Azure ML](https://github.com/NVIDIA/nim-deploy/tree/main/cloud-service-providers/azure/azureml) | | diff --git a/docs/run.ai/README.md b/docs/run.ai/README.md new file mode 100644 index 0000000..b188ace --- /dev/null +++ b/docs/run.ai/README.md @@ -0,0 +1,94 @@ +# NIMs on Run.ai + +[Run.ai](https://www.run.ai/) provides a platform for accelerating AI development delivering life cycle support spanning from concept to deployment of AI workloads. It layers on top of Kubernetes starting with a single cluster but extending to centralized multi-cluster management. It provides UI, GPU-aware scheduling, container orchestration, node pooling, organizational resource quota management, and more. And it offers administrators, researchers, and developers tools to manage resources across multiple Kubernetes clusters and subdivide them across project and departments, and automates Kubernetes primitives with its own AI optimized resources. + +## Run.ai Deployment Options + +The Run:ai Control Plane is available as a [hosted service](https://docs.run.ai/latest/home/components/#runai-control-plane-on-the-cloud) or alternatively as a [self-hosted](https://docs.run.ai/latest/home/components/#self-hosted-control-plane) option (including in disconnected "air-gapped" environments). In either case, the control plane can manage Run:ai "cluster engine" equipped clusters whether local or remotely cloud hosted. + +## Prerequisites + +1. A conformant Kubernetes cluster ([RunAI K8s version requirements](https://docs.run.ai/latest/admin/overview-administrator/)) +2. RunAI Control Plane and cluster(s) [installed](https://docs.run.ai/latest/admin/runai-setup/cluster-setup/cluster-install/) and operational +3. [NVIDIA GPU Operator](https://github.com/NVIDIA/gpu-operator) installed +4. General NIM requirements: [NIM Prerequisites](https://docs.nvidia.com/nim/large-language-models/latest/getting-started.html#prerequisites) +5. An NVIDIA AI Enterprise (NVAIE) License: [Sign up for NVAIE license](https://build.nvidia.com/meta/llama-3-8b-instruct?snippet_tab=Docker&signin=true&integrate_nim=true&self_hosted_api=true) or [Request a Free 90-Day NVAIE License](https://enterpriseproductregistration.nvidia.com/?LicType=EVAL&ProductFamily=NVAIEnterprise) through the NVIDIA Developer Program. +6. An NVIDIA NGC API Key: please follow the guidance in the [NVIDIA NIM Getting Started](https://docs.nvidia.com/nim/large-language-models/latest/getting-started.html#option-2-from-ngc) documentation to generate a properly scoped API key if you haven't already. + +## InferenceWorkload + +Run.ai provides an [InferenceWorkload](https://docs.run.ai/latest/Researcher/workloads/inference-overview/) resource to help automate inference services like NIMs. It leverages [Knative](https://github.com/knative) to automate the underlying service and routing of traffic. YAML examples can be found [here](https://docs.run.ai/latest/developer/cluster-api/submit-yaml/#inference-workload-example). + +It should be noted that InferenceWorkload is an optional add-on for Run.ai. Consult your Run.ai UI portal or cluster administrator to determine which clusters support InferenceWorkload. + +### Basic Example + +At the core, running NIMs with InferenceWorkload is quite simple. However, many customizations are possible, such as adding variables, PVCs to cache models, health checks, and other special configurations that will pass through to the pods backing the services. The `examples` directory can evolve over time with more complex deployment examples. The following example is a bare minimum configuration. + +This example can also be deployed through [UI](https://docs.run.ai/latest/Researcher/workloads/inference-overview/) - including creating the secret and InferenceWorkload. + +**Preparation**: +* A Runai Project (and corresponding Kubernetes namespace, which is the project name prefixed with `runai-`). You should be set up to run "kubectl" commands to the target cluster and namespace. +* An NGC API Key +* `curl` and `jq` for the test script +* A Docker registry secret for `nvcr.io` needs to exist in your Run.ai project. This can only be created through the UI, via "credentials" section. Add a new docker-registry credential, choose the scope to be your project, set username to `$oauthtoken` and password to your NGC API key. Set the registry url to `nvcr.io`. This only has to be done once per scope, and Run.ai will detect and use it when it is needed. + +1. Deploy InferenceWorkload to your current Kubernetes context via Helm, with working directory being the same as this README, setting the necessary environment variables + +``` +% export NAMESPACE=[namespace] +% export NGC_KEY=[ngc key] +% helm install --set namespace=$NAMESPACE --set ngcKey=$NGC_KEY my-llama-1 examples/basic-llama +``` + +Now, wait for the InferenceWorkload's ksvc to become ready. + +``` +% kubectl get ksvc basic-llama -o wide --watch +NAME URL LATESTCREATED LATESTREADY READY REASON +basic-llama http://basic-llama.runai-myproject.inference.12345678.dgxc.ngc.nvidia.com basic-llama-00001 Unknown RevisionMissing +basic-llama http://basic-llama.runai-myproject.inference.12345678.dgxc.ngc.nvidia.com basic-llama-00001 basic-llama-00001 Unknown RevisionMissing +basic-llama http://basic-llama.runai-myproject.inference.12345678.dgxc.ngc.nvidia.com basic-llama-00001 basic-llama-00001 Unknown IngressNotConfigured +basic-llama http://basic-llama.runai-myproject.inference.12345678.dgxc.ngc.nvidia.com basic-llama-00001 basic-llama-00001 Unknown Uninitialized +basic-llama http://basic-llama.runai-myproject.inference.12345678.dgxc.ngc.nvidia.com basic-llama-00001 basic-llama-00001 True +``` + +2. Query your new inference service + +As seen above, you will get a new knative service accessible via hostname-based routing. Use the hostname from this URL to pass to the test script by setting an environment variable `LHOST`. + +``` +% export LHOST="basic-llama.runai-myproject.inference.12345678.dgxc.ngc.nvidia.com" +% ./examples/query-llama.sh +Here's a song about pizza: + +**Verse 1** +I'm walkin' down the street, smellin' something sweet +Followin' the aroma to my favorite treat +A slice of heaven in a box, or so I've been told +Gimme that pizza love, and my heart will be gold +``` + +3. Remove inference service + +``` +% helm uninstall my-llama-1 +release "my-llama-1" uninstalled +``` +### PVC Example + +The PVC example runs in much the same way. It adds a mounted PVC to the example NIM container in a place where it can be used as a cache - `/opt/nim/.cache`, and configured to be retained between helm uninstall and install, so that the model data need only be downloaded on first use. + +``` +% helm install --set namespace=$NAMESPACE --set ngcKey=$NGC_KEY my-llama-pvc examples/basic-llama-pvc + +% kubectl get ksvc basic-llama-pvc --watch +``` + +### Troubleshooting + +Users can troubleshoot workloads by looking at the underlying resources that are created. There should be deployments, pods, ksvcs to describe or view logs from. + +## Air-gapped operations + +For scenarios in which Run:ai clusters are operating in air-gapped (disconnected) environments, please see NVIDIA NIM documentation for [serving models from local assets](https://docs.nvidia.com/nim/large-language-models/latest/getting-started.html#serving-models-from-local-assets). diff --git a/docs/run.ai/examples/basic-llama-pvc/.helmignore b/docs/run.ai/examples/basic-llama-pvc/.helmignore new file mode 100644 index 0000000..0e8a0eb --- /dev/null +++ b/docs/run.ai/examples/basic-llama-pvc/.helmignore @@ -0,0 +1,23 @@ +# Patterns to ignore when building packages. +# This supports shell glob matching, relative path matching, and +# negation (prefixed with !). Only one pattern per line. +.DS_Store +# Common VCS dirs +.git/ +.gitignore +.bzr/ +.bzrignore +.hg/ +.hgignore +.svn/ +# Common backup files +*.swp +*.bak +*.tmp +*.orig +*~ +# Various IDEs +.project +.idea/ +*.tmproj +.vscode/ diff --git a/docs/run.ai/examples/basic-llama-pvc/Chart.yaml b/docs/run.ai/examples/basic-llama-pvc/Chart.yaml new file mode 100644 index 0000000..2602149 --- /dev/null +++ b/docs/run.ai/examples/basic-llama-pvc/Chart.yaml @@ -0,0 +1,24 @@ +apiVersion: v2 +name: basic-llama-pvc +description: A Helm chart for Kubernetes + +# A chart can be either an 'application' or a 'library' chart. +# +# Application charts are a collection of templates that can be packaged into versioned archives +# to be deployed. +# +# Library charts provide useful utilities or functions for the chart developer. They're included as +# a dependency of application charts to inject those utilities and functions into the rendering +# pipeline. Library charts do not define any templates and therefore cannot be deployed. +type: application + +# This is the chart version. This version number should be incremented each time you make changes +# to the chart and its templates, including the app version. +# Versions are expected to follow Semantic Versioning (https://semver.org/) +version: 0.1.0 + +# This is the version number of the application being deployed. This version number should be +# incremented each time you make changes to the application. Versions are not expected to +# follow Semantic Versioning. They should reflect the version the application is using. +# It is recommended to use it with quotes. +appVersion: "1.0.0" diff --git a/docs/run.ai/examples/basic-llama-pvc/templates/inferenceworkload.yaml b/docs/run.ai/examples/basic-llama-pvc/templates/inferenceworkload.yaml new file mode 100644 index 0000000..4acc3f1 --- /dev/null +++ b/docs/run.ai/examples/basic-llama-pvc/templates/inferenceworkload.yaml @@ -0,0 +1,39 @@ +apiVersion: run.ai/v2alpha1 +kind: InferenceWorkload +metadata: + name: basic-llama-pvc + namespace: {{ .Values.namespace }} +spec: + name: + value: basic-llama-pvc + environment: + items: + NGC_API_KEY: + value: SECRET:ngc-secret-pvc,NGC_API_KEY + gpu: + value: "1" + image: + value: "nvcr.io/nim/meta/llama-3.1-8b-instruct" + minScale: + value: 1 + maxScale: + value: 2 + runAsUid: + value: 1000 + runAsGid: + value: 1000 + ports: + items: + serving-port: + value: + container: 8000 + protocol: http + serviceType: ServingPort + pvcs: + items: + pvc: + value: + claimName: nim-cache + existingPvc: true + path: /opt/nim/.cache + readOnly: false diff --git a/docs/run.ai/examples/basic-llama-pvc/templates/ngc-secret.yaml b/docs/run.ai/examples/basic-llama-pvc/templates/ngc-secret.yaml new file mode 100644 index 0000000..53772fb --- /dev/null +++ b/docs/run.ai/examples/basic-llama-pvc/templates/ngc-secret.yaml @@ -0,0 +1,8 @@ +apiVersion: v1 +kind: Secret +type: Opaque +metadata: + name: ngc-secret-pvc + namespace: {{ .Values.namespace}} +data: + NGC_API_KEY: {{ .Values.ngcKey | b64enc }} diff --git a/docs/run.ai/examples/basic-llama-pvc/templates/pvc.yaml b/docs/run.ai/examples/basic-llama-pvc/templates/pvc.yaml new file mode 100644 index 0000000..07ef52c --- /dev/null +++ b/docs/run.ai/examples/basic-llama-pvc/templates/pvc.yaml @@ -0,0 +1,14 @@ +kind: PersistentVolumeClaim +apiVersion: v1 +metadata: + name: nim-cache + namespace: {{ .Values.namespace }} + annotations: + helm.sh/resource-policy: "keep" +spec: + storageClassName: {{ .Values.storageClassName }} + accessModes: + - ReadWriteMany + resources: + requests: + storage: 32Gi diff --git a/docs/run.ai/examples/basic-llama-pvc/values.yaml b/docs/run.ai/examples/basic-llama-pvc/values.yaml new file mode 100644 index 0000000..69b36db --- /dev/null +++ b/docs/run.ai/examples/basic-llama-pvc/values.yaml @@ -0,0 +1,8 @@ + +# These can be edited here locally, but should be overridden like so: +# helm install --set namespace=$NAMESPACE --set ngcKey=$NGC_KEY +namespace: override-with-flag +ngcKey: override-with-flag + +## optional to override +storageClassName: standard-rwx diff --git a/docs/run.ai/examples/basic-llama/.helmignore b/docs/run.ai/examples/basic-llama/.helmignore new file mode 100644 index 0000000..0e8a0eb --- /dev/null +++ b/docs/run.ai/examples/basic-llama/.helmignore @@ -0,0 +1,23 @@ +# Patterns to ignore when building packages. +# This supports shell glob matching, relative path matching, and +# negation (prefixed with !). Only one pattern per line. +.DS_Store +# Common VCS dirs +.git/ +.gitignore +.bzr/ +.bzrignore +.hg/ +.hgignore +.svn/ +# Common backup files +*.swp +*.bak +*.tmp +*.orig +*~ +# Various IDEs +.project +.idea/ +*.tmproj +.vscode/ diff --git a/docs/run.ai/examples/basic-llama/Chart.yaml b/docs/run.ai/examples/basic-llama/Chart.yaml new file mode 100644 index 0000000..37c63bd --- /dev/null +++ b/docs/run.ai/examples/basic-llama/Chart.yaml @@ -0,0 +1,24 @@ +apiVersion: v2 +name: basic-llama +description: A Helm chart for Kubernetes + +# A chart can be either an 'application' or a 'library' chart. +# +# Application charts are a collection of templates that can be packaged into versioned archives +# to be deployed. +# +# Library charts provide useful utilities or functions for the chart developer. They're included as +# a dependency of application charts to inject those utilities and functions into the rendering +# pipeline. Library charts do not define any templates and therefore cannot be deployed. +type: application + +# This is the chart version. This version number should be incremented each time you make changes +# to the chart and its templates, including the app version. +# Versions are expected to follow Semantic Versioning (https://semver.org/) +version: 0.1.0 + +# This is the version number of the application being deployed. This version number should be +# incremented each time you make changes to the application. Versions are not expected to +# follow Semantic Versioning. They should reflect the version the application is using. +# It is recommended to use it with quotes. +appVersion: "1.0.0" diff --git a/docs/run.ai/examples/basic-llama/templates/inferenceworkload.yaml b/docs/run.ai/examples/basic-llama/templates/inferenceworkload.yaml new file mode 100644 index 0000000..acb65cd --- /dev/null +++ b/docs/run.ai/examples/basic-llama/templates/inferenceworkload.yaml @@ -0,0 +1,27 @@ +apiVersion: run.ai/v2alpha1 +kind: InferenceWorkload +metadata: + name: basic-llama + namespace: {{ .Values.namespace }} +spec: + name: + value: basic-llama + environment: + items: + NGC_API_KEY: + value: SECRET:ngc-secret,NGC_API_KEY + gpu: + value: "1" + image: + value: "nvcr.io/nim/meta/llama-3.1-8b-instruct" + minScale: + value: 1 + maxScale: + value: 2 + ports: + items: + serving-port: + value: + container: 8000 + protocol: http + serviceType: ServingPort diff --git a/docs/run.ai/examples/basic-llama/templates/ngc-secret.yaml b/docs/run.ai/examples/basic-llama/templates/ngc-secret.yaml new file mode 100644 index 0000000..78fb174 --- /dev/null +++ b/docs/run.ai/examples/basic-llama/templates/ngc-secret.yaml @@ -0,0 +1,8 @@ +apiVersion: v1 +kind: Secret +type: Opaque +metadata: + name: ngc-secret + namespace: {{ .Values.namespace}} +data: + NGC_API_KEY: {{ .Values.ngcKey | b64enc }} diff --git a/docs/run.ai/examples/basic-llama/values.yaml b/docs/run.ai/examples/basic-llama/values.yaml new file mode 100644 index 0000000..e09c86e --- /dev/null +++ b/docs/run.ai/examples/basic-llama/values.yaml @@ -0,0 +1,5 @@ + +# These can be edited here locally, but should be overridden like so: +# helm install --set namespace=$NAMESPACE --set ngcKey=$NGC_KEY +namespace: override-with-flag +ngcKey: override-with-flag diff --git a/docs/run.ai/examples/query-llama.sh b/docs/run.ai/examples/query-llama.sh new file mode 100755 index 0000000..09012c0 --- /dev/null +++ b/docs/run.ai/examples/query-llama.sh @@ -0,0 +1,29 @@ +#!/bin/bash + +if [[ -z $LHOST ]]; then + echo "please provide an LHOST env var" + exit 1 +fi + +Q="Write a song about pizza" +MODEL=$(curl -s "http://${LHOST}/v1/models" | jq -r '.data[0]|.id') + +curl -s "http://${LHOST}/v1/chat/completions" \ + -H "Accept: application/json" \ + -H "Content-Type: application/json" \ + -d '{ + "messages": [ + { + "content": "'"${Q}"'", + "role": "user" + } + ], + "model": "'"${MODEL}"'", + "max_tokens": 500, + "top_p": 0.8, + "temperature": 0.9, + "seed": '$RANDOM', + "stream": false, + "stop": ["hello\n"], + "frequency_penalty": 1.0 +}' | jq -r '.choices[0]|.message.content'