Skip to content

Commit db3277c

Browse files
committed
remerged from internal repo
1 parent 9377937 commit db3277c

15 files changed

+147
-492
lines changed

GETTING_STARTED_CONSOLE_DEPLOY.md

Lines changed: 94 additions & 447 deletions
Large diffs are not rendered by default.

GETTING_STARTED_HELM_DEPLOY.md

Lines changed: 23 additions & 43 deletions
Original file line numberDiff line numberDiff line change
@@ -124,6 +124,7 @@ helm install lens oci-ai-incubations/lens -n lens --create-namespace \
124124
--set grafana.adminPassword="access password for grafana portal. User name is admin by default" \
125125
126126
```
127+
127128
## Verify for successful install
128129

129130
Once the installation is complete you should see the following pods in the "lens" namespace. If you don't please uninstall and reinstall or check the helm install events/logs.
@@ -226,23 +227,17 @@ helm install lens ./helm -n lens \
226227
--set backend.image.tag=stable
227228
```
228229

229-
### Uninstall the control plane components
230-
```bash
231-
helm uninstall lens -n lens
232-
```
233-
234-
235230
## Step 2: OCI GPU Data Plane Plugin installation on GPU Nodes
236231

237-
1. **Navigate to Dashboards**: Go to the dashboard section
232+
1. **Navigate to Dashboards**: Go to the dashboard section of the OCI GPU Scanner Portal
238233
2. **Go to Tab - OCI GPU Scanner Install Script**:
239234
- You can use the script there and deploy the oci-scanner plugin on to your gpus nodes manually.
240235
- Embed them into a slurm script if you run a slurm cluster.
241236
- Use the kubernetes objects for the plugin under the `oci_scanner_plugin` folder for a Kubernetes cluster. Refer to [Readme](oci_scanner_plugin/README.md).
242237
- use the same scripts to be added as part of your new GPU compute deployments through cloud-init scripts.
243238
---
244239

245-
## Step 4: Explore Monitoring Dashboards
240+
## Step 3: Explore Monitoring Dashboards
246241

247242
1. **Navigate to Dashboards**: Go to the dashboard section
248243
2. **View Available Dashboards**:
@@ -255,46 +250,31 @@ helm uninstall lens -n lens
255250
6. **Access Additional Features**:
256251
- **Custom Queries**: Use Prometheus queries to create custom visualizations
257252
- **Alerting**: Set up alerts for critical GPU or cluster issues
258-
259253
---
260254

261-
## Architecture
262-
263-
The Helm chart deploys the following components:
264-
265-
1. **Frontend (Portal)**
266-
- React/Node.js application
267-
- Served on port 3000
268-
- Service for internal/external access
269-
270-
2. **Backend (Control Plane)**
271-
- Django application
272-
- Served on port 5000 (container), 80 (service)
273-
- External access via LoadBalancer service
274-
- Connects to Postgres
275-
- Configured with Prometheus Pushgateway and Grafana URLs
276-
277-
3. **Postgres Database**
278-
- Managed via StatefulSet/Deployment
279-
- Persistent storage via PVC
280-
- Service for backend connectivity
281-
282-
4. **ConfigMaps and Secrets**
283-
- All environment variables and sensitive data are managed via ConfigMaps and Kubernetes Secrets
284-
285255
## Cleanup
286256

287257
You can remove all control plane resources in **one step**:
288258

289-
1. **Destroy the Control Plane Components**
290-
- Go to **Resource Manager → Stacks** in the OCI Console.
291-
- Select your **OCI GPU Scanner stack**.
292-
- Click **Destroy**, confirm, and wait until the job succeeds.
259+
### Uninstall the control plane components
260+
```bash
261+
helm uninstall lens -n lens
262+
```
263+
### Uninstall the data plane components if installed as OKE daemon set
293264

294-
This will remove:
295-
- The OKE cluster and all nodes
296-
- The VCN and networking components
297-
- All OCI GPU Scanner application components
298-
- Associated storage and IAM policies (if created)
265+
```bash
266+
helm uninstall lens -n lens
267+
```
268+
### Uninstall the data plane components if it was installed as system services (per GPU node)
269+
270+
```bash
271+
cd /home/ubuntu/$(hostname)/oci-lens-plugin/
272+
./uninstall
273+
cd ..
274+
rm -rf *
275+
cd ..
276+
rmdir $(hostname)
277+
278+
```
299279

300-
Once the stack is destroyed, your tenancy will be free of any OCI GPU Scanner-related resources.
280+
Once the stack is destroyed, your OKE cluster will be free of any OCI GPU Scanner-related resources.

GETTING_STARTED_RM_DEPLOY.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# Getting Started with OCI GPU Scanner Quickstart
1+
# Getting started with OCI GPU Scanner quickstart using resource manager
22

33
**❗❗Important: The instructions below are for creating a new standalone deployment. To install OCI GPU Scanner on an existing OKE cluster, please refer to the [Install OCI GPU Scanner to an Existing OKE Cluster](GETTING_STARTED_HELM_DEPLOY.md)**
44

README.md

Lines changed: 29 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -86,6 +86,33 @@ eth0 presence check: Checks if the eth0 network interface is present
8686

8787
Additional checks are performed based on GPU type (AMD or NVIDIA), such as XGMI, NVLINK, and fabric manager monitoring.
8888

89+
## Architecture
90+
91+
The Helm chart deploys the following components:
92+
93+
1. **Frontend (Portal)**
94+
- React/Node.js application
95+
- Served on port 3000
96+
- Service for internal/external access
97+
98+
2. **Backend (Control Plane)**
99+
- Django application
100+
- Served on port 5000 (container), 80 (service)
101+
- External access via LoadBalancer service
102+
- Connects to Postgres
103+
- Configured with Prometheus Push gateway and Grafana URLs
104+
105+
3. **Postgres Database**
106+
- Managed via StatefulSet/Deployment
107+
- Persistent storage via PVC
108+
- Service for backend connectivity
109+
110+
4. **ConfigMaps and Secrets**
111+
- All environment variables and sensitive data are managed via ConfigMaps and Kubernetes Secrets
112+
113+
Sample deployment stamp.
114+
115+
![deployment architecture](/media/scanner_architecture.png "architecture snapshot")
89116

90117
## Dashboards & Monitoring
91118
After deployment, you will have access to Grafana, Prometheus, and Portal endpoints for data interaction. See example screenshots below:
@@ -140,7 +167,8 @@ The below list of features are being prioritized. If you would like a new featur
140167
## Limitations
141168

142169
1. Only Ubuntu Linux OS based GPU node monitoring is supported.
143-
2. Control plane components only work with x86 CPU nodes
170+
2. Control plane components only work with x86 CPU nodes.
171+
3. Active health checks do not run as low priority jobs hence running a active health check will disrupt any existing GPU workloads active on that node.
144172

145173
## Support & Contact
146174

169 KB
Loading
127 KB
Loading

media/portaldeploy/confirm.png

134 KB
Loading

media/portaldeploy/deploy1.png

226 KB
Loading

media/portaldeploy/monitoring.png

511 KB
Loading
380 KB
Loading

0 commit comments

Comments
 (0)