You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
--set grafana.adminPassword="access password for grafana portal. User name is admin by default" \
125
125
126
126
```
127
+
127
128
## Verify for successful install
128
129
129
130
Once the installation is complete you should see the following pods in the "lens" namespace. If you don't please uninstall and reinstall or check the helm install events/logs.
## Step 2: OCI GPU Data Plane Plugin installation on GPU Nodes
236
231
237
-
1. **Navigate to Dashboards**: Go to the dashboard section
232
+
1. **Navigate to Dashboards**: Go to the dashboard section of the OCI GPU Scanner Portal
238
233
2. **Go to Tab - OCI GPU Scanner Install Script**:
239
234
- You can use the script there and deploy the oci-scanner plugin on to your gpus nodes manually.
240
235
- Embed them into a slurm script if you run a slurm cluster.
241
236
- Use the kubernetes objects for the plugin under the `oci_scanner_plugin` folder for a Kubernetes cluster. Refer to [Readme](oci_scanner_plugin/README.md).
242
237
- use the same scripts to be added as part of your new GPU compute deployments through cloud-init scripts.
243
238
---
244
239
245
-
## Step 4: Explore Monitoring Dashboards
240
+
## Step 3: Explore Monitoring Dashboards
246
241
247
242
1. **Navigate to Dashboards**: Go to the dashboard section
248
243
2. **View Available Dashboards**:
@@ -255,46 +250,31 @@ helm uninstall lens -n lens
255
250
6. **Access Additional Features**:
256
251
- **Custom Queries**: Use Prometheus queries to create custom visualizations
257
252
- **Alerting**: Set up alerts for critical GPU or cluster issues
258
-
259
253
---
260
254
261
-
## Architecture
262
-
263
-
The Helm chart deploys the following components:
264
-
265
-
1. **Frontend (Portal)**
266
-
- React/Node.js application
267
-
- Served on port 3000
268
-
- Service for internal/external access
269
-
270
-
2. **Backend (Control Plane)**
271
-
- Django application
272
-
- Served on port 5000 (container), 80 (service)
273
-
- External access via LoadBalancer service
274
-
- Connects to Postgres
275
-
- Configured with Prometheus Pushgateway and Grafana URLs
276
-
277
-
3. **Postgres Database**
278
-
- Managed via StatefulSet/Deployment
279
-
- Persistent storage via PVC
280
-
- Service for backend connectivity
281
-
282
-
4. **ConfigMaps and Secrets**
283
-
- All environment variables and sensitive data are managed via ConfigMaps and Kubernetes Secrets
284
-
285
255
## Cleanup
286
256
287
257
You can remove all control plane resources in **one step**:
288
258
289
-
1. **Destroy the Control Plane Components**
290
-
- Go to **Resource Manager → Stacks** in the OCI Console.
291
-
- Select your **OCI GPU Scanner stack**.
292
-
- Click **Destroy**, confirm, and wait until the job succeeds.
259
+
### Uninstall the control plane components
260
+
```bash
261
+
helm uninstall lens -n lens
262
+
```
263
+
### Uninstall the data plane components if installed as OKE daemon set
293
264
294
-
This will remove:
295
-
- The OKE cluster and all nodes
296
-
- The VCN and networking components
297
-
- All OCI GPU Scanner application components
298
-
- Associated storage and IAM policies (if created)
265
+
```bash
266
+
helm uninstall lens -n lens
267
+
```
268
+
### Uninstall the data plane components if it was installed as system services (per GPU node)
269
+
270
+
```bash
271
+
cd /home/ubuntu/$(hostname)/oci-lens-plugin/
272
+
./uninstall
273
+
cd ..
274
+
rm -rf *
275
+
cd ..
276
+
rmdir $(hostname)
277
+
278
+
```
299
279
300
-
Once the stack is destroyed, your tenancy will be free of any OCI GPU Scanner-related resources.
280
+
Once the stack is destroyed, your OKE cluster will be free of any OCI GPU Scanner-related resources.
Copy file name to clipboardExpand all lines: GETTING_STARTED_RM_DEPLOY.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,4 +1,4 @@
1
-
# Getting Started with OCI GPU Scanner Quickstart
1
+
# Getting started with OCI GPU Scanner quickstart using resource manager
2
2
3
3
**❗❗Important: The instructions below are for creating a new standalone deployment. To install OCI GPU Scanner on an existing OKE cluster, please refer to the [Install OCI GPU Scanner to an Existing OKE Cluster](GETTING_STARTED_HELM_DEPLOY.md)**
After deployment, you will have access to Grafana, Prometheus, and Portal endpoints for data interaction. See example screenshots below:
@@ -140,7 +167,8 @@ The below list of features are being prioritized. If you would like a new featur
140
167
## Limitations
141
168
142
169
1. Only Ubuntu Linux OS based GPU node monitoring is supported.
143
-
2. Control plane components only work with x86 CPU nodes
170
+
2. Control plane components only work with x86 CPU nodes.
171
+
3. Active health checks do not run as low priority jobs hence running a active health check will disrupt any existing GPU workloads active on that node.
0 commit comments