Merge pull request #231 from jeffbearer/feature/readiness-probe-29392102

k8s-ci-robot · web-flow · commit bf827e8b4a38 · 2025-09-11T11:58:08.000-07:00
Enhanced CSI driver readiness validation with comprehensive LNet health checks
diff --git a/deploy/csi-azurelustre-controller.yaml b/deploy/csi-azurelustre-controller.yaml
@@ -14,7 +14,6 @@ spec:
       labels:
         app: csi-azurelustre-controller
     spec:
-      hostNetwork: true
       serviceAccountName: csi-azurelustre-controller-sa
       nodeSelector:
         kubernetes.io/os: linux  # add "kubernetes.io/role: master" to run controller on master node
diff --git a/deploy/csi-azurelustre-node.yaml b/deploy/csi-azurelustre-node.yaml
@@ -110,6 +110,22 @@ spec:
             initialDelaySeconds: 60
             timeoutSeconds: 10
             periodSeconds: 30
+          readinessProbe:
+            failureThreshold: 5
+            exec:
+              command:
+                - /app/readinessProbe.sh
+            initialDelaySeconds: 10
+            timeoutSeconds: 10
+            periodSeconds: 30
+          startupProbe:
+            failureThreshold: 120
+            exec:
+              command:
+                - /app/readinessProbe.sh
+            initialDelaySeconds: 10
+            timeoutSeconds: 5
+            periodSeconds: 5
           env:
             - name: CSI_ENDPOINT
               value: unix:///csi/csi.sock
diff --git a/docs/csi-debug.md b/docs/csi-debug.md
@@ -2,6 +2,157 @@
 
 ---
 
+## Driver Readiness and Health Issues
+
+### Enhanced LNet Validation Troubleshooting
+
+**Symptoms:**
+
+- CSI driver node pods show `2/3` ready status
+- Readiness probe failing repeatedly
+- Pods remain in `Running` or startup issues
+- Mount operations fail with "driver not ready" errors
+
+#### Detailed Probe Verification Steps
+
+If the exec-based readiness probe fails (exit code 1), use these detailed verification steps:
+
+```shell
+# Verify detailed probe configuration
+kubectl describe -n kube-system pod -l app=csi-azurelustre-node
+```
+Look for exec-based probe configuration in the pod description:
+- `Readiness: exec [/app/readinessProbe.sh]`
+- `Startup: exec [/app/readinessProbe.sh]`
+
+In the Events section, you may see initial startup probe failures during LNet initialization:
+- `Warning Unhealthy ... Startup probe failed: Node pod detected - performing Lustre-specific readiness checks`
+
+This is normal during the initialization phase. Once LNet is fully operational, the probes will succeed.
+
+```shell
+# Test the readiness probe script directly
+kubectl exec -n kube-system <pod-name> -c azurelustre -- /app/readinessProbe.sh
+```
+Expected output when working correctly:
+- `"Node pod detected - performing Lustre-specific readiness checks"`
+- `"All Lustre readiness checks passed"`
+
+```shell
+# Check for enhanced validation messages
+kubectl logs -n kube-system -l app=csi-azurelustre-node -c azurelustre --tail=20
+```
+Look for CSI driver startup and readiness messages:
+- `"vendor_version":"v0.4.0-readiness-http"` - Confirms feature branch deployment
+- Standard CSI GRPC operation logs indicating successful driver initialization
+
+```shell
+# Check for detailed validation failure reasons
+kubectl logs -n kube-system <pod-name> -c azurelustre | grep -E "(LNet validation failed|Failed to|not operational)"
+```
+
+Common issues and solutions:
+- **"No valid NIDs"**: LNet networking not properly configured
+- **"Self-ping test failed"**: Network connectivity issues
+- **"Interfaces not operational"**: Network interfaces not in UP state
+- **"Lustre module not loaded"**: Kernel module loading issues
+
+**Test readiness probe directly:**
+
+```sh
+# Test the exec-based readiness probe script
+kubectl exec -n kube-system <csi-azurelustre-node-pod> -c azurelustre -- /app/readinessProbe.sh
+```
+
+Expected responses:
+- Exit code 0: Enhanced LNet validation passed
+- Exit code 1: One or more validation checks failed (with descriptive error message)
+
+**Test HTTP health endpoints (optional manual testing):**
+
+```sh
+# Test enhanced readiness/liveness via HTTP endpoint
+kubectl exec -n kube-system <csi-azurelustre-node-pod> -c azurelustre -- curl -s localhost:29763/healthz
+```
+
+HTTP responses:
+- `/healthz`: `ok` (HTTP 200) or `not ready` (HTTP 503)
+
+**Check enhanced validation logs:**
+
+```sh
+# Look for detailed LNet validation messages
+kubectl logs -n kube-system <csi-azurelustre-node-pod> -c azurelustre | grep -E "(LNet validation|NIDs|self-ping|interfaces)"
+```
+
+Look for validation success messages:
+- `"LNet validation passed: all checks successful"`
+- `"Found NIDs: <network-identifiers>"`
+- `"LNet self-ping to <nid> successful"`
+- `"All LNet interfaces operational"`
+
+**Common readiness failure patterns:**
+
+1. **No valid NIDs found:**
+   ```text
+   LNet validation failed: no valid NIDs
+   No valid non-loopback LNet NIDs found
+   ```
+   **Solution:** Check LNet configuration and network setup
+
+2. **Self-ping test failed:**
+   ```text
+   LNet validation failed: self-ping test failed
+   LNet self-ping to <nid> failed
+   ```
+   **Solution:** Verify network connectivity and LNet networking
+
+3. **Interfaces not operational:**
+   ```text
+   LNet validation failed: interfaces not operational
+   Found non-operational interface: status: down
+   ```
+   **Solution:** Check network interface status and configuration
+
+4. **Module loading issues:**
+   ```text
+   Lustre module not loaded
+   LNet kernel module is not loaded
+   ```
+   **Solution:** Check kernel module installation and loading
+
+**Debug LNet configuration manually:**
+
+```sh
+# Check kernel modules
+kubectl exec -n kube-system <csi-azurelustre-node-pod> -c azurelustre -- lsmod | grep -E "(lnet|lustre)"
+
+# Check LNet NIDs
+kubectl exec -n kube-system <csi-azurelustre-node-pod> -c azurelustre -- lctl list_nids
+
+# Test LNet self-ping
+kubectl exec -n kube-system <csi-azurelustre-node-pod> -c azurelustre -- lctl ping <nid>
+
+# Check interface status
+kubectl exec -n kube-system <csi-azurelustre-node-pod> -c azurelustre -- lnetctl net show --net tcp
+```
+
+**Check probe configuration:**
+
+```sh
+# Verify probe settings in deployment
+kubectl describe -n kube-system pod <csi-azurelustre-node-pod> | grep -A 10 -E "(Liveness|Readiness|Startup)"
+```
+
+**Monitor readiness probe attempts:**
+
+```sh
+# Watch probe events in real-time
+kubectl get events --field-selector involvedObject.name=<csi-azurelustre-node-pod> -n kube-system -w | grep -E "(Readiness|Liveness)"
+```
+
+---
+
 ## Volume Provisioning Issues
 
 ### Dynamic Provisioning (AMLFS Cluster Creation) - Public Preview
diff --git a/docs/install-csi-driver.md b/docs/install-csi-driver.md
@@ -39,6 +39,54 @@ This document explains how to install Azure Lustre CSI driver on a kubernetes cl
     csi-azurelustre-node-g6sfx   3/3     Running   0          30s
     ```
 
+### Verifying CSI Driver Readiness for Lustre Operations
+
+Before mounting Azure Lustre filesystems, it's important to verify that the CSI driver nodes are fully initialized and ready for Lustre operations. The driver includes **enhanced LNet validation** that performs comprehensive readiness checks:
+
+- Load required kernel modules (lnet, lustre)
+- Configure LNet networking with valid Network Identifiers (NIDs)
+- Verify LNet self-ping functionality
+- Validate all network interfaces are operational
+- Complete all initialization steps
+
+#### Readiness Validation
+
+The CSI driver deployment includes automated probes for container health monitoring:
+
+- **Liveness Probe**: `/healthz` (Port 29763) - HTTP endpoint for basic container health
+- **Container Status**: Kubernetes readiness based on container startup and basic health checks
+
+#### Verification Steps
+
+1. **Check pod readiness status:**
+   ```shell
+   kubectl get -n kube-system pod -l app=csi-azurelustre-node -o wide
+   ```
+   All node pods should show `READY` status as `3/3` and `STATUS` as `Running`.
+
+2. **Verify probe configuration:**
+   ```shell
+   kubectl describe -n kube-system pod -l app=csi-azurelustre-node
+   ```
+   Look for exec-based readiness and startup probe configuration in the pod description:
+   - `Readiness: exec [/app/readinessProbe.sh]`
+   - `Startup: exec [/app/readinessProbe.sh]`
+   
+   In the Events section, you may see initial startup probe failures during LNet initialization:
+   - `Warning Unhealthy ... Startup probe failed: Node pod detected - performing Lustre-specific readiness checks`
+   
+   This is normal during the initialization phase. Once LNet is fully operational, the probes will succeed and no more failure events will appear.
+
+3. **Monitor validation logs:**
+   ```shell
+   kubectl logs -n kube-system -l app=csi-azurelustre-node -c azurelustre --tail=20
+   ```
+   Look for CSI driver startup and successful GRPC operation logs indicating driver initialization is complete.
+
+> **Note**: If you encounter readiness or initialization issues, see the [CSI Driver Troubleshooting Guide](csi-debug.md#enhanced-lnet-validation-troubleshooting) for detailed debugging steps.
+
+**Important**: The enhanced validation ensures the driver reports ready only when LNet is fully functional for Lustre operations. Wait for all CSI driver node pods to pass enhanced readiness checks before creating PersistentVolumes or mounting Lustre filesystems.
+
 ## Default instructions for production release
 
 ### Install with kubectl (current production release)
@@ -73,3 +121,46 @@ This document explains how to install Azure Lustre CSI driver on a kubernetes cl
     csi-azurelustre-node-drlq2   3/3     Running   0          30s
     csi-azurelustre-node-g6sfx   3/3     Running   0          30s
     ```
+
+
+### Verifying CSI Driver Readiness for Lustre Operations
+
+Before mounting Azure Lustre filesystems, it is important to verify that the CSI driver nodes are fully initialized and ready for Lustre operations. The driver includes enhanced LNet validation that performs comprehensive readiness checks:
+
+- Load required kernel modules (lnet, lustre)
+- Configure LNet networking with valid Network Identifiers (NIDs)
+- Verify LNet self-ping functionality
+- Validate all network interfaces are operational
+- Complete all initialization steps
+
+#### Enhanced Readiness Validation
+
+The CSI driver deployment includes automated **exec-based readiness probes** for accurate readiness detection:
+
+- **Readiness & Startup Probes**: `/app/readinessProbe.sh` - Exec-based validation with comprehensive LNet checking
+- **Liveness Probe**: `/healthz` (Port 29763) - HTTP endpoint for basic container health
+
+#### Verification Steps
+
+1. **Check pod readiness status:**
+   ```shell
+   kubectl get -n kube-system pod -l app=csi-azurelustre-node -o wide
+   ```
+   All node pods should show `READY` status as `3/3` and `STATUS` as `Running`.
+
+2. **Verify probe configuration:**
+   ```shell
+   kubectl describe -n kube-system pod -l app=csi-azurelustre-node
+   ```
+   Look for exec-based readiness and startup probe configuration and check that no recent probe failures appear in the Events section.
+
+3. **Monitor validation logs:**
+   ```shell
+   kubectl logs -n kube-system -l app=csi-azurelustre-node -c azurelustre --tail=20
+   ```
+   Look for CSI driver startup and successful GRPC operation logs indicating driver initialization is complete.
+
+> **Note**: If you encounter readiness or initialization issues, see the [CSI Driver Troubleshooting Guide](csi-debug.md#enhanced-lnet-validation-troubleshooting) for detailed debugging steps.
+
+**Important**: The enhanced validation ensures the driver reports ready only when LNet is fully functional for Lustre operations. Wait for all CSI driver node pods to pass enhanced readiness checks before creating PersistentVolumes or mounting Lustre filesystems.
+
diff --git a/pkg/azurelustreplugin/Dockerfile b/pkg/azurelustreplugin/Dockerfile
@@ -16,8 +16,9 @@ FROM ubuntu:22.04
 
 COPY "./_output/azurelustreplugin" "/app/azurelustreplugin"
 COPY "./pkg/azurelustreplugin/entrypoint.sh" "/app/entrypoint.sh"
+COPY "./pkg/azurelustreplugin/readinessProbe.sh" "/app/readinessProbe.sh"
 
-RUN chmod +x "/app/entrypoint.sh"
+RUN chmod +x "/app/entrypoint.sh" && chmod +x "/app/readinessProbe.sh"
 
 RUN apt-get update && \
   apt-get upgrade -y && \
diff --git a/pkg/azurelustreplugin/readinessProbe.sh b/pkg/azurelustreplugin/readinessProbe.sh
@@ -0,0 +1,70 @@
+#!/bin/bash
+
+# readinessProbe.sh - Health check script for Azure Lustre CSI driver
+# This script performs direct LNet readiness validation
+
+set -euo pipefail
+
+# Check if this is a controller pod (no Lustre client installation required)
+INSTALL_LUSTRE_CLIENT=${AZURELUSTRE_CSI_INSTALL_LUSTRE_CLIENT:-"yes"}
+
+if [[ "${INSTALL_LUSTRE_CLIENT}" == "no" ]]; then
+    echo "Controller pod detected - reporting ready (skipping Lustre checks)"
+    exit 0
+fi
+
+echo "Node pod detected - performing Lustre-specific readiness checks"
+
+# Check if CSI socket exists and is accessible
+SOCKET_PATH=${CSI_ENDPOINT#unix://}
+
+if [[ ! -S "${SOCKET_PATH}" ]]; then
+    echo "CSI socket not found: ${SOCKET_PATH}"
+    exit 1
+fi
+
+# Check if LNet is properly configured and operational
+# This replicates the logic from CheckLustreReadiness()
+
+# Check if LNet NIDs are valid and available
+if ! lnetctl net show >/dev/null 2>&1; then
+    echo "LNet not available or not configured"
+    exit 1
+fi
+
+# Check if we have any NIDs configured
+NID_COUNT=$(lnetctl net show 2>/dev/null | grep -c "nid:")
+if [[ "${NID_COUNT}" -eq 0 ]]; then
+    echo "No LNet NIDs configured"
+    exit 1
+fi
+
+# Check LNet self-ping functionality
+if ! lnetctl ping --help >/dev/null 2>&1; then
+    echo "LNet ping functionality not available"
+    exit 1
+fi
+
+# Get the first available NID for self-ping test (exclude loopback)
+FIRST_NID=$(lnetctl net show 2>/dev/null | grep "nid:" | grep -v "@lo" | head -1 | sed 's/.*nid: \([^ ]*\).*/\1/' || echo "")
+if [[ -z "${FIRST_NID}" ]]; then
+    echo "Unable to determine LNet NID for self-ping test"
+    exit 1
+fi
+
+# Perform self-ping test with timeout
+if ! timeout 10 lnetctl ping "${FIRST_NID}" >/dev/null 2>&1; then
+    echo "LNet self-ping test failed for NID: ${FIRST_NID}"
+    exit 1
+fi
+
+# Check if LNet interfaces are operational
+# Verify we have at least one interface in 'up' state
+UP_INTERFACES=$(lnetctl net show 2>/dev/null | grep -c "status: up")
+if [[ "${UP_INTERFACES}" -eq 0 ]]; then
+    echo "No LNet interfaces in 'up' state"
+    exit 1
+fi
+
+echo "All Lustre readiness checks passed"
+exit 0