Skip to content

Commit bf827e8

Browse files
authored
Merge pull request #231 from jeffbearer/feature/readiness-probe-29392102
Enhanced CSI driver readiness validation with comprehensive LNet health checks
2 parents 7cf6130 + efb26ae commit bf827e8

File tree

6 files changed

+330
-2
lines changed

6 files changed

+330
-2
lines changed

deploy/csi-azurelustre-controller.yaml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -14,7 +14,6 @@ spec:
1414
labels:
1515
app: csi-azurelustre-controller
1616
spec:
17-
hostNetwork: true
1817
serviceAccountName: csi-azurelustre-controller-sa
1918
nodeSelector:
2019
kubernetes.io/os: linux # add "kubernetes.io/role: master" to run controller on master node

deploy/csi-azurelustre-node.yaml

Lines changed: 16 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -110,6 +110,22 @@ spec:
110110
initialDelaySeconds: 60
111111
timeoutSeconds: 10
112112
periodSeconds: 30
113+
readinessProbe:
114+
failureThreshold: 5
115+
exec:
116+
command:
117+
- /app/readinessProbe.sh
118+
initialDelaySeconds: 10
119+
timeoutSeconds: 10
120+
periodSeconds: 30
121+
startupProbe:
122+
failureThreshold: 120
123+
exec:
124+
command:
125+
- /app/readinessProbe.sh
126+
initialDelaySeconds: 10
127+
timeoutSeconds: 5
128+
periodSeconds: 5
113129
env:
114130
- name: CSI_ENDPOINT
115131
value: unix:///csi/csi.sock

docs/csi-debug.md

Lines changed: 151 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,157 @@
22

33
---
44

5+
## Driver Readiness and Health Issues
6+
7+
### Enhanced LNet Validation Troubleshooting
8+
9+
**Symptoms:**
10+
11+
- CSI driver node pods show `2/3` ready status
12+
- Readiness probe failing repeatedly
13+
- Pods remain in `Running` or startup issues
14+
- Mount operations fail with "driver not ready" errors
15+
16+
#### Detailed Probe Verification Steps
17+
18+
If the exec-based readiness probe fails (exit code 1), use these detailed verification steps:
19+
20+
```shell
21+
# Verify detailed probe configuration
22+
kubectl describe -n kube-system pod -l app=csi-azurelustre-node
23+
```
24+
Look for exec-based probe configuration in the pod description:
25+
- `Readiness: exec [/app/readinessProbe.sh]`
26+
- `Startup: exec [/app/readinessProbe.sh]`
27+
28+
In the Events section, you may see initial startup probe failures during LNet initialization:
29+
- `Warning Unhealthy ... Startup probe failed: Node pod detected - performing Lustre-specific readiness checks`
30+
31+
This is normal during the initialization phase. Once LNet is fully operational, the probes will succeed.
32+
33+
```shell
34+
# Test the readiness probe script directly
35+
kubectl exec -n kube-system <pod-name> -c azurelustre -- /app/readinessProbe.sh
36+
```
37+
Expected output when working correctly:
38+
- `"Node pod detected - performing Lustre-specific readiness checks"`
39+
- `"All Lustre readiness checks passed"`
40+
41+
```shell
42+
# Check for enhanced validation messages
43+
kubectl logs -n kube-system -l app=csi-azurelustre-node -c azurelustre --tail=20
44+
```
45+
Look for CSI driver startup and readiness messages:
46+
- `"vendor_version":"v0.4.0-readiness-http"` - Confirms feature branch deployment
47+
- Standard CSI GRPC operation logs indicating successful driver initialization
48+
49+
```shell
50+
# Check for detailed validation failure reasons
51+
kubectl logs -n kube-system <pod-name> -c azurelustre | grep -E "(LNet validation failed|Failed to|not operational)"
52+
```
53+
54+
Common issues and solutions:
55+
- **"No valid NIDs"**: LNet networking not properly configured
56+
- **"Self-ping test failed"**: Network connectivity issues
57+
- **"Interfaces not operational"**: Network interfaces not in UP state
58+
- **"Lustre module not loaded"**: Kernel module loading issues
59+
60+
**Test readiness probe directly:**
61+
62+
```sh
63+
# Test the exec-based readiness probe script
64+
kubectl exec -n kube-system <csi-azurelustre-node-pod> -c azurelustre -- /app/readinessProbe.sh
65+
```
66+
67+
Expected responses:
68+
- Exit code 0: Enhanced LNet validation passed
69+
- Exit code 1: One or more validation checks failed (with descriptive error message)
70+
71+
**Test HTTP health endpoints (optional manual testing):**
72+
73+
```sh
74+
# Test enhanced readiness/liveness via HTTP endpoint
75+
kubectl exec -n kube-system <csi-azurelustre-node-pod> -c azurelustre -- curl -s localhost:29763/healthz
76+
```
77+
78+
HTTP responses:
79+
- `/healthz`: `ok` (HTTP 200) or `not ready` (HTTP 503)
80+
81+
**Check enhanced validation logs:**
82+
83+
```sh
84+
# Look for detailed LNet validation messages
85+
kubectl logs -n kube-system <csi-azurelustre-node-pod> -c azurelustre | grep -E "(LNet validation|NIDs|self-ping|interfaces)"
86+
```
87+
88+
Look for validation success messages:
89+
- `"LNet validation passed: all checks successful"`
90+
- `"Found NIDs: <network-identifiers>"`
91+
- `"LNet self-ping to <nid> successful"`
92+
- `"All LNet interfaces operational"`
93+
94+
**Common readiness failure patterns:**
95+
96+
1. **No valid NIDs found:**
97+
```text
98+
LNet validation failed: no valid NIDs
99+
No valid non-loopback LNet NIDs found
100+
```
101+
**Solution:** Check LNet configuration and network setup
102+
103+
2. **Self-ping test failed:**
104+
```text
105+
LNet validation failed: self-ping test failed
106+
LNet self-ping to <nid> failed
107+
```
108+
**Solution:** Verify network connectivity and LNet networking
109+
110+
3. **Interfaces not operational:**
111+
```text
112+
LNet validation failed: interfaces not operational
113+
Found non-operational interface: status: down
114+
```
115+
**Solution:** Check network interface status and configuration
116+
117+
4. **Module loading issues:**
118+
```text
119+
Lustre module not loaded
120+
LNet kernel module is not loaded
121+
```
122+
**Solution:** Check kernel module installation and loading
123+
124+
**Debug LNet configuration manually:**
125+
126+
```sh
127+
# Check kernel modules
128+
kubectl exec -n kube-system <csi-azurelustre-node-pod> -c azurelustre -- lsmod | grep -E "(lnet|lustre)"
129+
130+
# Check LNet NIDs
131+
kubectl exec -n kube-system <csi-azurelustre-node-pod> -c azurelustre -- lctl list_nids
132+
133+
# Test LNet self-ping
134+
kubectl exec -n kube-system <csi-azurelustre-node-pod> -c azurelustre -- lctl ping <nid>
135+
136+
# Check interface status
137+
kubectl exec -n kube-system <csi-azurelustre-node-pod> -c azurelustre -- lnetctl net show --net tcp
138+
```
139+
140+
**Check probe configuration:**
141+
142+
```sh
143+
# Verify probe settings in deployment
144+
kubectl describe -n kube-system pod <csi-azurelustre-node-pod> | grep -A 10 -E "(Liveness|Readiness|Startup)"
145+
```
146+
147+
**Monitor readiness probe attempts:**
148+
149+
```sh
150+
# Watch probe events in real-time
151+
kubectl get events --field-selector involvedObject.name=<csi-azurelustre-node-pod> -n kube-system -w | grep -E "(Readiness|Liveness)"
152+
```
153+
154+
---
155+
5156
## Volume Provisioning Issues
6157

7158
### Dynamic Provisioning (AMLFS Cluster Creation) - Public Preview

docs/install-csi-driver.md

Lines changed: 91 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -39,6 +39,54 @@ This document explains how to install Azure Lustre CSI driver on a kubernetes cl
3939
csi-azurelustre-node-g6sfx 3/3 Running 0 30s
4040
```
4141

42+
### Verifying CSI Driver Readiness for Lustre Operations
43+
44+
Before mounting Azure Lustre filesystems, it's important to verify that the CSI driver nodes are fully initialized and ready for Lustre operations. The driver includes **enhanced LNet validation** that performs comprehensive readiness checks:
45+
46+
- Load required kernel modules (lnet, lustre)
47+
- Configure LNet networking with valid Network Identifiers (NIDs)
48+
- Verify LNet self-ping functionality
49+
- Validate all network interfaces are operational
50+
- Complete all initialization steps
51+
52+
#### Readiness Validation
53+
54+
The CSI driver deployment includes automated probes for container health monitoring:
55+
56+
- **Liveness Probe**: `/healthz` (Port 29763) - HTTP endpoint for basic container health
57+
- **Container Status**: Kubernetes readiness based on container startup and basic health checks
58+
59+
#### Verification Steps
60+
61+
1. **Check pod readiness status:**
62+
```shell
63+
kubectl get -n kube-system pod -l app=csi-azurelustre-node -o wide
64+
```
65+
All node pods should show `READY` status as `3/3` and `STATUS` as `Running`.
66+
67+
2. **Verify probe configuration:**
68+
```shell
69+
kubectl describe -n kube-system pod -l app=csi-azurelustre-node
70+
```
71+
Look for exec-based readiness and startup probe configuration in the pod description:
72+
- `Readiness: exec [/app/readinessProbe.sh]`
73+
- `Startup: exec [/app/readinessProbe.sh]`
74+
75+
In the Events section, you may see initial startup probe failures during LNet initialization:
76+
- `Warning Unhealthy ... Startup probe failed: Node pod detected - performing Lustre-specific readiness checks`
77+
78+
This is normal during the initialization phase. Once LNet is fully operational, the probes will succeed and no more failure events will appear.
79+
80+
3. **Monitor validation logs:**
81+
```shell
82+
kubectl logs -n kube-system -l app=csi-azurelustre-node -c azurelustre --tail=20
83+
```
84+
Look for CSI driver startup and successful GRPC operation logs indicating driver initialization is complete.
85+
86+
> **Note**: If you encounter readiness or initialization issues, see the [CSI Driver Troubleshooting Guide](csi-debug.md#enhanced-lnet-validation-troubleshooting) for detailed debugging steps.
87+
88+
**Important**: The enhanced validation ensures the driver reports ready only when LNet is fully functional for Lustre operations. Wait for all CSI driver node pods to pass enhanced readiness checks before creating PersistentVolumes or mounting Lustre filesystems.
89+
4290
## Default instructions for production release
4391
4492
### Install with kubectl (current production release)
@@ -73,3 +121,46 @@ This document explains how to install Azure Lustre CSI driver on a kubernetes cl
73121
csi-azurelustre-node-drlq2 3/3 Running 0 30s
74122
csi-azurelustre-node-g6sfx 3/3 Running 0 30s
75123
```
124+
125+
126+
### Verifying CSI Driver Readiness for Lustre Operations
127+
128+
Before mounting Azure Lustre filesystems, it is important to verify that the CSI driver nodes are fully initialized and ready for Lustre operations. The driver includes enhanced LNet validation that performs comprehensive readiness checks:
129+
130+
- Load required kernel modules (lnet, lustre)
131+
- Configure LNet networking with valid Network Identifiers (NIDs)
132+
- Verify LNet self-ping functionality
133+
- Validate all network interfaces are operational
134+
- Complete all initialization steps
135+
136+
#### Enhanced Readiness Validation
137+
138+
The CSI driver deployment includes automated **exec-based readiness probes** for accurate readiness detection:
139+
140+
- **Readiness & Startup Probes**: `/app/readinessProbe.sh` - Exec-based validation with comprehensive LNet checking
141+
- **Liveness Probe**: `/healthz` (Port 29763) - HTTP endpoint for basic container health
142+
143+
#### Verification Steps
144+
145+
1. **Check pod readiness status:**
146+
```shell
147+
kubectl get -n kube-system pod -l app=csi-azurelustre-node -o wide
148+
```
149+
All node pods should show `READY` status as `3/3` and `STATUS` as `Running`.
150+
151+
2. **Verify probe configuration:**
152+
```shell
153+
kubectl describe -n kube-system pod -l app=csi-azurelustre-node
154+
```
155+
Look for exec-based readiness and startup probe configuration and check that no recent probe failures appear in the Events section.
156+
157+
3. **Monitor validation logs:**
158+
```shell
159+
kubectl logs -n kube-system -l app=csi-azurelustre-node -c azurelustre --tail=20
160+
```
161+
Look for CSI driver startup and successful GRPC operation logs indicating driver initialization is complete.
162+
163+
> **Note**: If you encounter readiness or initialization issues, see the [CSI Driver Troubleshooting Guide](csi-debug.md#enhanced-lnet-validation-troubleshooting) for detailed debugging steps.
164+
165+
**Important**: The enhanced validation ensures the driver reports ready only when LNet is fully functional for Lustre operations. Wait for all CSI driver node pods to pass enhanced readiness checks before creating PersistentVolumes or mounting Lustre filesystems.
166+

pkg/azurelustreplugin/Dockerfile

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,8 +16,9 @@ FROM ubuntu:22.04
1616

1717
COPY "./_output/azurelustreplugin" "/app/azurelustreplugin"
1818
COPY "./pkg/azurelustreplugin/entrypoint.sh" "/app/entrypoint.sh"
19+
COPY "./pkg/azurelustreplugin/readinessProbe.sh" "/app/readinessProbe.sh"
1920

20-
RUN chmod +x "/app/entrypoint.sh"
21+
RUN chmod +x "/app/entrypoint.sh" && chmod +x "/app/readinessProbe.sh"
2122

2223
RUN apt-get update && \
2324
apt-get upgrade -y && \
Lines changed: 70 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,70 @@
1+
#!/bin/bash
2+
3+
# readinessProbe.sh - Health check script for Azure Lustre CSI driver
4+
# This script performs direct LNet readiness validation
5+
6+
set -euo pipefail
7+
8+
# Check if this is a controller pod (no Lustre client installation required)
9+
INSTALL_LUSTRE_CLIENT=${AZURELUSTRE_CSI_INSTALL_LUSTRE_CLIENT:-"yes"}
10+
11+
if [[ "${INSTALL_LUSTRE_CLIENT}" == "no" ]]; then
12+
echo "Controller pod detected - reporting ready (skipping Lustre checks)"
13+
exit 0
14+
fi
15+
16+
echo "Node pod detected - performing Lustre-specific readiness checks"
17+
18+
# Check if CSI socket exists and is accessible
19+
SOCKET_PATH=${CSI_ENDPOINT#unix://}
20+
21+
if [[ ! -S "${SOCKET_PATH}" ]]; then
22+
echo "CSI socket not found: ${SOCKET_PATH}"
23+
exit 1
24+
fi
25+
26+
# Check if LNet is properly configured and operational
27+
# This replicates the logic from CheckLustreReadiness()
28+
29+
# Check if LNet NIDs are valid and available
30+
if ! lnetctl net show >/dev/null 2>&1; then
31+
echo "LNet not available or not configured"
32+
exit 1
33+
fi
34+
35+
# Check if we have any NIDs configured
36+
NID_COUNT=$(lnetctl net show 2>/dev/null | grep -c "nid:")
37+
if [[ "${NID_COUNT}" -eq 0 ]]; then
38+
echo "No LNet NIDs configured"
39+
exit 1
40+
fi
41+
42+
# Check LNet self-ping functionality
43+
if ! lnetctl ping --help >/dev/null 2>&1; then
44+
echo "LNet ping functionality not available"
45+
exit 1
46+
fi
47+
48+
# Get the first available NID for self-ping test (exclude loopback)
49+
FIRST_NID=$(lnetctl net show 2>/dev/null | grep "nid:" | grep -v "@lo" | head -1 | sed 's/.*nid: \([^ ]*\).*/\1/' || echo "")
50+
if [[ -z "${FIRST_NID}" ]]; then
51+
echo "Unable to determine LNet NID for self-ping test"
52+
exit 1
53+
fi
54+
55+
# Perform self-ping test with timeout
56+
if ! timeout 10 lnetctl ping "${FIRST_NID}" >/dev/null 2>&1; then
57+
echo "LNet self-ping test failed for NID: ${FIRST_NID}"
58+
exit 1
59+
fi
60+
61+
# Check if LNet interfaces are operational
62+
# Verify we have at least one interface in 'up' state
63+
UP_INTERFACES=$(lnetctl net show 2>/dev/null | grep -c "status: up")
64+
if [[ "${UP_INTERFACES}" -eq 0 ]]; then
65+
echo "No LNet interfaces in 'up' state"
66+
exit 1
67+
fi
68+
69+
echo "All Lustre readiness checks passed"
70+
exit 0

0 commit comments

Comments
 (0)