Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 0 additions & 1 deletion deploy/csi-azurelustre-controller.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,6 @@ spec:
labels:
app: csi-azurelustre-controller
spec:
hostNetwork: true
serviceAccountName: csi-azurelustre-controller-sa
nodeSelector:
kubernetes.io/os: linux # add "kubernetes.io/role: master" to run controller on master node
Expand Down
16 changes: 16 additions & 0 deletions deploy/csi-azurelustre-node.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -110,6 +110,22 @@ spec:
initialDelaySeconds: 60
timeoutSeconds: 10
periodSeconds: 30
readinessProbe:
failureThreshold: 5
exec:
command:
- /app/readinessProbe.sh
initialDelaySeconds: 10
timeoutSeconds: 10
periodSeconds: 30
startupProbe:
failureThreshold: 120
exec:
command:
- /app/readinessProbe.sh
initialDelaySeconds: 10
timeoutSeconds: 5
periodSeconds: 5
env:
- name: CSI_ENDPOINT
value: unix:///csi/csi.sock
Expand Down
151 changes: 151 additions & 0 deletions docs/csi-debug.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,157 @@

---

## Driver Readiness and Health Issues

### Enhanced LNet Validation Troubleshooting

**Symptoms:**

- CSI driver node pods show `2/3` ready status
- Readiness probe failing repeatedly
- Pods remain in `Running` or startup issues
- Mount operations fail with "driver not ready" errors

#### Detailed Probe Verification Steps

If the exec-based readiness probe fails (exit code 1), use these detailed verification steps:

```shell
# Verify detailed probe configuration
kubectl describe -n kube-system pod -l app=csi-azurelustre-node
```
Look for exec-based probe configuration in the pod description:
- `Readiness: exec [/app/readinessProbe.sh]`
- `Startup: exec [/app/readinessProbe.sh]`

In the Events section, you may see initial startup probe failures during LNet initialization:
- `Warning Unhealthy ... Startup probe failed: Node pod detected - performing Lustre-specific readiness checks`

This is normal during the initialization phase. Once LNet is fully operational, the probes will succeed.

```shell
# Test the readiness probe script directly
kubectl exec -n kube-system <pod-name> -c azurelustre -- /app/readinessProbe.sh
```
Expected output when working correctly:
- `"Node pod detected - performing Lustre-specific readiness checks"`
- `"All Lustre readiness checks passed"`

```shell
# Check for enhanced validation messages
kubectl logs -n kube-system -l app=csi-azurelustre-node -c azurelustre --tail=20
```
Look for CSI driver startup and readiness messages:
- `"vendor_version":"v0.4.0-readiness-http"` - Confirms feature branch deployment
- Standard CSI GRPC operation logs indicating successful driver initialization

```shell
# Check for detailed validation failure reasons
kubectl logs -n kube-system <pod-name> -c azurelustre | grep -E "(LNet validation failed|Failed to|not operational)"
```

Common issues and solutions:
- **"No valid NIDs"**: LNet networking not properly configured
- **"Self-ping test failed"**: Network connectivity issues
- **"Interfaces not operational"**: Network interfaces not in UP state
- **"Lustre module not loaded"**: Kernel module loading issues

**Test readiness probe directly:**

```sh
# Test the exec-based readiness probe script
kubectl exec -n kube-system <csi-azurelustre-node-pod> -c azurelustre -- /app/readinessProbe.sh
```

Expected responses:
- Exit code 0: Enhanced LNet validation passed
- Exit code 1: One or more validation checks failed (with descriptive error message)

**Test HTTP health endpoints (optional manual testing):**

```sh
# Test enhanced readiness/liveness via HTTP endpoint
kubectl exec -n kube-system <csi-azurelustre-node-pod> -c azurelustre -- curl -s localhost:29763/healthz
```

HTTP responses:
- `/healthz`: `ok` (HTTP 200) or `not ready` (HTTP 503)

**Check enhanced validation logs:**

```sh
# Look for detailed LNet validation messages
kubectl logs -n kube-system <csi-azurelustre-node-pod> -c azurelustre | grep -E "(LNet validation|NIDs|self-ping|interfaces)"
```

Look for validation success messages:
- `"LNet validation passed: all checks successful"`
- `"Found NIDs: <network-identifiers>"`
- `"LNet self-ping to <nid> successful"`
- `"All LNet interfaces operational"`

**Common readiness failure patterns:**

1. **No valid NIDs found:**
```text
LNet validation failed: no valid NIDs
No valid non-loopback LNet NIDs found
```
**Solution:** Check LNet configuration and network setup

2. **Self-ping test failed:**
```text
LNet validation failed: self-ping test failed
LNet self-ping to <nid> failed
```
**Solution:** Verify network connectivity and LNet networking

3. **Interfaces not operational:**
```text
LNet validation failed: interfaces not operational
Found non-operational interface: status: down
```
**Solution:** Check network interface status and configuration

4. **Module loading issues:**
```text
Lustre module not loaded
LNet kernel module is not loaded
```
**Solution:** Check kernel module installation and loading

**Debug LNet configuration manually:**

```sh
# Check kernel modules
kubectl exec -n kube-system <csi-azurelustre-node-pod> -c azurelustre -- lsmod | grep -E "(lnet|lustre)"

# Check LNet NIDs
kubectl exec -n kube-system <csi-azurelustre-node-pod> -c azurelustre -- lctl list_nids

# Test LNet self-ping
kubectl exec -n kube-system <csi-azurelustre-node-pod> -c azurelustre -- lctl ping <nid>

# Check interface status
kubectl exec -n kube-system <csi-azurelustre-node-pod> -c azurelustre -- lnetctl net show --net tcp
```

**Check probe configuration:**

```sh
# Verify probe settings in deployment
kubectl describe -n kube-system pod <csi-azurelustre-node-pod> | grep -A 10 -E "(Liveness|Readiness|Startup)"
```

**Monitor readiness probe attempts:**

```sh
# Watch probe events in real-time
kubectl get events --field-selector involvedObject.name=<csi-azurelustre-node-pod> -n kube-system -w | grep -E "(Readiness|Liveness)"
```

---

## Volume Provisioning Issues

### Dynamic Provisioning (AMLFS Cluster Creation) - Public Preview
Expand Down
91 changes: 91 additions & 0 deletions docs/install-csi-driver.md
Original file line number Diff line number Diff line change
Expand Up @@ -39,6 +39,54 @@ This document explains how to install Azure Lustre CSI driver on a kubernetes cl
csi-azurelustre-node-g6sfx 3/3 Running 0 30s
```

### Verifying CSI Driver Readiness for Lustre Operations

Before mounting Azure Lustre filesystems, it's important to verify that the CSI driver nodes are fully initialized and ready for Lustre operations. The driver includes **enhanced LNet validation** that performs comprehensive readiness checks:

- Load required kernel modules (lnet, lustre)
- Configure LNet networking with valid Network Identifiers (NIDs)
- Verify LNet self-ping functionality
- Validate all network interfaces are operational
- Complete all initialization steps

#### Readiness Validation

The CSI driver deployment includes automated probes for container health monitoring:

- **Liveness Probe**: `/healthz` (Port 29763) - HTTP endpoint for basic container health
- **Container Status**: Kubernetes readiness based on container startup and basic health checks

#### Verification Steps

1. **Check pod readiness status:**
```shell
kubectl get -n kube-system pod -l app=csi-azurelustre-node -o wide
```
All node pods should show `READY` status as `3/3` and `STATUS` as `Running`.

2. **Verify probe configuration:**
```shell
kubectl describe -n kube-system pod -l app=csi-azurelustre-node
```
Look for exec-based readiness and startup probe configuration in the pod description:
- `Readiness: exec [/app/readinessProbe.sh]`
- `Startup: exec [/app/readinessProbe.sh]`

In the Events section, you may see initial startup probe failures during LNet initialization:
- `Warning Unhealthy ... Startup probe failed: Node pod detected - performing Lustre-specific readiness checks`

This is normal during the initialization phase. Once LNet is fully operational, the probes will succeed and no more failure events will appear.

3. **Monitor validation logs:**
```shell
kubectl logs -n kube-system -l app=csi-azurelustre-node -c azurelustre --tail=20
```
Look for CSI driver startup and successful GRPC operation logs indicating driver initialization is complete.

> **Note**: If you encounter readiness or initialization issues, see the [CSI Driver Troubleshooting Guide](csi-debug.md#enhanced-lnet-validation-troubleshooting) for detailed debugging steps.

**Important**: The enhanced validation ensures the driver reports ready only when LNet is fully functional for Lustre operations. Wait for all CSI driver node pods to pass enhanced readiness checks before creating PersistentVolumes or mounting Lustre filesystems.

## Default instructions for production release

### Install with kubectl (current production release)
Expand Down Expand Up @@ -73,3 +121,46 @@ This document explains how to install Azure Lustre CSI driver on a kubernetes cl
csi-azurelustre-node-drlq2 3/3 Running 0 30s
csi-azurelustre-node-g6sfx 3/3 Running 0 30s
```


### Verifying CSI Driver Readiness for Lustre Operations

Before mounting Azure Lustre filesystems, it is important to verify that the CSI driver nodes are fully initialized and ready for Lustre operations. The driver includes enhanced LNet validation that performs comprehensive readiness checks:

- Load required kernel modules (lnet, lustre)
- Configure LNet networking with valid Network Identifiers (NIDs)
- Verify LNet self-ping functionality
- Validate all network interfaces are operational
- Complete all initialization steps

#### Enhanced Readiness Validation

The CSI driver deployment includes automated **exec-based readiness probes** for accurate readiness detection:

- **Readiness & Startup Probes**: `/app/readinessProbe.sh` - Exec-based validation with comprehensive LNet checking
- **Liveness Probe**: `/healthz` (Port 29763) - HTTP endpoint for basic container health

#### Verification Steps

1. **Check pod readiness status:**
```shell
kubectl get -n kube-system pod -l app=csi-azurelustre-node -o wide
```
All node pods should show `READY` status as `3/3` and `STATUS` as `Running`.

2. **Verify probe configuration:**
```shell
kubectl describe -n kube-system pod -l app=csi-azurelustre-node
```
Look for exec-based readiness and startup probe configuration and check that no recent probe failures appear in the Events section.

3. **Monitor validation logs:**
```shell
kubectl logs -n kube-system -l app=csi-azurelustre-node -c azurelustre --tail=20
```
Look for CSI driver startup and successful GRPC operation logs indicating driver initialization is complete.

> **Note**: If you encounter readiness or initialization issues, see the [CSI Driver Troubleshooting Guide](csi-debug.md#enhanced-lnet-validation-troubleshooting) for detailed debugging steps.

**Important**: The enhanced validation ensures the driver reports ready only when LNet is fully functional for Lustre operations. Wait for all CSI driver node pods to pass enhanced readiness checks before creating PersistentVolumes or mounting Lustre filesystems.

3 changes: 2 additions & 1 deletion pkg/azurelustreplugin/Dockerfile
Original file line number Diff line number Diff line change
Expand Up @@ -16,8 +16,9 @@ FROM ubuntu:22.04

COPY "./_output/azurelustreplugin" "/app/azurelustreplugin"
COPY "./pkg/azurelustreplugin/entrypoint.sh" "/app/entrypoint.sh"
COPY "./pkg/azurelustreplugin/readinessProbe.sh" "/app/readinessProbe.sh"

RUN chmod +x "/app/entrypoint.sh"
RUN chmod +x "/app/entrypoint.sh" && chmod +x "/app/readinessProbe.sh"

RUN apt-get update && \
apt-get upgrade -y && \
Expand Down
70 changes: 70 additions & 0 deletions pkg/azurelustreplugin/readinessProbe.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
#!/bin/bash

# readinessProbe.sh - Health check script for Azure Lustre CSI driver
# This script performs direct LNet readiness validation

set -euo pipefail

# Check if this is a controller pod (no Lustre client installation required)
INSTALL_LUSTRE_CLIENT=${AZURELUSTRE_CSI_INSTALL_LUSTRE_CLIENT:-"yes"}

if [[ "${INSTALL_LUSTRE_CLIENT}" == "no" ]]; then
echo "Controller pod detected - reporting ready (skipping Lustre checks)"
exit 0
fi

echo "Node pod detected - performing Lustre-specific readiness checks"

# Check if CSI socket exists and is accessible
SOCKET_PATH=${CSI_ENDPOINT#unix://}

if [[ ! -S "${SOCKET_PATH}" ]]; then
echo "CSI socket not found: ${SOCKET_PATH}"
exit 1
fi

# Check if LNet is properly configured and operational
# This replicates the logic from CheckLustreReadiness()

# Check if LNet NIDs are valid and available
if ! lnetctl net show >/dev/null 2>&1; then
echo "LNet not available or not configured"
exit 1
fi

# Check if we have any NIDs configured
NID_COUNT=$(lnetctl net show 2>/dev/null | grep -c "nid:")
if [[ "${NID_COUNT}" -eq 0 ]]; then
echo "No LNet NIDs configured"
exit 1
fi

# Check LNet self-ping functionality
if ! lnetctl ping --help >/dev/null 2>&1; then
echo "LNet ping functionality not available"
exit 1
fi

# Get the first available NID for self-ping test (exclude loopback)
FIRST_NID=$(lnetctl net show 2>/dev/null | grep "nid:" | grep -v "@lo" | head -1 | sed 's/.*nid: \([^ ]*\).*/\1/' || echo "")
if [[ -z "${FIRST_NID}" ]]; then
echo "Unable to determine LNet NID for self-ping test"
exit 1
fi

# Perform self-ping test with timeout
if ! timeout 10 lnetctl ping "${FIRST_NID}" >/dev/null 2>&1; then
echo "LNet self-ping test failed for NID: ${FIRST_NID}"
exit 1
fi

# Check if LNet interfaces are operational
# Verify we have at least one interface in 'up' state
UP_INTERFACES=$(lnetctl net show 2>/dev/null | grep -c "status: up")
if [[ "${UP_INTERFACES}" -eq 0 ]]; then
echo "No LNet interfaces in 'up' state"
exit 1
fi

echo "All Lustre readiness checks passed"
exit 0