-
Notifications
You must be signed in to change notification settings - Fork 27
Enhanced CSI driver readiness validation with comprehensive LNet health checks #231
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enhanced CSI driver readiness validation with comprehensive LNet health checks #231
Conversation
Welcome @jeffbearer! |
@dabradley your question about the registrar: No Negative Impact Expected Different Communication Paths: Registrar → Driver socket (basic gRPC connectivity) Registrar ensures driver registration with kubelet The driver socket remains available
HTTP endpoints for direct health checks |
83bf1a7
to
869c5c9
Compare
869c5c9
to
39320a6
Compare
39320a6
to
ce6abb5
Compare
ce6abb5
to
c68659d
Compare
f4cd9e5
to
bba17fb
Compare
This also needs to target |
bba17fb
to
e8da4e5
Compare
Pull Request Test Coverage Report for Build 17654187098Details
💛 - Coveralls |
…th checks - Add CSI-compliant external liveness probe sidecars to both controller and node deployments - Implement comprehensive LNet validation including NIDs, self-ping, and interface checks - Separate health endpoints: /healthz (readiness) and /livez (liveness) on dedicated ports - Controller uses port 29762, Node uses port 29763 for consistent internal communication - Enhanced validation functions: hasValidLNetNIDs(), lnetSelfPingWorks(), lnetInterfacesOperational() - Early health server startup for immediate status availability - Maintain CSI community standards while providing Lustre-specific health validation Hybrid approach provides both: - Standard CSI external liveness probe monitoring gRPC endpoints - Enhanced HTTP health endpoints with comprehensive Lustre readiness validation
e8da4e5
to
efb26ae
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/lgtm
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dabradley, jeffbearer The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Implementation Summary for Azure DevOps Work Item 29392102
Work Item: "Allow CSI driver to report readiness for end user"
✅ Completed Implementation
This implementation enables users to determine when the Azure Lustre CSI driver is fully initialized and ready for filesystem operations, addressing the core requirement that users should know when to start Lustre-related tasks. Successfully implemented a hybrid approach that combines CSI standards compliance with enhanced Lustre-specific validation.
1. Enhanced Readiness Validation (
pkg/azurelustre/identityserver.go
)Core Implementation:
New Functions Added:
CheckLustreReadiness()
: Standalone function for readiness validation that can be called from health serverhasValidLNetNIDs()
: Validates LNet has valid non-loopback Network IdentifierslnetSelfPingWorks()
: Tests actual LNet communication with self-ping validationlnetInterfacesOperational()
: Verifies network interfaces are in UP state and operationalisLNetFunctional()
: Master function orchestrating all enhanced LNet validation testsValidation Logic Flow:
AZURELUSTRE_CSI_INSTALL_LUSTRE_CLIENT
environment variablelnet
kernel module loadedlustre
kernel module loaded2. HTTP Health Server (
pkg/azurelustreplugin/main.go
)Early Health Server Startup:
Health Endpoints:
/healthz
: Readiness endpoint using full Lustre validation logic200 OK
with "ok" when driver ready503 Service Unavailable
with "not ready" when not ready/livez
: Simple liveness endpoint200 OK
with "alive"Implementation Benefits:
3. CSI Standards Compliance with External Liveness Probes
Added Standard CSI Liveness Probe Sidecars:
mcr.microsoft.com/oss/kubernetes-csi/livenessprobe:v2.14.0
on port 29764mcr.microsoft.com/oss/kubernetes-csi/livenessprobe:v2.14.0
on port 29764Hybrid Architecture Benefits:
4. Kubernetes Deployment Configuration
Controller Deployment (
deploy/csi-azurelustre-controller.yaml
):Node DaemonSet (
deploy/csi-azurelustre-node.yaml
):Kubernetes Health Probe Configuration:
/healthz
endpoint for user readiness/livez
endpoint for container healthCustom Container Image - Testing Only:
jebearercsi29392102v2.azurecr.io/azurelustre-csi:v0.4.0-readiness-http
5. Production Deployment and Testing
✅ Successfully Deployed to AKS:
✅ Validated Functionality:
🔧 Technical Architecture
Hybrid Health Monitoring System:
CSI Standard Compliance:
Enhanced Lustre Validation:
No Conflicts Architecture:
Port Allocation Strategy:
/healthz
,/livez
)/healthz
,/livez
)🧪 Testing Results
AKS Cluster Validation:
Deployment Status:
Health Endpoint Testing:
Log Validation:
🎯 Implementation Benefits
For End Users:
For Operations Teams:
For Platform Integration:
📋 Usage Instructions
Checking Driver Readiness:
Understanding Readiness States:
✅ Acceptance Criteria Fulfilled
🚀 Production Readiness
The implementation is production-ready and successfully deployed in AKS environment:
Key Success Metrics:
Deployment Image:
jebearercsi29392102v2.azurecr.io/azurelustre-csi:v0.4.0-readiness-http
📁 File Inventory
Core Implementation Files:
pkg/azurelustre/identityserver.go
: Enhanced CSI Probe() with comprehensive LNet validationpkg/azurelustreplugin/main.go
: HTTP health server with early startup and dual endpointsdeploy/csi-azurelustre-controller.yaml
: Controller deployment with CSI liveness probe sidecardeploy/csi-azurelustre-node.yaml
: Node DaemonSet with CSI liveness probe sidecar and enhanced validationContainer Images:
jebearercsi29392102v2.azurecr.io/azurelustre-csi:v0.4.0-readiness-http
mcr.microsoft.com/oss/kubernetes-csi/livenessprobe:v2.14.0
mcr.microsoft.com/oss/kubernetes-csi/csi-provisioner:v5.1.0
,csi-node-driver-registrar:v2.12.0
Implementation Highlights: