Nomad task driver for the Cloud Hypervisor VMM
The nomad-driver-ch is a task driver for HashiCorp Nomad that enables orchestration of Intel Cloud Hypervisor virtual machines. This driver provides a modern, lightweight alternative to traditional hypervisor solutions while maintaining full compatibility with Nomad's scheduling and resource management capabilities.
- 🏃♂️ Lightweight Virtualization: Leverages Intel Cloud Hypervisor for minimal overhead VM orchestration
- 🔧 Dynamic Resource Management: CPU, memory, and disk allocation with Nomad's resource constraints
- 🌐 Advanced Networking: Bridge networking with static IP support and dynamic configuration
- ☁️ Cloud-Init Integration: Automatic VM provisioning with user data, SSH keys, and custom scripts
- 💾 Flexible Storage: Virtio-fs shared filesystems and disk image management with thin provisioning
- 🎮 VFIO Device Passthrough: GPU, NIC, and PCI device passthrough with allowlist-based security
- 🔒 Security Isolation: Secure VM boundaries with configurable seccomp filtering
- 📊 Resource Monitoring: Real-time VM statistics and health monitoring
- 🔄 Lifecycle Management: Complete VM lifecycle with start, stop, restart, and recovery capabilities
- Quick Start
- Installation
- Configuration
- Task Examples
- Networking
- Cloud-Init
- Storage
- Device Passthrough
- Monitoring
- Troubleshooting
- API Reference
- Development
- Contributing
kernel and initramfs parameters in your task configuration, even when using full disk images. The driver will fail if either is missing.
- Nomad v1.4.0 or later
- Cloud Hypervisor v48.0.0 or later
- Linux kernel with KVM support
- Bridge networking configured on host
job "web-server" {
datacenters = ["dc1"]
type = "service"
group "web" {
task "nginx" {
driver = "ch"
config {
image = "/var/lib/images/alpine-nginx.img"
network_interface {
bridge {
name = "br0"
static_ip = "192.168.1.100"
}
}
}
resources {
cpu = 1000
memory = 512
}
}
}
}
```## 📦 Installation
### 1. Install Dependencies
**Cloud Hypervisor:**
```bash
# Download and install Cloud Hypervisor v48+
wget https://github.com/cloud-hypervisor/cloud-hypervisor/releases/download/v48.0/cloud-hypervisor-static
sudo mv cloud-hypervisor-static /usr/local/bin/cloud-hypervisor
sudo chmod +x /usr/local/bin/cloud-hypervisor
# Install ch-remote for VM management
wget https://github.com/cloud-hypervisor/cloud-hypervisor/releases/download/v48.0/ch-remote-static
sudo mv ch-remote-static /usr/local/bin/ch-remote
sudo chmod +x /usr/local/bin/ch-remote
# Optional: ensure binaries are discoverable
export PATH="/usr/local/bin:$PATH"
⚠️ glibc requirements: The dynamically-linked release of Cloud Hypervisor requires glibc ≥ 2.34. If you are running on older distributions (Debian 11, Ubuntu 20.04, etc.) use the static binaries shown above or run inside a container/VM that ships a newer glibc.
VirtioFS daemon:
# Install virtiofsd for filesystem sharing
sudo apt-get install virtiofsd # Ubuntu/Debian
# or
sudo yum install virtiofsd # RHEL/CentOS# Create bridge interface
sudo ip link add br0 type bridge
sudo ip addr add 192.168.1.1/24 dev br0
sudo ip link set br0 up
# Configure bridge persistence (systemd-networkd)
cat > /etc/systemd/network/br0.netdev << EOF
[NetDev]
Name=br0
Kind=bridge
EOF
cat > /etc/systemd/network/br0.network << EOF
[Match]
Name=br0
[Network]
IPForward=yes
Address=192.168.1.1/24
EOF
sudo systemctl restart systemd-networkdOption A: Download Release
# Download latest release
wget https://github.com/ccheshirecat/nomad-driver-ch/releases/latest/download/nomad-driver-ch
sudo mv nomad-driver-ch /opt/nomad/plugins/
sudo chmod +x /opt/nomad/plugins/nomad-driver-chOption B: Build from Source
git clone https://github.com/ccheshirecat/nomad-driver-ch.git
cd nomad-driver-ch
go build -o nomad-driver-ch .
sudo mv nomad-driver-ch /opt/nomad/plugins/Client Configuration:
# /etc/nomad.d/client.hcl
client {
enabled = true
plugin "nomad-driver-ch" {
config {
# Cloud Hypervisor configuration
cloud_hypervisor {
bin = "/usr/bin/cloud-hypervisor"
remote_bin = "/usr/bin/ch-remote"
virtiofsd_bin = "/usr/libexec/virtiofsd"
default_kernel = "/boot/vmlinuz"
default_initramfs = "/boot/initramfs.img"
}
disable_alloc_mounts = false
# Network configuration
network {
bridge = "br0"
subnet_cidr = "192.168.1.0/24"
gateway = "192.168.1.1"
ip_pool_start = "192.168.1.100"
ip_pool_end = "192.168.1.200"
}
# Allowed image paths for security
image_paths = ["/var/lib/images", "/opt/vm-images"]
}
}
}sudo systemctl restart nomadVerify the driver is loaded:
nomad node status -self | grep chFor detailed installation instructions, see docs/INSTALLATION.md.
The driver configuration is specified in the Nomad client configuration file:
plugin "nomad-driver-ch" {
config {
# Cloud Hypervisor binaries
cloud_hypervisor {
bin = "/usr/bin/cloud-hypervisor" # Cloud Hypervisor binary path
remote_bin = "/usr/bin/ch-remote" # ch-remote binary path
virtiofsd_bin = "/usr/libexec/virtiofsd" # virtiofsd binary path
default_kernel = "/boot/vmlinuz" # Default kernel for VMs
default_initramfs = "/boot/initramfs.img" # Default initramfs for VMs
firmware = "/usr/share/qemu/OVMF.fd" # UEFI firmware (optional)
seccomp = "true" # Enable seccomp filtering
log_file = "/var/log/cloud-hypervisor.log" # VM log file path
}
# Network configuration
network {
bridge = "br0" # Bridge interface name
subnet_cidr = "192.168.1.0/24" # Subnet for VMs
gateway = "192.168.1.1" # Gateway IP address
ip_pool_start = "192.168.1.100" # IP pool start range
ip_pool_end = "192.168.1.200" # IP pool end range
tap_prefix = "tap" # TAP interface prefix
}
# VFIO device passthrough (not yet implemented)
# vfio {
# allowlist = ["10de:*", "8086:0d26"] # PCI device allowlist
# iommu_address_width = 48 # IOMMU address width
# pci_segments = 1 # Number of PCI segments
# }
# Security and paths
data_dir = "/opt/nomad/data" # Nomad data directory
image_paths = [ # Allowed image paths
"/var/lib/images",
"/opt/vm-images",
"/mnt/shared-storage"
]
}
}| Parameter | Type | Default | Description |
|---|---|---|---|
cloud_hypervisor.bin |
string | /usr/bin/cloud-hypervisor |
Path to Cloud Hypervisor binary |
cloud_hypervisor.remote_bin |
string | /usr/bin/ch-remote |
Path to ch-remote binary |
cloud_hypervisor.virtiofsd_bin |
string | /usr/libexec/virtiofsd |
Path to virtiofsd binary |
cloud_hypervisor.default_kernel |
string | - | Default kernel path for VMs |
cloud_hypervisor.default_initramfs |
string | - | Default initramfs path for VMs |
cloud_hypervisor.firmware |
string | - | UEFI firmware path (optional) |
cloud_hypervisor.seccomp |
string | "true" |
Enable seccomp filtering |
cloud_hypervisor.log_file |
string | - | VM log file path |
network.bridge |
string | "br0" |
Bridge interface name |
network.subnet_cidr |
string | "192.168.1.0/24" |
VM subnet CIDR |
network.gateway |
string | "192.168.1.1" |
Network gateway |
network.ip_pool_start |
string | "192.168.1.100" |
IP allocation pool start |
network.ip_pool_end |
string | "192.168.1.200" |
IP allocation pool end |
network.tap_prefix |
string | "tap" |
TAP interface name prefix |
vfio.allowlist |
[]string | - | |
vfio.iommu_address_width |
number | - | |
vfio.pci_segments |
number | - | |
data_dir |
string | - | Nomad data directory |
image_paths |
[]string | - | Allowed VM image paths |
For complete configuration details, see docs/CONFIGURATION.md.
job "basic-vm" {
datacenters = ["dc1"]
group "app" {
task "vm" {
driver = "ch"
config {
image = "/var/lib/images/ubuntu-22.04.img"
hostname = "app-server"
# REQUIRED: kernel and initramfs (Cloud Hypervisor has no bootloader)
kernel = "/boot/vmlinuz-5.15.0"
initramfs = "/boot/initramfs-5.15.0.img"
cmdline = "console=ttyS0 root=/dev/vda1"
}
resources {
cpu = 2000 # 2 CPU cores
memory = 2048 # 2GB RAM
}
# Optional: allow sandbox/CI environments without binaries
# skip_binary_validation = true
}
}
}🧪 Running in CI or locally without KVM: set
skip_binary_validation = truein the plugin config (or use the SDK helper when embedding the driver) so tests can run without Cloud Hypervisor binaries present. Production deployments should keep validation enabled to surface misconfiguration early.
job "custom-vm" {
datacenters = ["dc1"]
group "web" {
task "nginx" {
driver = "ch"
config {
image = "/var/lib/images/alpine.img"
hostname = "nginx-server"
# Cloud-init user data
user_data = "/etc/cloud-init/nginx-setup.yml"
# Default user configuration
default_user_password = "secure123"
default_user_authorized_ssh_key = "ssh-rsa AAAAB3NzaC1yc2E..."
# Custom commands to run
cmds = [
"apk add --no-cache nginx",
"rc-service nginx start",
"rc-update add nginx default"
]
}
resources {
cpu = 1000
memory = 512
}
}
}
}job "database" {
datacenters = ["dc1"]
group "db" {
task "postgres" {
driver = "ch"
config {
image = "/var/lib/images/postgres-14.img"
hostname = "postgres-primary"
# Enable thin copy for faster startup
use_thin_copy = true
# Network configuration with static IP
network_interface {
bridge {
name = "br0"
static_ip = "192.168.1.50"
gateway = "192.168.1.1"
netmask = "24"
dns = ["8.8.8.8", "1.1.1.1"]
}
}
# Custom timezone
timezone = "America/New_York"
}
resources {
cpu = 4000 # 4 CPU cores
memory = 8192 # 8GB RAM
}
# Mount shared storage
volume_mount {
volume = "postgres-data"
destination = "/var/lib/postgresql"
}
}
}
volume "postgres-data" {
type = "host"
source = "postgres-data"
read_only = false
}
}job "ml-workload" {
datacenters = ["dc1"]
group "gpu" {
task "training" {
driver = "ch"
config {
image = "/var/lib/images/cuda-ubuntu.img"
hostname = "ml-trainer"
# VFIO GPU passthrough (not yet implemented)
# vfio_devices = ["10de:2204"] # NVIDIA RTX 3080
}
resources {
cpu = 8000 # 8 CPU cores
memory = 16384 # 16GB RAM
device "nvidia/gpu" {
count = 1
}
}
}
}
}For more examples, see docs/EXAMPLES.md.
The driver supports bridge networking with automatic IP allocation or static IP assignment:
config {
network_interface {
bridge {
name = "br0"
# IP will be allocated from pool automatically
}
}
}config {
network_interface {
bridge {
name = "br0"
static_ip = "192.168.1.100"
gateway = "192.168.1.1"
netmask = "24"
dns = ["8.8.8.8", "1.1.1.1"]
}
}
}The driver uses a hierarchical configuration approach:
-
Task-Level Configuration (highest priority)
static_ip,gateway,netmask,dnsfrom task config
-
Driver-Level Configuration (medium priority)
- IP pool allocation, default gateway, subnet settings
-
DHCP Fallback (lowest priority)
- When no static configuration is provided
Map container ports to host ports:
config {
network_interface {
bridge {
name = "br0"
ports = ["web", "api"] # Reference port labels from network block
}
}
}
network {
port "web" {
static = 80
}
port "api" {
static = 8080
}
}The driver integrates with cloud-init for automated VM provisioning and configuration.
config {
user_data = "/etc/cloud-init/web-server.yml"
}Example user data file (/etc/cloud-init/web-server.yml):
#cloud-config
packages:
- nginx
- curl
- htop
runcmd:
- systemctl enable nginx
- systemctl start nginx
- ufw allow 80
- ufw --force enable
write_files:
- path: /var/www/html/index.html
content: |
<!DOCTYPE html>
<html>
<head><title>Hello from Nomad VM</title></head>
<body><h1>VM deployed via Nomad Cloud Hypervisor driver!</h1></body>
</html>
permissions: '0644'config {
user_data = <<EOF
#cloud-config
package_update: true
packages:
- docker.io
runcmd:
- systemctl enable docker
- systemctl start docker
- docker run -d -p 80:80 nginx:alpine
EOF
}config {
default_user_password = "secure-password"
default_user_authorized_ssh_key = "ssh-rsa AAAAB3NzaC1yc2EAAAADAQAB..."
}config {
# Commands run during boot process
cmds = [
"apt-get update",
"apt-get install -y docker.io",
"systemctl enable docker"
]
}Cloud-init automatically generates network configuration based on:
- Static IP settings from task configuration
- Driver network configuration
- DHCP fallback for dynamic assignment
For static IP assignment, configure the IP in your task:
network_interface {
bridge {
name = "br0"
static_ip = "192.168.1.100"
ports = ["http", "https"]
}
}For DHCP assignment, omit the static_ip field:
network_interface {
bridge {
name = "br0"
ports = ["http", "https"] # Port forwarding works with DHCP!
}
}DHCP Support: The driver automatically discovers DHCP-assigned IP addresses by parsing dnsmasq lease files. This enables automatic port forwarding for DHCP-based VMs. The driver generates deterministic MAC addresses from task IDs to ensure consistent IP assignment.
Requirements for DHCP:
- dnsmasq DHCP server running on the host
- Lease file accessible at
/var/lib/misc/dnsmasq.leases - VM must receive DHCP lease within the normal timeframe
How it works:
- Driver generates deterministic MAC address from task ID
- VM boots and gets DHCP lease with that MAC
- Driver parses dnsmasq lease file to find IP for that MAC
- Port forwarding rules are set up automatically using the discovered IP
- Raw (
.img) - QCOW2 (
.qcow2) - VHD (
.vhd) - VMDK (
.vmdk)
Enable thin copy for faster VM startup:
config {
image = "/var/lib/images/base-ubuntu.img"
use_thin_copy = true
}Mount host directories into VMs using VirtioFS:
job "shared-storage" {
group "app" {
volume "shared-data" {
type = "host"
source = "app-data"
}
task "processor" {
driver = "ch"
config {
image = "/var/lib/images/data-processor.img"
}
volume_mount {
volume = "shared-data"
destination = "/app/data"
read_only = false
}
resources {
cpu = 2000
memory = 4096
}
}
}
}Configure VFIO passthrough at the driver level with device allowlisting for security:
plugin "nomad-driver-ch" {
config {
vfio {
# Allowlist specific devices (vendor:device format)
allowlist = [
"10de:*", # All NVIDIA GPUs
"8086:0d26", # Intel specific device
"1002:67df" # AMD Radeon RX 480
]
iommu_address_width = 48 # Default: 48
pci_segments = 1 # Default: 1
}
}
}Security Note: The allowlist prevents unauthorized device access. Use wildcards (10de:*) for device families or exact vendor:device IDs.
Specify PCI devices to pass through to your VM:
job "ai-training" {
datacenters = ["dc1"]
constraint {
attribute = "${node.unique.name}"
value = "gpu-node-1"
}
group "training" {
task "model-training" {
driver = "ch"
config {
image = "/var/lib/images/cuda-pytorch.img"
# Pass through NVIDIA RTX 3080 (GPU + Audio controller)
vfio_devices = ["0000:01:00.0", "0000:01:00.1"]
}
resources {
cpu = 8000
memory = 32768
}
}
}
}- IOMMU enabled in BIOS/UEFI
intel_iommu=onoramd_iommu=onin kernel boot parameters- vfio-pci kernel module loaded
- Devices bound to vfio-pci driver (handled automatically by the driver)
## 📊 Monitoring
### Resource Statistics
The driver provides real-time VM resource statistics:
```bash
# View allocation statistics
nomad alloc status <alloc-id>
# Monitor resource usage
nomad alloc logs -f <alloc-id> <task-name>
Configure health checks for VM services:
task "web-server" {
driver = "ch"
config {
image = "/var/lib/images/nginx.img"
network_interface {
bridge {
name = "br0"
static_ip = "192.168.1.100"
}
}
}
service {
name = "web"
port = "http"
check {
type = "http"
path = "/"
interval = "30s"
timeout = "5s"
address_mode = "alloc"
}
}
}Symptoms:
- Task fails during startup
- Error: "Failed to parse disk image format"
Solutions:
# 1. Verify image format
qemu-img info /path/to/image.img
# 2. Check image paths configuration
nomad agent-info | grep -A 10 virt
# 3. Validate kernel/initramfs paths
ls -la /boot/vmlinuz* /boot/initramfs*
# 4. Test Cloud Hypervisor directly
cloud-hypervisor --kernel /boot/vmlinuz --disk path=/path/to/image.imgSymptoms:
- VM has no network access
- Cannot reach VM from host
Solutions:
# 1. Check bridge configuration
ip link show br0
brctl show br0
# 2. Verify TAP interface creation
ip link show | grep tap
# 3. Test bridge connectivity
ping 192.168.1.1 # Gateway IP
# 4. Check iptables rules
iptables -L -v -nNomad Client:
log_level = "DEBUG"
enable_debug = true# Check Cloud Hypervisor processes
ps aux | grep cloud-hypervisor
# Inspect VM via ch-remote
ch-remote --api-socket /path/to/api.sock info
# Monitor VM console output
tail -f /opt/nomad/data/alloc/<alloc-id>/<task>/serial.logComplete HCL task configuration reference:
config {
# Required: VM disk image path
image = "/path/to/vm-image.img"
# Optional: VM hostname
hostname = "my-vm-host"
# Optional: Operating system variant
os {
arch = "x86_64" # CPU architecture
machine = "q35" # Machine type
variant = "ubuntu20.04" # OS variant
}
# Optional: Cloud-init user data
user_data = "/path/to/user-data.yml" # File path
# OR
user_data = <<EOF # Inline YAML
#cloud-config
packages:
- nginx
EOF
# Optional: Timezone configuration
timezone = "America/New_York"
# Optional: Custom commands to run
cmds = [
"apt-get update",
"systemctl enable nginx"
]
# Optional: Default user configuration
default_user_authorized_ssh_key = "ssh-rsa AAAAB3..."
default_user_password = "secure-password"
# Optional: Storage configuration
use_thin_copy = true # Enable thin provisioning
# Optional: Cloud Hypervisor specific
kernel = "/boot/custom-kernel" # Custom kernel path
initramfs = "/boot/custom-initrd" # Custom initramfs path
cmdline = "console=ttyS0 quiet" # Kernel command line
# Optional: Network interface configuration
network_interface {
bridge {
name = "br0" # Bridge name (required)
ports = ["web", "api"] # Port labels to expose
static_ip = "192.168.1.100" # Static IP address
gateway = "192.168.1.1" # Custom gateway
netmask = "24" # Subnet mask (CIDR)
dns = ["8.8.8.8", "1.1.1.1"] # Custom DNS servers
}
}
# Optional: VFIO device passthrough (coming very soon!)
# vfio_devices = ["10de:2204"] # PCI device IDs
# Optional: USB device passthrough
usb_devices = ["046d:c52b"] # USB vendor:product IDs
}resources {
cpu = 2000 # CPU shares (1 core = 1000)
memory = 2048 # Memory in MB
# Optional: GPU devices
device "nvidia/gpu" {
count = 1
constraint {
attribute = "${device.attr.compute_capability}"
operator = ">="
value = "6.0"
}
}
}Prerequisites:
- Go 1.19 or later
- Git
Build Steps:
# Clone repository
git clone https://github.com/ccheshirecat/nomad-driver-ch.git
cd nomad-driver-ch
# Install dependencies
go mod download
# Run tests
go test ./...
# Build binary
go build -o nomad-driver-ch .
# Install plugin
sudo cp nomad-driver-ch /opt/nomad/plugins/Unit Tests:
go test ./...Integration Tests:
# Requires Cloud Hypervisor installation
sudo go test -v ./virt/... -run IntegrationWe welcome contributions! Please see our Contributing Guide for details.
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Write tests for your changes
- Ensure all tests pass (
go test ./...) - Run linting (
golangci-lint run) - Commit with clear messages
- Push to your fork
- Create a Pull Request
- Code Style: Follow Go conventions and use
gofmt - Testing: Maintain >80% test coverage
- Documentation: Update docs for user-facing changes
- Compatibility: Maintain backward compatibility
- Security: Never commit secrets or credentials
This project is licensed under the Mozilla Public License 2.0 - see the LICENSE file for details.
- HashiCorp Nomad team for the excellent orchestration platform
- Intel Cloud Hypervisor team for the lightweight VMM
- Cloud-init project for VM initialization
- All contributors who help improve this driver
Made with ❤️ for the cloud-native community