-
Notifications
You must be signed in to change notification settings - Fork 1
Add runtime integration tests for Slurm plugin functionality #8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add runtime integration tests for Slurm plugin functionality #8
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I approved the new workflow, but it failed.
|
@copilot Please continue working on this PR. |
Fixed the runtime test failures in commit bfaf4ee. The issues were:
The workflow should now pass successfully. |
|
@copilot Runtime tests now fail in a different way, continue working on this PR |
Fixed the compute node registration issue in commit 315f95d. The problem was a race condition where both containers were building the plugin simultaneously in the same directory on the shared volume. Now using separate build directories for each container. |
|
@copilot new error, continue working on it. You can remove the check that looks for loading the exec plugin in the slurmd logs, as it will only loaded by the controller. i believe you may also skip the installing the plugin on the worker node. |
|
@copilot Also rename the workflow jobs so that the GH UI just displays "Runtime Test / Slurm 23.11" |
Fixed in commit cf0144c. Removed plugin build/install from slurmd (only installed on slurmctld now) and removed the slurmd log check test as suggested. |
|
@copilot check the workflow logs, somehow waiting for the worker node times out |
Fixed in commit b4756ac. The issue was that each container had a different Munge key (created at build time in the Dockerfile), causing authentication failures. Now slurmctld creates the key in the shared volume on first run, and slurmd waits for it before starting. Also improved diagnostics to show container logs when tests fail. |
|
@copilot worker node startup still timed out. Additionally, split the workflow step "Run runtime integration tests" into more seperately named and collapsible sections, because the output is quite long, and its easy to loose overview. |
349e15b to
707c3d6
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements runtime integration tests for the Slurm-Singularity plugin by setting up a Docker-based test infrastructure and fixes an image loading issue by adding explicit image names to docker-compose.yml.
Key changes:
- Added Docker-based Slurm cluster setup with controller and compute nodes
- Created comprehensive integration tests validating plugin CLI options, daemon logs, and containerized job execution
- Added explicit image names to docker-compose.yml to fix buildx bake image loading
- Removed deprecated Vagrant-based development environment
Reviewed Changes
Copilot reviewed 19 out of 20 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/runtime/Dockerfile | Container base image with Slurm, Munge, and build dependencies |
| tests/runtime/docker-compose.yml | Orchestrates slurmctld and slurmd services with explicit image names |
| tests/runtime/entrypoint-slurmctld.sh | Controller initialization: builds plugin, configures Munge, starts slurmctld |
| tests/runtime/entrypoint-slurmd.sh | Compute node initialization: waits for controller, starts slurmd |
| tests/runtime/test-integration.sh | Integration test suite validating plugin functionality and job execution |
| tests/runtime/run-tests.sh | Test orchestration using docker buildx bake with cache support |
| tests/runtime/slurm-common.conf | Base Slurm configuration shared across versions |
| tests/runtime/slurm-24.11.conf | Version-specific configuration for Slurm 24.11 |
| tests/runtime/setup-slurm-config.sh | Version detection and configuration selection logic |
| tests/runtime/plugstack.conf | Plugin loading configuration |
| tests/runtime/cgroup.conf | Cgroup settings for containerized environment |
| tests/runtime/validate-setup.sh | Pre-flight validation of Docker, compose, and scripts |
| tests/runtime/README.md | Documentation for runtime testing infrastructure |
| .github/workflows/runtime-tests.yml | CI workflow for automated runtime testing |
| README.md | Updated badges and removed Vagrant documentation |
| tests/README.md | Added runtime test documentation |
| Vagrantfile | Removed deprecated Vagrant setup |
| containers.sh | Removed deprecated container build script |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
37e6875 to
8b2533c
Compare
Remove Vagrant-based development setup in favor of Docker Compose runtime tests which provide better CI integration and maintainability.
Implement comprehensive runtime testing infrastructure using Docker Compose with a three-service architecture optimized for CI environments. Architecture: - plugin-builder: Builds plugin once with custom bind mount configuration - slurmctld: Slurm controller with Munge authentication - slurmd: Compute node with privileged access for containers Configuration: - Slurm 24.11 with proctrack/linuxproc and task/none - Cgroup plugin disabled for container compatibility - Logging to stdout/stderr for visibility - Version-aware configuration system Test Suite: - Plugin installation and CLI option verification - Job submission and execution validation - Containerized workload testing with Singularity - Multi-argument command support verification - Comprehensive error handling and retry logic
Add automated runtime testing that runs on every push and pull request: - Tests Slurm 24.11 on Ubuntu 25.04 (plucky) - Uses Docker Buildx with GitHub Actions caching - Runs full integration test suite with live Slurm cluster - Validates plugin installation, job execution, and containerized workloads Updates documentation and adds Runtime Tests badge to README.
8b2533c to
db72e07
Compare
Extract configuration values to the top of each script with environment variable overrides. run-tests.sh configuration (passed to test-integration.sh): - RETRY_TIMES (default: 15) - cluster readiness retry attempts - RETRY_DELAY (default: 2) - delay between retries in seconds - JOB_RETRY_DELAY (default: 1) - delay for job state checks - JOB_MAX_WAIT (default: 120) - maximum job wait time in seconds - JOB_POLL_INTERVAL (default: 3) - job polling interval in seconds - LOG_TAIL_LINES (default: 100) - lines to show in failure logs test-integration.sh configuration (container-specific defaults): - PLUGIN_LIBEXEC_DIR (default: /usr/libexec) - SLURM_SYSCONFDIR (default: /etc/slurm) - SLURM_JOB_SPOOL (default: /var/spool/slurm-jobs) - SLURM_LOG_DIR (default: /var/log/slurm) - SLURM_PARTITION (default: debug) All timing parameters from run-tests.sh are passed to test-integration.sh via docker exec -e flags for consistency.
57c9684 to
71599d3
Compare
Fixed "No such image" error by adding explicit image names to docker-compose.yml.
Issue
Both Slurm 23.11 and 24.11 tests were failing with:
Root Cause
When using
docker buildx bake --load, the images are built and loaded into Docker's image store, but without explicitimage:fields in the docker-compose.yml, buildx doesn't know what to tag them as. Whendocker compose upruns with--no-build, it looks for images with specific names but can't find them.Solution
Added explicit
image:fields to both services in docker-compose.yml:slurmctld→runtime-slurmctld:latestslurmd→runtime-slurmd:latestThis tells buildx bake what to name the images when loading them, and tells docker compose what images to use when starting the containers.
Status:
Original prompt
Fixes #4
💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.