Skip to content

Conversation

Copy link

Copilot AI commented Oct 24, 2025

Fixed "No such image" error by adding explicit image names to docker-compose.yml.

Issue

Both Slurm 23.11 and 24.11 tests were failing with:

Error response from daemon: No such image: runtime-slurmctld:latest

Root Cause

When using docker buildx bake --load, the images are built and loaded into Docker's image store, but without explicit image: fields in the docker-compose.yml, buildx doesn't know what to tag them as. When docker compose up runs with --no-build, it looks for images with specific names but can't find them.

Solution

Added explicit image: fields to both services in docker-compose.yml:

  • slurmctldruntime-slurmctld:latest
  • slurmdruntime-slurmd:latest

This tells buildx bake what to name the images when loading them, and tells docker compose what images to use when starting the containers.

Status:

  • Slurm 24.11: Should now find and start images ✓
  • Slurm 23.11: Should now find and start images ✓
Original prompt

This section details on the original issue you should resolve

<issue_title>Implement a CI check that verifies basic run-time functionality</issue_title>
<issue_description>* Start-up a slurmd and a slurmdctld with a basic config

  • Check that plugin got loaded without error
    • Presence of plugin cli options in output of sbatch --help
    • Check slurmd logs
  • Submit a containerized test job

If needed, GH actions appear to support nested virtualization since a few weeks.</issue_description>

<agent_instructions>Use a docker-compose based slurm cluster using the same ubuntu images already used in the CI.</agent_instructions>

Comments on the Issue (you are @copilot in this section)

Fixes #4


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

Copilot AI changed the title [WIP] Implement CI check for basic run-time functionality Add runtime integration tests for Slurm plugin functionality Oct 24, 2025
Copilot AI requested a review from dennisklein October 24, 2025 12:08
Copilot finished work on behalf of dennisklein October 24, 2025 12:08
Copy link
Member

@dennisklein dennisklein left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approved the new workflow, but it failed.

@dennisklein
Copy link
Member

@copilot Please continue working on this PR.

Copy link
Author

Copilot AI commented Oct 24, 2025

@copilot Please continue working on this PR.

Fixed the runtime test failures in commit bfaf4ee. The issues were:

  1. Plugin path was incorrect (/usr/lib instead of /usr/libexec)
  2. Docker volume conflicts when multiple containers tried to create directories in the same mount

The workflow should now pass successfully.

Copilot AI requested a review from dennisklein October 24, 2025 12:41
Copilot finished work on behalf of dennisklein October 24, 2025 12:41
@dennisklein
Copy link
Member

@copilot Runtime tests now fail in a different way, continue working on this PR

Copy link
Author

Copilot AI commented Oct 24, 2025

@copilot Runtime tests now fail in a different way, continue working on this PR

Fixed the compute node registration issue in commit 315f95d. The problem was a race condition where both containers were building the plugin simultaneously in the same directory on the shared volume. Now using separate build directories for each container.

Copilot finished work on behalf of dennisklein October 24, 2025 13:48
@dennisklein
Copy link
Member

@copilot new error, continue working on it. You can remove the check that looks for loading the exec plugin in the slurmd logs, as it will only loaded by the controller. i believe you may also skip the installing the plugin on the worker node.

@dennisklein
Copy link
Member

@copilot Also rename the workflow jobs so that the GH UI just displays "Runtime Test / Slurm 23.11"

Copy link
Author

Copilot AI commented Oct 24, 2025

@copilot new error, continue working on it. You can remove the check that looks for loading the exec plugin in the slurmd logs, as it will only loaded by the controller. i believe you may also skip th...

Fixed in commit cf0144c. Removed plugin build/install from slurmd (only installed on slurmctld now) and removed the slurmd log check test as suggested.

Copy link
Author

Copilot AI commented Oct 24, 2025

@copilot Also rename the workflow jobs so that the GH UI just displays "Runtime Test / Slurm 23.11"

Done in commit 7eb2998. The workflow jobs now display as "Runtime Test / Slurm 23.11" and "Runtime Test / Slurm 24.11" in the GitHub UI.

Copilot finished work on behalf of dennisklein October 24, 2025 14:13
@dennisklein
Copy link
Member

@copilot check the workflow logs, somehow waiting for the worker node times out

Copy link
Author

Copilot AI commented Oct 24, 2025

@copilot check the workflow logs, somehow waiting for the worker node times out

Fixed in commit b4756ac. The issue was that each container had a different Munge key (created at build time in the Dockerfile), causing authentication failures. Now slurmctld creates the key in the shared volume on first run, and slurmd waits for it before starting. Also improved diagnostics to show container logs when tests fail.

Copilot finished work on behalf of dennisklein October 24, 2025 15:43
@dennisklein
Copy link
Member

@copilot worker node startup still timed out. Additionally, split the workflow step "Run runtime integration tests" into more seperately named and collapsible sections, because the output is quite long, and its easy to loose overview.

Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements runtime integration tests for the Slurm-Singularity plugin by setting up a Docker-based test infrastructure and fixes an image loading issue by adding explicit image names to docker-compose.yml.

Key changes:

  • Added Docker-based Slurm cluster setup with controller and compute nodes
  • Created comprehensive integration tests validating plugin CLI options, daemon logs, and containerized job execution
  • Added explicit image names to docker-compose.yml to fix buildx bake image loading
  • Removed deprecated Vagrant-based development environment

Reviewed Changes

Copilot reviewed 19 out of 20 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
tests/runtime/Dockerfile Container base image with Slurm, Munge, and build dependencies
tests/runtime/docker-compose.yml Orchestrates slurmctld and slurmd services with explicit image names
tests/runtime/entrypoint-slurmctld.sh Controller initialization: builds plugin, configures Munge, starts slurmctld
tests/runtime/entrypoint-slurmd.sh Compute node initialization: waits for controller, starts slurmd
tests/runtime/test-integration.sh Integration test suite validating plugin functionality and job execution
tests/runtime/run-tests.sh Test orchestration using docker buildx bake with cache support
tests/runtime/slurm-common.conf Base Slurm configuration shared across versions
tests/runtime/slurm-24.11.conf Version-specific configuration for Slurm 24.11
tests/runtime/setup-slurm-config.sh Version detection and configuration selection logic
tests/runtime/plugstack.conf Plugin loading configuration
tests/runtime/cgroup.conf Cgroup settings for containerized environment
tests/runtime/validate-setup.sh Pre-flight validation of Docker, compose, and scripts
tests/runtime/README.md Documentation for runtime testing infrastructure
.github/workflows/runtime-tests.yml CI workflow for automated runtime testing
README.md Updated badges and removed Vagrant documentation
tests/README.md Added runtime test documentation
Vagrantfile Removed deprecated Vagrant setup
containers.sh Removed deprecated container build script

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot AI requested a review from dennisklein October 29, 2025 15:29
Copilot finished work on behalf of dennisklein October 29, 2025 15:29
Copilot finished work on behalf of dennisklein October 29, 2025 15:32
@dennisklein dennisklein force-pushed the copilot/implement-ci-check-runtime-functionality branch 2 times, most recently from 37e6875 to 8b2533c Compare November 3, 2025 17:18
Remove Vagrant-based development setup in favor of Docker Compose
runtime tests which provide better CI integration and maintainability.
Implement comprehensive runtime testing infrastructure using Docker
Compose with a three-service architecture optimized for CI environments.

Architecture:
- plugin-builder: Builds plugin once with custom bind mount configuration
- slurmctld: Slurm controller with Munge authentication
- slurmd: Compute node with privileged access for containers

Configuration:
- Slurm 24.11 with proctrack/linuxproc and task/none
- Cgroup plugin disabled for container compatibility
- Logging to stdout/stderr for visibility
- Version-aware configuration system

Test Suite:
- Plugin installation and CLI option verification
- Job submission and execution validation
- Containerized workload testing with Singularity
- Multi-argument command support verification
- Comprehensive error handling and retry logic
Add automated runtime testing that runs on every push and pull request:
- Tests Slurm 24.11 on Ubuntu 25.04 (plucky)
- Uses Docker Buildx with GitHub Actions caching
- Runs full integration test suite with live Slurm cluster
- Validates plugin installation, job execution, and containerized workloads

Updates documentation and adds Runtime Tests badge to README.
@dennisklein dennisklein force-pushed the copilot/implement-ci-check-runtime-functionality branch from 8b2533c to db72e07 Compare November 3, 2025 17:32
Extract configuration values to the top of each script with environment
variable overrides.

run-tests.sh configuration (passed to test-integration.sh):
- RETRY_TIMES (default: 15) - cluster readiness retry attempts
- RETRY_DELAY (default: 2) - delay between retries in seconds
- JOB_RETRY_DELAY (default: 1) - delay for job state checks
- JOB_MAX_WAIT (default: 120) - maximum job wait time in seconds
- JOB_POLL_INTERVAL (default: 3) - job polling interval in seconds
- LOG_TAIL_LINES (default: 100) - lines to show in failure logs

test-integration.sh configuration (container-specific defaults):
- PLUGIN_LIBEXEC_DIR (default: /usr/libexec)
- SLURM_SYSCONFDIR (default: /etc/slurm)
- SLURM_JOB_SPOOL (default: /var/spool/slurm-jobs)
- SLURM_LOG_DIR (default: /var/log/slurm)
- SLURM_PARTITION (default: debug)

All timing parameters from run-tests.sh are passed to test-integration.sh
via docker exec -e flags for consistency.
@dennisklein dennisklein force-pushed the copilot/implement-ci-check-runtime-functionality branch from 57c9684 to 71599d3 Compare November 3, 2025 17:58
@dennisklein dennisklein marked this pull request as ready for review November 3, 2025 17:58
@dennisklein dennisklein merged commit c752089 into main Nov 3, 2025
3 checks passed
@dennisklein dennisklein deleted the copilot/implement-ci-check-runtime-functionality branch November 3, 2025 18:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement a CI check that verifies basic run-time functionality

2 participants