Multi-runner support (on one instance), multi-{OS,arch} demos #3

ryan-williams · 2025-09-15T04:49:38Z

README demos section shows some improvements since #2:

Support multiple runners/jobs per instance (in sequence or parallel)
Support multiple archs and OSs
Uses GitHub runner job start/end hooks

Examples

demos#32

mamba#771

(mamba#771)

Ocean_Emulator#308

(Ocean_Emulator#308)

CloudWatch console

EC2 instances console (showing many terminat{ed,ing} instances

Also fix a few inconsistent `runner_initial_grace_period=120` defaults

- Add runners_per_instance input to action.yml and runner.yml - Modify __main__.py to generate multiple tokens per instance - Update StartAWS to handle grouped tokens for multiple runners - Add runners_per_instance parameter to action.yml and runner.yml (default: 1) - Generate multiple GitHub runner tokens upfront for multi-runner instances - Update template to register and start multiple runners in separate directories - Each runner gets its own directory (runner-0, runner-1, etc.) - Update termination logic to handle all runners on instance - Maintain backward compatibility with single runner mode - Update output handling to provide array of all runner labels This allows a single EC2 instance to host multiple GitHub runners, enabling concurrent job execution without launching separate instances.

The RUNNER_INITIAL_GRACE_PERIOD and RUNNER_GRACE_PERIOD values were being correctly passed to the instance and set in the runner's .env file, but the systemd service that runs the termination check didn't have access to these environment variables. Added Environment= directives to the systemd service to pass the grace period values through, so they'll be available when the termination check script runs.

The termination check script was failing to deregister runners because $homedir was inside a single-quoted heredoc ('EOFT'), preventing variable substitution. This caused the script to look for /runner-* instead of /home/ubuntu/runner-*. Fixed by: - Changing heredoc delimiter from 'EOFT' to EOFT (allows substitution) - Escaping shell variables with \ to preserve them for runtime - Ensuring $homedir is substituted at template generation time This ensures runners are properly deregistered before instance termination, preventing orphaned runners in GitHub.

Added a 'debug' input parameter that enables verbose output (set -x) in the runner setup script. This helps with troubleshooting without always having verbose output. Changes: - Added 'debug' input to action.yml and runner.yml workflow - Pass debug parameter through Python code to template - Conditionally enable 'set -x' only when debug is true - Keep runner-debug.log output regardless (useful for post-mortem) The debug mode can be enabled by setting debug: true in the workflow that calls ec2-gha.

Added checks for critical system dependencies and made the script more portable across different Linux distributions: - Check for Python3 availability (critical dependency) - Support both dpkg (Debian/Ubuntu) and rpm (RHEL/CentOS/Amazon Linux) for CloudWatch agent installation - Support both curl and wget for downloading files - Fail gracefully with clear error messages when critical tools are missing This makes ec2-gha more flexible for use with different AMIs beyond just Ubuntu, while maintaining backward compatibility.

Major improvements to runner configuration: 1. Parallelization (when Python3 available): - Moved runner setup logic to a shell function 'configure_runner' - Python now uses ThreadPoolExecutor to configure up to 4 runners in parallel - Significantly reduces setup time for multiple runners 2. Python3-free fallback: - Single runner can now be configured without Python3 - Uses shell-based JSON parsing (basic sed extraction) - Limited to single runner due to JSON parsing complexity in pure shell - Allows ec2-gha to work on minimal AMIs without Python3 3. Better separation of concerns: - Shell function handles all runner setup logic - Python only handles JSON parsing and parallelization - Makes the code more maintainable and testable The default path uses Python3 for reliability and performance, but the fallback ensures basic functionality on minimal AMIs.

some shared-functions factoring

… N minutes before shutdown`

…GB for testing, e.g. +2)`

…k-full.yml` example

jder · 2025-09-15T20:07:39Z

README.md

+
+Each instance gets a unique runner label and can execute jobs independently. This is useful for:
+- Matrix builds that need isolated environments
+- Parallel testing across different configurations


All the instances are identical, right? So this doesn't add anything in terms of isolation or configuration variances over the doing a matrix with an instance_count of 1? (Just faster?)

jder · 2025-09-24T18:16:36Z

src/ec2_gha/__main__.py

    gh = GitHubInstance(token=token, repo=repo)
+
+    # Pass runners_per_instance to StartAWS
+    params["runners_per_instance"] = runners_per_instance


AFAICT we're popping this out of this dict just above and then putting it back in here.

jder · 2025-09-24T19:07:32Z

src/ec2_gha/__main__.py

+        grouped_tokens = []
+        for i in range(0, total_runners, runners_per_instance):
+            grouped_tokens.append(all_tokens[i:i+runners_per_instance])
+        params["grouped_runner_tokens"] = grouped_tokens


Ok, if I understand correctly, with runners_per_instance == 1:

We pass count = instance_count, not grouped_runner_tokens

gha_runner's __post_init__ makes count tokens and stores in cloud_params['gh_runner_tokens']

In StartAWS.create_instances we then read the number of instances out of the size of len(gh_runner_tokens).

And with runners_per_instance > 1:

We pass count = instance_count and grouped_runner_tokens

gha_runner's __post_init__ makes count tokens and stores in cloud_params['gh_runner_tokens']

In StartAWS.create_instances we see that grouped_runner_tokens is populated so we ignore gh_runner_tokens (and so also count) and get the number of instances from the length of grouped_runner_tokens instead.

This seems a little confusing and has a few minor knock on effects (eg create_instances checks for gh_runner_tokens but not grouped_runner_tokens; the former is documented in init but the latter is not; count is meaningless if you pass grouped_runner_tokens.) Could we have a single mechanism here? Perhaps we pass runners_per_instance to StartAWS and have it populate runner_tokens (always with groups?) in __post_init__ where it currently does it for one runner?

jder · 2025-09-24T19:29:14Z

src/ec2_gha/start.py

+            # Check UserData size before calling AWS
+            user_data_size = len(params.get("UserData", ""))
+            if user_data_size > 16384:
+                raise ValueError(
+                    f"UserData exceeds AWS limit: {user_data_size} bytes (limit: 16384 bytes, "
+                    f"over by: {user_data_size - 16384} bytes). "
+                    f"Template needs to be reduced by at least {user_data_size - 16384} bytes."
+                )
+
+            try:
+                result = ec2.run_instances(**params)
+            except Exception as e:
+                if "User data is limited to 16384 bytes" in str(e):
+                    # This shouldn't happen if our check above works, but just in case
+                    raise ValueError(
+                        f"UserData exceeds AWS limit: {user_data_size} bytes (limit: 16384 bytes, "
+                        f"over by: {user_data_size - 16384} bytes)"
+                    ) from e
+                raise


🐑 this seems like a lot; maybe remove this pre- and post- check and let the exception get raised by run_instances? Or perhaps catch + re-raise with add_note about reducing template size if userdata is too large?

jder · 2025-09-24T19:36:08Z

src/ec2_gha/scripts/check-runner-termination.sh

+# Check job files and update timestamps for active runners
+# This creates a heartbeat mechanism to detect stuck/failed job completion
+for job_file in $J/*.job; do
+  [ -f "$job_file" ] || continue


How could this trigger? Are there non-normal files in this directory?

jder · 2025-09-24T19:36:46Z

src/ec2_gha/scripts/check-runner-termination.sh

+done
+
+# Ensure activity file exists and get its timestamp
+[ ! -f "$A" ] && touch "$A"


🐑 Is this the same as [-f "$A" ] || touch "$A"? If so can we use the same style as above? (Or use an if statement for both?)

ryan-williams added 30 commits September 14, 2025 21:33

Bump default: runner_grace_period=60

870b888

Also fix a few inconsistent `runner_initial_grace_period=120` defaults

Add runner_poll_interval, update README, templ fix

87141fe

Improve Tags

601ef7d

Add CLAUDE.md

0b7eed6

instance_name

d071dd0

scripts/instance-runtime.py

d8ffe17

v2 + timer fix

fc11173

"merge" v2

ed964f5

demo-archs: {Ubuntu,Debian} x {AMD,ARM}

f7101c6

typing lints, environ

7a3a563

sleep 10m in debug mode only

ea6bb7d

CLAUDE.md

1f87795

demo-dbg-minimal.yml

b80f959

CW: /tmp/runner-*-config.log

4189ba8

debug mode shutdown

a096a5b

./bin/installdependencies.sh for AL2023 / Debian 13

05e1e73

shared-functions.sh.templ

5ba8ef1

runner-common fix

41c79f6

gha-runner@rw/dns: print PublicDnsName to GHA UI

7db6296

azl deps fix

f76b72c

compress templ

5b81d71

err handling

4dc62f0

allow some runners to proceed, even if others fail to init

b9e5743

heredoc fixes

147526b

ryan-williams added 8 commits September 14, 2025 23:29

rename: outputs.{matrix,mtx}

15cb127

Open-Athena/gha-runner@v1

34cc783

skip installdependencies.sh if deps are present

dd43541

cpu-sweep name tweak

6193a32

requires-python = ">= 3.10"

83b77b4

add uv.lock

404c6f9

lint, comment update

c3f3b2e

rm vestigial gpu-benchmark.py

1597611

ryan-williams marked this pull request as ready for review September 15, 2025 04:56

ryan-williams requested review from jder and alxmrs September 15, 2025 04:56

ryan-williams added 12 commits September 16, 2025 21:25

MAX_INSTANCE_LIFETIME = "120" # 2 hours (in minutes)

7401d0f

add trailing newlines

0b464c6

Check if any runners are actually running

01dd723

more robust MAX_LIFETIME_MINUTES shutdown logic

6059133

some shared-functions factoring

Add placeholder workflow for demo-disk-full

854396e

tweak cpu-sweep Debian job names

fba0891

CLAUDE.md: clearer update-snapshots comment

56c5ca7

DL userdata payload at runtime

7a6d11e

`Debug mode: false=off, true/trace=set -x only, number=set -x + sleep…

cc64d49

… N minutes before shutdown`

`ec2_root_device_size: Root disk size in GB (0=AMI default, +N=AMI+N …

d04bbe7

…GB for testing, e.g. +2)`

"heartbeat" job files, detect stale jobs / out-of-disk, add `test-dis…

0792e03

…k-full.yml` example

more robust debug_sleep_and_shutdown

1fe02b0

ryan-williams force-pushed the v2 branch from a9fbc4f to 1fe02b0 Compare September 19, 2025 04:24

ryan-williams added 3 commits September 24, 2025 12:24

terminate_instance when no runners registered successfully

7425302

CW init fix, fail on CW init failure

d4c1714

reformat / de-obfuscate some vars

bb72f09

jder reviewed Sep 24, 2025

View reviewed changes

CW amd fix

0dde623

ryan-williams force-pushed the v2 branch from 03bd243 to 0dde623 Compare September 24, 2025 20:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Multi-runner support (on one instance), multi-{OS,arch} demos #3

Multi-runner support (on one instance), multi-{OS,arch} demos #3

Uh oh!

ryan-williams commented Sep 15, 2025

Uh oh!

jder Sep 15, 2025

Uh oh!

jder Sep 24, 2025

Uh oh!

jder Sep 24, 2025

Uh oh!

jder Sep 24, 2025

Uh oh!

jder Sep 24, 2025

Uh oh!

jder Sep 24, 2025

Uh oh!

Uh oh!

Multi-runner support (on one instance), multi-{OS,arch} demos #3

Are you sure you want to change the base?

Multi-runner support (on one instance), multi-{OS,arch} demos #3

Uh oh!

Conversation

ryan-williams commented Sep 15, 2025

Examples

Uh oh!

jder Sep 15, 2025

Choose a reason for hiding this comment

Uh oh!

jder Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

jder Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

jder Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

jder Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

jder Sep 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!