Skip to content

Conversation

ryan-williams
Copy link
Member

README demos section shows some improvements since #2:

  • Support multiple runners/jobs per instance (in sequence or parallel)
  • Support multiple archs and OSs
  • Uses GitHub runner job start/end hooks

Examples

demos#32

demos#32

mamba#771

mamba/installs#14
(mamba#771)

Ocean_Emulator#308

oe/gpu#1098

(Ocean_Emulator#308)

CloudWatch console

EC2 instances console (showing many terminat{ed,ing} instances

EC2 console - instances

Also fix a few inconsistent `runner_initial_grace_period=120` defaults
- Add runners_per_instance input to action.yml and runner.yml
- Modify __main__.py to generate multiple tokens per instance
- Update StartAWS to handle grouped tokens for multiple runners

- Add runners_per_instance parameter to action.yml and runner.yml (default: 1)
- Generate multiple GitHub runner tokens upfront for multi-runner instances
- Update template to register and start multiple runners in separate directories
- Each runner gets its own directory (runner-0, runner-1, etc.)
- Update termination logic to handle all runners on instance
- Maintain backward compatibility with single runner mode
- Update output handling to provide array of all runner labels

This allows a single EC2 instance to host multiple GitHub runners,
enabling concurrent job execution without launching separate instances.
The RUNNER_INITIAL_GRACE_PERIOD and RUNNER_GRACE_PERIOD values were
being correctly passed to the instance and set in the runner's .env
file, but the systemd service that runs the termination check didn't
have access to these environment variables.

Added Environment= directives to the systemd service to pass the
grace period values through, so they'll be available when the
termination check script runs.
The termination check script was failing to deregister runners because
$homedir was inside a single-quoted heredoc ('EOFT'), preventing
variable substitution. This caused the script to look for /runner-*
instead of /home/ubuntu/runner-*.

Fixed by:
- Changing heredoc delimiter from 'EOFT' to EOFT (allows substitution)
- Escaping shell variables with \ to preserve them for runtime
- Ensuring $homedir is substituted at template generation time

This ensures runners are properly deregistered before instance termination,
preventing orphaned runners in GitHub.
Added a 'debug' input parameter that enables verbose output (set -x) in
the runner setup script. This helps with troubleshooting without always
having verbose output.

Changes:
- Added 'debug' input to action.yml and runner.yml workflow
- Pass debug parameter through Python code to template
- Conditionally enable 'set -x' only when debug is true
- Keep runner-debug.log output regardless (useful for post-mortem)

The debug mode can be enabled by setting debug: true in the workflow
that calls ec2-gha.
Added checks for critical system dependencies and made the script more
portable across different Linux distributions:

- Check for Python3 availability (critical dependency)
- Support both dpkg (Debian/Ubuntu) and rpm (RHEL/CentOS/Amazon Linux)
  for CloudWatch agent installation
- Support both curl and wget for downloading files
- Fail gracefully with clear error messages when critical tools are missing

This makes ec2-gha more flexible for use with different AMIs beyond
just Ubuntu, while maintaining backward compatibility.
Major improvements to runner configuration:

1. Parallelization (when Python3 available):
   - Moved runner setup logic to a shell function 'configure_runner'
   - Python now uses ThreadPoolExecutor to configure up to 4 runners in parallel
   - Significantly reduces setup time for multiple runners

2. Python3-free fallback:
   - Single runner can now be configured without Python3
   - Uses shell-based JSON parsing (basic sed extraction)
   - Limited to single runner due to JSON parsing complexity in pure shell
   - Allows ec2-gha to work on minimal AMIs without Python3

3. Better separation of concerns:
   - Shell function handles all runner setup logic
   - Python only handles JSON parsing and parallelization
   - Makes the code more maintainable and testable

The default path uses Python3 for reliability and performance, but the
fallback ensures basic functionality on minimal AMIs.
@ryan-williams ryan-williams marked this pull request as ready for review September 15, 2025 04:56

Each instance gets a unique runner label and can execute jobs independently. This is useful for:
- Matrix builds that need isolated environments
- Parallel testing across different configurations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the instances are identical, right? So this doesn't add anything in terms of isolation or configuration variances over the doing a matrix with an instance_count of 1? (Just faster?)

gh = GitHubInstance(token=token, repo=repo)

# Pass runners_per_instance to StartAWS
params["runners_per_instance"] = runners_per_instance
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT we're popping this out of this dict just above and then putting it back in here.

grouped_tokens = []
for i in range(0, total_runners, runners_per_instance):
grouped_tokens.append(all_tokens[i:i+runners_per_instance])
params["grouped_runner_tokens"] = grouped_tokens
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, if I understand correctly, with runners_per_instance == 1:

  • We pass count = instance_count, not grouped_runner_tokens
  • gha_runner's __post_init__ makes count tokens and stores in cloud_params['gh_runner_tokens']
  • In StartAWS.create_instances we then read the number of instances out of the size of len(gh_runner_tokens).

And with runners_per_instance > 1:

  • We pass count = instance_count and grouped_runner_tokens
  • gha_runner's __post_init__ makes count tokens and stores in cloud_params['gh_runner_tokens']
  • In StartAWS.create_instances we see that grouped_runner_tokens is populated so we ignore gh_runner_tokens (and so also count) and get the number of instances from the length of grouped_runner_tokens instead.

This seems a little confusing and has a few minor knock on effects (eg create_instances checks for gh_runner_tokens but not grouped_runner_tokens; the former is documented in init but the latter is not; count is meaningless if you pass grouped_runner_tokens.) Could we have a single mechanism here? Perhaps we pass runners_per_instance to StartAWS and have it populate runner_tokens (always with groups?) in __post_init__ where it currently does it for one runner?

Comment on lines +471 to +489
# Check UserData size before calling AWS
user_data_size = len(params.get("UserData", ""))
if user_data_size > 16384:
raise ValueError(
f"UserData exceeds AWS limit: {user_data_size} bytes (limit: 16384 bytes, "
f"over by: {user_data_size - 16384} bytes). "
f"Template needs to be reduced by at least {user_data_size - 16384} bytes."
)

try:
result = ec2.run_instances(**params)
except Exception as e:
if "User data is limited to 16384 bytes" in str(e):
# This shouldn't happen if our check above works, but just in case
raise ValueError(
f"UserData exceeds AWS limit: {user_data_size} bytes (limit: 16384 bytes, "
f"over by: {user_data_size - 16384} bytes)"
) from e
raise
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐑 this seems like a lot; maybe remove this pre- and post- check and let the exception get raised by run_instances? Or perhaps catch + re-raise with add_note about reducing template size if userdata is too large?

# Check job files and update timestamps for active runners
# This creates a heartbeat mechanism to detect stuck/failed job completion
for job_file in $J/*.job; do
[ -f "$job_file" ] || continue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How could this trigger? Are there non-normal files in this directory?

done

# Ensure activity file exists and get its timestamp
[ ! -f "$A" ] && touch "$A"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐑 Is this the same as [-f "$A" ] || touch "$A"? If so can we use the same style as above? (Or use an if statement for both?)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants