-
Notifications
You must be signed in to change notification settings - Fork 0
Multi-runner support (on one instance), multi-{OS,arch} demos #3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Also fix a few inconsistent `runner_initial_grace_period=120` defaults
- Add runners_per_instance input to action.yml and runner.yml - Modify __main__.py to generate multiple tokens per instance - Update StartAWS to handle grouped tokens for multiple runners - Add runners_per_instance parameter to action.yml and runner.yml (default: 1) - Generate multiple GitHub runner tokens upfront for multi-runner instances - Update template to register and start multiple runners in separate directories - Each runner gets its own directory (runner-0, runner-1, etc.) - Update termination logic to handle all runners on instance - Maintain backward compatibility with single runner mode - Update output handling to provide array of all runner labels This allows a single EC2 instance to host multiple GitHub runners, enabling concurrent job execution without launching separate instances.
The RUNNER_INITIAL_GRACE_PERIOD and RUNNER_GRACE_PERIOD values were being correctly passed to the instance and set in the runner's .env file, but the systemd service that runs the termination check didn't have access to these environment variables. Added Environment= directives to the systemd service to pass the grace period values through, so they'll be available when the termination check script runs.
The termination check script was failing to deregister runners because $homedir was inside a single-quoted heredoc ('EOFT'), preventing variable substitution. This caused the script to look for /runner-* instead of /home/ubuntu/runner-*. Fixed by: - Changing heredoc delimiter from 'EOFT' to EOFT (allows substitution) - Escaping shell variables with \ to preserve them for runtime - Ensuring $homedir is substituted at template generation time This ensures runners are properly deregistered before instance termination, preventing orphaned runners in GitHub.
Added a 'debug' input parameter that enables verbose output (set -x) in the runner setup script. This helps with troubleshooting without always having verbose output. Changes: - Added 'debug' input to action.yml and runner.yml workflow - Pass debug parameter through Python code to template - Conditionally enable 'set -x' only when debug is true - Keep runner-debug.log output regardless (useful for post-mortem) The debug mode can be enabled by setting debug: true in the workflow that calls ec2-gha.
Added checks for critical system dependencies and made the script more portable across different Linux distributions: - Check for Python3 availability (critical dependency) - Support both dpkg (Debian/Ubuntu) and rpm (RHEL/CentOS/Amazon Linux) for CloudWatch agent installation - Support both curl and wget for downloading files - Fail gracefully with clear error messages when critical tools are missing This makes ec2-gha more flexible for use with different AMIs beyond just Ubuntu, while maintaining backward compatibility.
Major improvements to runner configuration: 1. Parallelization (when Python3 available): - Moved runner setup logic to a shell function 'configure_runner' - Python now uses ThreadPoolExecutor to configure up to 4 runners in parallel - Significantly reduces setup time for multiple runners 2. Python3-free fallback: - Single runner can now be configured without Python3 - Uses shell-based JSON parsing (basic sed extraction) - Limited to single runner due to JSON parsing complexity in pure shell - Allows ec2-gha to work on minimal AMIs without Python3 3. Better separation of concerns: - Shell function handles all runner setup logic - Python only handles JSON parsing and parallelization - Makes the code more maintainable and testable The default path uses Python3 for reliability and performance, but the fallback ensures basic functionality on minimal AMIs.
some shared-functions factoring
… N minutes before shutdown`
…GB for testing, e.g. +2)`
…k-full.yml` example
|
||
Each instance gets a unique runner label and can execute jobs independently. This is useful for: | ||
- Matrix builds that need isolated environments | ||
- Parallel testing across different configurations |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All the instances are identical, right? So this doesn't add anything in terms of isolation or configuration variances over the doing a matrix with an instance_count of 1? (Just faster?)
gh = GitHubInstance(token=token, repo=repo) | ||
|
||
# Pass runners_per_instance to StartAWS | ||
params["runners_per_instance"] = runners_per_instance |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT we're popping this out of this dict just above and then putting it back in here.
grouped_tokens = [] | ||
for i in range(0, total_runners, runners_per_instance): | ||
grouped_tokens.append(all_tokens[i:i+runners_per_instance]) | ||
params["grouped_runner_tokens"] = grouped_tokens |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, if I understand correctly, with runners_per_instance
== 1:
- We pass count =
instance_count
, notgrouped_runner_tokens
- gha_runner's
__post_init__
makescount
tokens and stores incloud_params['gh_runner_tokens']
- In StartAWS.create_instances we then read the number of instances out of the size of len(gh_runner_tokens).
And with runners_per_instance > 1:
- We pass count =
instance_count
andgrouped_runner_tokens
- gha_runner's
__post_init__
makescount
tokens and stores incloud_params['gh_runner_tokens']
- In StartAWS.create_instances we see that
grouped_runner_tokens
is populated so we ignoregh_runner_tokens
(and so alsocount
) and get the number of instances from the length ofgrouped_runner_tokens
instead.
This seems a little confusing and has a few minor knock on effects (eg create_instances checks for gh_runner_tokens
but not grouped_runner_tokens
; the former is documented in init but the latter is not; count is meaningless if you pass grouped_runner_tokens
.) Could we have a single mechanism here? Perhaps we pass runners_per_instance to StartAWS and have it populate runner_tokens (always with groups?) in __post_init__
where it currently does it for one runner?
# Check UserData size before calling AWS | ||
user_data_size = len(params.get("UserData", "")) | ||
if user_data_size > 16384: | ||
raise ValueError( | ||
f"UserData exceeds AWS limit: {user_data_size} bytes (limit: 16384 bytes, " | ||
f"over by: {user_data_size - 16384} bytes). " | ||
f"Template needs to be reduced by at least {user_data_size - 16384} bytes." | ||
) | ||
|
||
try: | ||
result = ec2.run_instances(**params) | ||
except Exception as e: | ||
if "User data is limited to 16384 bytes" in str(e): | ||
# This shouldn't happen if our check above works, but just in case | ||
raise ValueError( | ||
f"UserData exceeds AWS limit: {user_data_size} bytes (limit: 16384 bytes, " | ||
f"over by: {user_data_size - 16384} bytes)" | ||
) from e | ||
raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🐑 this seems like a lot; maybe remove this pre- and post- check and let the exception get raised by run_instances? Or perhaps catch + re-raise with add_note
about reducing template size if userdata is too large?
# Check job files and update timestamps for active runners | ||
# This creates a heartbeat mechanism to detect stuck/failed job completion | ||
for job_file in $J/*.job; do | ||
[ -f "$job_file" ] || continue |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How could this trigger? Are there non-normal files in this directory?
done | ||
|
||
# Ensure activity file exists and get its timestamp | ||
[ ! -f "$A" ] && touch "$A" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🐑 Is this the same as [-f "$A" ] || touch "$A"
? If so can we use the same style as above? (Or use an if statement for both?)
README demos section shows some improvements since #2:
Examples
demos#32
mamba#771
(mamba#771)
Ocean_Emulator#308
(Ocean_Emulator#308)
CloudWatch console
EC2 instances console (showing many terminat{ed,ing} instances