Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
870b888
Bump default: `runner_grace_period=60`
ryan-williams Aug 13, 2025
87141fe
Add `runner_poll_interval`, update README, templ fix
ryan-williams Aug 13, 2025
601ef7d
Improve Tags
ryan-williams Aug 13, 2025
0b7eed6
Add CLAUDE.md
ryan-williams Aug 13, 2025
d071dd0
`instance_name`
ryan-williams Aug 13, 2025
d8ffe17
`scripts/instance-runtime.py`
ryan-williams Aug 14, 2025
fc11173
v2 + timer fix
ryan-williams Aug 18, 2025
ed964f5
"merge" v2
ryan-williams Aug 19, 2025
38aa700
Implement multiple runners per instance support
ryan-williams Aug 19, 2025
ad60f7c
Fix runner grace period environment variables in systemd service
ryan-williams Aug 19, 2025
641aaca
Fix runner deregistration by properly substituting homedir path
ryan-williams Aug 19, 2025
0fee210
Add optional debug mode for troubleshooting
ryan-williams Aug 19, 2025
ee915ce
Make system dependencies more robust and cross-platform
ryan-williams Aug 19, 2025
ed21f50
Parallelize runner setup and add Python3-free fallback
ryan-williams Aug 19, 2025
f7101c6
demo-archs: {Ubuntu,Debian} x {AMD,ARM}
ryan-williams Aug 19, 2025
7a3a563
typing lints, `environ`
ryan-williams Aug 20, 2025
ea6bb7d
sleep 10m in debug mode only
ryan-williams Aug 20, 2025
1f87795
CLAUDE.md
ryan-williams Aug 20, 2025
b80f959
`demo-dbg-minimal.yml`
ryan-williams Aug 20, 2025
4189ba8
CW: `/tmp/runner-*-config.log`
ryan-williams Aug 20, 2025
a096a5b
debug mode shutdown
ryan-williams Aug 20, 2025
05e1e73
`./bin/installdependencies.sh` for AL2023 / Debian 13
ryan-williams Aug 20, 2025
5ba8ef1
`shared-functions.sh.templ`
ryan-williams Aug 20, 2025
41c79f6
runner-common fix
ryan-williams Aug 20, 2025
7db6296
gha-runner@rw/dns: print PublicDnsName to GHA UI
ryan-williams Aug 20, 2025
f76b72c
azl deps fix
ryan-williams Aug 20, 2025
5b81d71
compress templ
ryan-williams Aug 20, 2025
4dc62f0
err handling
ryan-williams Aug 20, 2025
b9e5743
allow some runners to proceed, even if others fail to init
ryan-williams Aug 20, 2025
147526b
heredoc fixes
ryan-williams Aug 20, 2025
7d04a6e
mv `shared-functions.sh{.templ,}`
ryan-williams Aug 20, 2025
9104e97
demo-archs fixes
ryan-williams Aug 20, 2025
12f1088
improve runner labels, factor `template_vars`
ryan-williams Aug 20, 2025
b23db5f
`demo-cpu-sweep.yml`
ryan-williams Aug 20, 2025
cf175b5
`name` param
ryan-williams Aug 20, 2025
f5a7f81
demos refactor
ryan-williams Aug 20, 2025
b99370e
cpu-sweep names
ryan-williams Aug 20, 2025
c03a8c6
gpu-sweep conda fix
ryan-williams Aug 20, 2025
dbc6f25
sweeps
ryan-williams Aug 21, 2025
de4a8a5
`instance_name`s, `$run`
ryan-williams Aug 21, 2025
3fa0203
docs, $run, default $name
ryan-williams Aug 21, 2025
1fb4726
`outputs.matrix`, rm `runners,mapping`
ryan-williams Aug 21, 2025
15cb127
rename: `outputs.{matrix,mtx}`
ryan-williams Aug 21, 2025
34cc783
`Open-Athena/gha-runner@v1`
ryan-williams Aug 21, 2025
dd43541
skip `installdependencies.sh` if deps are present
ryan-williams Aug 21, 2025
6193a32
cpu-sweep name tweak
ryan-williams Aug 21, 2025
83b77b4
`requires-python = ">= 3.10"`
ryan-williams Sep 7, 2025
404c6f9
add uv.lock
ryan-williams Sep 7, 2025
c3f3b2e
lint, comment update
ryan-williams Sep 15, 2025
1597611
rm vestigial `gpu-benchmark.py`
ryan-williams Sep 15, 2025
7401d0f
`MAX_INSTANCE_LIFETIME = "120" # 2 hours (in minutes)`
ryan-williams Sep 17, 2025
0b464c6
add trailing newlines
ryan-williams Sep 17, 2025
01dd723
`Check if any runners are actually running`
ryan-williams Sep 17, 2025
6059133
more robust `MAX_LIFETIME_MINUTES` shutdown logic
ryan-williams Sep 17, 2025
854396e
Add placeholder workflow for demo-disk-full
ryan-williams Sep 18, 2025
fba0891
tweak cpu-sweep Debian job names
ryan-williams Sep 18, 2025
56c5ca7
`CLAUDE.md`: clearer `update-snapshots` comment
ryan-williams Sep 18, 2025
7a6d11e
DL userdata payload at runtime
ryan-williams Sep 18, 2025
cc64d49
`Debug mode: false=off, true/trace=set -x only, number=set -x + sleep…
ryan-williams Sep 18, 2025
d04bbe7
`ec2_root_device_size: Root disk size in GB (0=AMI default, +N=AMI+N …
ryan-williams Sep 18, 2025
0792e03
"heartbeat" job files, detect stale jobs / out-of-disk, add `test-dis…
ryan-williams Sep 18, 2025
1fe02b0
more robust `debug_sleep_and_shutdown`
ryan-williams Sep 18, 2025
7425302
`terminate_instance` when no runners registered successfully
ryan-williams Sep 24, 2025
d4c1714
CW init fix, fail on CW init failure
ryan-williams Sep 24, 2025
bb72f09
reformat / de-obfuscate some vars
ryan-williams Sep 24, 2025
0dde623
CW amd fix
ryan-williams Sep 24, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
78 changes: 0 additions & 78 deletions .github/test-scripts/gpu-benchmark.py

This file was deleted.

105 changes: 105 additions & 0 deletions .github/workflows/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,105 @@
# `ec2-gha` Demos
This directory contains the reusable workflow and demo workflows for ec2-gha, demonstrating various capabilities.

For documentation about the main workflow, [`runner.yml`](runner.yml), see [the main README](../../README.md).

<!-- toc -->
- [`demos` – run all demo workflows](#demos)
- [Core demos](#core)
- [`dbg-minimal` – configurable debugging instance](#dbg-minimal)
- [`gpu-minimal` – `nvidia-smi` "hello world"](#gpu-minimal)
- [`cpu-sweep` – OS/architecture matrix](#cpu-sweep)
- [`gpu-sweep` – GPU instance types with PyTorch](#gpu-sweep)
- [Parallelization](#parallel)
- [`instances-mtx` – multiple instances for parallel jobs](#instances-mtx)
- [`runners-mtx` – multiple runners on single instance](#runners-mtx)
- [`jobs-split` – different job types on separate instances](#jobs-split)
- [Stress testing](#stress-tests)
- [`test-disk-full` – disk-full scenario testing](#test-disk-full)
- [Real-world example: Mamba installation testing](#mamba)
<!-- /toc -->

## [`demos`](demos.yml) – run all demo workflows <a id="demos"></a>
Useful regression test, demonstrates and verifies features.

[![](../../img/demos%2325%201.png)][demos#25]

## Core demos <a id="core"></a>

### [`dbg-minimal`](demo-dbg-minimal.yml) – configurable debugging instance <a id="dbg-minimal"></a>
- `workflow_dispatch` with customizable parameters (instance type, AMI, timeouts)
- Also callable via `workflow_call` (used by `cpu-sweep`)
- Extended debug mode for troubleshooting
- **Instance type:** `t3.large` (default), configurable
- **Use case:** Interactive debugging and testing

### [`gpu-minimal`](demo-gpu-minimal.yml) – `nvidia-smi` "hello world" <a id="gpu-minimal"></a>
- **Instance type:** `g4dn.xlarge`

### [`cpu-sweep`](demo-cpu-sweep.yml) – OS/architecture matrix <a id="cpu-sweep"></a>
- Tests 12 combinations across operating systems and architectures
- **OS:** Ubuntu 22.04/24.04, Debian 11/12, AL2, AL2023
- **Architectures:** x86 (`t3.*`) and ARM (`t4g.*`)
- Calls `dbg-minimal` for each combination
- **Use case:** Cross-platform compatibility testing

### [`gpu-sweep`](demo-gpu-sweep.yml) – GPU instance types with PyTorch <a id="gpu-sweep"></a>
- Tests different GPU instance families
- **Instance types:** `g4dn.xlarge`, `g5.xlarge`, `g6.xlarge`, `g5g.xlarge` (ARM64 + GPU)
- Uses Deep Learning OSS PyTorch 2.5.1 AMIs
- Activates conda environment and runs PyTorch CUDA tests
- **Use case:** GPU compatibility and performance testing

## Parallelization <a id="parallel"></a>

### [`instances-mtx`](demo-instances-mtx.yml) – multiple instances for parallel jobs <a id="instances-mtx"></a>
- Creates configurable number of instances (default: 3)
- Uses matrix strategy to run jobs in parallel
- Each job runs on its own EC2 instance
- **Instance type:** `t3.medium`
- **Use case:** Parallel test execution, distributed builds

### [`runners-mtx`](demo-runners-mtx.yml) – multiple runners on single instance <a id="runners-mtx"></a>
- Configurable runners per instance (default: 3)
- All runners share the same instance resources
- Demonstrates resource-efficient parallel execution
- **Instance type:** `t3.xlarge` (larger instance for multiple runners)
- **Use case:** Shared environment testing, resource optimization

### [`jobs-split`](demo-jobs-split.yml) – different job types on separate instances <a id="jobs-split"></a>
- Launches 2 instances
- Build job runs on first instance
- Test job runs on second instance
- Demonstrates targeted job placement
- **Instance type:** `t3.medium`
- **Use case:** Pipeline with dedicated instances per stage

## Stress testing <a id="stress-tests"></a>

### [`test-disk-full`](test-disk-full.yml) – disk-full scenario testing <a id="test-disk-full"></a>
- Tests runner behavior when disk space is exhausted
- **Configurable parameters:**
- `disk_size`: Root disk size (`0`=AMI default, `+N`=AMI+N GB, e.g., `+2`)
- `fill_strategy`: How to fill disk (`gradual`, `immediate`, or `during-tests`)
- `debug`: Debug mode (`false`, `true`/`trace`, or number for trace+sleep)
- `max_instance_lifetime`: Maximum lifetime before forced shutdown (default: 15 minutes)
- **Features tested:**
- Heartbeat mechanism for detecting stuck jobs
- Stale job file detection and cleanup
- Worker/Listener process monitoring
- Robust shutdown with multiple fallback methods
- **Instance type:** `t3.medium` (default)
- **Use case:** Verifying robustness in resource-constrained environments

## Real-world example: [Mamba installation testing](https://github.com/Open-Athena/mamba/blob/gha/.github/workflows/install.yaml) <a id="mamba"></a>
- Tests different versions of `mamba_ssm` package on GPU instances
- **Customizes `instance_name`**: `"$repo/$name==${{ inputs.mamba_version }} (#$run)"`
- Results in descriptive names like `"mamba/install==2.2.5 (#123)"`
- Makes it easy to identify which version is being tested on each instance
- Uses pre-installed PyTorch from DLAMI conda environment
- **Use case:** Package compatibility testing across versions

[![](../../img/mamba%2312.png)][mamba#12]

[mamba#12]: https://github.com/Open-Athena/mamba/actions/runs/16972369660/
[demos#25]: https://github.com/Open-Athena/ec2-gha/actions/runs/17004697889
70 changes: 0 additions & 70 deletions .github/workflows/demo-archs.yml

This file was deleted.

44 changes: 44 additions & 0 deletions .github/workflows/demo-cpu-sweep.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
name: Demo – OS/Architecture sweep (CPU nodes)
on:
workflow_dispatch:
inputs:
sleep:
description: "Sleep duration in seconds for workload simulation"
required: false
type: string
default: "10"
workflow_call: # Can be called from other workflows
inputs:
sleep:
required: false
type: string
default: "10"
permissions:
id-token: write # Required for AWS OIDC authentication
contents: read # Required for actions/checkout
jobs:
os-arch-matrix:
uses: ./.github/workflows/demo-dbg-minimal.yml
strategy:
fail-fast: false
matrix:
include:
- { "os": Ubuntu 22.04, "arch": x86, "ami": ami-021589336d307b577, "instance_type": t3.medium }
- { "os": Ubuntu 22.04, "arch": ARM, "ami": ami-06daf9c2d2cf1cb37, "instance_type": t4g.medium }
- { "os": Ubuntu 24.04, "arch": x86, "ami": ami-0ca5a2f40c2601df6, "instance_type": t3.medium }
- { "os": Ubuntu 24.04, "arch": ARM, "ami": ami-0aa307ed50ca3e58f, "instance_type": t4g.medium }
- { "os": Debian 11 , "arch": x86, "ami": ami-0e6612f57082e7ea4, "instance_type": t3.large }
- { "os": Debian 11 , "arch": ARM, "ami": ami-0c3f5b0b87f042da8, "instance_type": t4g.large }
- { "os": Debian 12 , "arch": x86, "ami": ami-05b50089e01b13194, "instance_type": t3.large }
- { "os": Debian 12 , "arch": ARM, "ami": ami-0505441d7e1514742, "instance_type": t4g.large }
- { "os": AL2 , "arch": x86, "ami": ami-0e2c86481225d3c51, "instance_type": t3.small }
- { "os": AL2 , "arch": ARM, "ami": ami-08333c9352b93f31e, "instance_type": t4g.small }
- { "os": AL2023 , "arch": x86, "ami": ami-00ca32bbc84273381, "instance_type": t3.small }
- { "os": AL2023 , "arch": ARM, "ami": ami-0aa7db6294d00216f, "instance_type": t4g.small }
name: ${{ matrix.os }} ${{ matrix.arch }}
with:
type: ${{ matrix.instance_type }}
ami: ${{ matrix.ami }}
sleep: ${{ inputs.sleep }}
instance_name: 'cpu-sweep#$run ${{ matrix.os }} ${{ matrix.arch }}'
secrets: inherit
Loading