Open-Athena · ryan-williams · Aug 13, 2025 · Aug 13, 2025 · Aug 13, 2025 · Aug 13, 2025
diff --git a/.github/test-scripts/gpu-benchmark.py b/.github/test-scripts/gpu-benchmark.py
diff --git a/.github/workflows/README.md b/.github/workflows/README.md
@@ -0,0 +1,105 @@
+# `ec2-gha` Demos
+This directory contains the reusable workflow and demo workflows for ec2-gha, demonstrating various capabilities.
+
+For documentation about the main workflow, [`runner.yml`](runner.yml), see [the main README](../../README.md).
+
+<!-- toc -->
+- [`demos` – run all demo workflows](#demos)
+- [Core demos](#core)
+    - [`dbg-minimal` – configurable debugging instance](#dbg-minimal)
+    - [`gpu-minimal` – `nvidia-smi` "hello world"](#gpu-minimal)
+    - [`cpu-sweep` – OS/architecture matrix](#cpu-sweep)
+    - [`gpu-sweep` – GPU instance types with PyTorch](#gpu-sweep)
+- [Parallelization](#parallel)
+    - [`instances-mtx` – multiple instances for parallel jobs](#instances-mtx)
+    - [`runners-mtx` – multiple runners on single instance](#runners-mtx)
+    - [`jobs-split` – different job types on separate instances](#jobs-split)
+- [Stress testing](#stress-tests)
+    - [`test-disk-full` – disk-full scenario testing](#test-disk-full)
+- [Real-world example: Mamba installation testing](#mamba)
+<!-- /toc -->
+
+## [`demos`](demos.yml) – run all demo workflows <a id="demos"></a>
+Useful regression test, demonstrates and verifies features.
+
+[![](../../img/demos%2325%201.png)][demos#25]
+
+## Core demos <a id="core"></a>
+
+### [`dbg-minimal`](demo-dbg-minimal.yml) – configurable debugging instance <a id="dbg-minimal"></a>
+- `workflow_dispatch` with customizable parameters (instance type, AMI, timeouts)
+- Also callable via `workflow_call` (used by `cpu-sweep`)
+- Extended debug mode for troubleshooting
+- **Instance type:** `t3.large` (default), configurable
+- **Use case:** Interactive debugging and testing
+
+### [`gpu-minimal`](demo-gpu-minimal.yml) – `nvidia-smi` "hello world" <a id="gpu-minimal"></a>
+- **Instance type:** `g4dn.xlarge`
+
+### [`cpu-sweep`](demo-cpu-sweep.yml) – OS/architecture matrix <a id="cpu-sweep"></a>
+- Tests 12 combinations across operating systems and architectures
+- **OS:** Ubuntu 22.04/24.04, Debian 11/12, AL2, AL2023
+- **Architectures:** x86 (`t3.*`) and ARM (`t4g.*`)
+- Calls `dbg-minimal` for each combination
+- **Use case:** Cross-platform compatibility testing
+
+### [`gpu-sweep`](demo-gpu-sweep.yml) – GPU instance types with PyTorch <a id="gpu-sweep"></a>
+- Tests different GPU instance families
+- **Instance types:** `g4dn.xlarge`, `g5.xlarge`, `g6.xlarge`, `g5g.xlarge` (ARM64 + GPU)
+- Uses Deep Learning OSS PyTorch 2.5.1 AMIs
+- Activates conda environment and runs PyTorch CUDA tests
+- **Use case:** GPU compatibility and performance testing
+
+## Parallelization <a id="parallel"></a>
+
+### [`instances-mtx`](demo-instances-mtx.yml) – multiple instances for parallel jobs <a id="instances-mtx"></a>
+- Creates configurable number of instances (default: 3)
+- Uses matrix strategy to run jobs in parallel
+- Each job runs on its own EC2 instance
+- **Instance type:** `t3.medium`
+- **Use case:** Parallel test execution, distributed builds
+
+### [`runners-mtx`](demo-runners-mtx.yml) – multiple runners on single instance <a id="runners-mtx"></a>
+- Configurable runners per instance (default: 3)
+- All runners share the same instance resources
+- Demonstrates resource-efficient parallel execution
+- **Instance type:** `t3.xlarge` (larger instance for multiple runners)
+- **Use case:** Shared environment testing, resource optimization
+
+### [`jobs-split`](demo-jobs-split.yml) – different job types on separate instances <a id="jobs-split"></a>
+- Launches 2 instances
+- Build job runs on first instance
+- Test job runs on second instance
+- Demonstrates targeted job placement
+- **Instance type:** `t3.medium`
+- **Use case:** Pipeline with dedicated instances per stage
+
+## Stress testing <a id="stress-tests"></a>
+
+### [`test-disk-full`](test-disk-full.yml) – disk-full scenario testing <a id="test-disk-full"></a>
+- Tests runner behavior when disk space is exhausted
+- **Configurable parameters:**
+  - `disk_size`: Root disk size (`0`=AMI default, `+N`=AMI+N GB, e.g., `+2`)
+  - `fill_strategy`: How to fill disk (`gradual`, `immediate`, or `during-tests`)
+  - `debug`: Debug mode (`false`, `true`/`trace`, or number for trace+sleep)
+  - `max_instance_lifetime`: Maximum lifetime before forced shutdown (default: 15 minutes)
+- **Features tested:**
+  - Heartbeat mechanism for detecting stuck jobs
+  - Stale job file detection and cleanup
+  - Worker/Listener process monitoring
+  - Robust shutdown with multiple fallback methods
+- **Instance type:** `t3.medium` (default)
+- **Use case:** Verifying robustness in resource-constrained environments
+
+## Real-world example: [Mamba installation testing](https://github.com/Open-Athena/mamba/blob/gha/.github/workflows/install.yaml) <a id="mamba"></a>
+- Tests different versions of `mamba_ssm` package on GPU instances
+- **Customizes `instance_name`**: `"$repo/$name==${{ inputs.mamba_version }} (#$run)"`
+  - Results in descriptive names like `"mamba/install==2.2.5 (#123)"`
+  - Makes it easy to identify which version is being tested on each instance
+- Uses pre-installed PyTorch from DLAMI conda environment
+- **Use case:** Package compatibility testing across versions
+
+[![](../../img/mamba%2312.png)][mamba#12]
+
+[mamba#12]: https://github.com/Open-Athena/mamba/actions/runs/16972369660/
+[demos#25]: https://github.com/Open-Athena/ec2-gha/actions/runs/17004697889
diff --git a/.github/workflows/demo-archs.yml b/.github/workflows/demo-archs.yml
diff --git a/.github/workflows/demo-cpu-sweep.yml b/.github/workflows/demo-cpu-sweep.yml
@@ -0,0 +1,44 @@
+name: Demo – OS/Architecture sweep (CPU nodes)
+on:
+  workflow_dispatch:
+    inputs:
+      sleep:
+        description: "Sleep duration in seconds for workload simulation"
+        required: false
+        type: string
+        default: "10"
+  workflow_call:   # Can be called from other workflows
+    inputs:
+      sleep:
+        required: false
+        type: string
+        default: "10"
+permissions:
+  id-token: write  # Required for AWS OIDC authentication
+  contents: read   # Required for actions/checkout
+jobs:
+  os-arch-matrix:
+    uses: ./.github/workflows/demo-dbg-minimal.yml
+    strategy:
+      fail-fast: false
+      matrix:
+        include:
+          - { "os": Ubuntu 22.04, "arch": x86, "ami": ami-021589336d307b577, "instance_type":  t3.medium }
+          - { "os": Ubuntu 22.04, "arch": ARM, "ami": ami-06daf9c2d2cf1cb37, "instance_type": t4g.medium }
+          - { "os": Ubuntu 24.04, "arch": x86, "ami": ami-0ca5a2f40c2601df6, "instance_type":  t3.medium }
+          - { "os": Ubuntu 24.04, "arch": ARM, "ami": ami-0aa307ed50ca3e58f, "instance_type": t4g.medium }
+          - { "os": Debian 11   , "arch": x86, "ami": ami-0e6612f57082e7ea4, "instance_type":   t3.large }
+          - { "os": Debian 11   , "arch": ARM, "ami": ami-0c3f5b0b87f042da8, "instance_type":  t4g.large }
+          - { "os": Debian 12   , "arch": x86, "ami": ami-05b50089e01b13194, "instance_type":   t3.large }
+          - { "os": Debian 12   , "arch": ARM, "ami": ami-0505441d7e1514742, "instance_type":  t4g.large }
+          - { "os":    AL2      , "arch": x86, "ami": ami-0e2c86481225d3c51, "instance_type":   t3.small }
+          - { "os":    AL2      , "arch": ARM, "ami": ami-08333c9352b93f31e, "instance_type":  t4g.small }
+          - { "os": AL2023      , "arch": x86, "ami": ami-00ca32bbc84273381, "instance_type":   t3.small }
+          - { "os": AL2023      , "arch": ARM, "ami": ami-0aa7db6294d00216f, "instance_type":  t4g.small }
+    name: ${{ matrix.os }} ${{ matrix.arch }}
+    with:
+      type: ${{ matrix.instance_type }}
+      ami: ${{ matrix.ami }}
+      sleep: ${{ inputs.sleep }}
+      instance_name: 'cpu-sweep#$run ${{ matrix.os }} ${{ matrix.arch }}'
+    secrets: inherit