Skip to content

Implement ORB AWS EC2 Worker Adapter#525

Open
magniloquency wants to merge 88 commits intofinos:mainfrom
magniloquency:orb
Open

Implement ORB AWS EC2 Worker Adapter#525
magniloquency wants to merge 88 commits intofinos:mainfrom
magniloquency:orb

Conversation

@magniloquency
Copy link
Contributor

@magniloquency magniloquency commented Jan 22, 2026

This pull request implements the ORB AWS EC2 Worker Adapter (orb_aws_ec2), enabling Scaler to dynamically scale worker instances on AWS using the ORB Python SDK (replacing the original CLI-based approach).

Key Changes

  • ORB SDK integration: Replaced ORBHelper (subprocess wrapper around the orb CLI) with direct ORBClient SDK usage. Config is built in-memory from ORBAWSEC2WorkerAdapterConfig fields — no file copying, no temp dirs for templates or user data.
  • Removed orb_config_path: No longer needed; the provider config is constructed programmatically via _build_app_config().
  • Async polling: _poll_for_instance_id() is now fully async (asyncio.sleep instead of time.sleep + run_in_executor).
  • AMI Building: Introduced ami/ directory with Packer configuration (opengris-scaler.pkr.hcl) and a build script (build.sh) to create AMIs pre-configured with opengris-scaler.
  • Configuration: Added ORBAWSEC2WorkerAdapterConfig for detailed adapter settings. Removed redundant top-level event_loop and worker_io_threads fields — these are now inherited from the standard worker_config.
  • Multi-worker support: Each EC2 instance runs cpu_count - 1 workers, where cpu_count is determined by the machine type configured by the user.
  • Unified entry points: The ORB AWS EC2 worker manager is fully integrated into scaler_worker_manager (as the orb_aws_ec2 subcommand) and the scaler all-in-one launcher (type = "orb_aws_ec2" in [[worker_manager]]). The dedicated scaler_worker_manager_orb entry point has been removed.
  • Scheduler Fix: The WorkerAdapterController now waits for a pending command to complete before sending a new one, preventing duplicate StartWorkerGroup commands during the long ORB polling period.

Bug Fixes

  • TooManyWorkers suppression during EC2 boot: The suppression logic was inverted, causing the scheduler to resume spamming StartWorkers immediately after receiving TooManyWorkers. Replaced the Set-based approach with a baseline Dict that records the managed worker count at the time TooManyWorkers was received — suppression is now held until at least one booting instance registers.
  • Zero-worker default on single-core machines: DEFAULT_MAX_TASK_CONCURRENCY was cpu_count() - 1, which evaluates to 0 on single-core machines. Removed the subtraction so at least one worker is started by default.
  • ORB strategy defaults workaround: When ORBClient is initialised with app_config=, it skips _load_strategy_defaults(), causing RunInstances to be absent from supported APIs. Fixed by including provider_defaults.aws.handlers explicitly in _build_app_config().
  • Lazy-import orb: Moved the from orb import ORBClient import inside _run() to defer it until the adapter is actually used, fixing CI test failures when orb is not installed.
  • User data updated: EC2 instances now launch workers via scaler_worker_manager baremetal_native --mode fixed, replacing the deprecated scaler_cluster command.

Dependencies

  • Added orb-py and boto3 to the orb_aws_ec2 and all extra dependency groups in pyproject.toml

Usage

Workers launched by the ORB AWS EC2 adapter are EC2 instances that connect back to the scheduler over the network. The scheduler address, object storage address, and object storage server must all be externally reachable from those instances (e.g. a private VPC IP or public IP).

# stack.toml

[scheduler]
scheduler_address = "tcp://0.0.0.0:8516"
object_storage_address = "tcp://127.0.0.1:8517"

[object_storage_server]
object_storage_address = "tcp://0.0.0.0:8517"

[[worker_manager]]
type = "orb_aws_ec2"
scheduler_address = "tcp://<scheduler_external_ip>:8516"
object_storage_address = "tcp://<oss_external_ip>:8517"
image_id = "ami-0528819f94f4f5fa5"
instance_type = "t3.medium"
aws_region = "us-east-1"
scaler stack.toml
# or
scaler_worker_manager orb_aws_ec2 tcp://<scheduler_external_ip>:8516 \
    --object-storage-address tcp://<oss_external_ip>:8517 \
    --image-id ami-0528819f94f4f5fa5 \
    --instance-type t3.medium \
    --aws-region us-east-1

@magniloquency magniloquency force-pushed the orb branch 13 times, most recently from 1339a57 to 1ebb8d9 Compare January 29, 2026 04:09
@magniloquency magniloquency force-pushed the orb branch 9 times, most recently from 0182b36 to 2c2573a Compare February 10, 2026 00:18
- Include submit_tasks.py in examples readme and documentation.
- Implement skip_examples.txt for top-level examples in CI.
- Add submit_tasks.py to skip_examples.txt as it requires a running scheduler.
magniloquency and others added 6 commits March 16, 2026 21:51
- Use main's new worker_managers/ docs structure (PR finos#611)
- Move worker_manager_adapter/orb.rst to worker_managers/orb.rst
- Add ORB entry to worker_managers/index.rst
- Accept deletion of reorganized files (examples.rst, worker_manager_adapter/index.rst, common_parameters.rst)
Resolved conflicts:
- pyproject.toml: keep unified scaler_worker_manager + scaler entry points from main, retain scaler_worker_manager_orb from orb branch
- README.md: use main's unified CLI command naming in TOML section table, add orb_worker_adapter row
- tests/config/test_config_class.py: use bytes literal (b""") from main for mock_open read_data
- docs/source/tutorials/configuration.rst: accept deletion from main
- Register ORBWorkerAdapterConfig with _tag = "orb" for discriminator-based
  TOML parsing in the scaler all-in-one launcher
- Add orb subcommand to scaler_worker_manager dispatcher
- Add ORBWorkerAdapterConfig to WorkerManagerUnion in scaler.py
- Remove redundant top-level event_loop and worker_io_threads fields from
  ORBWorkerAdapterConfig in favour of the existing worker_config equivalents
- Update docs (commands.rst, orb.rst) and README to reflect the unified entry point
- Add tests for orb subcommand parsing, TOML config, and _run_worker_manager dispatch
The orb worker manager is now accessible via the unified
scaler_worker_manager orb subcommand, making the dedicated
entry point redundant.
When ORBClient is initialised with app_config=, its _ensure_raw_config()
merges only default_config.json (which has provider_defaults: {}) with the
caller-supplied dict, skipping the _load_strategy_defaults() call that
normally loads aws_defaults.json.  As a result get_effective_handlers()
returns {} and RunInstances is absent from supported_apis, causing:

  ApplicationError: Provider does not support API 'RunInstances'. Supported APIs: []

Fix by including provider_defaults.aws.handlers explicitly in
_build_app_config() so the RunInstances handler definition is always
present regardless of how ORB loads its config.
When the ORB adapter is at capacity it returns TooManyWorkers, but the
scheduler's worker count (based on received heartbeats) may still be
below max_task_concurrency because newly-created instances haven't sent
their first heartbeat yet. This caused the scheduler to re-request a
worker on every heartbeat, spamming the log.

Fix: track sources that have returned TooManyWorkers and suppress new
StartWorkers requests for that source until the scheduler's own worker
count drops below max_task_concurrency (indicating a worker left and
the ORB adapter has freed up capacity).

Also fix a latent bug in all three scaling policies where the capacity
check `len(managed) >= max_task_concurrency` is always True when
max_task_concurrency == -1 (unlimited), blocking all scaling.
The module-level `from orb import ORBClient as orb` caused CI tests to
fail when patching ORBWorkerAdapter, because importing the module
triggered the import of `orb` which is not installed in CI. Moving the
import inside `_run()` defers it until the adapter is actually used.
Replace the deprecated scaler_cluster command with scaler_worker_manager
baremetal_native, passing --mode fixed and --worker-manager-id sourced
from ec2-metadata.
Fix incorrect version.txt path in build.sh (was two levels up, should be three), and add the newly built AMI ami-0b76605999d8f5d2b for scaler 1.26.4 / Python 3.13 to the ORB docs table.
DEFAULT_MAX_TASK_CONCURRENCY was cpu_count() - 1, which evaluates to 0
on single-core machines. Remove the subtraction so at least one worker
is started by default.
The _at_capacity_sources clearing condition was inverted: it cleared
suppression when managed_worker_ids < max_task_concurrency, which is
exactly the case during EC2 boot (0 workers, instance not yet registered).
This caused the scheduler to resume spamming StartWorkers on the very
next heartbeat after receiving TooManyWorkers.

Replace the Set-based approach with a baseline Dict that records the
managed worker count at the time TooManyWorkers was received. Suppression
is now held until the scheduler's view of workers grows beyond that
baseline, i.e. at least one booting instance has sent its first heartbeat.
@magniloquency magniloquency marked this pull request as ready for review March 27, 2026 01:21
@magniloquency magniloquency requested review from gxuu and rafa-be March 27, 2026 01:29
# Conflicts:
#	src/scaler/scheduler/controllers/policies/simple_policy/scaling/fixed_elastic.py
@magniloquency magniloquency changed the title Implement ORB Worker Adapter Implement ORB AWS EC2 Worker Adapter Mar 27, 2026
Renames all identifiers, file names, directories, config tags, CLI
subcommands, docs, README, and tests from `orb` / `ORBWorkerAdapter` to
`orb_aws_ec2` / `ORBAWSEC2WorkerAdapter` to make clear this adapter is
specifically for AWS EC2 via the ORB SDK.
@sharpener6 sharpener6 enabled auto-merge (squash) March 27, 2026 03:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants