Implement ORB AWS EC2 Worker Adapter by magniloquency · Pull Request #525 · finos/opengris-scaler

magniloquency · 2026-01-22T04:12:09Z

This pull request implements the ORB AWS EC2 Worker Adapter (orb_aws_ec2), enabling Scaler to dynamically scale worker instances on AWS using the ORB Python SDK (replacing the original CLI-based approach).

Key Changes

ORB SDK integration: Replaced ORBHelper (subprocess wrapper around the orb CLI) with direct ORBClient SDK usage. Config is built in-memory from ORBAWSEC2WorkerAdapterConfig fields — no file copying, no temp dirs for templates or user data.
Removed orb_config_path: No longer needed; the provider config is constructed programmatically via _build_app_config().
Async polling: _poll_for_instance_id() is now fully async (asyncio.sleep instead of time.sleep + run_in_executor).
AMI Building: Introduced ami/ directory with Packer configuration (opengris-scaler.pkr.hcl) and a build script (build.sh) to create AMIs pre-configured with opengris-scaler.
Configuration: Added ORBAWSEC2WorkerAdapterConfig for detailed adapter settings. Removed redundant top-level event_loop and worker_io_threads fields — these are now inherited from the standard worker_config.
Multi-worker support: Each EC2 instance runs cpu_count - 1 workers, where cpu_count is determined by the machine type configured by the user.
Unified entry points: The ORB AWS EC2 worker manager is fully integrated into scaler_worker_manager (as the orb_aws_ec2 subcommand) and the scaler all-in-one launcher (type = "orb_aws_ec2" in [[worker_manager]]). The dedicated scaler_worker_manager_orb entry point has been removed.
Scheduler Fix: The WorkerAdapterController now waits for a pending command to complete before sending a new one, preventing duplicate StartWorkerGroup commands during the long ORB polling period.

Bug Fixes

TooManyWorkers suppression during EC2 boot: The suppression logic was inverted, causing the scheduler to resume spamming StartWorkers immediately after receiving TooManyWorkers. Replaced the Set-based approach with a baseline Dict that records the managed worker count at the time TooManyWorkers was received — suppression is now held until at least one booting instance registers.
Zero-worker default on single-core machines: DEFAULT_MAX_TASK_CONCURRENCY was cpu_count() - 1, which evaluates to 0 on single-core machines. Removed the subtraction so at least one worker is started by default.
ORB strategy defaults workaround: When ORBClient is initialised with app_config=, it skips _load_strategy_defaults(), causing RunInstances to be absent from supported APIs. Fixed by including provider_defaults.aws.handlers explicitly in _build_app_config().
Lazy-import orb: Moved the from orb import ORBClient import inside _run() to defer it until the adapter is actually used, fixing CI test failures when orb is not installed.
User data updated: EC2 instances now launch workers via scaler_worker_manager baremetal_native --mode fixed, replacing the deprecated scaler_cluster command.

Dependencies

Added orb-py and boto3 to the orb_aws_ec2 and all extra dependency groups in pyproject.toml

Usage

Workers launched by the ORB AWS EC2 adapter are EC2 instances that connect back to the scheduler over the network. The scheduler address, object storage address, and object storage server must all be externally reachable from those instances (e.g. a private VPC IP or public IP).

# stack.toml

[scheduler]
scheduler_address = "tcp://0.0.0.0:8516"
object_storage_address = "tcp://127.0.0.1:8517"

[object_storage_server]
object_storage_address = "tcp://0.0.0.0:8517"

[[worker_manager]]
type = "orb_aws_ec2"
scheduler_address = "tcp://<scheduler_external_ip>:8516"
object_storage_address = "tcp://<oss_external_ip>:8517"
image_id = "ami-0528819f94f4f5fa5"
instance_type = "t3.medium"
aws_region = "us-east-1"

scaler stack.toml
# or
scaler_worker_manager orb_aws_ec2 tcp://<scheduler_external_ip>:8516 \
    --object-storage-address tcp://<oss_external_ip>:8517 \
    --image-id ami-0528819f94f4f5fa5 \
    --instance-type t3.medium \
    --aws-region us-east-1

- Include submit_tasks.py in examples readme and documentation. - Implement skip_examples.txt for top-level examples in CI. - Add submit_tasks.py to skip_examples.txt as it requires a running scheduler.

.github/actions/run-test/action.yml

src/scaler/worker_manager_adapter/orb_aws_ec2/ami/build.sh

.github/actions/run-test/action.yml

examples/skip_examples.txt

examples/submit_tasks.py

src/scaler/worker_manager_adapter/orb/config/default_config.json

src/scaler/config/section/orb_worker_adapter.py

src/scaler/worker_adapter/orb/worker_adapter.py

src/scaler/worker_manager_adapter/orb_aws_ec2/worker_manager.py

src/scaler/worker_manager_adapter/orb/worker_manager.py

src/scaler/worker_manager_adapter/orb/types.py

src/scaler/worker_adapter/orb/worker_adapter.py

src/scaler/worker_manager_adapter/orb/worker_manager.py

- Use main's new worker_managers/ docs structure (PR finos#611) - Move worker_manager_adapter/orb.rst to worker_managers/orb.rst - Add ORB entry to worker_managers/index.rst - Accept deletion of reorganized files (examples.rst, worker_manager_adapter/index.rst, common_parameters.rst)

Resolved conflicts: - pyproject.toml: keep unified scaler_worker_manager + scaler entry points from main, retain scaler_worker_manager_orb from orb branch - README.md: use main's unified CLI command naming in TOML section table, add orb_worker_adapter row - tests/config/test_config_class.py: use bytes literal (b""") from main for mock_open read_data - docs/source/tutorials/configuration.rst: accept deletion from main

- Register ORBWorkerAdapterConfig with _tag = "orb" for discriminator-based TOML parsing in the scaler all-in-one launcher - Add orb subcommand to scaler_worker_manager dispatcher - Add ORBWorkerAdapterConfig to WorkerManagerUnion in scaler.py - Remove redundant top-level event_loop and worker_io_threads fields from ORBWorkerAdapterConfig in favour of the existing worker_config equivalents - Update docs (commands.rst, orb.rst) and README to reflect the unified entry point - Add tests for orb subcommand parsing, TOML config, and _run_worker_manager dispatch

The orb worker manager is now accessible via the unified scaler_worker_manager orb subcommand, making the dedicated entry point redundant.

When ORBClient is initialised with app_config=, its _ensure_raw_config() merges only default_config.json (which has provider_defaults: {}) with the caller-supplied dict, skipping the _load_strategy_defaults() call that normally loads aws_defaults.json. As a result get_effective_handlers() returns {} and RunInstances is absent from supported_apis, causing: ApplicationError: Provider does not support API 'RunInstances'. Supported APIs: [] Fix by including provider_defaults.aws.handlers explicitly in _build_app_config() so the RunInstances handler definition is always present regardless of how ORB loads its config.

When the ORB adapter is at capacity it returns TooManyWorkers, but the scheduler's worker count (based on received heartbeats) may still be below max_task_concurrency because newly-created instances haven't sent their first heartbeat yet. This caused the scheduler to re-request a worker on every heartbeat, spamming the log. Fix: track sources that have returned TooManyWorkers and suppress new StartWorkers requests for that source until the scheduler's own worker count drops below max_task_concurrency (indicating a worker left and the ORB adapter has freed up capacity). Also fix a latent bug in all three scaling policies where the capacity check `len(managed) >= max_task_concurrency` is always True when max_task_concurrency == -1 (unlimited), blocking all scaling.

The module-level `from orb import ORBClient as orb` caused CI tests to fail when patching ORBWorkerAdapter, because importing the module triggered the import of `orb` which is not installed in CI. Moving the import inside `_run()` defers it until the adapter is actually used.

Replace the deprecated scaler_cluster command with scaler_worker_manager baremetal_native, passing --mode fixed and --worker-manager-id sourced from ec2-metadata.

Fix incorrect version.txt path in build.sh (was two levels up, should be three), and add the newly built AMI ami-0b76605999d8f5d2b for scaler 1.26.4 / Python 3.13 to the ORB docs table.

DEFAULT_MAX_TASK_CONCURRENCY was cpu_count() - 1, which evaluates to 0 on single-core machines. Remove the subtraction so at least one worker is started by default.

The _at_capacity_sources clearing condition was inverted: it cleared suppression when managed_worker_ids < max_task_concurrency, which is exactly the case during EC2 boot (0 workers, instance not yet registered). This caused the scheduler to resume spamming StartWorkers on the very next heartbeat after receiving TooManyWorkers. Replace the Set-based approach with a baseline Dict that records the managed worker count at the time TooManyWorkers was received. Suppression is now held until the scheduler's view of workers grows beyond that baseline, i.e. at least one booting instance has sent its first heartbeat.

README.md

# Conflicts: # src/scaler/scheduler/controllers/policies/simple_policy/scaling/fixed_elastic.py

Renames all identifiers, file names, directories, config tags, CLI subcommands, docs, README, and tests from `orb` / `ORBWorkerAdapter` to `orb_aws_ec2` / `ORBAWSEC2WorkerAdapter` to make clear this adapter is specifically for AWS EC2 via the ORB SDK.

magniloquency force-pushed the orb branch 13 times, most recently from 1339a57 to 1ebb8d9 Compare January 29, 2026 04:09

magniloquency force-pushed the orb branch 9 times, most recently from 0182b36 to 2c2573a Compare February 10, 2026 00:18

Implement ORB Worker Adapter

4506147

magniloquency force-pushed the orb branch from 2c2573a to 4506147 Compare February 10, 2026 00:19

magniloquency marked this pull request as ready for review February 10, 2026 00:28

magniloquency requested review from 1597463007 and sharpener6 February 10, 2026 00:28

Add submit_tasks example to documentation and CI skip list

783c10c

- Include submit_tasks.py in examples readme and documentation. - Implement skip_examples.txt for top-level examples in CI. - Add submit_tasks.py to skip_examples.txt as it requires a running scheduler.

gxuu reviewed Feb 10, 2026

View reviewed changes

magniloquency and others added 6 commits March 16, 2026 21:51

Merge main into orb branch

90633b7

Merge branch 'main' into orb

3bd5c9c

Fix ymq import in ORB worker manager after e921fff refactor

3f025de

Update orb-py dependency to 1.5.1

a79a3f6

Merge branch 'main' into orb

325dcd4

magniloquency mentioned this pull request Mar 24, 2026

ORB worker adapter: support provider APIs beyond RunInstances #628

Open

magniloquency added 14 commits March 25, 2026 22:05

Fix import order in orb worker_manager

602b736

Remove dedicated scaler_worker_manager_orb entry point

183f998

The orb worker manager is now accessible via the unified scaler_worker_manager orb subcommand, making the dedicated entry point redundant.

Merge origin/main into orb

672ad72

Remove run_worker_manager_orb script

2d50d26

Update ORB user data to use scaler_worker_manager with --mode fixed

1177f26

Replace the deprecated scaler_cluster command with scaler_worker_manager baremetal_native, passing --mode fixed and --worker-manager-id sourced from ec2-metadata.

fix io threads

e254eac

Add AMI 1.26.4 to docs and fix build.sh version path

68d7e68

Fix incorrect version.txt path in build.sh (was two levels up, should be three), and add the newly built AMI ami-0b76605999d8f5d2b for scaler 1.26.4 / Python 3.13 to the ORB docs table.

Fix zero-worker default on single-core machines

95f4fe7

DEFAULT_MAX_TASK_CONCURRENCY was cpu_count() - 1, which evaluates to 0 on single-core machines. Remove the subtraction so at least one worker is started by default.

magniloquency marked this pull request as ready for review March 27, 2026 01:21

magniloquency requested review from gxuu and rafa-be March 27, 2026 01:29

sharpener6 reviewed Mar 27, 2026

View reviewed changes

README.md Show resolved Hide resolved

Merge remote-tracking branch 'origin/main' into orb

d1d9633

# Conflicts: # src/scaler/scheduler/controllers/policies/simple_policy/scaling/fixed_elastic.py

magniloquency changed the title ~~Implement ORB Worker Adapter~~ Implement ORB AWS EC2 Worker Adapter Mar 27, 2026

sharpener6 approved these changes Mar 27, 2026

View reviewed changes

sharpener6 enabled auto-merge (squash) March 27, 2026 03:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement ORB AWS EC2 Worker Adapter#525

Implement ORB AWS EC2 Worker Adapter#525
magniloquency wants to merge 88 commits intofinos:mainfrom
magniloquency:orb

magniloquency commented Jan 22, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

magniloquency commented Jan 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Key Changes

Bug Fixes

Dependencies

Usage

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

magniloquency commented Jan 22, 2026 •

edited

Loading