Skip to content

Merge native adapters#21

Open
magniloquency wants to merge 16 commits intoremove-cluster-classfrom
merge-native-adapters
Open

Merge native adapters#21
magniloquency wants to merge 16 commits intoremove-cluster-classfrom
merge-native-adapters

Conversation

@magniloquency
Copy link
Copy Markdown
Owner

No description provided.

magniloquency and others added 16 commits March 4, 2026 22:16
Adds a `NativeWorkerAdapterMode` enum (DYNAMIC / FIXED) to
`NativeWorkerAdapter`, replacing the separate `FixedNativeWorkerAdapter`.

Key changes:
- `NativeWorkerAdapterConfig`: new `mode` field, `-n`/`--num-of-workers`
  alias, and FIXED-mode validation (`max_workers >= 0`)
- `NativeWorkerAdapter`: unified `run()` entry point with lazy ZMQ setup
  (`_setup_zmq()`); FIXED mode pre-spawns workers via
  `_spawn_initial_workers()`, rejects dynamic scale commands, and
  self-exits via `_monitor_workers()` when all workers have died
- `cluster.py` entry point: now constructs `NativeWorkerAdapter` directly
  in FIXED mode instead of importing the deleted fixed-native entry point
- `combo.py`: replaces `FixedNativeWorkerAdapter` with
  `NativeWorkerAdapter(mode=FIXED)` launched as a subprocess
- Tests: updated to use the subprocess pattern (`Process(target=adapter.run)`)
- Docs: `native.rst` extended to cover `--mode`; `fixed_native.rst` deleted;
  index, configuration, and README updated accordingly
- Deleted: `fixed_native.py`, `fixed_native_worker_adapter.py`,
  `worker_manager_baremetal_fixed_native.py`,
  `run_worker_manager_baremetal_fixed_native.py`, `fixed_native.rst`
- Replace stale Cluster/ClusterConfig imports with NativeWorkerAdapter/NativeWorkerAdapterConfig
- Access adapter config via cluster._worker_adapter instead of removed cluster._cluster
- Launch GPU adapter as a subprocess (Process(target=adapter.run)) consistent with current API
- Add explicit cleanup of gpu_adapter_process before cluster.shutdown()
- Extend isort/black/pflake8/mypy checks to include examples/ directory
- Document examples/task_capabilities.py in the Files Modified table
Adds amendment-1.md documenting three changes to the merge plan:
simplified FIXED mode run path, config/TOML section rename from
NativeWorkerAdapterConfig to NativeWorkerManagerConfig, and new
--worker-type parameter. Implements all three changes across source,
tests, examples, and docs.
Fixed mode has no event loop or scheduler connector; workers connect
directly to the scheduler themselves.
Resolved conflicts by applying PR finos#583 intent (merge FixedNativeWorkerAdapter
into NativeWorkerAdapter) on top of main's renames:

- NativeWorkerAdapter → NativeWorkerManager
- worker_adapter_config → worker_manager_config (WorkerManagerConfig)
- Import path: native_worker_adapter.py → native_worker_manager.py
- combo.py uses NativeWorkerManager with NativeWorkerManagerMode.FIXED and
  multiprocessing.Process pattern; stored as _worker_manager
- Deleted fixed_native files (fixed_native.py, fixed_native_worker_manager.py,
  worker_manager_baremetal_fixed_native.py, fixed_native.rst) as intended by finos#583
- Updated cluster.py, all affected tests and examples accordingly
- Fixed mypy/pflake8: missing multiprocessing import in test_client.py,
  stale NativeWorkerAdapter reference in cluster.py
- Signal handling and register_event_loop in _run_fixed() to prevent
  orphaned worker subprocesses on shutdown
- Initialize _ident to None in __init__ so it exists before _setup_zmq()
- Type hint on configure_parser parser parameter
- Start scheduler before worker manager process in combo.py
- Clarifying comment on WorkerGroupInfo.task_arn in ecs.py
Merged from remove-cluster-class. When the main task completes,
background tasks created via loop.create_task() (the
create_async_loop_routine coroutines) may still be running as orphans.
Previously, loop.close() was called immediately, causing those tasks to
receive RuntimeError: Event loop is closed.

Now all pending tasks are cancelled and awaited before the cleanup
callback and loop.close(), following the standard asyncio shutdown
pattern.
* Remove Cluster class, replace with FixedNativeWorkerAdapter

Cluster was a multiprocessing.Process subclass that wrapped
FixedNativeWorkerAdapter in a subprocess with an asyncio event loop
solely to handle signals. This intermediate process layer is removed;
workers are now direct children of SchedulerClusterCombo.

- Delete Cluster, ClusterConfig, and the cluster entry point module
- Redirect scaler_cluster and run_cluster.py to the fixed native
  worker adapter entry point (with --num-of-workers alias for compat)
- Add SIGINT/SIGTERM handling to the fixed native adapter entry point
- Update SchedulerClusterCombo, tests, and examples to use
  FixedNativeWorkerAdapter directly
- Update ECS adapter to use scaler_cluster with updated parameters
  (--max-workers replaces --num-of-workers/--worker-names; worker IDs
  are no longer pre-announced since workers self-assign UUIDs)

* bump version

* Add -n as backward compat alias for scaler_cluster

* scaler_cluster uses [cluster] TOML section for backward compat

Refactor fixed native entry point to accept a configurable section
name, so scaler_cluster reads [cluster] while
scaler_worker_adapter_fixed_native continues to read
[fixed_native_worker_adapter].

* Update docs to remove stale Cluster class references

- Remove ClusterProcess/Worker startup console output from quickstart (no longer produced by FixedNativeWorkerAdapter)
- Update worker_adapters/index: Fixed Native is used by SchedulerClusterCombo, not Cluster
- Replace --num-of-workers/num_of_workers with canonical --max-workers/max_workers in configuration examples

* Fix stale imports in run_* scripts

Entry points were renamed from worker_adapter_* to worker_manager_*,
and run_cluster.py now imports from the cluster entry point directly.

* Fix regression in task_capabilities.py introduced by 9668545

9668545 accidentally reverted the task_capabilities.py changes from
b4ebe01, re-introducing Cluster/ClusterConfig which this branch removes.
Restore the FixedNativeWorkerAdapter-based version from b4ebe01.

* fix: resolve mypy error and apply black/isort formatting

Update combo.py to use renamed FixedNativeWorkerManager/FixedNativeWorkerManagerConfig/WorkerManagerConfig
(replacing stale FixedNativeWorkerAdapter/FixedNativeWorkerAdapterConfig/WorkerAdapterConfig names).
Apply black formatting to three files.

* fix: update tests to use renamed FixedNativeWorkerManager classes

Update 5 test files to reflect the renaming of FixedNativeWorkerAdapter
to FixedNativeWorkerManager and FixedNativeWorkerAdapterConfig to
FixedNativeWorkerManagerConfig, including updated import paths and
keyword argument names.

* fix: update example to use renamed FixedNativeWorkerManager classes

* fix: update tests to use renamed WorkerManagerConfig

Replace references to the removed worker_adapter module and WorkerAdapterConfig
with the renamed worker_manager module and WorkerManagerConfig.

* fix: rename adapter variables and identifiers to manager

* fix: update test_death_timeout import path from worker_manager_manager to worker_manager_adapter

* fix: update task_capabilities example to use WorkerManagerConfig

Replace non-existent WorkerAdapterConfig from scaler.config.common.worker_adapter
with WorkerManagerConfig from scaler.config.common.worker_manager.

* fix: cancel pending tasks before closing event loop in run_task_forever

When the main task completes, background tasks created via
loop.create_task() (the create_async_loop_routine coroutines) may still
be running as orphans. Previously, loop.close() was called immediately,
causing those tasks to receive RuntimeError: Event loop is closed when
their pending awaits tried to interact with the loop.

Now all pending tasks are cancelled and awaited before the cleanup
callback and loop.close(), following the standard asyncio shutdown
pattern.

---------

Co-authored-by: sharpener6 <1sc2l4qi@duck.com>
Resolve trivial conflicts from merging origin/main. All conflicts were
between the HEAD NativeWorkerManager/NativeWorkerManagerConfig approach
and the intermediate FixedNativeWorkerManager added during the
remove-cluster-class merge. Kept HEAD versions throughout.
Signed-off-by: magniloquency <197707854+magniloquency@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant