Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
6 changes: 6 additions & 0 deletions .github/actions/run-test/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,13 @@ runs:
run: |
uv pip install --system -r examples/applications/requirements_applications.txt
uv pip install --system -r examples/ray_compat/requirements.txt
readarray -t skip_examples < examples/skip_examples.txt
for example in "./examples"/*.py; do
filename=$(basename "$example")
if [[ " ${skip_examples[*]} " =~ [[:space:]]${filename}[[:space:]] ]]; then
echo "Skipping $example"
continue
fi
echo "Running $example"
python $example
done
Expand Down
3 changes: 3 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -36,6 +36,9 @@ CMakeFiles/
src/scaler/protocol/capnp/*.c++
src/scaler/protocol/capnp/*.h

orb/logs/
orb/metrics/

# AWS HPC test-generated files
.scaler_aws_batch_config.json
.scaler_aws_hpc.env
Expand Down
65 changes: 41 additions & 24 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -118,7 +118,7 @@ This will start a scheduler with 4 workers on port `2345`.
### Setting up a computing cluster from the CLI

The object storage server, scheduler and workers can also be started from the command line with
`scaler_scheduler` and `scaler_cluster`.
`scaler_scheduler` and `scaler_worker_manager`.

First, start the scheduler, and make it connect to the object storage server:

Expand All @@ -132,28 +132,22 @@ $ scaler_scheduler "tcp://127.0.0.1:2345"
...
```

Finally, start a set of workers (a.k.a. a Scaler *cluster*) that connects to the previously started scheduler:
Finally, start a set of workers that connects to the previously started scheduler:

```bash
$ scaler_cluster -n 4 tcp://127.0.0.1:2345
[INFO]2023-03-19 12:19:19-0400: logging to ('/dev/stdout',)
[INFO]2023-03-19 12:19:19-0400: ClusterProcess: starting 4 workers, heartbeat_interval_seconds=2, object_retention_seconds=3600
[INFO]2023-03-19 12:19:19-0400: Worker[0] started
[INFO]2023-03-19 12:19:19-0400: Worker[1] started
[INFO]2023-03-19 12:19:19-0400: Worker[2] started
[INFO]2023-03-19 12:19:19-0400: Worker[3] started
$ scaler_worker_manager native --mode fixed --max-task-concurrency 4 tcp://127.0.0.1:2345
...
```

Multiple Scaler clusters can be connected to the same scheduler, providing distributed computation over multiple
Multiple worker managers can be connected to the same scheduler, providing distributed computation over multiple
servers.

`-h` lists the available options for the object storage server, scheduler and the cluster executables:
`-h` lists the available options for the object storage server, scheduler and the worker manager executables:

```bash
$ scaler_object_storage_server -h
$ scaler_scheduler -h
$ scaler_cluster -h
$ scaler_worker_manager native --help
```

### Submitting Python tasks using the Scaler client
Expand Down Expand Up @@ -243,12 +237,14 @@ The following table maps each Scaler command to its corresponding section name i
| Command | TOML Section Name |
|--------------------------------------|---------------------------------|
| `scaler_scheduler` | `[scheduler]` |
| `scaler_cluster` | `[cluster]` |
| `scaler_object_storage_server` | `[object_storage_server]` |
| `scaler_ui` | `[webui]` |
| `scaler_top` | `[top]` |
| `scaler_worker_manager_baremetal_native` | `[native_worker_manager]` |
| `scaler_worker_manager_symphony` | `[symphony_worker_manager]` |
| `scaler_worker_manager native` | `[native_worker_manager]` |
| `scaler_worker_manager symphony` | `[symphony_worker_manager]` |
| `scaler_worker_manager ecs` | `[ecs_worker_manager]` |
| `scaler_worker_manager hpc` | `[aws_hpc_worker_manager]` |
| `scaler_worker_manager orb` | `[orb_worker_adapter]` |

### Practical Scenarios & Examples

Expand All @@ -269,8 +265,9 @@ logging_paths = ["/dev/stdout", "/var/log/scaler/scheduler.log"]
policy_engine_type = "simple"
policy_content = "allocate=even_load; scaling=no"

[cluster]
num_of_workers = 8
[native_worker_manager]
mode = "fixed"
max_task_concurrency = 8
per_worker_capabilities = "linux,cpu=8"
task_timeout_seconds = 600

Expand All @@ -285,7 +282,7 @@ With this single file, starting your entire stack is simple and consistent:
```bash
scaler_object_storage_server tcp://127.0.0.1:6379 --config example_config.toml &
scaler_scheduler tcp://127.0.0.1:6378 --config example_config.toml &
scaler_cluster tcp://127.0.0.1:6378 --config example_config.toml &
scaler_worker_manager native tcp://127.0.0.1:6378 --config example_config.toml &
scaler_ui tcp://127.0.0.1:6380 --config example_config.toml &
```

Expand All @@ -295,12 +292,12 @@ You can override any value from the TOML file by providing it as a command-line
example_config.toml file but test the cluster with 12 workers instead of 8:

```bash
# The --num-of-workers flag will take precedence over the [cluster] section
scaler_cluster tcp://127.0.0.1:6378 --config example_config.toml --num-of-workers 12
# The --max-task-concurrency flag will take precedence over the [native_worker_manager] section
scaler_worker_manager native tcp://127.0.0.1:6378 --config example_config.toml --max-task-concurrency 12
```

The cluster will start with 12 workers, but all other settings (like `task_timeout_seconds`) will still be loaded from the
`[cluster]` section of example_config.toml.
`[native_worker_manager]` section of example_config.toml.

## Nested computations

Expand Down Expand Up @@ -351,7 +348,7 @@ When starting a cluster of workers, you can define the capabilities available on
capabilities these provide.

```bash
$ scaler_cluster -n 4 --per-worker-capabilities "gpu,linux" tcp://127.0.0.1:2345
$ scaler_worker_manager native --mode fixed --max-task-concurrency 4 --per-worker-capabilities "gpu,linux" tcp://127.0.0.1:2345
```

### Specifying Capability Requirements for Tasks
Expand Down Expand Up @@ -380,7 +377,7 @@ might be added in the future.
A Scaler scheduler can interface with IBM Spectrum Symphony to provide distributed computing across Symphony clusters.

```bash
$ scaler_worker_manager_symphony tcp://127.0.0.1:2345 --service-name ScalerService --base-concurrency 4
$ scaler_worker_manager symphony tcp://127.0.0.1:2345 --service-name ScalerService --base-concurrency 4
```

This will start a Scaler worker that connects to the Scaler scheduler at `tcp://127.0.0.1:2345` and uses the Symphony
Expand Down Expand Up @@ -465,6 +462,26 @@ where `deepest_nesting_level` is the deepest nesting level a task has in your wo
workload that has
a base task that calls a nested task that calls another nested task, then the deepest nesting level is 2.

## ORB (AWS EC2) integration

A Scaler scheduler can interface with ORB (Open Resource Broker) to dynamically provision and manage workers on AWS EC2 instances.

```bash
$ scaler_worker_manager orb tcp://127.0.0.1:2345 --image-id ami-0528819f94f4f5fa5
```

This will start an ORB worker adapter that connects to the Scaler scheduler at `tcp://127.0.0.1:2345`. The scheduler can then request new workers from this adapter, which will be launched as EC2 instances.

### Configuration

The ORB adapter requires `orb-py` and `boto3` to be installed. You can install them with:

```bash
$ pip install "opengris-scaler[orb]"
```

For more details on configuring ORB, including AWS credentials and instance templates, please refer to the [ORB Worker Adapter documentation](https://finos.github.io/opengris-scaler/tutorials/worker_manager_adapter/orb.html).

## Worker Manager usage

> **Note**: This feature is experimental and may change in future releases.
Expand All @@ -480,7 +497,7 @@ specification [here](https://github.com/finos/opengris).
Start a Native Worker Manager and connect it to the scheduler:

```bash
$ scaler_worker_manager_baremetal_native tcp://127.0.0.1:2345
$ scaler_worker_manager native tcp://127.0.0.1:2345
```

To check that the Worker Manager is working, you can bring up `scaler_top` to see workers spawning and terminating as
Expand Down
3 changes: 3 additions & 0 deletions docs/source/index.rst
Original file line number Diff line number Diff line change
Expand Up @@ -29,8 +29,11 @@ Content
tutorials/scaling
tutorials/worker_manager_adapter/index
tutorials/worker_manager_adapter/native
tutorials/worker_manager_adapter/orb
tutorials/worker_manager_adapter/aws_hpc/index
tutorials/worker_manager_adapter/common_parameters
tutorials/worker_manager_adapter/worker_manager
tutorials/worker_manager_adapter/aio
tutorials/compatibility/ray
tutorials/configuration
tutorials/examples
Expand Down
2 changes: 1 addition & 1 deletion docs/source/tutorials/compatibility/ray.rst
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,7 @@ Ray Compatibility Layer
Scaler is a lightweight distributed computation engine similar to Ray. Scaler supports many of the same concepts as Ray including
remote functions (known as tasks in Scaler), futures, cluster object storage, labels (known as capabilities in Scaler), and it comes with comparable monitoring tools.

Unlike Ray, Scaler supports both local clusters and also easily integrates with multiple cloud providers out of the box, including AWS EC2 and IBM Symphony,
Unlike Ray, Scaler supports both local clusters and also easily integrates with multiple cloud providers out of the box, including ORB (AWS EC2) and IBM Symphony,
with more providers planned for the future. You can view our `roadmap on GitHub <https://github.com/finos/opengris-scaler/discussions/333>`_
for details on upcoming cloud integrations.

Expand Down
43 changes: 22 additions & 21 deletions docs/source/tutorials/configuration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -78,14 +78,14 @@ For the list of available settings, use the CLI command:

.. code:: bash

scaler_cluster -h
scaler_worker_manager native --help

**Preload Hook**

Workers can execute an optional initialization function before processing tasks using the ``--preload`` option. This enables workers to:

* Set up environments on demand
* Preload data, libraries, or models
* Preload data, libraries, or models
* Initialize connections or state

The preload specification follows the format ``module.path:function(args, kwargs)`` where:
Expand All @@ -97,10 +97,10 @@ The preload specification follows the format ``module.path:function(args, kwargs
.. code:: bash

# Simple function call with no arguments
scaler_cluster tcp://127.0.0.1:8516 --preload "mypackage.init:setup"
scaler_worker_manager native tcp://127.0.0.1:8516 --preload "mypackage.init:setup"

# Function call with arguments
scaler_cluster tcp://127.0.0.1:8516 --preload "mypackage.init:configure('production', debug=False)"
scaler_worker_manager native tcp://127.0.0.1:8516 --preload "mypackage.init:configure('production', debug=False)"

The preload function is executed once per processor during initialization, before any tasks are processed. If the preload function fails, the error is logged and the processor will terminate.

Expand All @@ -127,7 +127,7 @@ This can be set using the CLI:

.. code:: bash

scaler_cluster -n 10 tcp://127.0.0.1:8516 -dts 300
scaler_worker_manager native --mode fixed --max-task-concurrency 10 tcp://127.0.0.1:8516 -dts 300

Or through the programmatic API:

Expand Down Expand Up @@ -185,22 +185,22 @@ The following table maps each Scaler command to its corresponding section name i
- TOML Section Name
* - ``scaler_scheduler``
- ``[scheduler]``
* - ``scaler_cluster``
- ``[cluster]``
* - ``scaler_object_storage_server``
- ``[object_storage_server]``
* - ``scaler_ui``
- ``[webui]``
* - ``scaler_top``
- ``[top]``
* - ``scaler_worker_manager_baremetal_native``
* - ``scaler_worker_manager native``
- ``[native_worker_manager]``
* - ``scaler_worker_manager_symphony``
* - ``scaler_worker_manager symphony``
- ``[symphony_worker_manager]``
* - ``scaler_worker_manager_aws_raw_ecs``
* - ``scaler_worker_manager ecs``
- ``[ecs_worker_manager]``
* - ``python -m scaler.entry_points.worker_manager_aws_hpc_batch``
* - ``scaler_worker_manager hpc``
- ``[aws_hpc_worker_manager]``
* - ``scaler_worker_manager orb``
- ``[orb_worker_adapter]``


Practical Scenarios & Examples
Expand All @@ -224,7 +224,8 @@ Here is an example of a single ``example_config.toml`` file that configures mult
policy_engine_type = "simple"
policy_content = "allocate=even_load; scaling=no"

[cluster]
[native_worker_manager]
mode = "fixed"
max_task_concurrency = 8
per_worker_capabilities = "linux,cpu=8"
task_timeout_seconds = 600
Expand All @@ -241,7 +242,7 @@ With this single file, starting your entire stack is simple and consistent:

scaler_object_storage_server tcp://127.0.0.1:6379 --config example_config.toml &
scaler_scheduler tcp://127.0.0.1:6378 --config example_config.toml &
scaler_cluster tcp://127.0.0.1:6378 --config example_config.toml &
scaler_worker_manager native tcp://127.0.0.1:6378 --config example_config.toml &
scaler_ui tcp://127.0.0.1:6380 --config example_config.toml &


Expand All @@ -251,10 +252,10 @@ You can override any value from the TOML file by providing it as a command-line

.. code-block:: bash

# The --max-task-concurrency flag will take precedence over the [cluster] section
scaler_cluster tcp://127.0.0.1:6378 --config example_config.toml --max-task-concurrency 12
# The --max-task-concurrency flag will take precedence over the [native_worker_manager] section
scaler_worker_manager native tcp://127.0.0.1:6378 --config example_config.toml --max-task-concurrency 12

The cluster will start with **12 workers**, but all other settings (like ``task_timeout_seconds``) will still be loaded from the ``[cluster]`` section of ``example_config.toml``.
The cluster will start with **12 workers**, but all other settings (like ``task_timeout_seconds``) will still be loaded from the ``[native_worker_manager]`` section of ``example_config.toml``.


**Scenario 3: Waterfall Scaling Configuration**
Expand All @@ -276,18 +277,18 @@ To use the ``waterfall_v1`` policy engine for priority-based scaling across mult
2, ECS, 50
"""

[native_worker_adapter]
[native_worker_manager]
max_task_concurrency = 8

[ecs_worker_adapter]
[ecs_worker_manager]
max_task_concurrency = 50

Then start the scheduler and worker adapters:

.. code-block:: bash

scaler_scheduler tcp://127.0.0.1:8516 --config waterfall_config.toml &
scaler_worker_adapter_native tcp://127.0.0.1:8516 --config waterfall_config.toml &
scaler_worker_adapter_ecs tcp://127.0.0.1:8516 --config waterfall_config.toml &
scaler_worker_manager native tcp://127.0.0.1:8516 --config waterfall_config.toml &
scaler_worker_manager ecs tcp://127.0.0.1:8516 --config waterfall_config.toml &

Local ``NAT`` workers will scale up first. When they reach capacity, ``ECS`` workers will begin scaling. On scale-down, ECS workers drain before local workers.
8 changes: 8 additions & 0 deletions docs/source/tutorials/examples.rst
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,14 @@ Shows how to send a basic task to scheduler
.. literalinclude:: ../../../examples/simple_client.py
:language: python

Submit Tasks
~~~~~~~~~~~~

Shows various ways to submit tasks (submit, map, starmap)

.. literalinclude:: ../../../examples/submit_tasks.py
:language: python

Client Mapping Tasks
~~~~~~~~~~~~~~~~~~~~

Expand Down
2 changes: 1 addition & 1 deletion docs/source/tutorials/quickstart.rst
Original file line number Diff line number Diff line change
Expand Up @@ -137,7 +137,7 @@ Here we use localhost addresses for demonstration, however the scheduler and wor

.. code:: bash

scaler_cluster -n 10 tcp://127.0.0.1:8516
scaler_worker_manager native --mode fixed --max-task-concurrency 10 tcp://127.0.0.1:8516


From here, connect the Python Client and begin submitting tasks:
Expand Down
Loading