Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
d62c72a
Starting spec work, migrate zephyr to no longer use generators.
rjpower Jan 7, 2026
3de7ee7
Add actor support to Fray RPC
rjpower Jan 8, 2026
90ca81a
Fix actor RPC test failures
rjpower Jan 8, 2026
f85d27f
Fix protobuf duplicate symbol conflict in connect-python
rjpower Jan 8, 2026
698d974
Resolve code review issues, flip worker/controller RPC direction.
rjpower Jan 9, 2026
c47b4a0
Remove docs.
rjpower Jan 9, 2026
ae6b97a
Initial design + Clauding.
rjpower Jan 10, 2026
3c01e0b
Adding detailed design and examples.
rjpower Jan 10, 2026
c09c4ed
Cleaning up design.
rjpower Jan 10, 2026
e6ddf5d
Add fluster library with protobuf RPC support.
rjpower Jan 11, 2026
270f83f
Merge protos, cleanup ideas.
rjpower Jan 11, 2026
d61f579
Bundling.
rjpower Jan 11, 2026
f4407f8
Add ImageBuilder for Docker image building with layered caching
rjpower Jan 11, 2026
68aefd9
Add DockerRuntime for container execution with resource limits
rjpower Jan 11, 2026
64df916
Remove log streaming and use proto ResourceSpec in DockerRuntime
rjpower Jan 11, 2026
f77c076
Add PortAllocator for managing ephemeral ports in worker jobs
rjpower Jan 12, 2026
59b3ad2
Fix unused allocator fixture parameters in tests
rjpower Jan 12, 2026
10fb8d9
Add JobManager for orchestrating worker job lifecycle
rjpower Jan 12, 2026
3925eff
Fix type annotations and unused imports in JobManager
rjpower Jan 12, 2026
e9850d8
Fix log implementation to use disk-based storage
rjpower Jan 12, 2026
b4c7677
Implement WorkerService RPC interface and fix BundleCache race condition
rjpower Jan 12, 2026
1b135f0
Add status_message field to JobStatus for build phase progress
rjpower Jan 12, 2026
6b963fc
Add FRAY_PORT_MAPPING environment variable for port communication
rjpower Jan 12, 2026
61110a7
Add HTTP server with Connect RPC and web dashboard for Fluster Worker
rjpower Jan 12, 2026
f283790
Add CLI entry point for Fluster worker
rjpower Jan 12, 2026
72a3aa8
Rename worker types module to avoid circular import conflict with bui…
rjpower Jan 12, 2026
0c59fb2
Refactor worker: rename server → dashboard, add state machine, improv…
rjpower Jan 12, 2026
dfb9138
Remove unused venv archive/extract functionality from VenvCache
rjpower Jan 12, 2026
77e7986
Add end-to-end test, clean up dashboard.
rjpower Jan 12, 2026
2397b2c
WIP.
rjpower Jan 12, 2026
9340210
Clean up worker.
rjpower Jan 12, 2026
7ab59eb
WIP.
rjpower Jan 13, 2026
6efa63a
Use explicit global BuildKit cache ID for UV dependency caching
rjpower Jan 13, 2026
4d56df0
Cleanup proto generation.
rjpower Jan 13, 2026
9ae142d
Cleanup.
rjpower Jan 13, 2026
abc19fb
Add controller v0 proto messages and implementation plan
rjpower Jan 13, 2026
9972ef1
Mark Stage 1 as completed in implementation plan
rjpower Jan 13, 2026
a9bccac
[fluster] Implement Stage 2: Controller Core Data Structures
rjpower Jan 13, 2026
ab1ed8a
[fluster] Implement Stage 3: Worker Registry
rjpower Jan 13, 2026
ee66ca5
[fluster] Implement Stage 4: Job Scheduler Thread
rjpower Jan 13, 2026
6e9e23c
[fluster] Implement Stage 5: Worker Heartbeat Monitor
rjpower Jan 13, 2026
e087df2
[fluster] Implement Stage 6: Job Failure and Retry Logic
rjpower Jan 13, 2026
a82d513
[fluster] Implement Stage 7: Controller Service Implementation
rjpower Jan 13, 2026
3c9c5f5
[fluster] Implement Stage 8: Integration Tests
rjpower Jan 13, 2026
a326201
[fluster] Implement Stage 9: HTTP Dashboard
rjpower Jan 13, 2026
f197278
[fluster] Improve dashboard and service implementation
rjpower Jan 13, 2026
a07d23d
[fluster] Fix cluster example and improve job status display
rjpower Jan 13, 2026
6e6ca6f
Move docs around.
rjpower Jan 13, 2026
3ba8604
Remove fray experiments.
rjpower Jan 13, 2026
c75a977
[fluster] Add resource-aware job scheduling
rjpower Jan 14, 2026
062166d
Working on actor system.
rjpower Jan 14, 2026
0a686e0
[fluster] Add actor system stages 1-4: server, client, resolver, endp…
rjpower Jan 14, 2026
c6c348d
[fluster] Convert actor system to use Connect RPC interfaces
rjpower Jan 14, 2026
9d5b512
[fluster] Add ActorPool with round-robin and broadcast support (Stage 5)
rjpower Jan 14, 2026
d5f5c0f
[fluster] Implement Stage 6 - GcsResolver
rjpower Jan 14, 2026
6a6318e
[fluster] Add GcsResolver for GCP VM metadata-based discovery (Stage 6)
rjpower Jan 14, 2026
6b580e3
[fluster] Add introspection RPCs - ListMethods, ListActors (Stage 7)
rjpower Jan 14, 2026
5edf63e
[fluster] Implement Stage 8 - Actor System Integration Examples
rjpower Jan 14, 2026
8acca80
WIP -- cluster controller resolver.
rjpower Jan 14, 2026
553ea59
Add helper controller and worker types.
rjpower Jan 14, 2026
19565f4
Fix integration tests and improve testing boundaries.
rjpower Jan 14, 2026
89825bd
Milestone: actor + metadata integration working for the full end-to-e…
rjpower Jan 14, 2026
6ca698b
Worker pool implementation, clean up protos.
rjpower Jan 14, 2026
037bbc5
Complete proto message migration to nested Controller/Worker structure
rjpower Jan 14, 2026
f476c16
Add basic docs and fix up worker pool.
rjpower Jan 14, 2026
199e02b
Merge JobManager into Worker class
rjpower Jan 14, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -223,3 +223,4 @@ gha-creds-*.json
*.jsonl
**/*.jsonl
scr/*
.weaver/
11 changes: 10 additions & 1 deletion AGENTS.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,7 +23,7 @@
- Assume Python >=3.11.
- Always use `uv run` for Python entry points. If that fails, try `.venv/bin/python` directly.
- Run `uv run python infra/pre-commit.py --all-files` before sending changes; formatting and linting are enforced with `ruff`.
- Keep type hints passing under `uv run mypy`; configuration lives in `pyproject.toml`.
- Keep type hints passing under `uv run pyrefly`; configuration lives in `pyproject.toml`.

### Communication & Commits

Expand Down Expand Up @@ -88,6 +88,15 @@ You don't generate comments that merely restate the code, e.g.
- Run the appropriate tests for your changes (for example, `uv run pytest` under the relevant directory); consult subproject guides for preferred markers.
- Use pytest features like fixtures and parameterization to avoid duplication and write clean code.

PREFER:

- Integration style tests which exercise behavior and test the output

DO NOT:

- Create tests which validate obvious features: if a type exists, a constant has a value, etc.


## Environment

- Prefer to use `uv` when possible. If you can't (for instance, due to sandbox restrictions) you can use `.venv/bin/python`
Expand Down
3 changes: 3 additions & 0 deletions infra/pre-commit.py
Original file line number Diff line number Diff line change
Expand Up @@ -46,6 +46,9 @@
".git/**",
".github/**",
"tests/snapshots/**",
# grpc generated files
"**/*_connect.py",
"**/*_pb2.py",
"**/*.gz",
"**/*.pb",
"**/*.index",
Expand Down
46 changes: 46 additions & 0 deletions lib/fluster/AGENTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
# Agent Tips

* Use the connect/RPC abstractions to implement and perform RPC calls. DO NOT use httpx or raw HTTP.
* Use scripts/generate-protos.py to regenerate files after changing the `.proto` files.
* Prefer shallow, functional interfaces which return control to the user, vs callbacks or inheritance.

e.g.

class Scheduler:
def add_job()
def add_worker():
def compute_schedule() -> ScheduledJobs:

is preferable to:

class Scheduler:
def __init__(self, job_creator: JobCreator):
self.job_creator = job_creator
def run(self):
... self.job_creator.create_job()

It's acceptable to have a top-level class which implements the main loop of
course, but prefer to keep other interfaces shallow and functional whenever
possible.

* Tests should evaluate _behavior_, not implementation. Don't test things that are trivially caught by the type checker. Explicitly that means:

- No tests for "constant = constant"
- No tests for "method exists"
- No tests for "create an object(x, y, z) and attributes are x, y, z"

These tests have negative value - they make our code more brittle. Test
_behavior_ instead. You can use mocks as needed to isolate environments (e.g.
mock around a remote API). Prefer "fakes" -- e.g. create a real database but
with fake data -- when reasonable.

## Protocols and Testing

Non-trivial public classes should define a protocol which represents their
_important_ interface characteristics. Use this protocol in type hints for
when the class is used instead of the concrete class.

Test to this protocol, not the concrete class: the protocol should describe the
interesting behavior of the class, but not betray the implementation details.

(You may of course _instantiate_ the concrete class for testing.)
130 changes: 130 additions & 0 deletions lib/fluster/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,130 @@
# Fluster

Fluster is a distributed job orchestration and RPC framework designed to replace Ray with simpler, more focused primitives. It provides job lifecycle management, actor-based RPC communication, and task dispatch capabilities for distributed Python workloads.

## Architecture Overview

Fluster consists of four main components:

| Component | Description |
|-----------|-------------|
| **Controller** | Central coordinator managing job scheduling, worker registration, and service discovery |
| **Worker** | Execution agent that runs jobs in isolated containers with resource management |
| **Actor System** | RPC framework enabling Python object method invocation across processes |
| **WorkerPool** | High-level task dispatch abstraction for stateless parallel workloads |

```
┌─────────────────────────────────────────────────────────────────┐
│ Controller │
│ │
│ Job Scheduling │ Worker Registry │ Endpoint Registry│
└─────────────────────────────────────────────────────────────────┘
│ │ ▲
│ dispatch │ health │ register
▼ ▼ │
┌─────────────────┐ ┌─────────────────┐ ┌─────────────────┐
│ Worker │ │ Worker │ │ ActorServer │
│ │ │ │ │ (in job) │
│ runs jobs in │ │ runs jobs in │ │ │
│ containers │ │ containers │ │ │
└─────────────────┘ └─────────────────┘ └─────────────────┘
```

## Directory Structure

```
src/fluster/
├── actor/ # Actor RPC system
│ ├── client.py # Actor method invocation
│ ├── pool.py # Multi-endpoint management
│ ├── resolver.py # Endpoint discovery
│ ├── server.py # Actor hosting
│ └── types.py # Core types
├── cluster/ # Cluster orchestration
│ ├── controller/ # Controller service
│ ├── worker/ # Worker service
│ ├── client.py # Client interface
│ └── types.py # Shared types
├── proto/ # Protocol definitions
├── worker_pool.py # Task dispatch
└── *_pb2.py, *_connect.py # Generated RPC code
```

## Component Documentation

- [Controller Overview](docs/controller.md) - Job scheduling and coordination
- [Worker Overview](docs/worker.md) - Job execution and container management
- [Actor System Overview](docs/actor.md) - RPC and service discovery

## Quick Start

### Submitting a Job

```python
from fluster.cluster import RpcClusterClient, Entrypoint, create_environment
from fluster.cluster_pb2 import ResourceSpec

def my_task():
print("Hello from fluster!")

client = RpcClusterClient("http://controller:8080")
job_id = client.submit(
name="my-job",
entrypoint=Entrypoint(callable=my_task),
resources=ResourceSpec(cpu=1, memory="2GB"),
environment=create_environment(),
)
client.wait(job_id)
```

### Running an Actor Server

```python
from fluster.actor import ActorServer, ActorContext

class InferenceActor:
def predict(self, ctx: ActorContext, data: list) -> list:
return [x * 2 for x in data]

server = ActorServer(controller_address="http://controller:8080")
server.register("inference", InferenceActor())
server.serve()
```

### Calling Actors

```python
from fluster.actor import ActorPool, ClusterResolver

resolver = ClusterResolver("http://controller:8080")
pool: ActorPool = resolver.lookup("inference")
pool.wait_for_size(1)

result = pool.call().predict([1, 2, 3])
```

### Using WorkerPool for Task Dispatch

```python
from fluster.worker_pool import WorkerPool, WorkerPoolConfig
from fluster.cluster import RpcClusterClient

client = RpcClusterClient("http://controller:8080")
config = WorkerPoolConfig(num_workers=10, resources=ResourceSpec(cpu=2))
pool = WorkerPool(client, config)

futures = [pool.submit(process_shard, shard) for shard in shards]
results = [f.result() for f in futures]
pool.shutdown()
```

## Design Principles

1. **Shallow interfaces**: Components expose minimal APIs with clear responsibilities
2. **Explicit over implicit**: No magic discovery or hidden state synchronization
3. **Stateless workers**: Task retry and load balancing work because workers maintain no shared state
4. **Arbitrary callables**: Jobs and actor methods accept any picklable Python callable

## Related Documentation

- [Fray-Zero Design](docs/fray-zero.md) - Original design document and rationale
10 changes: 10 additions & 0 deletions lib/fluster/buf.gen.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
version: v2
managed:
enabled: true
plugins:
- remote: buf.build/protocolbuffers/python
out: src/fluster
- remote: buf.build/protocolbuffers/pyi
out: src/fluster
- remote: buf.build/connectrpc/python
out: src/fluster
10 changes: 10 additions & 0 deletions lib/fluster/buf.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
version: v2
modules:
- path: src/fluster/proto
- path: src/fluster/actor/proto
lint:
use:
- STANDARD
breaking:
use:
- FILE
Loading