Skip to content

Implement Distributed Execution with Message Queues #223

@Adithyakp86

Description

@Adithyakp86

Title: [Feature] Implement Distributed Execution with Message Queues

Description

Currently, the AI Council Orchestrator runs primarily as a monolithic execution pipeline. To scale this for high-throughput, enterprise-level workloads, we need to transition the model execution and task processing into a Distributed Execution architecture.

This will involve breaking down the execution layer into worker nodes that can pull sub-tasks from a message broker, process them in parallel across different servers, and return the results for final synthesis.

Objectives & Acceptance Criteria

  • Message Queue Design: Design and implement a distributed architecture integrating a robust message broker (e.g., RabbitMQ, Apache Kafka, or Redis Pub/Sub).
  • Worker Nodes: Implement dedicated worker nodes responsible for parallel model execution that listen to specific task queues.
  • Load Balancing: Add intelligent load balancing to distribute heavy inference tasks evenly across available workers.
  • Distributed Resilience: Implement distributed circuit breakers to prevent systemic failures when specific AI APIs experience downtime or high latency.
  • Distributed Tracing: Add distributed tracing (e.g., OpenTelemetry) to track requests as they move through the broker and worker nodes.
  • Deployment Configs: Create boilerplate deployment configurations (Docker Compose, basic Kubernetes manifests) for standing up the distributed system.

Implementation Hints

  • Start by creating an architecture proposal diagram.
  • Ensure the OrchestrationLayer can seamlessly publish tasks to the broker and asynchronously wait for the synthesis results.
  • OpenTelemetry is highly recommended for tracing the lifecycle of a distributed orchestrator request.

Skills Required

  • Python (Asyncio)
  • Distributed Systems Architecture
  • Message Brokers (RabbitMQ/Kafka)
  • Docker/Microservices

Constraints (Important for OSCG)

  • This is a significant architectural expansion. Please provide a system design document in the comments for maintainer approval before starting development.
  • Ensure that the single-node (current) execution method remains available as a configuration option for local development.

Difficulty: 🔴 Hard / Advanced
Estimated Time: 25-35 hours
Labels: enhancement, hard, OSCG, distributed-systems, architecture

Assign this issue to me under the oscg

Metadata

Metadata

Assignees

Labels

No labels
No labels

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions