The system that manages and improves itself.
Ouroboros is an agent-first software engineering system.
Input: natural language task | Output: merged pull request
Six specialized agents — planner, implementer, validator, reviewer, cleaner, post-mortem — plan, write, test, review, merge, and learn from failures autonomously inside a constrained architecture.
Ouroboros is an agent-first software factory. It takes a natural language task as input and produces a merged, tested, reviewed pull request as output. The agents collaborate through typed Pydantic contracts — no text parsing, no regex, no string matching at any boundary.
The system is self-referential: agents can be tasked to improve the agent infrastructure itself — better prompts, tighter lint rules, new tools — all flowing through the same PR review process.
flowchart LR
Task["'Fix the off-by-one error\nin utils/counter.py'"] --> Plan
subgraph OUROBOROS
Plan --> Implement --> Validate --> OpenPR["Open PR"] --> Review
Validate -- "retry (max 5)" --> Implement
Review -- "approved" --> Merge
Review -- "not approved" --> Implement
Review -- "human feedback" --> Feedback["Feedback Loop"]
Feedback --> Implement
Validate -- "escalate" --> PostMortem["Post-Mortem\n→ self-improvement issue"]
end
Merge --> Done["Merged PR #42"]
Task: "Fix the off-by-one error in utils/counter.py"
- Planner decomposes the task into typed execution steps
- Implementer writes the patch, returns
FileChange[]- Validator runs pytest + ruff + arch_lint — all pass
- PR opened via
gh pr create- Reviewer agent inspects the diff, approves
- PR merged via
gh pr merge --squashTotal cost: $0.0087 | Iterations: 2 | Time: ~30s
Status: Testing
| Done | Upcoming |
|---|---|
| Core workflow (plan → implement → validate → review → merge) | Live agent integration tests with Vertex AI |
| Architecture linting with AGENT_REMEDIATION | Larger repo benchmarks |
| 10 Golden Principles with machine-checkable lint | Screenshot diff tool for UI validation |
| Repository index (189 symbols, 47 files) | Prompt tuning from Logfire trace data |
| Per-node token tracking and cost metrics | End-to-end Ralph Loop on real tasks |
| 284 tests passing (no GCP credentials required) | |
| Parallel sandbox execution via isolated git worktrees | |
GitHub issue comment trigger (/run-task) |
|
| PR feedback loop — agents address human review comments | |
| Struggle-driven self-improvement (post-mortem → auto-fix) | |
| Per-worktree app booting with isolated observability | |
CLI interface (ouroboros run, feedback, gc, status) |
- Why This Project Exists
- Why Ouroboros?
- Architecture Overview
- The Ralph Loop
- Agent Workers
- Typed Output Models
- Tool System
- Guard Rails
- Cost Awareness
- PR Feedback Loop
- Struggle-Driven Self-Improvement
- Entropy Management & GC
- Repository Index
- Context Builder
- Lint Framework
- Observability
- Infrastructure & Sandboxing
- Per-Worktree App Booting
- CLI
- Test Suite
- Core Beliefs
- Tech Stack
- Repository Structure
- Getting Started
- Configuration
- CI/CD Pipelines
Recent research suggests software engineering is shifting from writing code to designing environments where agents write code. The bottleneck moves from implementation speed to infrastructure quality — how well constrained, observable, and self-correcting the agent environment is.
Ouroboros explores what that environment looks like in practice:
- Strict architectural constraints — layered imports enforced by AST-based linting, not convention
- Typed agent contracts — every handoff is a Pydantic model, not a string to parse
- Deterministic validation — test/lint routing is a pure function, not an LLM guess
- Automated entropy management — daily GC workflow prevents codebase drift before it compounds
- Cost as a first-class signal — every run tracks tokens, dollars, and per-node breakdowns
The name is intentional: the system can be tasked to improve itself, and those improvements flow through the same constrained pipeline as any other change.
Traditional software development is a loop: plan → write → test → review → merge → repeat. Ouroboros encodes this loop as a state machine where AI agents execute each step, with typed contracts at every boundary and hard limits to prevent runaway execution.
Key design constraints:
-
No text parsing, ever. Every agent output is a typed Pydantic model. No regex, no JSON extraction, no "parse the LLM response." If a handoff can fail silently, it will — so every handoff is a type.
-
Guards are hard limits, not suggestions.
MAX_IMPLEMENT_ITERATIONS = 5is a constant, not a config value. An agent that loops forever is worse than one that escalates to a human. -
Token budgets are first-class. The context builder enforces a token budget before agents see anything. Agents that read the whole repo are agents that fail on large repos.
-
Entropy is tracked daily. Ten machine-checkable Golden Principles (GP-001 to GP-010) are enforced by linters and a daily garbage collection workflow that opens atomic cleanup PRs.
-
Self-improvement is the point. Agents can write better agent workers, tighter lint rules, and new tools — all flowing through the same PR review process as any other change.
Ouroboros uses a strict layered architecture enforced by AST-based linting. Each layer can only import from layers below it:
flowchart TD
workflows["WORKFLOWS\nralph_loop · feedback_loop · entropy_gc"]
workers["WORKERS\nplanner · implementer · reviewer · validator · cleaner · post_mortem"]
tools["TOOLS\nfs · shell · git · browser · observability · benchmark"]
core["CORE\nguards · state · context_builder · config · paths"]
models["MODELS\nPlanOutput · ImplementOutput · ReviewOutput · ValidationOutput\nCostSummary · HarnessImprovementOutput · ReproductionResult"]
workflows --> workers --> tools --> core --> models
style workflows fill:#4a9eff,color:#fff
style workers fill:#7c5cbf,color:#fff
style tools fill:#2e8b57,color:#fff
style core fill:#d4a017,color:#fff
style models fill:#c0392b,color:#fff
Enforced invariants:
- Workers cannot cross-import each other (shared logic goes to
core/ormodels/) - Tools cannot import workers (tools are stateless; workers orchestrate them)
- Models cannot import anything above them (pure types, zero side effects)
Violations are caught by lint/arch_lint.py with actionable AGENT_REMEDIATION messages so agents can self-fix.
The Ralph Loop (agents/workflows/ralph_loop.py) is the main workflow. It takes a task string and produces a merged PR:
flowchart TD
START(["START"]) --> plan_node
plan_node["plan_node\nPlannerAgent → PlanOutput"]
plan_node --> route_plan{{"bug-fix task?"}}
route_plan -- "yes" --> reproduce_node["reproduce_node\nRun pytest, capture traceback"]
route_plan -- "no" --> implement_node
reproduce_node --> implement_node
implement_node["implement_node\nImplementerAgent → ImplementOutput\nwrites FileChange[] to disk"]
implement_node --> validate_node
validate_node["validate_node\npytest + ruff + arch_lint\n→ ValidationOutput"]
validate_node --> route_validate{{"next_action?"}}
route_validate -- "retry (max 5)" --> implement_node
route_validate -- "escalate" --> human_checkpoint["human_checkpoint\nEscalated to human"]
route_validate -- "proceed" --> perf_validate_node
perf_validate_node["perf_validate_node\nBenchmark comparison"]
perf_validate_node --> ui_validate_node
ui_validate_node["ui_validate_node\nOptional: Playwright screenshots"]
ui_validate_node --> open_pr_node
open_pr_node["open_pr_node\ngit commit + gh pr create"]
open_pr_node --> review_loop_node
review_loop_node["review_loop_node\nReviewerAgent → ReviewOutput"]
review_loop_node --> route_review{{"approved?"}}
route_review -- "no (max 3)" --> implement_node
route_review -- "yes" --> merge_node
merge_node["merge_node\ngh pr merge --squash"]
merge_node --> DONE(["DONE"])
human_checkpoint --> post_mortem_node["post_mortem_node\nAnalyze failure → create\nharness-improvement issue"]
post_mortem_node --> DONE
style plan_node fill:#4a9eff,color:#fff
style reproduce_node fill:#9b59b6,color:#fff
style implement_node fill:#7c5cbf,color:#fff
style validate_node fill:#2e8b57,color:#fff
style perf_validate_node fill:#1abc9c,color:#fff
style ui_validate_node fill:#17a2b8,color:#fff
style open_pr_node fill:#d4a017,color:#fff
style review_loop_node fill:#e67e22,color:#fff
style merge_node fill:#27ae60,color:#fff
style human_checkpoint fill:#c0392b,color:#fff
style post_mortem_node fill:#e74c3c,color:#fff
Conditional routing is driven entirely by typed model fields — no string matching:
ValidationOutput.next_action:"proceed"|"retry"|"escalate"ReviewOutput.approved:true→ merge,false→ address feedback
Entry point:
result = await run_ralph_loop("Fix the off-by-one error in utils/counter.py")
# result.status == "done"
# result.pr_url == "https://github.com/org/repo/pull/42"
# result.estimated_cost_usd == 0.012Six specialized workers, each returning typed Pydantic models:
| Worker | Input | Output | Uses LLM? |
|---|---|---|---|
| Planner | Task + TaskContext | PlanOutput (steps, risk, domains) |
Yes |
| Implementer | Task + Plan + prior failures | ImplementOutput (FileChange[], commit msg) |
Yes |
| Reviewer | PR diff + task context | ReviewOutput (approved, comments, blocking issues) |
Yes |
| Validator | (runs tools directly) | ValidationOutput (test/lint results, next_action) |
No |
| Cleaner | Scan report + domains | CleanupOutput (violations, quality scores, PR recs) |
Yes |
| Post-Mortem | Task + error_log + iteration count | HarnessImprovementOutput (failure category, root cause, suggested fix) |
Yes |
The Validator is deliberately deterministic — it runs pytest and lint tools, then calls a pure function (determine_next_action()) to decide the next step. No LLM call, no ambiguity.
Each LLM-based worker:
- Loads a system prompt from
agents/prompts/*.txt - Uses
get_model()(Gemini 3.0 Flash via Vertex AI) - Returns
(TypedOutput, TokenUsage)for cost tracking - Has
retries=3for transient failures
Every agent-to-agent handoff is a Pydantic model. Here are the key types:
# Planning
class ExecutionStep(BaseModel):
description: str
files_affected: list[str]
tool: Literal["fs", "shell", "git", "browser", "observability", "index"]
expected_output: str
class PlanOutput(BaseModel):
task_summary: str
steps: list[ExecutionStep]
risk_level: Literal["low", "medium", "high"]
requires_human_review: bool
requires_browser_validation: bool
affected_domains: list[str]
# Implementation
class FileChange(BaseModel):
path: str
operation: Literal["create", "modify", "delete"]
content: str | None
diff_summary: str
class ImplementOutput(BaseModel):
files_changed: list[FileChange]
commit_message: str
test_commands: list[str]
# Validation (drives routing)
class ValidationOutput(BaseModel):
tests: TestResult
lint: LintResult
arch_lint: LintResult
overall_pass: bool
next_action: Literal["proceed", "retry", "escalate"] # ← routing signal
failure_summary: str
# Review (drives merge decision)
class ReviewOutput(BaseModel):
approved: bool # ← merge gate
comments: list[ReviewComment]
blocking_issues: list[str]
summary: str
arch_violations: list[str]The next_action and approved fields are what drive LangGraph's conditional edges. Pure type routing — no string parsing.
All agent capabilities are registered in a ToolRegistry singleton. The planner queries REGISTRY.all_tools() before creating a plan, ensuring it can only reference tools that actually exist.
| Category | Tool | Description |
|---|---|---|
| fs | read_file(path) |
Read a file from the repo |
write_file(path, content) |
Write content, create parent dirs | |
list_dir(path) |
List directory contents | |
search_repo(query, pattern) |
Ripgrep search across repo | |
search_symbol(name) |
O(1) lookup in repo index | |
reindex(paths) |
Update symbol index for changed files | |
| shell | run_tests(path) |
Run pytest, return structured TestResult |
run_lint(path) |
Run ruff + arch_lint + golden_lint | |
run_build() |
Build the application | |
run_command(cmd) |
Run arbitrary shell command | |
| git | git_status() |
Branch, changed, staged, untracked files |
commit(message, files) |
Stage specific files and commit | |
open_pr(title, body) |
Create PR via gh CLI |
|
get_pr_diff(pr_number) |
Fetch PR diff | |
get_pr_comments(pr_number) |
Fetch review comments | |
merge_pr(pr_number, strategy) |
Merge PR (squash/merge) | |
get_pr_metadata(pr_number) |
Fetch PR branch, title, body, labels | |
reply_to_pr_comment(pr_number, comment_id, body) |
Reply to a review comment | |
add_pr_label(pr_number, label) |
Add a label to a PR | |
push_to_remote(branch) |
Push branch to origin | |
create_issue(title, body, labels) |
Create a GitHub issue | |
| browser | take_screenshot(url) |
Playwright screenshot (base64 PNG) |
snapshot_dom(url) |
Capture accessibility tree | |
drive_ui_flow(url, steps) |
Execute UI action sequence | |
| observability | query_logs(logql) |
Query VictoriaLogs (LogQL syntax) |
query_metrics(promql) |
Query VictoriaMetrics (PromQL syntax) |
Every tool returns a typed Pydantic model — TestResult, CommitResult, PRResult, ScreenshotResult, etc. No raw strings.
# Tool capability metadata — used by planner to understand what's available
class ToolCapability(BaseModel):
name: str
description: str
input_schema: dict
output_type: str
category: Literal["fs", "shell", "git", "browser", "observability", "index"]
requires_sandbox: boolHard limits enforced at the entry of every LangGraph node via pre_node_guard(). These are constants, not config — intentionally not tunable at runtime:
| Guard | Value | Scope | On Breach |
|---|---|---|---|
MAX_IMPLEMENT_ITERATIONS |
5 | implement → validate loops | escalate |
MAX_REVIEW_ITERATIONS |
3 | review → fix loops | escalate |
MAX_TOOL_CALLS_PER_NODE |
50 | tools per LangGraph node | abort |
MAX_TOTAL_TOOL_CALLS |
200 | tools across entire run | abort |
MAX_COST_USD_PER_RUN |
$2.00 | cost ceiling per workflow | escalate |
def check_guards(state: RalphState) -> GuardResult:
"""Checked at every node entry. Returns allowed/escalate/abort."""
if state.iteration_count >= MAX_IMPLEMENT_ITERATIONS:
return GuardResult(allowed=False, action="escalate")
if state.estimated_cost_usd >= state.cost_budget_usd:
return GuardResult(allowed=False, action="escalate")
...When an agent can't solve a problem within bounds, it escalates to a human rather than burning tokens indefinitely.
Every workflow run tracks token usage and cost, producing a RunMetrics report:
class TokenUsage(BaseModel):
tokens_in: int
tokens_out: int
def cost_usd(self) -> float:
"""Gemini 3.0 Flash: $0.25/1M input, $1.50/1M output"""
return (self.tokens_in * 0.25 + self.tokens_out * 1.50) / 1_000_000
class RunMetrics(BaseModel):
cost: CostSummary
per_node_costs: dict[str, CostSummary] # Cost per LangGraph node
highest_cost_node: str # Where most tokens were spentExample run breakdown:
| Node | Input Tokens | Output Tokens | Cost USD |
|---|---|---|---|
| plan_node | 2,100 | 800 | $0.0017 |
| implement_node | 4,500 | 2,200 | $0.0044 |
| validate_node | 0 | 0 | $0.0000 |
| review_node | 3,800 | 1,100 | $0.0026 |
| TOTAL | 10,400 | 4,100 | $0.0087 |
Cost data flows to Logfire, building a dataset of cost-per-PR-by-task-type for regression tracking.
When a human reviewer requests changes on an agent PR, the feedback loop workflow (agents/workflows/feedback_loop.py) autonomously addresses the comments:
flowchart TD
START(["PR Review: changes_requested"]) --> gather
gather["gather_feedback_node\nCollect review comments"]
gather --> implement
implement["implement_feedback_node\nPlanner + Implementer address comments"]
implement --> validate
validate["validate_feedback_node\npytest + ruff + arch_lint"]
validate --> route{{"next_action?"}}
route -- "retry (max 5)" --> implement
route -- "proceed" --> push["commit_push_node\nCommit + push to PR branch"]
route -- "escalate" --> END_ESC(["ESCALATED"])
push --> reply["reply_node\nReply to each review comment"]
reply --> DONE(["DONE"])
style gather fill:#4a9eff,color:#fff
style implement fill:#7c5cbf,color:#fff
style validate fill:#2e8b57,color:#fff
style push fill:#d4a017,color:#fff
style reply fill:#e67e22,color:#fff
Triggers:
pull_request_reviewevent withchanges_requestedaction/feedbackcomment on an agent PR
Safety: Max 3 feedback iterations per PR (tracked via feedback-iteration-N labels). After 3 rounds, the agent escalates to a human.
# CLI usage
ouroboros feedback 42 # Address feedback on PR #42When the Ralph Loop escalates to a human (guard limits hit, repeated validation failures), the post-mortem agent analyzes the failure and creates a harness-improvement GitHub issue with a concrete fix suggestion:
flowchart LR
Failure["Agent escalates\n(hit guard limit)"] --> PostMortem["Post-Mortem Agent\nAnalyze failure"]
PostMortem --> Issue["GitHub Issue\nlabel: harness-improvement"]
Issue --> Agent["Agent picks up issue\n(harness-fix.yml)"]
Agent --> PR["PR to fix\nthe harness itself"]
style PostMortem fill:#e74c3c,color:#fff
style Issue fill:#d4a017,color:#fff
style Agent fill:#7c5cbf,color:#fff
The post-mortem agent categorizes failures into:
missing_tool— agent needed a capability that doesn't existbad_prompt— system prompt led to incorrect behaviorinsufficient_context— context builder didn't provide enough infoguard_limit— legitimate task exceeded hard limitsvalidation_loop— stuck in implement/validate cycleexternal_dependency— external service failure
Each category maps to a priority and a suggested fix targeting specific files in the harness.
Entropy is tracked as a first-class concern through ten Golden Principles — machine-checkable rules enforced by lint/golden_lint.py and a daily GC workflow:
| Principle | Rule | Severity | Auto-fixable |
|---|---|---|---|
| GP-001 | No duplicate utility functions across packages | error | Yes |
| GP-002 | No file exceeds 500 lines | warning | No |
| GP-003 | No hand-rolled helpers duplicating shared packages | warning | No |
| GP-004 | All external data validated at boundary (Pydantic) | error | No |
| GP-005 | No print() outside scripts/ — use structured logging |
info | Yes |
| GP-006 | Schema types follow *Output/*Result/*Schema naming |
info | No |
| GP-007 | No dead imports | info | Yes |
| GP-008 | All docs reference real code that still exists | warning | No |
| GP-009 | Active exec-plans updated within 7 days | warning | No |
| GP-010 | QUALITY_SCORE.md regenerated within 24 hours |
info | Yes |
The entropy GC workflow (agents/workflows/entropy_gc.py) runs daily via GitHub Actions:
flowchart TD
scan["entropy_scan_node\nRun all linters, collect violations"]
scan --> analyze["analyze_violations_node\nCleanerAgent clusters violations\nby principle + domain"]
analyze --> prs["open_cleanup_prs_node\nOne atomic PR per violation cluster"]
prs --> score["update_quality_score_node\nWrite docs/QUALITY_SCORE.md"]
style scan fill:#c0392b,color:#fff
style analyze fill:#7c5cbf,color:#fff
style prs fill:#d4a017,color:#fff
style score fill:#27ae60,color:#fff
Each cleanup PR is:
- Atomic — one principle violation cluster per PR
- Auto-mergeable — if tests pass, no human required
- Tiny — <1 minute review time
The repo index (repo_index/) provides O(1) symbol lookup so agents don't need to read every file:
# Instead of reading 50 files to find a class:
@tool
def search_symbol(name: str) -> SymbolLocation | None:
"""Look up a symbol by name. Returns file + line."""
# O(1) lookup in symbols.jsonGenerated files:
symbols.json— Symbol name to file + line + kind (class, function, constant)file_map.json— File path to domain, layer, imports, exports
// symbols.json (189 symbols indexed across 47 files)
{
"ValidationOutput": {"file": "agents/models/validator.py", "line": 42, "kind": "class"},
"run_planner": {"file": "agents/workers/planner.py", "line": 18, "kind": "async_function"}
}
// file_map.json
{
"agents/workers/planner.py": {
"domain": "agents",
"layer": "workers",
"imports": ["agents.models.planner", "agents.core.config"],
"exports": ["run_planner"]
}
}The index is rebuilt automatically on every merge to main via CI, and agents can call reindex() after writing files.
Agents never receive raw file dumps. The context builder (agents/core/context_builder.py) produces a token-budgeted context package:
flowchart TD
Task["Task: 'Fix the login endpoint validation'"]
Task --> build["build_context(task)"]
build --> repo["Query repo index\nfor relevant files"]
build --> arch["Load arch rules\nfor touched layers"]
build --> tools["Query tool registry\nfor available capabilities"]
repo --> ctx
arch --> ctx
tools --> ctx
ctx["TaskContext\nrelevant_files, relevant_docs\narch_rules, active_plans\navailable_tools, token_budget"]
style build fill:#4a9eff,color:#fff
style ctx fill:#27ae60,color:#fff
The context builder is the gatekeeper for token spend. Without it, agents read 50 files and burn context on noise. With it, they receive exactly what they need within budget.
Four complementary linters work together to enforce code quality:
AST-based layer dependency checker. Every violation includes an actionable remediation message:
ARCH-VIOLATION: agents/workers/planner.py imports from agents/workers/reviewer.py
RULE: Workers cannot cross-import. Extract shared logic to agents/core/.
REMEDIATION: Move shared type X to agents/models/shared.py and import from there.
DOCS: See ARCHITECTURE.md#worker-isolation
Enforces the 10 Golden Principles. Detection methods:
- GP-001: AST body comparison (
ast.unparse()) across all functions - GP-002: Line counting
- GP-003: Pattern matching (
while+sleep()= hand-rolled retry) - GP-004:
json.loads()withoutmodel_validate()call - GP-005:
print()call detection outside allowed directories - GP-006: BaseModel subclass suffix validation in
agents/models/ - GP-007: Delegates to
ruff check --select F401
Cross-references backtick paths in .md files against the repo index, ensuring documentation references real code that still exists.
Centralized rule definitions with AGENT_REMEDIATION fields:
@dataclass
class LintRule:
id: str # "ARCH-001", "GP-005"
name: str # "worker-cross-import"
description: str # Human-readable
severity: str # "error", "warning", "info"
agent_remediation: str # Agent reads this to self-fix
docs_link: str # Reference to docs
auto_fixable: bool # Can agent fix without human?Two layers of observability — one for the agent system, one for the applications agents build:
flowchart LR
subgraph agent_obs["Agent Observability"]
PydanticAI --> Logfire["Logfire\n(auto-instrumented)"]
Logfire --> traces["Model calls · Tool calls\nNode transitions · RunMetrics"]
end
subgraph app_obs["App Observability"]
App --> Vector
Vector --> VLogs["VictoriaLogs\nLogQL @ :9428"]
Vector --> VMetrics["VictoriaMetrics\nPromQL @ :8428"]
VLogs --> Grafana[":3000"]
VMetrics --> Grafana
end
VLogs -. "query_logs()" .-> Agents["Agent Tools"]
VMetrics -. "query_metrics()" .-> Agents
style Logfire fill:#ff6b35,color:#fff
style Grafana fill:#f46800,color:#fff
Agents can query the observability stack to diagnose issues — the same way humans do:
# Agent querying logs to diagnose an error
logs = await query_logs('{service="api"} |= "error"', duration="1h")
# Agent checking request latency
metrics = await query_metrics('rate(http_requests_total[5m])', duration="1h")# docker-compose.yml
services:
vector: # Log/metric aggregation (port 8686)
victoria-logs: # LogQL log storage (port 9428, 7-day retention)
victoria-metrics: # PromQL metric storage (port 8428, 7-day retention)
grafana: # Dashboard (port 3000)Each agent worktree gets an isolated Docker environment:
# Spin up isolated env for a task
scripts/worktree_up.sh feature-login 8100
# Creates git worktree at ../ouroboros-feature-login
# Starts sandbox containers on port 8100+
# Tear down when done
scripts/worktree_down.sh feature-loginWorktrees get isolated Docker networks (ouroboros-{name}), unique port allocations, and separate Vector instances forwarding to the main observability stack.
Each isolated worktree can boot its own application and observability stack with automatically allocated ports. Port offsets are deterministically derived from the worktree name (sha256(name) % 900 + 100) to prevent collisions between parallel agent runs.
# Automatic port allocation from worktree name
scripts/worktree_up.sh feature-login
# App: http://localhost:8247
# VictoriaLogs: http://localhost:9547
# VictoriaMetrics: http://localhost:8647
# CLI with --with-app flag for full Docker lifecycle
ouroboros run --worktree --with-app "Fix the login form"The --with-app flag in the CLI manages the full Docker lifecycle:
- Creates git worktree with unique branch
- Computes deterministic port offset
- Sets
APP_URL,VICTORIA_LOGS_URL,VICTORIA_METRICS_URLenv vars - Starts Docker stack (
docker-compose.yml+docker-compose.worktree.yml) - Runs the Ralph Loop with full observability
- Tears down containers and worktree on completion
The observability tools (query_logs(), query_metrics()) read endpoint URLs from environment variables at call time (not import time), so each worktree's agent queries its own isolated stack.
The ouroboros CLI (agents/cli.py) provides the primary interface for running agent workflows:
# Run the Ralph Loop on a task
ouroboros run "Add a /health endpoint that returns 200"
# Run in an isolated worktree with app booting
ouroboros run --worktree --with-app "Fix the login form"
# Address human review feedback on a PR
ouroboros feedback 42
# Run entropy GC (daily cleanup)
ouroboros gc
# Update quality scores only (no PRs)
ouroboros gc --scores-only
# List active agent worktrees
ouroboros status| Command | Description |
|---|---|
ouroboros run <task> |
Run the Ralph Loop — plan, implement, validate, review, merge |
ouroboros run --worktree <task> |
Run in an isolated git worktree |
ouroboros run --worktree --with-app <task> |
Run with full Docker app + observability |
ouroboros feedback <pr-number> |
Address review comments on an agent PR |
ouroboros gc |
Run entropy GC — scan, cleanup, open PRs |
ouroboros gc --scores-only |
Update quality scores without opening PRs |
ouroboros status |
List active ouroboros worktrees |
64 tests organized into two categories, all runnable without GCP credentials or pydantic_ai installed:
| Test File | Tests | Coverage |
|---|---|---|
test_arch_lint.py |
5 | Worker cross-import, tool→worker import, clean files, remediation messages |
test_golden_lint.py |
12 | GP-001 through GP-006 (duplicates, file size, hand-rolled retry, validation, print, naming) |
| Test File | Tests | Coverage |
|---|---|---|
test_guards.py |
6 | All guard types: iteration limits, tool budget, cost ceiling |
test_validator_logic.py |
6 | determine_next_action() routing: proceed, retry, escalate |
test_bug_fix.py |
5 | Model contracts for bug-fix workflow (PlanOutput, ImplementOutput, ValidationOutput) |
test_feature_gen.py |
5 | Model contracts for feature generation (plan, implement, review) |
test_entropy_gc.py |
8 | Entropy violation models, cleanup output, quality scoring, clustering |
test_workflow_routing.py |
8 | Reproduce node routing, feedback loop guards, post-mortem triggering |
test_post_mortem.py |
5 | Failure categorization, harness improvement output, issue creation |
test_feedback_loop.py |
4 | Feedback state transitions, comment reply logic, iteration tracking |
Design principle: Deterministic logic (guards, validator routing, model contracts) is tested without an LLM. Probabilistic behavior (actual agent runs) uses mocks in CI and requires GCP credentials for integration testing.
# Run all tests
uv run pytest tests/ -v
# Run only lint tests
uv run pytest tests/lint/ -v
# Run only agent eval tests
uv run pytest tests/agent_eval/ -vTen foundational principles that guide every design decision (from docs/design-docs/core-beliefs.md):
| # | Belief | Implication |
|---|---|---|
| 1 | Structure over text | Pydantic models at every boundary |
| 2 | Planner is not omniscient | Must query REGISTRY.all_tools() before planning |
| 3 | Token budget is first-class | build_context() enforces limits |
| 4 | Guards are not suggestions | Hard constants, not runtime config |
| 5 | Repo index is the map | search_symbol() over read_file() |
| 6 | Entropy accumulates | Daily GC workflow prevents compounding |
| 7 | Every run has a cost | CostSummary tracks regression |
| 8 | Self-referential loop is the feature | Agents improve agents, through PR review |
| 9 | Observability is a tool | Agents query query_logs() / query_metrics() |
| 10 | Small, atomic, reversible | 10 small PRs > 1 large PR |
| Layer | Tool | Why |
|---|---|---|
| Language Model | Gemini 3.0 Flash via Vertex AI | Production-grade rate limits, IAM auth, regional isolation |
| Agent Framework | PydanticAI | Typed structured outputs, native Logfire tracing |
| Orchestration | LangGraph | Explicit state machine, conditional routing, human escalation |
| Tracing | Logfire | First-class PydanticAI instrumentation, OTel native |
| Language | Python 3.12+ | |
| Package Manager | uv | Fast, lockfile-based dependency resolution |
| Linter/Formatter | ruff | Covers isort + flake8 + pyupgrade + more |
| Tests | pytest + pytest-asyncio | Async-native test runner |
| Git Automation | gh CLI | Programmatic PR create/review/merge |
| Browser | Playwright | DOM snapshots + screenshots for UI validation |
| Log Storage | VictoriaLogs | LogQL-compatible, queryable by agents |
| Metric Storage | VictoriaMetrics | PromQL-compatible, queryable by agents |
| Log Routing | Vector | Routes app logs/metrics to storage |
| Dashboards | Grafana | Human-facing visualization |
| CI | GitHub Actions | Lint + tests on every PR, entropy GC daily |
| Build | hatchling | PEP 517 build backend |
ouroboros/
├── AGENTS.md # Agent entry point (read first)
├── ARCHITECTURE.md # Layer dependency rules
├── README.md # This file
│
├── agents/
│ ├── core/
│ │ ├── config.py # Vertex AI + Gemini model init
│ │ ├── state.py # RalphState + FeedbackState TypedDicts
│ │ ├── guards.py # Hard iteration/cost limits
│ │ ├── context_builder.py # build_context() → TaskContext
│ │ ├── paths.py # repo_root() utility
│ │ └── instrumentation.py # Logfire setup
│ │
│ ├── models/ # Pure Pydantic output types
│ │ ├── planner.py # PlanOutput, ExecutionStep
│ │ ├── implementer.py # ImplementOutput, FileChange
│ │ ├── reviewer.py # ReviewOutput, ReviewComment
│ │ ├── validator.py # ValidationOutput, TestResult, LintResult
│ │ ├── cleaner.py # CleanupOutput, EntropyViolation
│ │ ├── cost.py # TokenUsage, CostSummary, RunMetrics
│ │ ├── registry.py # ToolCapability, ToolRegistry
│ │ ├── post_mortem.py # HarnessImprovementOutput, FailureCategory
│ │ └── reproducer.py # ReproductionResult, ErrorContext
│ │
│ ├── workers/ # PydanticAI agent implementations
│ │ ├── planner.py # run_planner() → (PlanOutput, TokenUsage)
│ │ ├── implementer.py # run_implementer() → (ImplementOutput, TokenUsage)
│ │ ├── reviewer.py # run_reviewer() → (ReviewOutput, TokenUsage)
│ │ ├── validator.py # run_validator() → ValidationOutput (no LLM)
│ │ ├── cleaner.py # run_cleaner() → (CleanupOutput, TokenUsage)
│ │ └── post_mortem.py # run_post_mortem() → (HarnessImprovementOutput, TokenUsage)
│ │
│ ├── tools/ # @tool functions + registry
│ │ ├── registry.py # REGISTRY singleton, all tools registered
│ │ ├── fs.py # read_file, write_file, search_symbol
│ │ ├── shell.py # run_tests, run_lint, run_build
│ │ ├── git.py # git_status, commit, open_pr, merge_pr, get_pr_metadata, create_issue
│ │ ├── browser.py # take_screenshot, snapshot_dom
│ │ ├── observability.py # query_logs, query_metrics
│ │ └── benchmark.py # run_benchmark, compare_benchmarks
│ │
│ ├── workflows/ # LangGraph state machines
│ │ ├── ralph_loop.py # Main PR lifecycle workflow
│ │ ├── feedback_loop.py # PR feedback loop workflow
│ │ ├── post_mortem.py # Struggle-driven self-improvement node
│ │ ├── reviewer_loop.py # Agent-to-agent review
│ │ └── entropy_gc.py # Daily entropy scan + cleanup PRs
│ │
│ ├── prompts/ # System prompt .txt files
│ │ ├── planner.txt
│ │ ├── implementer.txt
│ │ ├── reviewer.txt
│ │ ├── cleaner.txt
│ │ └── post_mortem.txt
│ │
│ └── cli.py # ouroboros CLI (run, feedback, gc, status)
│
├── lint/
│ ├── arch_lint.py # AST-based layer dependency checker
│ ├── golden_lint.py # GP-001 through GP-010 enforcement
│ ├── doc_lint.py # Stale doc reference detection
│ ├── rules.py # Named rules with AGENT_REMEDIATION
│ └── run_lint.py # CLI runner
│
├── repo_index/
│ ├── build_index.py # Generates symbols.json + file_map.json
│ ├── symbols.json # Symbol → file + line + kind
│ └── file_map.json # File → domain, layer, imports, exports
│
├── tests/
│ ├── lint/ # Linter unit tests
│ │ ├── test_arch_lint.py
│ │ └── test_golden_lint.py
│ └── agent_eval/ # Agent behavior tests
│ ├── test_guards.py
│ ├── test_validator_logic.py
│ ├── test_bug_fix.py
│ ├── test_feature_gen.py
│ ├── test_entropy_gc.py
│ ├── test_workflow_routing.py
│ ├── test_post_mortem.py
│ └── test_feedback_loop.py
│
├── harness/
│ ├── observability/
│ │ └── docker-compose.yml # VictoriaLogs + VictoriaMetrics + Grafana
│ └── sandbox/
│ ├── docker-compose.yml # Per-worktree app isolation
│ └── docker-compose.worktree.yml # Per-worktree observability override
│
├── scripts/
│ ├── worktree_up.sh # Spin up isolated worktree env
│ └── worktree_down.sh # Tear down worktree + containers
│
├── docs/
│ ├── DESIGN.md # System design decisions
│ ├── GOLDEN_PRINCIPLES.md # GP-001 through GP-010
│ ├── QUALITY_SCORE.md # Auto-updated domain quality grades
│ ├── PLANS.md # How to read/write exec plans
│ └── design-docs/
│ └── core-beliefs.md # 10 foundational principles
│
├── .github/workflows/
│ ├── ci.yml # Lint + tests on every PR
│ ├── entropy_gc.yml # Daily entropy scan (6am UTC)
│ ├── issue-trigger.yml # /run-task comment → agent run
│ ├── run-task.yml # Reusable workflow for task execution
│ ├── pr-feedback.yml # PR review → feedback loop
│ └── harness-fix.yml # harness-improvement → auto-fix
│
└── pyproject.toml # uv + ruff + pytest config
- Python 3.12+
- uv (package manager)
- gh (GitHub CLI, for PR operations)
- Docker + Docker Compose (for observability stack)
- Google Cloud project with Vertex AI enabled (for real agent runs)
# Clone the repository
git clone https://github.com/Tanush1912/ouroboros.git
cd ouroboros
# Install all dependencies
uv sync --all-extras
# Build the repo index
uv run python repo_index/build_index.py
# Run tests to verify everything works
uv run pytest tests/ -v# Set required environment variables
export GCP_PROJECT="your-gcp-project-id"
export GCP_LOCATION="us-central1" # optional, default
export LOGFIRE_TOKEN="your-logfire-token" # optional, for tracing
# Run a task through the Ralph Loop (CLI)
ouroboros run "Add a /health endpoint that returns 200"
# Run in an isolated worktree
ouroboros run --worktree "Fix the login form validation"
# Run with full app + observability stack
ouroboros run --worktree --with-app "Fix the login form"
# Address review feedback on a PR
ouroboros feedback 42
# Run entropy GC
ouroboros gc
# List active agent worktrees
ouroboros statusOr programmatically:
uv run python -c "
import asyncio
from agents.workflows.ralph_loop import run_ralph_loop
result = asyncio.run(run_ralph_loop('Add a /health endpoint that returns 200'))
print(f'Status: {result[\"status\"]}')
print(f'PR: {result[\"pr_url\"]}')
print(f'Cost: \${result[\"estimated_cost_usd\"]:.4f}')
"# Start the monitoring stack
cd harness/observability
docker compose up -d
# Grafana at http://localhost:3000 (admin/admin)
# VictoriaLogs API at http://localhost:9428
# VictoriaMetrics API at http://localhost:8428# Run all linters
uv run python lint/run_lint.py .
# Architecture lint only
uv run python lint/run_lint.py --arch-only .
# Golden principles lint only
uv run python lint/run_lint.py --golden-only .
# Ruff
uv run ruff check .
uv run ruff format --check .| Variable | Required | Default | Description |
|---|---|---|---|
GCP_PROJECT |
Yes (for agent runs) | — | Google Cloud project ID |
GCP_LOCATION |
No | us-central1 |
Vertex AI region |
OUROBOROS_MODEL |
No | gemini-3.0-flash-preview |
Model name |
LOGFIRE_TOKEN |
No | — | Logfire API token for tracing |
GITHUB_TOKEN |
No | — | GitHub CLI auth (for PR operations) |
VICTORIA_LOGS_URL |
No | http://localhost:9428 |
VictoriaLogs endpoint |
VICTORIA_METRICS_URL |
No | http://localhost:8428 |
VictoriaMetrics endpoint |
APP_URL |
No | — | Application URL for browser validation |
WORKTREE_NAME |
No | — | Current worktree name (set by CLI/scripts) |
APP_PORT |
No | 8000 |
Application port (offset for worktrees) |
VICTORIA_LOGS_PORT |
No | 9428 |
VictoriaLogs port (offset for worktrees) |
VICTORIA_METRICS_PORT |
No | 8428 |
VictoriaMetrics port (offset for worktrees) |
| File | Purpose |
|---|---|
pyproject.toml |
Dependencies, ruff config, pytest config |
agents/prompts/*.txt |
System prompts (edit to tune agent behavior) |
docs/GOLDEN_PRINCIPLES.md |
Quality rules (add new principles here) |
lint/rules.py |
Rule definitions with remediation messages |
flowchart TD
subgraph ci["Every PR — ci.yml"]
lint["Lint Job\nruff + arch_lint + golden_lint"]
test["Test Job\n17 lint + 47 agent eval"]
end
subgraph merge["On Merge to Main"]
idx["Rebuild symbols.json\n+ file_map.json"]
end
subgraph gc["Daily @ 6am UTC — entropy_gc.yml"]
entropy["Entropy scan + cleanup PRs\n+ update QUALITY_SCORE.md"]
end
subgraph triggers["Event-Driven Triggers"]
issue["issue-trigger.yml\n/run-task comment"]
feedback["pr-feedback.yml\nchanges_requested review"]
harness["harness-fix.yml\nharness-improvement label"]
end
style ci fill:#4a9eff,color:#fff
style merge fill:#27ae60,color:#fff
style gc fill:#e67e22,color:#fff
style triggers fill:#7c5cbf,color:#fff
Ouroboros — the serpent that eats its own tail.
A system that builds, reviews, and improves itself.