Every LLM has blind spots. A single model gives you one perspective with no way to know what it missed, oversimplified, or got wrong. Self-critique helps, but models are biased toward agreeing with themselves.
duh uses a 4-phase protocol that forces genuine disagreement between models. The key insight: a challenged and revised answer is consistently stronger than a direct answer from any single model.
The strongest available model (selected by output cost as a capability proxy) generates an initial answer.
The proposer gets a system prompt that encourages thorough, specific answers with concrete examples and numbers. No hedging.
IDLE --> PROPOSE
Multiple challenger models (default: 2) receive the proposal with explicit instructions to disagree:
- They must find at least one substantive disagreement (not a nitpick)
- They must not start with praise ("This is a good answer...")
- They must identify something wrong, oversimplified, or missing
- They must argue for an alternative when the proposal recommends approach X
Challengers are selected to maximize diversity -- models from different providers are preferred over same-model self-critique. Challenges run in parallel for speed.
PROPOSE --> CHALLENGE
!!! note "Sycophancy detection" duh scans the opening ~200 characters of each challenge for praise markers like "great answer", "I largely agree", or "no significant flaws". Sycophantic challenges are flagged and excluded from confidence calculations.
The original proposer receives all challenges and produces an improved answer that:
- Addresses each valid challenge directly
- Maintains correct points with stronger justification
- Incorporates new perspectives where they improve the answer
- Pushes back on wrong challenges with explanations
The revision prompt instructs the model not to mention the debate process -- just give the best possible answer.
CHALLENGE --> REVISE
A pure extraction step (no model call):
- Decision = the revision text
- Confidence = computed from challenge quality (0.5 to 1.0). More genuine (non-sycophantic) challenges = higher confidence, because the revision was more rigorously tested.
- Dissent = preserved text from genuine challenges, representing minority viewpoints that may be valuable even after revision
REVISE --> COMMIT
After COMMIT, duh compares the current round's challenges against the previous round's challenges using Jaccard word-overlap similarity:
- For each current challenge, find the maximum similarity to any previous challenge
- Average these maximum similarities
- If the average >= 0.7 (configurable threshold), challenges have converged
Convergence means challengers are raising the same issues across rounds. Further iteration is unlikely to improve the answer, so duh stops early.
Round 1 never converges (nothing to compare against).
If challenges haven't converged and rounds remain, the state machine cycles back:
COMMIT --> PROPOSE (new round, with previous context)
The next round's proposer receives the previous decision and its challenges, so it can build on what was already debated.
When rounds are exhausted or convergence is detected:
COMMIT --> COMPLETE
The voting protocol is an alternative to the full consensus debate. Instead of iterative propose-challenge-revise rounds, all models answer independently in parallel and a meta-judge aggregates the results.
- Judgment questions -- subjective evaluations, comparisons, opinions
- Speed-sensitive queries -- parallel fan-out is faster than sequential rounds
- High model count -- more models means more diverse perspectives to aggregate
Use --protocol voting or set protocol = "voting" in config. Use --protocol auto to let duh classify the question and route automatically.
- Fan-out: The question is sent to all configured models in parallel
- Collection: Each model's answer is collected as a
VoteResult - Meta-judge selection: The strongest model (highest output cost) is selected as judge
- Aggregation: The judge picks or synthesizes the best answer
| Strategy | Behavior |
|---|---|
majority (default) |
Meta-judge reads all answers and picks the best one, improving it if possible |
weighted |
Meta-judge synthesizes all answers, weighting by model capability (output cost as proxy) |
Configure the strategy in [voting]:
[voting]
aggregation = "weighted"With --protocol auto, duh uses the cheapest model to classify the question:
- Reasoning (logic, math, code, step-by-step) -- routes to consensus
- Judgment (opinions, evaluations, comparisons) -- routes to voting
For complex questions that span multiple domains, duh can decompose the question into a directed acyclic graph (DAG) of subtasks.
- Multi-part questions -- "Design a complete CI/CD pipeline" has research, tooling, and architecture components
- Questions with dependencies -- Some parts must be answered before others
- Broad-scope queries -- Better to solve focused subproblems and merge results
Use --decompose or set decompose = true in config.
- DECOMPOSE phase: The cheapest model breaks the question into 2-7 subtasks with dependency relationships, returned as JSON
- DAG validation: Labels are checked for uniqueness, dependencies are resolved, and the graph is verified acyclic (Kahn's algorithm)
- Scheduling: Subtasks are scheduled using
TopologicalSorter-- independent subtasks run in parallel, dependent subtasks wait for their prerequisites - Per-subtask consensus: Each subtask runs the full consensus protocol independently
- Synthesis: A meta-model merges all subtask results into a single coherent answer
| Strategy | Behavior |
|---|---|
merge |
Combine all subtask answers into one comprehensive response |
prioritize |
Weight subtask answers by their confidence scores |
If decomposition produces only one subtask (the question is already focused enough), duh skips synthesis and runs normal consensus directly. This avoids unnecessary overhead.
Models can use tools during the PROPOSE and CHALLENGE phases to access external information and capabilities.
| Tool | Description | Config key |
|---|---|---|
| Web search | Search the web using DuckDuckGo (or custom backend) | tools.web_search |
| Code execution | Run Python code in a sandboxed environment | tools.code_execution |
| File read | Read local files for context | Always available when tools enabled |
- The model receives tool definitions alongside the consensus prompt
- If the model requests a tool call, duh executes it and returns the result
- The model incorporates tool results into its response
- Tool calls are logged and displayed in the TOOLS panel after the decision
Enable tools globally in config:
[tools]
enabled = true
[tools.web_search]
backend = "duckduckgo"
max_results = 5
[tools.code_execution]
enabled = true
timeout = 30Or per-query via CLI:
duh ask --tools "What is the current price of Bitcoin?"!!! note "Tool call loop"
The tool-augmented send loop runs up to tools.max_rounds iterations (default: 5) per phase, allowing models to make multiple sequential tool calls if needed.
IDLE --> DECOMPOSE --> PROPOSE --> CHALLENGE --> REVISE --> COMMIT --> COMPLETE
| |
| +--> PROPOSE (next round)
| |
+--> (subtask scheduling + synthesis) +--> FAILED (on error)
The DECOMPOSE state is optional -- it is entered only when --decompose is used. The voting protocol bypasses the state machine entirely (parallel fan-out + aggregation).
Any non-terminal state can transition to FAILED on errors. COMPLETE and FAILED are terminal states.
Guard conditions enforce valid transitions:
- Can't PROPOSE without a non-empty question
- Can't CHALLENGE without a proposal
- Can't REVISE without challenges
- Can't COMMIT without a revision
- Can't start a new round if already converged or max rounds reached
- Can't COMPLETE if not converged and rounds remain
- Providers and Models -- How models are selected
- Cost Management -- Token tracking and limits