fix(arbitration): resolve issue #174 first-response bias#225
fix(arbitration): resolve issue #174 first-response bias#225savanigit wants to merge 1 commit intoshrixtacy:masterfrom
Conversation
🎉 Thanks for creating a PR!We'll review it as soon as possible. 📝 Pre-Review Checklist:
💬 Review Process:
Thank you for your contribution! 🙌🏼 |
📝 WalkthroughWalkthroughThe pull request consolidates conflict resolution logic in the arbitration layer by introducing a new helper method that intelligently selects the best response from candidates based on quality and confidence scores. Existing contradiction-resolution methods are refactored to use this shared selection logic, and the unknown conflict type handler is updated to employ score-based selection instead of always choosing the first response. Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 1 | ❌ 2❌ Failed checks (2 warnings)
✅ Passed checks (1 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
✅ Issue Link StatusGreat job @savanigit! This PR is linked to the following issue(s): 📋 Reminder:
|
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
tests/test_arbitration_layer.py (1)
163-208: Make this regression isolate confidence.
model-bis better on confidence and on the extra quality inputs, so this still passes if the resolver is ranking by the composite quality tuple instead of actual confidence. Make the higher-confidence response worse on risk/assumptions/content length, or add a second case that does, so this test protects the intended behavior.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/test_arbitration_layer.py` around lines 163 - 208, The test test_resolve_confidence_conflict_prefers_highest_confidence_not_first must isolate confidence so the resolver can't win by a composite quality tuple; modify the high_confidence AgentResponse (or add a second test case) so it keeps a higher SelfAssessment.confidence_score (e.g. 0.92) but has worse other quality fields (e.g. higher risk_level, non-empty assumptions, higher estimated_cost or token_usage, longer/poorer content) compared to low_confidence, then call arbitration_layer.resolve_contradiction(conflict, [...]) and assert chosen_response_id equals the high-confidence response id; this ensures resolve_contradiction is choosing by confidence alone and not by the aggregated quality tuple.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@ai_council/arbitration/layer.py`:
- Around line 247-290: The helper _select_best_response_for_conflict currently
ranks by (quality_score, confidence_score) which makes quality primary; change
it to accept a new optional parameter (e.g., ranking="quality"|"confidence") and
when ranking=="confidence" use max(..., key=lambda r:
(r.self_assessment.confidence_score if r.self_assessment else 0.0,
self._calculate_quality_score(r))) so confidence becomes the primary key; update
_resolve_confidence_conflict to call
_select_best_response_for_conflict(conflict, responses, ranking="confidence")
(or use a separate _select_best_response_by_confidence wrapper) and keep
existing behavior for other callers, preserving use of Conflict, AgentResponse,
_calculate_quality_score and self_assessment.confidence_score.
---
Nitpick comments:
In `@tests/test_arbitration_layer.py`:
- Around line 163-208: The test
test_resolve_confidence_conflict_prefers_highest_confidence_not_first must
isolate confidence so the resolver can't win by a composite quality tuple;
modify the high_confidence AgentResponse (or add a second test case) so it keeps
a higher SelfAssessment.confidence_score (e.g. 0.92) but has worse other quality
fields (e.g. higher risk_level, non-empty assumptions, higher estimated_cost or
token_usage, longer/poorer content) compared to low_confidence, then call
arbitration_layer.resolve_contradiction(conflict, [...]) and assert
chosen_response_id equals the high-confidence response id; this ensures
resolve_contradiction is choosing by confidence alone and not by the aggregated
quality tuple.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 13432c6e-3289-43ff-8269-ede050098a3a
📒 Files selected for processing (2)
ai_council/arbitration/layer.pytests/test_arbitration_layer.py
| def _select_best_response_for_conflict( | ||
| self, | ||
| conflict: Conflict, | ||
| responses: Optional[List[AgentResponse]] = None, | ||
| ) -> Optional[AgentResponse]: | ||
| """Select best response for a conflict using confidence + quality scoring. | ||
|
|
||
| Selection order: | ||
| 1. Responses matching conflict IDs exactly (`subtask_model` format) | ||
| 2. Best-effort matching by model ID token from conflict IDs | ||
| 3. Fallback to all successful responses | ||
| """ | ||
| if not responses: | ||
| return None | ||
|
|
||
| successful = [r for r in responses if r.success] | ||
| if not successful: | ||
| return None | ||
|
|
||
| response_map = { | ||
| f"{r.subtask_id}_{r.model_used}": r | ||
| for r in successful | ||
| } | ||
|
|
||
| # 1) Exact conflict-id matches | ||
| candidates = [response_map[rid] for rid in conflict.response_ids if rid in response_map] | ||
|
|
||
| # 2) Best-effort model-id matching when IDs are malformed/partial | ||
| if not candidates: | ||
| model_tokens = {rid.split("_", 1)[-1] for rid in conflict.response_ids if "_" in rid} | ||
| if model_tokens: | ||
| candidates = [r for r in successful if r.model_used in model_tokens] | ||
|
|
||
| # 3) Final fallback to all successful responses | ||
| if not candidates: | ||
| candidates = successful | ||
|
|
||
| return max( | ||
| candidates, | ||
| key=lambda r: ( | ||
| self._calculate_quality_score(r), | ||
| r.self_assessment.confidence_score if r.self_assessment else 0.0, | ||
| ), | ||
| ) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Verify the current selector is quality-first, not confidence-first.
sed -n '247,290p' ai_council/arbitration/layer.py
python - <<'PY'
def quality(conf, risk_score, content_len, assumptions):
return conf * 0.4 + risk_score * 0.3 + min(content_len / 1000.0, 1.0) * 0.2 + max(0, 1.0 - assumptions * 0.1) * 0.1
# Higher confidence but worse non-confidence quality signals.
a = ("high_conf", quality(0.95, 0.4, 5, 5), 0.95)
# Lower confidence but much better non-confidence quality signals.
b = ("low_conf", quality(0.50, 1.0, 1000, 0), 0.50)
print("current_key_winner =", max([a, b], key=lambda x: (x[1], x[2]))[0])
print("confidence_only_winner =", max([a, b], key=lambda x: x[2])[0])
PYRepository: shrixtacy/Ai-Council
Length of output: 1694
🏁 Script executed:
# Find all usages of _select_best_response_for_conflict
rg "_select_best_response_for_conflict" ai_council/arbitration/layer.py -A 2 -B 2Repository: shrixtacy/Ai-Council
Length of output: 1668
🏁 Script executed:
# Look for the _resolve_confidence_conflict method and its implementation
rg "_resolve_confidence_conflict" ai_council/arbitration/layer.py -A 10 -B 2Repository: shrixtacy/Ai-Council
Length of output: 1548
🏁 Script executed:
# Get context around line 421-434 to see the claimed call site
sed -n '410,445p' ai_council/arbitration/layer.pyRepository: shrixtacy/Ai-Council
Length of output: 1879
🏁 Script executed:
# Find _calculate_quality_score implementation
rg "_calculate_quality_score" ai_council/arbitration/layer.py -A 15 | head -40Repository: shrixtacy/Ai-Council
Length of output: 1734
🏁 Script executed:
# Get full _calculate_quality_score implementation
rg "_calculate_quality_score" ai_council/arbitration/layer.py -A 20 | grep -A 20 "def _calculate_quality_score"Repository: shrixtacy/Ai-Council
Length of output: 1145
Fix the ranking in _resolve_confidence_conflict to prioritize confidence over quality.
The helper _select_best_response_for_conflict ranks responses with (quality_score, confidence_score), making quality the primary decision axis. This causes _resolve_confidence_conflict to select a lower-confidence response if it has better risk/length/assumption metrics, contradicting its documented behavior of "choosing the most confident response". The returned reasoning ("selecting response with highest confidence score") is also inaccurate.
Examples: A response with 50% confidence but better risk/length scores beats one with 95% confidence but worse non-confidence metrics.
Update the helper to accept a ranking parameter (or create a separate confidence-first variant) so _resolve_confidence_conflict can use confidence_score as the primary key instead of quality_score.
Example fix
def _select_best_response_for_conflict(
self,
conflict: Conflict,
responses: Optional[List[AgentResponse]] = None,
+ ranking_key: str = "quality",
) -> Optional[AgentResponse]:
# ... candidate selection logic ...
- return max(
- candidates,
- key=lambda r: (
- self._calculate_quality_score(r),
- r.self_assessment.confidence_score if r.self_assessment else 0.0,
- ),
- )
+ if ranking_key == "confidence":
+ return max(
+ candidates,
+ key=lambda r: r.self_assessment.confidence_score if r.self_assessment else 0.0,
+ )
+
+ return max(
+ candidates,
+ key=lambda r: (
+ self._calculate_quality_score(r),
+ r.self_assessment.confidence_score if r.self_assessment else 0.0,
+ ),
+ )Then call from _resolve_confidence_conflict(...) with ranking_key="confidence".
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| def _select_best_response_for_conflict( | |
| self, | |
| conflict: Conflict, | |
| responses: Optional[List[AgentResponse]] = None, | |
| ) -> Optional[AgentResponse]: | |
| """Select best response for a conflict using confidence + quality scoring. | |
| Selection order: | |
| 1. Responses matching conflict IDs exactly (`subtask_model` format) | |
| 2. Best-effort matching by model ID token from conflict IDs | |
| 3. Fallback to all successful responses | |
| """ | |
| if not responses: | |
| return None | |
| successful = [r for r in responses if r.success] | |
| if not successful: | |
| return None | |
| response_map = { | |
| f"{r.subtask_id}_{r.model_used}": r | |
| for r in successful | |
| } | |
| # 1) Exact conflict-id matches | |
| candidates = [response_map[rid] for rid in conflict.response_ids if rid in response_map] | |
| # 2) Best-effort model-id matching when IDs are malformed/partial | |
| if not candidates: | |
| model_tokens = {rid.split("_", 1)[-1] for rid in conflict.response_ids if "_" in rid} | |
| if model_tokens: | |
| candidates = [r for r in successful if r.model_used in model_tokens] | |
| # 3) Final fallback to all successful responses | |
| if not candidates: | |
| candidates = successful | |
| return max( | |
| candidates, | |
| key=lambda r: ( | |
| self._calculate_quality_score(r), | |
| r.self_assessment.confidence_score if r.self_assessment else 0.0, | |
| ), | |
| ) | |
| def _select_best_response_for_conflict( | |
| self, | |
| conflict: Conflict, | |
| responses: Optional[List[AgentResponse]] = None, | |
| ranking_key: str = "quality", | |
| ) -> Optional[AgentResponse]: | |
| """Select best response for a conflict using confidence + quality scoring. | |
| Selection order: | |
| 1. Responses matching conflict IDs exactly (`subtask_model` format) | |
| 2. Best-effort matching by model ID token from conflict IDs | |
| 3. Fallback to all successful responses | |
| """ | |
| if not responses: | |
| return None | |
| successful = [r for r in responses if r.success] | |
| if not successful: | |
| return None | |
| response_map = { | |
| f"{r.subtask_id}_{r.model_used}": r | |
| for r in successful | |
| } | |
| # 1) Exact conflict-id matches | |
| candidates = [response_map[rid] for rid in conflict.response_ids if rid in response_map] | |
| # 2) Best-effort model-id matching when IDs are malformed/partial | |
| if not candidates: | |
| model_tokens = {rid.split("_", 1)[-1] for rid in conflict.response_ids if "_" in rid} | |
| if model_tokens: | |
| candidates = [r for r in successful if r.model_used in model_tokens] | |
| # 3) Final fallback to all successful responses | |
| if not candidates: | |
| candidates = successful | |
| if ranking_key == "confidence": | |
| return max( | |
| candidates, | |
| key=lambda r: r.self_assessment.confidence_score if r.self_assessment else 0.0, | |
| ) | |
| return max( | |
| candidates, | |
| key=lambda r: ( | |
| self._calculate_quality_score(r), | |
| r.self_assessment.confidence_score if r.self_assessment else 0.0, | |
| ), | |
| ) |
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@ai_council/arbitration/layer.py` around lines 247 - 290, The helper
_select_best_response_for_conflict currently ranks by (quality_score,
confidence_score) which makes quality primary; change it to accept a new
optional parameter (e.g., ranking="quality"|"confidence") and when
ranking=="confidence" use max(..., key=lambda r:
(r.self_assessment.confidence_score if r.self_assessment else 0.0,
self._calculate_quality_score(r))) so confidence becomes the primary key; update
_resolve_confidence_conflict to call
_select_best_response_for_conflict(conflict, responses, ranking="confidence")
(or use a separate _select_best_response_by_confidence wrapper) and keep
existing behavior for other callers, preserving use of Conflict, AgentResponse,
_calculate_quality_score and self_assessment.confidence_score.
|
@savanigit Check the CodeRabbit reccomendation and fix it if needed |
Pull Request
Description
Brief description of the changes in this PR.
Type of Change
Related Issues
Fixes #(issue number)
Changes Made
Testing
Testing Details
Describe the tests you ran to verify your changes:
# Example test code or commandsDocumentation
Code Quality
Summary by CodeRabbit
Improvements
Code Quality