fix(arbitration): resolve issue #174 first-response bias by savanigit · Pull Request #225 · shrixtacy/Ai-Council

savanigit · 2026-03-26T07:13:25Z

Pull Request

Description

Brief description of the changes in this PR.

Type of Change

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Documentation update
Performance improvement
Code refactoring

Related Issues

Fixes #(issue number)

Changes Made

List the main changes made in this PR
Be specific about what was added, modified, or removed

Testing

I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes
I have tested this change manually

Testing Details

Describe the tests you ran to verify your changes:

# Example test code or commands

Documentation

I have updated the documentation accordingly
I have added docstrings to new functions/classes
I have updated the README if needed

Code Quality

My code follows the project's style guidelines
I have performed a self-review of my own code
I have commented my code, particularly in hard-to-understand areas
My changes generate no new warnings

Summary by CodeRabbit

Improvements
- Enhanced conflict resolution mechanism to intelligently select the best responses based on quality scores and confidence assessments.
- Improved fallback behavior when handling unhandled conflict types to ensure robust response selection.
Code Quality
- Refactored shared conflict resolution logic to eliminate code duplication across multiple conflict handlers.
- Cleaned up unused imports and removed debug statements.

github-actions · 2026-03-26T07:13:33Z

🎉 Thanks for creating a PR!

We'll review it as soon as possible. ☺️

📝 Pre-Review Checklist:

✅ Double-check the file changes tab
✅ Ensure all commits are accurate and properly formatted
✅ Verify that your PR is linked to an issue (our bot will check this)
✅ All tests pass locally
✅ Code follows project style guidelines

💬 Review Process:

If there are any unresolved review comments, please address them
Respond to reviewer feedback promptly
Keep your branch up to date with the base branch

Thank you for your contribution! 🙌🏼

coderabbitai · 2026-03-26T07:13:44Z

📝 Walkthrough

Walkthrough

The pull request consolidates conflict resolution logic in the arbitration layer by introducing a new helper method that intelligently selects the best response from candidates based on quality and confidence scores. Existing contradiction-resolution methods are refactored to use this shared selection logic, and the unknown conflict type handler is updated to employ score-based selection instead of always choosing the first response.

Changes

Cohort / File(s)	Summary
Arbitration Layer Implementation `ai_council/arbitration/layer.py`	Removed unused import and debug print; added `_select_best_response_for_conflict()` helper that filters successful responses and applies tiered matching strategies (exact subtask ID, model-ID token extraction, or fallback) before scoring; refactored three contradiction-resolution methods to use the shared helper and adjusted confidence thresholds accordingly.
Arbitration Layer Tests `tests/test_arbitration_layer.py`	Added two async regression tests: one validating confidence-based response selection in conflict resolution, and another confirming score-based fallback selection for unknown conflict types with reasoning verification.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~22 minutes

Possibly related PRs

"feat(arbitration): implement semantic conflict detection and metric-based resolution" #209: Refactors ConcreteArbitrationLayer's contradiction-resolution logic to consolidate per-response selection into shared best-response logic with improved candidate filtering and scoring strategies.

Suggested reviewers

shrixtacy

Poem

🐰 Conflicts once tangled, now sorted with care,
A helper emerges to find matches fair,
By confidence and score, the best rises high,
Duplicated logic bids sweet goodbye! 🎯

🚥 Pre-merge checks | ✅ 1 | ❌ 2

❌ Failed checks (2 warnings)

Check name	Status	Explanation	Resolution
Description check	⚠️ Warning	The PR description is entirely a blank template with no actual content filled in—no concrete description of changes, no specific issue reference, no testing details, and no meaningful documentation updates.	Fill in all required sections: provide a concrete description of the first-response bias fix, check the appropriate 'Type of Change' box, reference issue `#174`, detail the specific changes made (new helper method, refactoring), confirm tests were added and pass, and document the new helper method's purpose.
Docstring Coverage	⚠️ Warning	Docstring coverage is 75.00% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (1 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title accurately describes the main change: refactoring the arbitration layer to eliminate first-response bias by implementing best-response selection logic instead of always choosing the first response.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

github-actions · 2026-03-26T07:15:29Z

✅ Issue Link Status

Great job @savanigit! This PR is linked to the following issue(s):

Closes [BUG] Arbitration Resolution Always Picks First Response (Stub Methods) #174

📋 Reminder:

Ensure your changes address the linked issue(s)
All tests should pass
Follow the project's coding standards
Update documentation if needed

coderabbitai

Actionable comments posted: 1

🧹 Nitpick comments (1)

tests/test_arbitration_layer.py (1)
163-208: Make this regression isolate confidence.

model-b is better on confidence and on the extra quality inputs, so this still passes if the resolver is ranking by the composite quality tuple instead of actual confidence. Make the higher-confidence response worse on risk/assumptions/content length, or add a second case that does, so this test protects the intended behavior.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/test_arbitration_layer.py` around lines 163 - 208, The test
test_resolve_confidence_conflict_prefers_highest_confidence_not_first must
isolate confidence so the resolver can't win by a composite quality tuple;
modify the high_confidence AgentResponse (or add a second test case) so it keeps
a higher SelfAssessment.confidence_score (e.g. 0.92) but has worse other quality
fields (e.g. higher risk_level, non-empty assumptions, higher estimated_cost or
token_usage, longer/poorer content) compared to low_confidence, then call
arbitration_layer.resolve_contradiction(conflict, [...]) and assert
chosen_response_id equals the high-confidence response id; this ensures
resolve_contradiction is choosing by confidence alone and not by the aggregated
quality tuple.

🤖 Prompt for all review comments with AI agents

Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@ai_council/arbitration/layer.py`:
- Around line 247-290: The helper _select_best_response_for_conflict currently
ranks by (quality_score, confidence_score) which makes quality primary; change
it to accept a new optional parameter (e.g., ranking="quality"|"confidence") and
when ranking=="confidence" use max(..., key=lambda r:
(r.self_assessment.confidence_score if r.self_assessment else 0.0,
self._calculate_quality_score(r))) so confidence becomes the primary key; update
_resolve_confidence_conflict to call
_select_best_response_for_conflict(conflict, responses, ranking="confidence")
(or use a separate _select_best_response_by_confidence wrapper) and keep
existing behavior for other callers, preserving use of Conflict, AgentResponse,
_calculate_quality_score and self_assessment.confidence_score.

---

Nitpick comments:
In `@tests/test_arbitration_layer.py`:
- Around line 163-208: The test
test_resolve_confidence_conflict_prefers_highest_confidence_not_first must
isolate confidence so the resolver can't win by a composite quality tuple;
modify the high_confidence AgentResponse (or add a second test case) so it keeps
a higher SelfAssessment.confidence_score (e.g. 0.92) but has worse other quality
fields (e.g. higher risk_level, non-empty assumptions, higher estimated_cost or
token_usage, longer/poorer content) compared to low_confidence, then call
arbitration_layer.resolve_contradiction(conflict, [...]) and assert
chosen_response_id equals the high-confidence response id; this ensures
resolve_contradiction is choosing by confidence alone and not by the aggregated
quality tuple.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: defaults

Review profile: CHILL

Plan: Pro

Run ID: 13432c6e-3289-43ff-8269-ede050098a3a

📥 Commits

Reviewing files that changed from the base of the PR and between 0ccfe87 and 52e76c4.

📒 Files selected for processing (2)

ai_council/arbitration/layer.py
tests/test_arbitration_layer.py

coderabbitai · 2026-03-26T07:21:06Z

ai_council/arbitration/layer.py

+    def _select_best_response_for_conflict(
+        self,
+        conflict: Conflict,
+        responses: Optional[List[AgentResponse]] = None,
+    ) -> Optional[AgentResponse]:
+        """Select best response for a conflict using confidence + quality scoring.
+
+        Selection order:
+        1. Responses matching conflict IDs exactly (`subtask_model` format)
+        2. Best-effort matching by model ID token from conflict IDs
+        3. Fallback to all successful responses
+        """
+        if not responses:
+            return None
+
+        successful = [r for r in responses if r.success]
+        if not successful:
+            return None
+
+        response_map = {
+            f"{r.subtask_id}_{r.model_used}": r
+            for r in successful
+        }
+
+        # 1) Exact conflict-id matches
+        candidates = [response_map[rid] for rid in conflict.response_ids if rid in response_map]
+
+        # 2) Best-effort model-id matching when IDs are malformed/partial
+        if not candidates:
+            model_tokens = {rid.split("_", 1)[-1] for rid in conflict.response_ids if "_" in rid}
+            if model_tokens:
+                candidates = [r for r in successful if r.model_used in model_tokens]
+
+        # 3) Final fallback to all successful responses
+        if not candidates:
+            candidates = successful
+
+        return max(
+            candidates,
+            key=lambda r: (
+                self._calculate_quality_score(r),
+                r.self_assessment.confidence_score if r.self_assessment else 0.0,
+            ),
+        )


⚠️ Potential issue | 🟠 Major

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash # Verify the current selector is quality-first, not confidence-first. sed -n '247,290p' ai_council/arbitration/layer.py python - <<'PY' def quality(conf, risk_score, content_len, assumptions): return conf * 0.4 + risk_score * 0.3 + min(content_len / 1000.0, 1.0) * 0.2 + max(0, 1.0 - assumptions * 0.1) * 0.1 # Higher confidence but worse non-confidence quality signals. a = ("high_conf", quality(0.95, 0.4, 5, 5), 0.95) # Lower confidence but much better non-confidence quality signals. b = ("low_conf", quality(0.50, 1.0, 1000, 0), 0.50) print("current_key_winner =", max([a, b], key=lambda x: (x[1], x[2]))[0]) print("confidence_only_winner =", max([a, b], key=lambda x: x[2])[0]) PY

Repository: shrixtacy/Ai-Council

Length of output: 1694

🏁 Script executed:

# Find all usages of _select_best_response_for_conflict rg "_select_best_response_for_conflict" ai_council/arbitration/layer.py -A 2 -B 2

Repository: shrixtacy/Ai-Council

Length of output: 1668

🏁 Script executed:

# Look for the _resolve_confidence_conflict method and its implementation rg "_resolve_confidence_conflict" ai_council/arbitration/layer.py -A 10 -B 2

Repository: shrixtacy/Ai-Council

Length of output: 1548

🏁 Script executed:

# Get context around line 421-434 to see the claimed call site sed -n '410,445p' ai_council/arbitration/layer.py

Repository: shrixtacy/Ai-Council

Length of output: 1879

🏁 Script executed:

# Find _calculate_quality_score implementation rg "_calculate_quality_score" ai_council/arbitration/layer.py -A 15 | head -40

Repository: shrixtacy/Ai-Council

Length of output: 1734

🏁 Script executed:

# Get full _calculate_quality_score implementation rg "_calculate_quality_score" ai_council/arbitration/layer.py -A 20 | grep -A 20 "def _calculate_quality_score"

Repository: shrixtacy/Ai-Council

Length of output: 1145

Fix the ranking in _resolve_confidence_conflict to prioritize confidence over quality.

The helper _select_best_response_for_conflict ranks responses with (quality_score, confidence_score), making quality the primary decision axis. This causes _resolve_confidence_conflict to select a lower-confidence response if it has better risk/length/assumption metrics, contradicting its documented behavior of "choosing the most confident response". The returned reasoning ("selecting response with highest confidence score") is also inaccurate.

Examples: A response with 50% confidence but better risk/length scores beats one with 95% confidence but worse non-confidence metrics.

Update the helper to accept a ranking parameter (or create a separate confidence-first variant) so _resolve_confidence_conflict can use confidence_score as the primary key instead of quality_score.

Example fix

def _select_best_response_for_conflict( self, conflict: Conflict, responses: Optional[List[AgentResponse]] = None, + ranking_key: str = "quality", ) -> Optional[AgentResponse]: # ... candidate selection logic ... - return max( - candidates, - key=lambda r: ( - self._calculate_quality_score(r), - r.self_assessment.confidence_score if r.self_assessment else 0.0, - ), - ) + if ranking_key == "confidence": + return max( + candidates, + key=lambda r: r.self_assessment.confidence_score if r.self_assessment else 0.0, + ) + + return max( + candidates, + key=lambda r: ( + self._calculate_quality_score(r), + r.self_assessment.confidence_score if r.self_assessment else 0.0, + ), + )

Then call from _resolve_confidence_conflict(...) with ranking_key="confidence".

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

def _select_best_response_for_conflict(

self,

conflict: Conflict,

responses: Optional[List[AgentResponse]] = None,

) -> Optional[AgentResponse]:

"""Select best response for a conflict using confidence + quality scoring.

Selection order:

1. Responses matching conflict IDs exactly (`subtask_model` format)

2. Best-effort matching by model ID token from conflict IDs

3. Fallback to all successful responses

"""

if not responses:

return None

successful = [r for r in responses if r.success]

if not successful:

return None

response_map = {

f"{r.subtask_id}_{r.model_used}": r

for r in successful

}

# 1) Exact conflict-id matches

candidates = [response_map[rid] for rid in conflict.response_ids if rid in response_map]

# 2) Best-effort model-id matching when IDs are malformed/partial

if not candidates:

model_tokens = {rid.split("_", 1)[-1] for rid in conflict.response_ids if "_" in rid}

if model_tokens:

candidates = [r for r in successful if r.model_used in model_tokens]

# 3) Final fallback to all successful responses

if not candidates:

candidates = successful

return max(

candidates,

key=lambda r: (

self._calculate_quality_score(r),

r.self_assessment.confidence_score if r.self_assessment else 0.0,

),

)

def _select_best_response_for_conflict(

self,

conflict: Conflict,

responses: Optional[List[AgentResponse]] = None,

ranking_key: str = "quality",

) -> Optional[AgentResponse]:

"""Select best response for a conflict using confidence + quality scoring.

Selection order:

1. Responses matching conflict IDs exactly (`subtask_model` format)

2. Best-effort matching by model ID token from conflict IDs

3. Fallback to all successful responses

"""

if not responses:

return None

successful = [r for r in responses if r.success]

if not successful:

return None

response_map = {

f"{r.subtask_id}_{r.model_used}": r

for r in successful

}

# 1) Exact conflict-id matches

candidates = [response_map[rid] for rid in conflict.response_ids if rid in response_map]

# 2) Best-effort model-id matching when IDs are malformed/partial

if not candidates:

model_tokens = {rid.split("_", 1)[-1] for rid in conflict.response_ids if "_" in rid}

if model_tokens:

candidates = [r for r in successful if r.model_used in model_tokens]

# 3) Final fallback to all successful responses

if not candidates:

candidates = successful

if ranking_key == "confidence":

return max(

candidates,

key=lambda r: r.self_assessment.confidence_score if r.self_assessment else 0.0,

)

return max(

candidates,

key=lambda r: (

self._calculate_quality_score(r),

r.self_assessment.confidence_score if r.self_assessment else 0.0,

),

)

🤖 Prompt for AI Agents

Verify each finding against the current code and only fix it if needed. In `@ai_council/arbitration/layer.py` around lines 247 - 290, The helper _select_best_response_for_conflict currently ranks by (quality_score, confidence_score) which makes quality primary; change it to accept a new optional parameter (e.g., ranking="quality"|"confidence") and when ranking=="confidence" use max(..., key=lambda r: (r.self_assessment.confidence_score if r.self_assessment else 0.0, self._calculate_quality_score(r))) so confidence becomes the primary key; update _resolve_confidence_conflict to call _select_best_response_for_conflict(conflict, responses, ranking="confidence") (or use a separate _select_best_response_by_confidence wrapper) and keep existing behavior for other callers, preserving use of Conflict, AgentResponse, _calculate_quality_score and self_assessment.confidence_score.

shrixtacy · 2026-03-28T05:05:14Z

@savanigit Check the CodeRabbit reccomendation and fix it if needed

fix(arbitration): resolve issue shrixtacy#174 first-response bias

52e76c4

github-actions bot added bug Something isn't working MED OSCG26 labels Mar 26, 2026

coderabbitai bot reviewed Mar 26, 2026

View reviewed changes

shrixtacy linked an issue Mar 28, 2026 that may be closed by this pull request

[BUG] Arbitration Resolution Always Picks First Response (Stub Methods) #174

Open

shrixtacy assigned savanigit Mar 28, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(arbitration): resolve issue #174 first-response bias#225

fix(arbitration): resolve issue #174 first-response bias#225
savanigit wants to merge 1 commit intoshrixtacy:masterfrom
savanigit:fix/issue-174-arbitration-resolution

savanigit commented Mar 26, 2026 •

edited by coderabbitai bot

Loading

Uh oh!

github-actions bot commented Mar 26, 2026

Uh oh!

coderabbitai bot commented Mar 26, 2026 •

edited

Loading

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (2 warnings)

Uh oh!

github-actions bot commented Mar 26, 2026

Uh oh!

coderabbitai bot left a comment

Uh oh!

coderabbitai bot Mar 26, 2026

Uh oh!

shrixtacy commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

savanigit commented Mar 26, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Pull Request

Description

Type of Change

Related Issues

Changes Made

Testing

Testing Details

Documentation

Code Quality

Summary by CodeRabbit

Uh oh!

github-actions bot commented Mar 26, 2026

🎉 Thanks for creating a PR!

Uh oh!

coderabbitai bot commented Mar 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Estimated code review effort

Possibly related PRs

Suggested reviewers

Poem

❌ Failed checks (2 warnings)

Uh oh!

github-actions bot commented Mar 26, 2026

✅ Issue Link Status

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

shrixtacy commented Mar 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

savanigit commented Mar 26, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Mar 26, 2026 •

edited

Loading