Add tournament-style pairwise ranking aggregation #128

bledden · 2026-01-06T02:08:13Z

Summary

Adds calculate_tournament_rankings() as an alternative ranking method alongside the existing mean-based aggregation.

Motivation

The current calculate_aggregate_rankings() averages position numbers, which has limitations:

Vulnerable to outlier rankings (one model ranking Response E first significantly affects E's score)
Position 1→2 treated same as position 4→5
Not theoretically principled

Tournament-style pairwise comparison is more robust:

Converts rankings to head-to-head matchups
Majority vote determines each matchup winner
Final score = win percentage across all matchups
Based on Condorcet voting theory

Algorithm

For rankings like:

Ranker 1: A > B > C
Ranker 2: A > C > B
Ranker 3: B > A > C

Extract pairwise preferences from each ranking
For each pair (A vs B), count votes: A wins 2, B wins 1 → A wins matchup
Calculate: A: 2 wins (100%), B: 1 win (50%), C: 0 wins (0%)

Changes

Add calculate_tournament_rankings() function in backend/council.py
Update run_full_council() to include tournament_rankings in metadata
Both methods now available: aggregate_rankings (mean) and tournament_rankings (pairwise)

Validation

Tested with 7 unit test scenarios:

✅ Unanimous rankings
✅ Split decisions (2:1 votes)
✅ Tie handling (0.5 points each)
✅ Single ranker edge case
✅ Empty rankings edge case
✅ Cyclic preferences (A>B, B>C, C>A)
✅ Outlier robustness comparison

End-to-end test with 5 models ranking 5 responses confirms tournament ranking is more robust to outliers.

Test plan

Verify tournament_rankings appears in metadata
Verify ranking order matches expected pairwise winners
Verify ties are handled correctly (0.5 points each)

🤖 Generated with Claude Code

Adds calculate_tournament_rankings() as an alternative to simple mean ranking. Algorithm: - Convert ordinal rankings to pairwise matchups - For each pair of models, majority vote determines winner - Ties awarded 0.5 points to each - Final score = wins / total_matchups Benefits over mean ranking: - More robust to outlier rankings - Theoretically principled (Condorcet-style) - Handles cyclic preferences gracefully Both ranking methods now included in metadata: - aggregate_rankings: mean position (existing) - tournament_rankings: pairwise win percentage (new) 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

Documents the tournament-style pairwise comparison algorithm with: - Explanation of why it's more robust than mean averaging - Concrete example showing self-promotion bias scenario - Tables comparing mean vs tournament results - Outlier robustness validation (mean degrades 1.0→1.5, tournament stays 100%) - Summary of validation test coverage 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <[email protected]>

bledden force-pushed the feature-tournament-ranking branch from 3aaa3e8 to b1bbb9a Compare January 6, 2026 02:26

This was referenced Jan 6, 2026

feat: add minority opinion detection for ranking disagreements #129

Open

feat: add ranking conflict detection between models #130

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add tournament-style pairwise ranking aggregation #128

Add tournament-style pairwise ranking aggregation #128

Uh oh!

bledden commented Jan 6, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Add tournament-style pairwise ranking aggregation #128

Are you sure you want to change the base?

Add tournament-style pairwise ranking aggregation #128

Uh oh!

Conversation

bledden commented Jan 6, 2026

Summary

Motivation

Algorithm

Changes

Validation

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant