Skip to content

Conversation

@bledden
Copy link

@bledden bledden commented Jan 6, 2026

Summary

Adds calculate_tournament_rankings() as an alternative ranking method alongside the existing mean-based aggregation.

Motivation

The current calculate_aggregate_rankings() averages position numbers, which has limitations:

  • Vulnerable to outlier rankings (one model ranking Response E first significantly affects E's score)
  • Position 1→2 treated same as position 4→5
  • Not theoretically principled

Tournament-style pairwise comparison is more robust:

  • Converts rankings to head-to-head matchups
  • Majority vote determines each matchup winner
  • Final score = win percentage across all matchups
  • Based on Condorcet voting theory

Algorithm

For rankings like:

Ranker 1: A > B > C
Ranker 2: A > C > B
Ranker 3: B > A > C
  1. Extract pairwise preferences from each ranking
  2. For each pair (A vs B), count votes: A wins 2, B wins 1 → A wins matchup
  3. Calculate: A: 2 wins (100%), B: 1 win (50%), C: 0 wins (0%)

Changes

  • Add calculate_tournament_rankings() function in backend/council.py
  • Update run_full_council() to include tournament_rankings in metadata
  • Both methods now available: aggregate_rankings (mean) and tournament_rankings (pairwise)

Validation

Tested with 7 unit test scenarios:

  • ✅ Unanimous rankings
  • ✅ Split decisions (2:1 votes)
  • ✅ Tie handling (0.5 points each)
  • ✅ Single ranker edge case
  • ✅ Empty rankings edge case
  • ✅ Cyclic preferences (A>B, B>C, C>A)
  • ✅ Outlier robustness comparison

End-to-end test with 5 models ranking 5 responses confirms tournament ranking is more robust to outliers.

Test plan

  • Verify tournament_rankings appears in metadata
  • Verify ranking order matches expected pairwise winners
  • Verify ties are handled correctly (0.5 points each)

🤖 Generated with Claude Code

Adds calculate_tournament_rankings() as an alternative to simple mean ranking.

Algorithm:
- Convert ordinal rankings to pairwise matchups
- For each pair of models, majority vote determines winner
- Ties awarded 0.5 points to each
- Final score = wins / total_matchups

Benefits over mean ranking:
- More robust to outlier rankings
- Theoretically principled (Condorcet-style)
- Handles cyclic preferences gracefully

Both ranking methods now included in metadata:
- aggregate_rankings: mean position (existing)
- tournament_rankings: pairwise win percentage (new)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
@bledden bledden force-pushed the feature-tournament-ranking branch from 3aaa3e8 to b1bbb9a Compare January 6, 2026 02:26
Documents the tournament-style pairwise comparison algorithm with:
- Explanation of why it's more robust than mean averaging
- Concrete example showing self-promotion bias scenario
- Tables comparing mean vs tournament results
- Outlier robustness validation (mean degrades 1.0→1.5, tournament stays 100%)
- Summary of validation test coverage

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant