Complete benchmark suite: model matrix and parameter sensitivity across analysis operations

## Context

This issue is the follow-up expansion to #273.

#273 is intentionally scoped as a minimal proving harness so the current prompt/runtime optimization slices can be measured quickly and reproducibly. Once that baseline exists, Pituitary still needs a broader benchmark suite that can guide longer-term model and parameter choices.

## Goal

Expand the minimal harness into a full benchmark suite for model-quality and latency evaluation across Pituitary's analysis operations.

## Scope

### 1. Larger golden dataset

Expand the benchmark corpus to cover each analysis type with at least 5 cases:

- `compare-specs`
- `check-doc-drift`
- `check-compliance`
- `analyze-impact-severity`

Each case should continue to include human-reviewed expected findings, expected evidence anchors, and confidence / severity expectations where relevant.

### 2. Model matrix

Run the suite across a real model matrix, for example:

- multiple Qwen sizes
- at least one alternative family where relevant
- at least 3 distinct model sizes overall

The output should make latency/quality tradeoffs comparable across candidates.

### 3. Parameter sensitivity

For the strongest candidates, sweep important parameters such as:
- `max_tokens`
- section limit
- positional vs relevance-based section selection

This is where the interaction between #271 and #272 should be quantified rather than guessed.

### 4. Reporting

Produce:
- a machine-readable results artifact
- a markdown summary table
- a recommended local-default configuration based on the results
- a reusable regression baseline for future runtime changes

## Acceptance

- benchmark dataset covers all 4 analysis operations with at least 5 cases each
- at least 3 model sizes are compared
- parameter sweeps include response cap and section-selection settings
- output includes latency, JSON validity, precision/recall-style quality signals, and evidence quality
- the suite is good enough to support model-selection and configuration recommendations rather than one-off local experiments


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Complete benchmark suite: model matrix and parameter sensitivity across analysis operations #285

Context

Goal

Scope

1. Larger golden dataset

2. Model matrix

3. Parameter sensitivity

4. Reporting

Acceptance

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Complete benchmark suite: model matrix and parameter sensitivity across analysis operations #285

Description

Context

Goal

Scope

1. Larger golden dataset

2. Model matrix

3. Parameter sensitivity

4. Reporting

Acceptance

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions