Context
This issue is the follow-up expansion to #273.
#273 is intentionally scoped as a minimal proving harness so the current prompt/runtime optimization slices can be measured quickly and reproducibly. Once that baseline exists, Pituitary still needs a broader benchmark suite that can guide longer-term model and parameter choices.
Goal
Expand the minimal harness into a full benchmark suite for model-quality and latency evaluation across Pituitary's analysis operations.
Scope
1. Larger golden dataset
Expand the benchmark corpus to cover each analysis type with at least 5 cases:
compare-specs
check-doc-drift
check-compliance
analyze-impact-severity
Each case should continue to include human-reviewed expected findings, expected evidence anchors, and confidence / severity expectations where relevant.
2. Model matrix
Run the suite across a real model matrix, for example:
- multiple Qwen sizes
- at least one alternative family where relevant
- at least 3 distinct model sizes overall
The output should make latency/quality tradeoffs comparable across candidates.
3. Parameter sensitivity
For the strongest candidates, sweep important parameters such as:
max_tokens
- section limit
- positional vs relevance-based section selection
This is where the interaction between #271 and #272 should be quantified rather than guessed.
4. Reporting
Produce:
- a machine-readable results artifact
- a markdown summary table
- a recommended local-default configuration based on the results
- a reusable regression baseline for future runtime changes
Acceptance
- benchmark dataset covers all 4 analysis operations with at least 5 cases each
- at least 3 model sizes are compared
- parameter sweeps include response cap and section-selection settings
- output includes latency, JSON validity, precision/recall-style quality signals, and evidence quality
- the suite is good enough to support model-selection and configuration recommendations rather than one-off local experiments
Context
This issue is the follow-up expansion to #273.
#273 is intentionally scoped as a minimal proving harness so the current prompt/runtime optimization slices can be measured quickly and reproducibly. Once that baseline exists, Pituitary still needs a broader benchmark suite that can guide longer-term model and parameter choices.
Goal
Expand the minimal harness into a full benchmark suite for model-quality and latency evaluation across Pituitary's analysis operations.
Scope
1. Larger golden dataset
Expand the benchmark corpus to cover each analysis type with at least 5 cases:
compare-specscheck-doc-driftcheck-complianceanalyze-impact-severityEach case should continue to include human-reviewed expected findings, expected evidence anchors, and confidence / severity expectations where relevant.
2. Model matrix
Run the suite across a real model matrix, for example:
The output should make latency/quality tradeoffs comparable across candidates.
3. Parameter sensitivity
For the strongest candidates, sweep important parameters such as:
max_tokensThis is where the interaction between #271 and #272 should be quantified rather than guessed.
4. Reporting
Produce:
Acceptance