Skip to content

Complete benchmark suite: model matrix and parameter sensitivity across analysis operations #285

@autholykos

Description

@autholykos

Context

This issue is the follow-up expansion to #273.

#273 is intentionally scoped as a minimal proving harness so the current prompt/runtime optimization slices can be measured quickly and reproducibly. Once that baseline exists, Pituitary still needs a broader benchmark suite that can guide longer-term model and parameter choices.

Goal

Expand the minimal harness into a full benchmark suite for model-quality and latency evaluation across Pituitary's analysis operations.

Scope

1. Larger golden dataset

Expand the benchmark corpus to cover each analysis type with at least 5 cases:

  • compare-specs
  • check-doc-drift
  • check-compliance
  • analyze-impact-severity

Each case should continue to include human-reviewed expected findings, expected evidence anchors, and confidence / severity expectations where relevant.

2. Model matrix

Run the suite across a real model matrix, for example:

  • multiple Qwen sizes
  • at least one alternative family where relevant
  • at least 3 distinct model sizes overall

The output should make latency/quality tradeoffs comparable across candidates.

3. Parameter sensitivity

For the strongest candidates, sweep important parameters such as:

  • max_tokens
  • section limit
  • positional vs relevance-based section selection

This is where the interaction between #271 and #272 should be quantified rather than guessed.

4. Reporting

Produce:

  • a machine-readable results artifact
  • a markdown summary table
  • a recommended local-default configuration based on the results
  • a reusable regression baseline for future runtime changes

Acceptance

  • benchmark dataset covers all 4 analysis operations with at least 5 cases each
  • at least 3 model sizes are compared
  • parameter sweeps include response cap and section-selection settings
  • output includes latency, JSON validity, precision/recall-style quality signals, and evidence quality
  • the suite is good enough to support model-selection and configuration recommendations rather than one-off local experiments

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions