Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
10 changes: 10 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,7 @@ dist/
downloads/
eggs/
.eggs/
generation/
lib/
lib64/
parts/
Expand Down Expand Up @@ -152,6 +153,12 @@ output/
# local artifacts
local/

# cache artifacts
cache/

# notebook cache
docs/notebooks/cache/

# mypy
.mypy_cache/
.dmypy.json
Expand All @@ -178,3 +185,6 @@ cython_debug/

# PyPI configuration file
.pypirc

# GitHub instructions
.github/instructions/
4 changes: 4 additions & 0 deletions .semversioner/next-release/minor-20251219235819458131.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
{
"type": "minor",
"description": "Add assertion generation for data-local and data-global questions with optional validation"
}
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ flowchart LR
BenchmarkQED is a suite of tools designed for automated benchmarking of retrieval-augmented generation (RAG) systems. It provides components for query generation, evaluation, and dataset preparation to facilitate reproducible testing at scale.

- **AutoQ:** Generates four classes of synthetic queries with variable data scope, ranging from <i>local queries</i> (answered using a small number of text regions) to <i>global queries</i> (requiring reasoning over large portions or the entirety of a dataset).
- **AutoE:** Evaluates RAG answers by comparing them side-by-side on key metrics—relevance, comprehensiveness, diversity, and empowerment—using the LLM-as-a-Judge approach. When ground truth is available, AutoE can also assess correctness, completeness, and other custom metrics.
- **AutoE:** Evaluates RAG answers by comparing them side-by-side on key metrics—relevance, comprehensiveness, diversity, and empowerment—using the LLM-as-a-Judge approach. When ground truth is available, AutoE can also assess correctness, completeness, and other custom metrics. Additionally, AutoE supports assertion-based scoring using either manually-authored assertions or those generated by AutoQ.
- **AutoD:** Provides data utilities for sampling and summarizing datasets, ensuring consistent inputs for query synthesis.

In addition to the tools, we also release two datasets to support the development and evaluation of RAG systems:
Expand Down
17 changes: 17 additions & 0 deletions benchmark_qed/autoe/__init__.py
Original file line number Diff line number Diff line change
@@ -1,2 +1,19 @@
# Copyright (c) 2025 Microsoft Corporation.
"""Relative measure module for evaluating the performance of models."""

from benchmark_qed.autoe.visualization import (
get_available_question_sets,
get_available_rag_methods,
plot_assertion_accuracy_by_rag_method,
plot_assertion_score_distribution,
prepare_assertion_summary_data,
)

__all__ = [
"get_available_question_sets",
"get_available_rag_methods",
# Assertion-based visualizations
"plot_assertion_accuracy_by_rag_method",
"plot_assertion_score_distribution",
"prepare_assertion_summary_data",
]
Loading