microsoft · ha2trinh · Dec 26, 2025 · Aug 4, 2025 · Aug 13, 2025 · Aug 19, 2025
diff --git a/.gitignore b/.gitignore
@@ -14,6 +14,7 @@ dist/
 downloads/
 eggs/
 .eggs/
+generation/
 lib/
 lib64/
 parts/
@@ -152,6 +153,12 @@ output/
 # local artifacts
 local/
 
+# cache artifacts
+cache/
+
+# notebook cache
+docs/notebooks/cache/
+
 # mypy
 .mypy_cache/
 .dmypy.json
@@ -178,3 +185,6 @@ cython_debug/
 
 # PyPI configuration file
 .pypirc
+
+# GitHub instructions
+.github/instructions/
diff --git a/.semversioner/next-release/minor-20251219235819458131.json b/.semversioner/next-release/minor-20251219235819458131.json
@@ -0,0 +1,4 @@
+{
+  "type": "minor",
+  "description": "Add assertion generation for data-local and data-global questions with optional validation"
+}
diff --git a/README.md b/README.md
@@ -34,7 +34,7 @@ flowchart LR
 BenchmarkQED is a suite of tools designed for automated benchmarking of retrieval-augmented generation (RAG) systems. It provides components for query generation, evaluation, and dataset preparation to facilitate reproducible testing at scale.
 
 - **AutoQ:** Generates four classes of synthetic queries with variable data scope, ranging from <i>local queries</i> (answered using a small number of text regions) to <i>global queries</i> (requiring reasoning over large portions or the entirety of a dataset).
-- **AutoE:** Evaluates RAG answers by comparing them side-by-side on key metrics—relevance, comprehensiveness, diversity, and empowerment—using the LLM-as-a-Judge approach. When ground truth is available, AutoE can also assess correctness, completeness, and other custom metrics.
+- **AutoE:** Evaluates RAG answers by comparing them side-by-side on key metrics—relevance, comprehensiveness, diversity, and empowerment—using the LLM-as-a-Judge approach. When ground truth is available, AutoE can also assess correctness, completeness, and other custom metrics. Additionally, AutoE supports assertion-based scoring using either manually-authored assertions or those generated by AutoQ.
 - **AutoD:** Provides data utilities for sampling and summarizing datasets, ensuring consistent inputs for query synthesis.
 
 In addition to the tools, we also release two datasets to support the development and evaluation of RAG systems:

diff --git a/benchmark_qed/autoe/__init__.py b/benchmark_qed/autoe/__init__.py
@@ -1,2 +1,19 @@
 # Copyright (c) 2025 Microsoft Corporation.
 """Relative measure module for evaluating the performance of models."""
+
+from benchmark_qed.autoe.visualization import (
+    get_available_question_sets,
+    get_available_rag_methods,
+    plot_assertion_accuracy_by_rag_method,
+    plot_assertion_score_distribution,
+    prepare_assertion_summary_data,
+)
+
+__all__ = [
+    "get_available_question_sets",
+    "get_available_rag_methods",
+    # Assertion-based visualizations
+    "plot_assertion_accuracy_by_rag_method",
+    "plot_assertion_score_distribution",
+    "prepare_assertion_summary_data",
+]