diff --git a/README.md b/README.md
index 53c49c8..fce0033 100644
--- a/README.md
+++ b/README.md
@@ -6,7 +6,7 @@ TODO:
- Add blog post final url
<-->
-👉 [Benchmark-QED Docs](https://microsoft.github.io/benchmark-qed/)
+👉 [BenchmarkQED Docs](https://microsoft.github.io/benchmark-qed/)
diff --git a/benchmark_qed/llm/provider/openai.py b/benchmark_qed/llm/provider/openai.py
index 3c94ec6..139db9f 100644
--- a/benchmark_qed/llm/provider/openai.py
+++ b/benchmark_qed/llm/provider/openai.py
@@ -10,7 +10,7 @@
from benchmark_qed.config.llm_config import AuthType, LLMConfig
from benchmark_qed.llm.type.base import BaseModelOutput, BaseModelResponse, Usage
-REASONING_MODELS = ["o3-mini", "o1-mini", "o1", "o1-pro"]
+REASONING_MODELS = ["o3", "o4-mini", "o3-mini", "o1-mini", "o1", "o1-pro"]
class BaseOpenAIChat:
diff --git a/docs/cli/autoe.md b/docs/cli/autoe.md
index 6bc949c..90a8986 100644
--- a/docs/cli/autoe.md
+++ b/docs/cli/autoe.md
@@ -1,15 +1,20 @@
## Pairwise Scoring Configuration
-This document describes the configuration schema for scoring a set of conditions using a language model. It includes definitions for conditions, evaluation criteria, and model configuration. For more information about how to configure the LLM check: [LLM Configuration](llm_config.md)
+This section describes the configuration schema for performing relative comparisons of RAG methods using the LLM-as-a-Judge approach. It includes definitions for conditions, evaluation criteria, and model configuration. For more information about how to configure the LLM, please refer to: [LLM Configuration](llm_config.md)
-To generate a template configuration file you can run:
+To create a template configuration file, run:
```sh
-benchmark_qed config init autoe_pairwise local/autoe_pairwise/settings.yaml
+benchmark-qed config init autoe_pairwise local/pairwise_test/settings.yaml
```
-See more about the config init command: [Config Init CLI](config_init.md)
+To perform pairwise scoring with your configuration file, use:
+```sh
+benchmark-qed autoe pairwise-scores local/pairwise_test/settings.yaml local/pairwise_test/output
+```
+
+For information about the `config init` command, refer to: [Config Init CLI](config_init.md)
---
@@ -51,30 +56,31 @@ Top-level configuration for scoring a set of conditions.
### YAML Example
-Below is an example of how this configuration might be represented in a YAML file. The API key is referenced using an environment variable.
-
-Save the following yaml file as autoe_pairwise_settings.yaml and use with the command:
-
-```sh
-benchmark_qed autoe pairwise-scores autoe_pairwise_settings.yaml local/output_test
-```
-
-To run autoe with our [generated answers](https://github.com/microsoft/benchmark-qed/docs/example_notebooks/example_answers). See the CLI Reference section for more options.
-
+Below is an example showing how this configuration might be represented in a YAML file. The API key is referenced using an environment variable.
```yaml
base:
name: vector_rag
- answer_base_path: example_answers/vector_rag
+ answer_base_path: input/vector_rag
+
others:
- name: lazygraphrag
- answer_base_path: example_answers/lazygraphrag
+ answer_base_path: input/lazygraphrag
- name: graphrag_global
- answer_base_path: example_answers/graphrag_global
+ answer_base_path: input/graphrag_global
+
question_sets:
- activity_global
- activity_local
+
+# Optional: Custom Evaluation Criteria
+# You may define your own list of evaluation criteria here. If this section is omitted, the default criteria will be used.
+# criteria:
+# - name: "criteria name"
+# description: "criteria description"
+
trials: 4
+
llm_config:
auth_type: api_key
model: gpt-4.1
@@ -93,16 +99,21 @@ OPENAI_API_KEY=your-secret-api-key-here
## Reference-Based Scoring Configuration
-This document describes the configuration schema for evaluating generated answers against a reference set using a language model. It includes definitions for reference and generated conditions, scoring criteria, and model configuration. For more information about how to configure the LLM check: [LLM Configuration](llm_config.md)
+This section explains how to configure reference-based scoring, where generated answers are evaluated against a reference set using the LLM-as-a-Judge approach. It covers the definitions for reference and generated conditions, scoring criteria, and model configuration. For details on LLM configuration, see: [LLM Configuration](llm_config.md)
-To generate a template configuration file you can run:
+To create a template configuration file, run:
```sh
-benchmark_qed config init autoe_reference local/autoe_reference/settings.yaml
+benchmark-qed config init autoe_reference local/reference_test/settings.yaml
```
-See more about the config init command: [Config Init CLI](config_init.md)
+To perform reference-based scoring with your configuration file, run:
+
+```sh
+benchmark-qed autoe reference-scores local/reference_test/settings.yaml local/reference_test/output
+```
+For information about the `config init` command, see: [Config Init CLI](config_init.md)
---
@@ -147,27 +158,25 @@ Top-level configuration for scoring generated answers against a reference.
Below is an example of how this configuration might be represented in a YAML file. The API key is referenced using an environment variable.
-Save the following yaml file as autoe_reference_settings.yaml and use with the command:
-
-```sh
-benchmark_qed autoe reference-scores autoe_reference_settings.yaml local/output_test
-```
-
-To run autoe with our [generated answers](https://github.com/microsoft/benchmark-qed/docs/example_notebooks/example_answers). See the CLI Reference section for more options.
-
-
```yaml
reference:
name: lazygraphrag
- answer_base_path: example_answers/lazygraphrag/activity_global.json
+ answer_base_path: input/lazygraphrag/activity_global.json
generated:
- name: vector_rag
- answer_base_path: example_answers/vector_rag/activity_global.json
+ answer_base_path: input/vector_rag/activity_global.json
+# Scoring scale
score_min: 1
score_max: 10
+# Optional: Custom Evaluation Criteria
+# You may define your own list of evaluation criteria here. If this section is omitted, the default criteria will be used.
+# criteria:
+# - name: "criteria name"
+# description: "criteria description"
+
trials: 4
llm_config:
@@ -191,7 +200,7 @@ OPENAI_API_KEY=your-secret-api-key-here
## CLI Reference
-This page documents the command-line interface of the benchmark-qed autoe package.
+This section documents the command-line interface of the BenchmarkQED's AutoE package.
::: mkdocs-typer2
:module: benchmark_qed.autoe.cli
diff --git a/docs/cli/autoq.md b/docs/cli/autoq.md
index 5e26427..65f76f9 100644
--- a/docs/cli/autoq.md
+++ b/docs/cli/autoq.md
@@ -1,14 +1,20 @@
## Question Generation Configuration
-This document describes the configuration schema for the question generation process, including input data, sampling, encoding, and model settings. For more information about how to configure the LLM check: [LLM Configuration](llm_config.md)
+This section provides an overview of the configuration schema for the question generation process, covering input data, sampling, encoding, and model settings. For details on configuring the LLM, see: [LLM Configuration](llm_config.md).
-To generate a template configuration file you can run:
+To create a template configuration file, run:
```sh
-benchmark_qed config init autoq local/autoq/settings.yaml
+benchmark-qed config init autoq local/autoq_test/settings.yaml
```
-See more about the config init command: [Config Init CLI](config_init.md)
+To generate synthetic queries using your configuration file, run:
+
+```sh
+benchmark-qed autoq local/autoq_test/settings.yaml local/autoq_test/output
+```
+
+For more information about the `config init` command, see: [Config Init CLI](config_init.md)
---
@@ -92,21 +98,13 @@ Top-level configuration for the entire question generation process.
Here is an example of how this configuration might look in a YAML file.
-Save the following yaml file as autoq_settings.yaml and use with the command:
-
-```sh
-benchmark_qed autoq autoq_settings.yaml local/output_test
-```
-
-To run autoq with our AP news dataset. See the CLI Reference section for more options.
-
```yaml
## Input Configuration
input:
- dataset_path: datasets/AP_news/raw_data/
+ dataset_path: ./input
input_type: json
- text_column: body_nitf
- metadata_columns: [headline, firstcreated]
+ text_column: body_nitf # The column in the dataset that contains the text to be processed. Modify this for your dataset
+ metadata_columns: [headline, firstcreated] # Additional metadata columns to include in the input. Modify this for your dataset
file_encoding: utf-8-sig
## Encoder configuration
@@ -133,7 +131,7 @@ embedding_model:
api_key: ${OPENAI_API_KEY}
llm_provider: openai.embedding
-## Question Generation Configuration
+## Question Generation Sample Configuration
data_local:
num_questions: 10
oversample_factor: 2.0
@@ -163,7 +161,7 @@ OPENAI_API_KEY=your-secret-api-key-here
## CLI Reference
-This page documents the command-line interface of the benchmark-qed autoq package.
+This section documents the command-line interface of the BenchmarkQED's AutoQ package.
::: mkdocs-typer2
:module: benchmark_qed.autoq.cli
diff --git a/docs/index.md b/docs/index.md
index f06ec75..115999a 100644
--- a/docs/index.md
+++ b/docs/index.md
@@ -82,9 +82,9 @@ For detailed instructions on configuring and running AutoQ from the command line
To learn more about the query synthesis process and using AutoQ programmatically, refer to the [AutoQ Notebook Example](notebooks/autoq.ipynb).
### AutoE
-The AutoE component automates the evaluation of RAG methods using the LLM-as-a-Judge approach. AutoE evaluates RAG-generated answers over a set of queries, which can be generated from AutoQ or from other sources. For each query, AutoE presents an LLM with pairs of answers (along with the query and target metric) in a counterbalanced order, and the model judges whether the first answer wins, loses, or ties with the second. Aggregating these judgments across multiple queries and trials yields **win rates** for each method. By default, AutoE compares RAG answers using four quality metrics: relevance, comprehensiveness, diversity, and empowerment, while also supporting user-defined metrics.
+The AutoE component automates the evaluation of RAG methods using the LLM-as-a-Judge approach. AutoE evaluates RAG-generated answers over a set of queries, which can be generated from AutoQ or from other sources. For each query, AutoE presents an LLM with pairs of answers (along with the query and target metric) in a counterbalanced order, and the model judges whether the first answer wins, loses, or ties with the second. Aggregating these judgments across multiple queries and trials yields **win rates** for each method. By default, AutoE compares RAG answers using [four quality metrics](https://github.com/microsoft/benchmark-qed/blob/799b78b6716a8f24fcd354b89a37b429ba1e587a/benchmark_qed/config/model/score.py#L28): relevance, comprehensiveness, diversity, and empowerment. Users can also define and configure custom evaluation metrics as needed.
-When reference answers (such as ground truth or "gold standard" responses) are available, AutoE can evaluate RAG-generated answers against these references using metrics like correctness, completeness, or other user-defined criteria on a customizable scoring scale.
+When reference answers (such as ground truth or "gold standard" responses) are available, AutoE can evaluate RAG-generated answers against these references using [default metrics](https://github.com/microsoft/benchmark-qed/blob/799b78b6716a8f24fcd354b89a37b429ba1e587a/benchmark_qed/config/model/score.py#L50) like correctness, completeness, or other user-defined criteria on a customizable scoring scale.
> **Choosing the Right LLM Judge**
>