Evaluating the RAG answer quality

Follow these steps to evaluate the quality of the answers generated by the RAG flow.

Deploy a GPT-4 model
Setup the evaluation environment
Generate ground truth data
Run bulk evaluation
Review the evaluation results
Run bulk evaluation on a PR

Deploy a GPT-4 model

Run this command to tell azd to deploy a GPT-4 model for evaluation:
```
azd env set DEPLOY_EVAL_MODEL true
```
Set the capacity to the highest possible value to ensure that the evaluation runs quickly.
```
azd env set AZURE_OPENAI_EVAL_DEPLOYMENT_CAPACITY 100
```
By default, that will provision a gpt-4 model, version turbo-2024-04-09. To change those settings, set the AZURE_OPENAI_EVAL_DEPLOYMENT and AZURE_OPENAI_EVAL_DEPLOYMENT_VERSION environment variables.
Then, run the following command to provision the model:
```
azd provision
```

Setup the evaluation environment

Install all the dependencies for the evaluation script by running the following command:

pip install -r requirements-dev.txt
pip install -r evals/requirements.txt

Generate ground truth data

Modify the prompt in evals/generate.txt to match your database table and RAG scenario.

Generate ground truth data by running the following command:

python evals/generate_ground_truth_data.py

Review the generated data after running that script, removing any question/answer pairs that don't seem like realistic user input.

Run bulk evaluation

Review the configuration in evals/eval_config.json to ensure that everything is correctly setup. You may want to adjust the metrics used. See the ai-rag-chat-evaluator README for more information on the available metrics.

By default, the evaluation script will evaluate every question in the ground truth data. Run the evaluation script by running the following command:

python evals/evaluate.py

Review the evaluation results

The evaluation script will output a summary of the evaluation results, inside the evals/results directory.

You can see a summary of results across all evaluation runs by running the following command:

python -m evaltools summary evals/results

Compare answers across runs by running the following command:

python -m evaltools diff evals/results/baseline/

Run bulk evaluation on a PR

To run the evaluation on the changes in a PR, you can add a /evaluate comment to the PR. This will trigger the evaluation workflow to run the evaluation on the PR changes and will post the results to the PR.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

evaluation.md

evaluation.md

Evaluating the RAG answer quality

Deploy a GPT-4 model

Setup the evaluation environment

Generate ground truth data

Run bulk evaluation

Review the evaluation results

Run bulk evaluation on a PR

Files

evaluation.md

Latest commit

History

evaluation.md

File metadata and controls

Evaluating the RAG answer quality

Deploy a GPT-4 model

Setup the evaluation environment

Generate ground truth data

Run bulk evaluation

Review the evaluation results

Run bulk evaluation on a PR