Follow these steps to evaluate the quality of the answers generated by the RAG flow.
- Deploy a GPT-4 model
- Setup the evaluation environment
- Generate ground truth data
- Run bulk evaluation
- Review the evaluation results
- Run bulk evaluation on a PR
-
Run this command to tell
azd
to deploy a GPT-4 model for evaluation:azd env set DEPLOY_EVAL_MODEL true
-
Set the capacity to the highest possible value to ensure that the evaluation runs quickly.
azd env set AZURE_OPENAI_EVAL_DEPLOYMENT_CAPACITY 100
By default, that will provision a
gpt-4
model, versionturbo-2024-04-09
. To change those settings, set theAZURE_OPENAI_EVAL_DEPLOYMENT
andAZURE_OPENAI_EVAL_DEPLOYMENT_VERSION
environment variables. -
Then, run the following command to provision the model:
azd provision
Install all the dependencies for the evaluation script by running the following command:
pip install -r requirements-dev.txt
pip install -r evals/requirements.txt
Modify the prompt in evals/generate.txt
to match your database table and RAG scenario.
Generate ground truth data by running the following command:
python evals/generate_ground_truth_data.py
Review the generated data after running that script, removing any question/answer pairs that don't seem like realistic user input.
Review the configuration in evals/eval_config.json
to ensure that everything is correctly setup. You may want to adjust the metrics used. See the ai-rag-chat-evaluator README for more information on the available metrics.
By default, the evaluation script will evaluate every question in the ground truth data. Run the evaluation script by running the following command:
python evals/evaluate.py
The evaluation script will output a summary of the evaluation results, inside the evals/results
directory.
You can see a summary of results across all evaluation runs by running the following command:
python -m evaltools summary evals/results
Compare answers across runs by running the following command:
python -m evaltools diff evals/results/baseline/
To run the evaluation on the changes in a PR, you can add a /evaluate
comment to the PR. This will trigger the evaluation workflow to run the evaluation on the PR changes and will post the results to the PR.