Skip to content

Latest commit

 

History

History
85 lines (54 loc) · 2.8 KB

evaluation.md

File metadata and controls

85 lines (54 loc) · 2.8 KB

Evaluating the RAG answer quality

Follow these steps to evaluate the quality of the answers generated by the RAG flow.

Deploy a GPT-4 model

  1. Run this command to tell azd to deploy a GPT-4 model for evaluation:

    azd env set DEPLOY_EVAL_MODEL true
  2. Set the capacity to the highest possible value to ensure that the evaluation runs quickly.

    azd env set AZURE_OPENAI_EVAL_DEPLOYMENT_CAPACITY 100

    By default, that will provision a gpt-4 model, version turbo-2024-04-09. To change those settings, set the AZURE_OPENAI_EVAL_DEPLOYMENT and AZURE_OPENAI_EVAL_DEPLOYMENT_VERSION environment variables.

  3. Then, run the following command to provision the model:

    azd provision

Setup the evaluation environment

Install all the dependencies for the evaluation script by running the following command:

pip install -r requirements-dev.txt
pip install -r evals/requirements.txt

Generate ground truth data

Modify the prompt in evals/generate.txt to match your database table and RAG scenario.

Generate ground truth data by running the following command:

python evals/generate_ground_truth_data.py

Review the generated data after running that script, removing any question/answer pairs that don't seem like realistic user input.

Run bulk evaluation

Review the configuration in evals/eval_config.json to ensure that everything is correctly setup. You may want to adjust the metrics used. See the ai-rag-chat-evaluator README for more information on the available metrics.

By default, the evaluation script will evaluate every question in the ground truth data. Run the evaluation script by running the following command:

python evals/evaluate.py

Review the evaluation results

The evaluation script will output a summary of the evaluation results, inside the evals/results directory.

You can see a summary of results across all evaluation runs by running the following command:

python -m evaltools summary evals/results

Compare answers across runs by running the following command:

python -m evaltools diff evals/results/baseline/

Run bulk evaluation on a PR

To run the evaluation on the changes in a PR, you can add a /evaluate comment to the PR. This will trigger the evaluation workflow to run the evaluation on the PR changes and will post the results to the PR.