Update GPT based evaluators to force output to be a single integer (#…

…3550) # Description Please add an informative description that covers that changes made by the pull request and link all relevant issues. # All Promptflow Contribution checklist: - [x] **The pull request does not introduce [breaking changes].** - [x] **CHANGELOG is updated for new features, bug fixes or other significant changes.** - [x] **I have read the [contribution guidelines](../CONTRIBUTING.md).** - [x] **I confirm that all new dependencies are compatible with the MIT license.** - [ ] **Create an issue and link to the pull request to get dedicated review from promptflow team. Learn more: [suggested workflow](../CONTRIBUTING.md#suggested-workflow).** ## General Guidelines and Best Practices - [x] Title of the pull request is clear and informative. - [x] There are a small number of commits, each of which have an informative message. This means that previously merged commits do not appear in the history of the PR. For more information on cleaning up the commits in your PR, [see this page](https://github.com/Azure/azure-powershell/blob/master/documentation/development-docs/cleaning-up-commits.md). ### Testing Guidelines - [ ] Pull request includes test coverage for the included changes. --------- Co-authored-by: Ankit Singhal <[email protected]>
microsoft · Jul 20, 2024 · f1e350d · f1e350d
1 parent de5aa0f
commit f1e350d
Show file tree

Hide file tree

Showing 24 changed files with 3,051 additions and 306 deletions.
diff --git a/src/promptflow-evals/CHANGELOG.md b/src/promptflow-evals/CHANGELOG.md
@@ -12,6 +12,8 @@
 - Converted built-in evaluators to async-based implementation, leveraging async batch run for performance improvement.
 - Parity between evals and Simulator on signature, passing credentials.
 - The `AdversarialSimulator` responds with `category` of harm in the response.
+- Reduced chances of NaNs in GPT based evaluators.
+
 
 ## v0.3.1 (2022-07-09)
 - This release contains minor bug fixes and improvements.

diff --git a/src/promptflow-evals/promptflow/evals/evaluators/_coherence/coherence.prompty b/src/promptflow-evals/promptflow/evals/evaluators/_coherence/coherence.prompty
@@ -25,7 +25,7 @@ inputs:
 
 ---
 system:
-You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric.
+You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric. You should return a single integer value between 1 to 5 representing the evaluation metric. You will include no other text or information.
 
 user:
 Coherence of an answer is measured by how well all the sentences fit together and sound naturally as a whole. Consider the overall quality of the answer when evaluating coherence. Given the question and answer, score the coherence of answer between one to five stars using the following rating scale:

diff --git a/src/promptflow-evals/promptflow/evals/evaluators/_fluency/fluency.prompty b/src/promptflow-evals/promptflow/evals/evaluators/_fluency/fluency.prompty
@@ -25,7 +25,7 @@ inputs:
 
 ---
 system:
-You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric.
+You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric. You should return a single integer value between 1 to 5 representing the evaluation metric. You will include no other text or information.
 user:
 Fluency measures the quality of individual sentences in the answer, and whether they are well-written and grammatically correct. Consider the quality of individual sentences when evaluating fluency. Given the question and answer, score the fluency of the answer between one to five stars using the following rating scale:
 One star: the answer completely lacks fluency

diff --git a/src/promptflow-evals/promptflow/evals/evaluators/_groundedness/groundedness.prompty b/src/promptflow-evals/promptflow/evals/evaluators/_groundedness/groundedness.prompty
@@ -25,7 +25,7 @@ inputs:
 
 ---
 system:
-You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric.
+You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric. You should return a single integer value between 1 to 5 representing the evaluation metric. You will include no other text or information.
 user:
 You will be presented with a CONTEXT and an ANSWER about that CONTEXT. You need to decide whether the ANSWER is entailed by the CONTEXT by choosing one of the following rating:
 1. 5: The ANSWER follows logically from the information contained in the CONTEXT.

diff --git a/src/promptflow-evals/promptflow/evals/evaluators/_relevance/relevance.prompty b/src/promptflow-evals/promptflow/evals/evaluators/_relevance/relevance.prompty
@@ -27,7 +27,7 @@ inputs:
 
 ---
 system:
-You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric.
+You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric. You should return a single integer value between 1 to 5 representing the evaluation metric. You will include no other text or information.
 user:
 Relevance measures how well the answer addresses the main aspects of the question, based on the context. Consider whether all and only the important aspects are contained in the answer when evaluating relevance. Given the context and question, score the relevance of the answer between one to five stars using the following rating scale:
 One star: the answer completely lacks relevance

diff --git a/src/promptflow-evals/promptflow/evals/evaluators/_similarity/similarity.prompty b/src/promptflow-evals/promptflow/evals/evaluators/_similarity/similarity.prompty
@@ -27,7 +27,7 @@ inputs:
 
 ---
 system:
-You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric.
+You are an AI assistant. You will be given the definition of an evaluation metric for assessing the quality of an answer in a question-answering task. Your job is to compute an accurate evaluation score using the provided evaluation metric. You should return a single integer value between 1 to 5 representing the evaluation metric. You will include no other text or information.
 user:
 Equivalence, as a metric, measures the similarity between the predicted answer and the correct answer. If the information and content in the predicted answer is similar or equivalent to the correct answer, then the value of the Equivalence metric should be high, else it should be low. Given the question, correct answer, and predicted answer, determine the value of Equivalence metric using the following rating scale:
 One star: the predicted answer is not at all similar to the correct answer

diff --git a/src/promptflow-evals/tests/evals/e2etests/test_builtin_evaluators.py b/src/promptflow-evals/tests/evals/e2etests/test_builtin_evaluators.py
@@ -1,3 +1,4 @@
+import numpy as np
 import pytest
 
 from promptflow.evals.evaluators import (
@@ -121,6 +122,17 @@ def test_composite_evaluator_qa(self, model_config, parallel):
         assert score["gpt_similarity"] > 0.0
         assert score["f1_score"] > 0.0
 
+    def test_qa_evaluator_for_nans(self, model_config):
+        qa_eval = QAEvaluator(model_config)
+        # Test Q/A below would cause NaNs in the evaluation metrics before the fix.
+        score = qa_eval(question="This's the color?", answer="Black", ground_truth="gray", context="gray")
+
+        assert score["gpt_groundedness"] is not np.nan
+        assert score["gpt_relevance"] is not np.nan
+        assert score["gpt_coherence"] is not np.nan
+        assert score["gpt_fluency"] is not np.nan
+        assert score["gpt_similarity"] is not np.nan
+
     @pytest.mark.azuretest
     def test_composite_evaluator_content_safety(self, project_scope, azure_cred):
         safety_eval = ContentSafetyEvaluator(project_scope, parallel=False, credential=azure_cred)