Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add draft Readme.MD #2891

Merged
merged 8 commits into from
Apr 19, 2024
Merged
Changes from 3 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
23 changes: 23 additions & 0 deletions src/promptflow-evals/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
# Prompt flow evaluators

[![Python package](https://img.shields.io/pypi/v/promptflow-tools)](https://pypi.org/project/promptflow-evals/)
nick863 marked this conversation as resolved.
Show resolved Hide resolved
[![License: MIT](https://img.shields.io/github/license/microsoft/promptflow)](https://github.com/microsoft/promptflow/blob/main/LICENSE)

## Introduction
Evaluators are prebuilt promptflow pipelines that are designed to measure the quality of the outputs from large language models.
The package includes
nick863 marked this conversation as resolved.
Show resolved Hide resolved
- F1 score evaluator.
The F1 score evaluator computes the F1 score based on the actual and predicted answer.
- Chat evaluator. Chat evaluator is an ensemble of other evaluators. It accepts the list of dialog turns, which include questions, answers and contexts and apply the evaluators. It calculates coherence and fluency of answers and if all the data points have context, the RAG-based metrics are being calculated by groundness and relevance evaluators.
- Coherence evaluator
This evaluator calculates coherence of an answer, which is measured by how well all the sentences fit together and sound naturally as a whole.
- Fluency evaluator.
Fluency measures the quality of individual sentences in the answer, and whether they are well-written and grammatically correct.
- Groundness evaluator. The Groundness evaluator is being applied if the context is provided. It estimates the integer score between 1 and 5 measuring, how logically answer is following from context, where one is a false statement, and five is true statement.
- QA evaluator is an ensemble of evaluator, which calculates groundness, relevance, coherence, fluency, similarity and F1 score of question and answer.
- Relevance evaluator. Relevance measures how well the answer addresses the main aspects of the question, based on the context. This evaluator also return integer value from 1 to 5 where 1 means that the answer completely lacks the relevance and 5 means that the relevance is perfect.
- Similarity evaluator measures similarity between the predicted answer and the correct answer using nubers from 1 to 5, where one means no similarity and five absolute similarity.
- Content safety evaluators score the answer received from model based on presence of inappropriate contents. Evaluators package includes three content safety evaluators.
* Self harm evaluator
* Hate/unfairness evaluator
* Sexual evaluator
Loading