Skip to content

Commit

Permalink
Rename metrics (explodinggradients#48)
Browse files Browse the repository at this point in the history
Rename factuality to faithfulness to convey the idea correctly and in
favor of incoming feature that measures factual consistency.
  • Loading branch information
shahules786 authored Jul 6, 2023
1 parent 9ad17b1 commit 8e29433
Show file tree
Hide file tree
Showing 11 changed files with 43 additions and 43 deletions.
8 changes: 4 additions & 4 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -74,14 +74,14 @@ dataset: Dataset

results = evaluate(dataset)
# {'ragas_score': 0.860, 'context_relavency': 0.817,
# 'factuality': 0.892, 'answer_relevancy': 0.874}
# 'faithfulness': 0.892, 'answer_relevancy': 0.874}
```
If you want a more in-depth explanation of core components, check out our [quick-start notebook](./examples/quickstart.ipynb)
## :luggage: Metrics

Ragas measures your pipeline's performance against two dimensions
1. **Factuality**: measures the factual consistency of the generated answer against the given context.
2. **Relevancy**: measures how relevant retrieved contexts and the generated answer are to the question.
1. **Faithfulness**: measures the information consistency of the generated answer against the given context. If any claims made in the answer that cannot be deduced from context is penalized.
2. **Relevancy**: measures how relevant retrieved contexts and the generated answer are to the question. The presence of extra or redundant information is penalized.

Through repeated experiments, we have found that the quality of a RAG pipeline is highly dependent on these two dimensions. The final `ragas_score` is the harmonic mean of these two factors.

Expand All @@ -103,7 +103,7 @@ If you want to get more involved with Ragas, check out our [discord server](http
## :raising_hand_man: FAQ
1. Why harmonic mean?

Harmonic mean penalizes extreme values. For example, if your generated answer is fully factually consistent with the context (factuality = 1) but is not relevant to the question (relevancy = 0), a simple average would give you a score of 0.5 but a harmonic mean will give you 0.0
Harmonic mean penalizes extreme values. For example, if your generated answer is fully factually consistent with the context (faithfulness = 1) but is not relevant to the question (relevancy = 0), a simple average would give you a score of 0.5 but a harmonic mean will give you 0.0



Expand Down
2 changes: 1 addition & 1 deletion docs/assets/bar-graph.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
6 changes: 3 additions & 3 deletions docs/metrics.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,15 @@
# Metrics

1. `factuality` : measures the factual consistency of the generated answer against the given context. This is done using a multi step paradigm that includes creation of statements from the generated answer followed by verifying each of these statements against the context. The answer is scaled to (0,1) range. Higher the better.
1. `faithfulness` : measures the factual consistency of the generated answer against the given context. This is done using a multi step paradigm that includes creation of statements from the generated answer followed by verifying each of these statements against the context. The answer is scaled to (0,1) range. Higher the better.
```python
from ragas.metrics import factuality
from ragas.metrics import faithfulness
# Dataset({
# features: ['question','contexts','answer'],
# num_rows: 25
# })
dataset: Dataset

results = evaluate(dataset, metrics=[factuality])
results = evaluate(dataset, metrics=[faithfulness])
```
2. `answer_relevancy`: measures how relevant is the generated answer to the prompt. This is quantified using conditional likelihood of an LLM generating the question given the answer. This is implemented using a custom model. Values range (0,1), higher the better.
```python
Expand Down
16 changes: 8 additions & 8 deletions examples/quickstart.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -122,7 +122,7 @@
"\n",
"Ragas measures your pipeline's performance against two dimensions\n",
"\n",
"1. Factuality: measures the factual consistency of the generated answer against the given context.\n",
"1. Faithfulness: measures the factual consistency of the generated answer against the given context.\n",
"2. Relevancy: measures how relevant retrieved contexts and the generated answer are to the question.\n",
"\n",
"Through repeated experiments, we have found that the quality of a RAG pipeline is highly dependent on these two dimensions. The final `ragas_score` is the harmonic mean of these two factors.\n",
Expand All @@ -137,7 +137,7 @@
"metadata": {},
"outputs": [],
"source": [
"from ragas.metrics import context_relevancy, answer_relevancy, factuality"
"from ragas.metrics import context_relevancy, answer_relevancy, faithfulness"
]
},
{
Expand All @@ -149,9 +149,9 @@
"\n",
"1. context_relevancy - a measure of how relevent the retrieved context is to the question. Conveys quality of the retrieval pipeline.\n",
"2. answer_relevancy - a measure of how relevent the answer is to the question\n",
"3. factuality - the factual consistancy of the answer to the context base on the question.\n",
"3. faithfulness - the factual consistancy of the answer to the context base on the question.\n",
"\n",
"**Note:** *`factuality` using OpenAI's API to compute the score. If you using this metric make sure you set the environment key `OPENAI_API_KEY` with your API key.*\n",
"**Note:** *`faithfulness` using OpenAI's API to compute the score. If you using this metric make sure you set the environment key `OPENAI_API_KEY` with your API key.*\n",
"\n",
"**Note:** *`context_relevancy` and `answer_relevancy` use very small LLMs to compute the score. It will run on CPU but having a GPU is recommended.*\n",
"\n",
Expand Down Expand Up @@ -188,7 +188,7 @@
{
"data": {
"text/plain": [
"{'ragas_score': 0.860, 'context_relavency': 0.817, 'factuality': 0.892, 'answer_relevancy': 0.874}"
"{'ragas_score': 0.860, 'context_relavency': 0.817, 'faithfulness': 0.892, 'answer_relevancy': 0.874}"
]
},
"execution_count": 8,
Expand All @@ -200,7 +200,7 @@
"from ragas import evaluate\n",
"\n",
"result = evaluate(\n",
" fiqa_eval[\"baseline\"], metrics=[context_relevancy, factuality, answer_relevancy]\n",
" fiqa_eval[\"baseline\"], metrics=[context_relevancy, faithfulness, answer_relevancy]\n",
")\n",
"\n",
"result"
Expand Down Expand Up @@ -248,7 +248,7 @@
" <th>answer</th>\n",
" <th>contexts</th>\n",
" <th>context_relavency</th>\n",
" <th>factuality</th>\n",
" <th>faithfulness</th>\n",
" <th>answer_relevancy</th>\n",
" </tr>\n",
" </thead>\n",
Expand Down Expand Up @@ -336,7 +336,7 @@
"3 [Set up a meeting with the bank that handles y... 0.781 \n",
"4 [The time horizon for your 401K/IRA is essenti... 0.737 \n",
"\n",
" factuality answer_relevancy \n",
" faithfulness answer_relevancy \n",
"0 1.0 0.922 \n",
"1 1.0 0.923 \n",
"2 1.0 0.824 \n",
Expand Down
Loading

0 comments on commit 8e29433

Please sign in to comment.