Rename metrics (explodinggradients#48)

Rename factuality to faithfulness to convey the idea correctly and in favor of incoming feature that measures factual consistency.
ai-forever · Jul 6, 2023 · 8e29433 · 8e29433
1 parent 9ad17b1
commit 8e29433
Show file tree

Hide file tree

Showing 11 changed files with 43 additions and 43 deletions.
diff --git a/README.md b/README.md
@@ -74,14 +74,14 @@ dataset: Dataset
 
 results = evaluate(dataset)
 # {'ragas_score': 0.860, 'context_relavency': 0.817, 
-# 'factuality': 0.892, 'answer_relevancy': 0.874}
+# 'faithfulness': 0.892, 'answer_relevancy': 0.874}
 ```
 If you want a more in-depth explanation of core components, check out our [quick-start notebook](./examples/quickstart.ipynb)
 ## :luggage: Metrics
 
 Ragas measures your pipeline's performance against two dimensions
-1. **Factuality**: measures the factual consistency of the generated answer against the given context.
-2. **Relevancy**:  measures how relevant retrieved contexts and the generated answer are to the question. 
+1. **Faithfulness**: measures the information consistency of the generated answer against the given context. If any claims made in the answer that cannot be deduced from context is penalized. 
+2. **Relevancy**:  measures how relevant retrieved contexts and the generated answer are to the question. The presence of extra or redundant information is penalized. 
 
 Through repeated experiments, we have found that the quality of a RAG pipeline is highly dependent on these two dimensions. The final `ragas_score` is the harmonic mean of these two factors. 
 
@@ -103,7 +103,7 @@ If you want to get more involved with Ragas, check out our [discord server](http
 ## :raising_hand_man: FAQ
 1. Why harmonic mean?
 
-Harmonic mean penalizes extreme values. For example, if your generated answer is fully factually consistent with the context (factuality = 1) but is not relevant to the question (relevancy = 0), a simple average would give you a score of 0.5 but a harmonic mean will give you 0.0
+Harmonic mean penalizes extreme values. For example, if your generated answer is fully factually consistent with the context (faithfulness = 1) but is not relevant to the question (relevancy = 0), a simple average would give you a score of 0.5 but a harmonic mean will give you 0.0
 
 
 

diff --git a/docs/assets/bar-graph.svg b/docs/assets/bar-graph.svg
diff --git a/docs/metrics.md b/docs/metrics.md
@@ -1,15 +1,15 @@
 # Metrics
 
-1. `factuality` : measures the factual consistency of the generated answer against the given context. This is done using a multi step paradigm that includes creation of statements from the generated answer followed by verifying each of these statements against the context. The answer is scaled to (0,1) range. Higher the better.
+1. `faithfulness` : measures the factual consistency of the generated answer against the given context. This is done using a multi step paradigm that includes creation of statements from the generated answer followed by verifying each of these statements against the context. The answer is scaled to (0,1) range. Higher the better.
 ```python
-from ragas.metrics import factuality
+from ragas.metrics import faithfulness
 # Dataset({
 #     features: ['question','contexts','answer'],
 #     num_rows: 25
 # })
 dataset: Dataset
 
-results = evaluate(dataset, metrics=[factuality])
+results = evaluate(dataset, metrics=[faithfulness])
 ```
 2. `answer_relevancy`: measures how relevant is the generated answer to the prompt. This is quantified using conditional likelihood of an LLM generating the question given the answer. This is implemented using a custom model. Values range (0,1), higher the better.
 ```python

diff --git a/examples/quickstart.ipynb b/examples/quickstart.ipynb
@@ -122,7 +122,7 @@
     "\n",
     "Ragas measures your pipeline's performance against two dimensions\n",
     "\n",
-    "1. Factuality: measures the factual consistency of the generated answer against the given context.\n",
+    "1. Faithfulness: measures the factual consistency of the generated answer against the given context.\n",
     "2. Relevancy: measures how relevant retrieved contexts and the generated answer are to the question.\n",
     "\n",
     "Through repeated experiments, we have found that the quality of a RAG pipeline is highly dependent on these two dimensions. The final `ragas_score` is the harmonic mean of these two factors.\n",
@@ -137,7 +137,7 @@
    "metadata": {},
    "outputs": [],
    "source": [
-    "from ragas.metrics import context_relevancy, answer_relevancy, factuality"
+    "from ragas.metrics import context_relevancy, answer_relevancy, faithfulness"
    ]
   },
   {
@@ -149,9 +149,9 @@
     "\n",
     "1. context_relevancy - a measure of how relevent the retrieved context is to the question. Conveys quality of the retrieval pipeline.\n",
     "2. answer_relevancy - a measure of how relevent the answer is to the question\n",
-    "3. factuality - the factual consistancy of the answer to the context base on the question.\n",
+    "3. faithfulness - the factual consistancy of the answer to the context base on the question.\n",
     "\n",
-    "**Note:** *`factuality` using OpenAI's API to compute the score. If you using this metric make sure you set the environment key `OPENAI_API_KEY` with your API key.*\n",
+    "**Note:** *`faithfulness` using OpenAI's API to compute the score. If you using this metric make sure you set the environment key `OPENAI_API_KEY` with your API key.*\n",
     "\n",
     "**Note:** *`context_relevancy` and `answer_relevancy` use very small LLMs to compute the score. It will run on CPU but having a GPU is recommended.*\n",
     "\n",
@@ -188,7 +188,7 @@
     {
      "data": {
       "text/plain": [
-       "{'ragas_score': 0.860, 'context_relavency': 0.817, 'factuality': 0.892, 'answer_relevancy': 0.874}"
+       "{'ragas_score': 0.860, 'context_relavency': 0.817, 'faithfulness': 0.892, 'answer_relevancy': 0.874}"
       ]
      },
      "execution_count": 8,
@@ -200,7 +200,7 @@
     "from ragas import evaluate\n",
     "\n",
     "result = evaluate(\n",
-    "    fiqa_eval[\"baseline\"], metrics=[context_relevancy, factuality, answer_relevancy]\n",
+    "    fiqa_eval[\"baseline\"], metrics=[context_relevancy, faithfulness, answer_relevancy]\n",
     ")\n",
     "\n",
     "result"
@@ -248,7 +248,7 @@
        "      <th>answer</th>\n",
        "      <th>contexts</th>\n",
        "      <th>context_relavency</th>\n",
-       "      <th>factuality</th>\n",
+       "      <th>faithfulness</th>\n",
        "      <th>answer_relevancy</th>\n",
        "    </tr>\n",
        "  </thead>\n",
@@ -336,7 +336,7 @@
        "3  [Set up a meeting with the bank that handles y...              0.781   \n",
        "4  [The time horizon for your 401K/IRA is essenti...              0.737   \n",
        "\n",
-       "   factuality  answer_relevancy  \n",
+       "   faithfulness  answer_relevancy  \n",
        "0         1.0             0.922  \n",
        "1         1.0             0.923  \n",
        "2         1.0             0.824  \n",