From 744e474dfe57b1992463fc267994baabde0134eb Mon Sep 17 00:00:00 2001
From: aravindputrevu <aravind.putrevu@gmail.com>
Date: Sat, 17 Feb 2024 01:30:22 +0100
Subject: [PATCH 1/6] Changes for Detecting Issues in a Text Dataset with
 Datalab

---
 notebooks/en/_toctree.yml                 |   2 +
 notebooks/en/index.md                     |   1 +
 notebooks/en/issues_in_text_dataset.ipynb | 571 ++++++++++++++++++++++
 3 files changed, 574 insertions(+)
 create mode 100644 notebooks/en/issues_in_text_dataset.ipynb
diff --git a/notebooks/en/_toctree.yml b/notebooks/en/_toctree.yml
index 09825b25..0b6d609f 100644
--- a/notebooks/en/_toctree.yml
+++ b/notebooks/en/_toctree.yml
@@ -12,3 +12,5 @@
     title: Advanced RAG on HuggingFace documentation using LangChain
   - local: rag_evaluation
     title: RAG Evaluation
+ - local: issues_in_text_dataset
+    title: Detecting Issues in a Text Dataset with Datalab
diff --git a/notebooks/en/index.md b/notebooks/en/index.md
index b9b2a530..029a4da9 100644
--- a/notebooks/en/index.md
+++ b/notebooks/en/index.md
@@ -12,6 +12,7 @@ Check out the recently added notebooks:
 - [Fine-tuning a Code LLM on Custom Code on a single GPU](fine_tuning_code_llm_on_single_gpu)
 - [RAG Evaluation Using Synthetic data and LLM-As-A-Judge](rag_evaluation)
 - [Advanced RAG on HuggingFace documentation using LangChain](advanced_rag)
+- [Detecting Issues in a Text Dataset with Datalab](issues_in_text_dataset)
 
 You can also check out the notebooks in the cookbook's [GitHub repo](https://github.com/huggingface/cookbook).
 
diff --git a/notebooks/en/issues_in_text_dataset.ipynb b/notebooks/en/issues_in_text_dataset.ipynb
new file mode 100644
index 00000000..36f19d14
--- /dev/null
+++ b/notebooks/en/issues_in_text_dataset.ipynb
@@ -0,0 +1,571 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# Detecting Issues in a Text Dataset with Datalab\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "In this 5-minute quickstart tutorial, we use Datalab to detect various issues in an intent classification dataset composed of (text) customer service requests at an online bank. We consider a subset of the [Banking77-OOS Dataset](https://arxiv.org/abs/2106.04564) containing 1,000 customer service requests which are classified into 10 categories based on their intent (you can run this same code on any text classification dataset). Cleanlab automatically identifies bad examples in our dataset, including mislabeled data, out-of-scope examples (outliers), or otherwise ambiguous examples. Consider filtering or correcting such bad examples before you dive deep into modeling your data!\n",
+    "\n",
+    "**Overview of what we'll do in this tutorial:**\n",
+    "\n",
+    "- Use a pretrained transformer model to extract the text embeddings from the customer service requests\n",
+    "\n",
+    "- Train a simple Logistic Regression model on the text embeddings to compute out-of-sample predicted probabilities\n",
+    "\n",
+    "- Run cleanlab's `Datalab` audit with these predictions and embeddings in order to identify problems like: label issues, outliers, and near duplicates in the dataset."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<div class=\"alert alert-info\">\n",
+    "Quickstart\n",
+    "<br/>\n",
+    "    \n",
+    "Already have (out-of-sample) `pred_probs` from a model trained on an existing set of labels? Maybe you have some numeric `features` as well? Run the code below to find any potential label errors in your dataset.\n",
+    "\n",
+    "<div  class=markdown markdown=\"1\" style=\"background:white;margin:16px\">  \n",
+    "    \n",
+    "```ipython3 \n",
+    "from cleanlab import Datalab\n",
+    "\n",
+    "lab = Datalab(data=your_dataset, label_name=\"column_name_of_labels\")\n",
+    "lab.find_issues(pred_probs=your_pred_probs, features=your_features)\n",
+    "\n",
+    "lab.report()\n",
+    "lab.get_issues()\n",
+    "```\n",
+    "    \n",
+    "</div>\n",
+    "</div>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 1. Install required dependencies\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "You can use `pip` to install all packages required for this tutorial as follows:\n",
+    "\n",
+    "```ipython3\n",
+    "!pip install sklearn sentence-transformers\n",
+    "!pip install \"cleanlab[datalab]\"\n",
+    "# Make sure to install the version corresponding to this tutorial\n",
+    "# E.g. if viewing master branch documentation:\n",
+    "#     !pip install git+https://github.com/cleanlab/cleanlab.git\n",
+    "```"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "nbsphinx": "hidden"
+   },
+   "outputs": [],
+   "source": [
+    "# Package installation (hidden on docs.cleanlab.ai).\n",
+    "# If running on Colab, may want to use GPU (select: Runtime > Change runtime type > Hardware accelerator > GPU)\n",
+    "# Package versions we used:scikit-learn==1.2.0 sentence-transformers==2.2.2\n",
+    "\n",
+    "dependencies = [\"cleanlab\", \"sentence_transformers\", \"datasets\"]\n",
+    "\n",
+    "# Supress outputs that may appear if tensorflow happens to be improperly installed: \n",
+    "import os \n",
+    "\n",
+    "os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"  # disable parallelism to avoid deadlocks with huggingface\n",
+    "\n",
+    "if \"google.colab\" in str(get_ipython()):  # Check if it's running in Google Colab\n",
+    "    %pip install cleanlab  # for colab\n",
+    "    cmd = ' '.join([dep for dep in dependencies if dep != \"cleanlab\"])\n",
+    "    %pip install $cmd\n",
+    "else:\n",
+    "    dependencies_test = [dependency.split('>')[0] if '>' in dependency \n",
+    "                         else dependency.split('<')[0] if '<' in dependency \n",
+    "                         else dependency.split('=')[0] for dependency in dependencies]\n",
+    "    missing_dependencies = []\n",
+    "    for dependency in dependencies_test:\n",
+    "        try:\n",
+    "            __import__(dependency)\n",
+    "        except ImportError:\n",
+    "            missing_dependencies.append(dependency)\n",
+    "\n",
+    "    if len(missing_dependencies) > 0:\n",
+    "        print(\"Missing required dependencies:\")\n",
+    "        print(*missing_dependencies, sep=\", \")\n",
+    "        print(\"\\nPlease install them before running the rest of this notebook.\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import re \n",
+    "import string \n",
+    "import pandas as pd \n",
+    "from sklearn.metrics import accuracy_score, log_loss \n",
+    "from sklearn.model_selection import cross_val_predict \n",
+    "from sklearn.linear_model import LogisticRegression\n",
+    "from sentence_transformers import SentenceTransformer\n",
+    "\n",
+    "from cleanlab import Datalab"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "nbsphinx": "hidden"
+   },
+   "outputs": [],
+   "source": [
+    "# This cell is hidden from docs.cleanlab.ai \n",
+    "\n",
+    "import random \n",
+    "import numpy as np \n",
+    "\n",
+    "pd.set_option(\"display.max_colwidth\", None) \n",
+    "\n",
+    "SEED = 123456  # for reproducibility\n",
+    "np.random.seed(SEED)\n",
+    "random.seed(SEED)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Load and format the text dataset\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data = pd.read_csv(\"https://s.cleanlab.ai/banking-intent-classification.csv\")\n",
+    "data.head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "raw_texts, labels = data[\"text\"].values, data[\"label\"].values\n",
+    "num_classes = len(set(labels))\n",
+    "\n",
+    "print(f\"This dataset has {num_classes} classes.\")\n",
+    "print(f\"Classes: {set(labels)}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's view the i-th example in the dataset:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "i = 1  # change this to view other examples from the dataset\n",
+    "print(f\"Example Label: {labels[i]}\")\n",
+    "print(f\"Example Text: {raw_texts[i]}\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The data is stored as two numpy arrays:\n",
+    "\n",
+    "1. `raw_texts` stores the customer service requests utterances in text format\n",
+    "2. `labels` stores the intent categories (labels) for each example"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<div class=\"alert alert-info\">\n",
+    "Bringing Your Own Data (BYOD)?\n",
+    "\n",
+    "You can easily replace the above with your own text dataset, and continue with the rest of the tutorial.\n",
+    "\n",
+    "</div>"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Next we convert the text strings into vectors better suited as inputs for our ML models. \n",
+    "\n",
+    "We will use numeric representations from a pretrained Transformer model as embeddings of our text. The [Sentence Transformers](https://huggingface.co/docs/hub/sentence-transformers) library offers simple methods to compute these embeddings for text data. Here, we load the pretrained `electra-small-discriminator` model, and then run our data through network to extract a vector embedding of each example."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "transformer = SentenceTransformer('google/electra-small-discriminator')\n",
+    "text_embeddings = transformer.encode(raw_texts)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Our subsequent ML model will directly operate on elements of `text_embeddings` in order to classify the customer service requests."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Define a classification model and compute out-of-sample predicted probabilities"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "A typical way to leverage pretrained networks for a particular classification task is to add a linear output layer and fine-tune the network parameters on the new data. However this can be computationally intensive. Alternatively, we can freeze the pretrained weights of the network and only train the output layer without having to rely on GPU(s). Here we do this conveniently by fitting a scikit-learn linear model on top of the extracted embeddings.\n",
+    "\n",
+    "To identify label issues, cleanlab requires a probabilistic prediction from your model for each datapoint. However these predictions will be _overfit_ (and thus unreliable) for datapoints the model was previously trained on. cleanlab is intended to only be used with **out-of-sample** predicted class probabilities, i.e. on datapoints held-out from the model during the training.\n",
+    "\n",
+    "Here we obtain out-of-sample predicted class probabilities for every example in our dataset using a Logistic Regression model with cross-validation.\n",
+    "Make sure that the columns of your `pred_probs` are properly ordered with respect to the ordering of classes, which for Datalab is: lexicographically sorted by class name."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "model = LogisticRegression(max_iter=400)\n",
+    "\n",
+    "pred_probs = cross_val_predict(model, text_embeddings, labels, method=\"predict_proba\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. Use cleanlab to find issues in your dataset"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Given feature embeddings and the (out-of-sample) predicted class probabilities obtained from any model you have, cleanlab can quickly help you identify low-quality examples in your dataset.\n",
+    "\n",
+    "Here, we use cleanlab's `Datalab` to find issues in our data. Datalab offers several ways of loading the data; we’ll simply wrap the training features and noisy labels in a dictionary. "
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data_dict = {\"texts\": raw_texts, \"labels\": labels}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "All that is need to audit your data is to call `find_issues()`. We pass in the predicted probabilities and the feature embeddings obtained above, but you do not necessarily need to provide all of this information depending on which types of issues you are interested in. The more inputs you provide, the more types of issues `Datalab` can detect in your data. Using a better model to produce these inputs will ensure cleanlab more accurately estimates issues."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "lab = Datalab(data_dict, label_name=\"labels\")\n",
+    "lab.find_issues(pred_probs=pred_probs, features=text_embeddings)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "After the audit is complete, review the findings using the `report` method:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "lab.report()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Label issues\n",
+    "\n",
+    "The report indicates that cleanlab identified many label issues in our dataset. We can see which examples are flagged as likely mislabeled and the label quality score for each example using the `get_issues` method, specifying `label` as an argument to focus on label issues in the data."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {
+    "scrolled": true
+   },
+   "outputs": [],
+   "source": [
+    "label_issues = lab.get_issues(\"label\")\n",
+    "label_issues.head() "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This method returns a dataframe containing a label quality score for each example. These numeric scores lie between 0 and 1, where lower scores indicate examples more likely to be mislabeled. The dataframe also contains a boolean column specifying whether or not each example is identified to have a label issue (indicating it is likely mislabeled)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We can get the subset of examples flagged with label issues, and also sort by label quality score to find the indices of the 5 most likely mislabeled examples in our dataset."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "identified_label_issues = label_issues[label_issues[\"is_label_issue\"] == True]\n",
+    "lowest_quality_labels = label_issues[\"label_score\"].argsort()[:5].to_numpy()\n",
+    "\n",
+    "print(\n",
+    "    f\"cleanlab found {len(identified_label_issues)} potential label errors in the dataset.\\n\"\n",
+    "    f\"Here are indices of the top 5 most likely errors: \\n {lowest_quality_labels}\"\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Let's review some of the most likely label errors. \n",
+    "\n",
+    "Here we display the top 5 examples identified as the most likely label errors in the dataset, together with their given (original) label and a suggested alternative label from cleanlab.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data_with_suggested_labels = pd.DataFrame(\n",
+    "    {\"text\": raw_texts, \"given_label\": labels, \"suggested_label\": label_issues[\"predicted_label\"]}\n",
+    ")\n",
+    "data_with_suggested_labels.iloc[lowest_quality_labels]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "scrolled": true
+   },
+   "source": [
+    "These are very clear label errors that cleanlab has identified in this data! Note that the `given_label` does not correctly reflect the intent of these requests, whoever produced this dataset made many mistakes that are important to address before modeling the data."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Outlier issues\n",
+    "\n",
+    "According to the report, our dataset contains some outliers.\n",
+    "We can see which examples are outliers (and a numeric quality score quantifying how typical each example appears to be) via `get_issues`. We sort the resulting DataFrame by cleanlab's outlier quality score to see the most severe outliers in our dataset."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "outlier_issues = lab.get_issues(\"outlier\")\n",
+    "outlier_issues.sort_values(\"outlier_score\").head()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "lowest_quality_outliers = outlier_issues[\"outlier_score\"].argsort()[:5]\n",
+    "\n",
+    "data.iloc[lowest_quality_outliers]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We see that cleanlab has identified entries in this dataset that do not appear to be proper customer requests. Outliers in this dataset appear to be out-of-scope customer requests and other nonsensical text which does not make sense for intent classification. Carefully consider whether such outliers may detrimentally affect your data modeling, and consider removing them from the dataset if so."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Near-duplicate issues\n",
+    "\n",
+    "According to the report, our dataset contains some sets of nearly duplicated examples.\n",
+    "We can see which examples are (nearly) duplicated (and a numeric quality score quantifying how dissimilar each example is from its nearest neighbor in the dataset) via `get_issues`. We sort the resulting DataFrame by cleanlab's near-duplicate quality score to see the text examples in our dataset that are most nearly duplicated."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "duplicate_issues = lab.get_issues(\"near_duplicate\")\n",
+    "duplicate_issues.sort_values(\"near_duplicate_score\").head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The results above show which examples cleanlab considers nearly duplicated (rows where `is_near_duplicate_issue == True`). Here, we see that example 160 and 148 are nearly duplicated, as are example 546 and 514.\n",
+    "\n",
+    "Let's view these examples to see how similar they are."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data.iloc[[160, 148]]"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "data.iloc[[546, 514]]"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "We see that these two sets of request are indeed very similar to one another! Including near duplicates in a dataset may have unintended effects on models, and be wary about splitting them across training/test sets. Learn more about handling near duplicates in a dataset from [the FAQ](https://docs.cleanlab.ai/stable/tutorials/faq.html#How-to-handle-near-duplicate-data-identified-by-cleanlab?)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Non-IID issues (data drift)\n",
+    "According to the report, our dataset does not appear to be Independent and Identically Distributed (IID).  The overall non-iid score for the dataset (displayed below) corresponds to the `p-value` of a statistical test for whether the ordering of samples in the dataset appears related to the similarity between their feature values.  A low `p-value` strongly suggests that the dataset violates the IID assumption, which is a key assumption required for conclusions (models) produced from the dataset to generalize to a larger population."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "p_value = lab.get_info('non_iid')['p-value']\n",
+    "p_value"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Here, our dataset was flagged as non-IID because the rows happened to be sorted by class label in the original data. This may be benign if we remember to shuffle rows before model training and data splitting. But if you don't know why your data was flagged as non-IID, then you should be worried about potential data drift or unexpected interactions between data points (their values may not be statistically independent). Think carefully about what future test data may look like (and whether your data is representative of the population you care about). You should not shuffle your data before the non-IID test runs (will invalidate its conclusions)."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "As demonstrated above, cleanlab can automatically shortlist the most likely issues in your dataset to help you better curate your dataset for subsequent modeling. With this shortlist, you can decide whether to fix these label issues or remove nonsensical or duplicated examples from your dataset to obtain a higher-quality dataset for training your next ML model. cleanlab's issue detection can be run with outputs from *any* type of model you initially trained.\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "colab": {
+   "collapsed_sections": [],
+   "name": "Text x TensorFlow",
+   "provenance": []
+  },
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.7"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}

From 6404c7358238d89be62d9328424adf2bb7183426 Mon Sep 17 00:00:00 2001
From: aravindputrevu <aravind.putrevu@gmail.com>
Date: Tue, 27 Feb 2024 23:46:59 +0100
Subject: [PATCH 2/6] Fixed the review comments

---
 notebooks/en/issues_in_text_dataset.ipynb | 4202 ++++++++++++++++++---
 1 file changed, 3633 insertions(+), 569 deletions(-)

diff --git a/notebooks/en/issues_in_text_dataset.ipynb b/notebooks/en/issues_in_text_dataset.ipynb
index 36f19d14..2c1cda57 100644
--- a/notebooks/en/issues_in_text_dataset.ipynb
+++ b/notebooks/en/issues_in_text_dataset.ipynb
@@ -1,571 +1,3635 @@
 {
- "cells": [
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "# Detecting Issues in a Text Dataset with Datalab\n"
-   ]
+  "cells": [
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "pw6cvzTocw4G"
+      },
+      "source": [
+        "# Detecting Issues in a Text Dataset with Datalab\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "0yPBE0Xccw4J"
+      },
+      "source": [
+        "Authored by: [@aravindputrevu](https://huggingface.co/aravindputrevu)\n",
+        "\n",
+        "\n",
+        "In this 5-minute quickstart tutorial, we use Datalab to detect various issues in an intent classification dataset composed of (text) customer service requests at an online bank. We consider a subset of the [Banking77-OOS Dataset](https://arxiv.org/abs/2106.04564) containing 1,000 customer service requests which are classified into 10 categories based on their intent (you can run this same code on any text classification dataset). [Cleanlab](https://github.com/cleanlab/cleanlab) automatically identifies bad examples in our dataset, including mislabeled data, out-of-scope examples (outliers), or otherwise ambiguous examples. Consider filtering or correcting such bad examples before you dive deep into modeling your data!\n",
+        "\n",
+        "**Overview of what we'll do in this tutorial:**\n",
+        "\n",
+        "- Use a pretrained transformer model to extract the text embeddings from the customer service requests\n",
+        "\n",
+        "- Train a simple Logistic Regression model on the text embeddings to compute out-of-sample predicted probabilities\n",
+        "\n",
+        "- Run cleanlab's `Datalab` audit with these predictions and embeddings in order to identify problems like: label issues, outliers, and near duplicates in the dataset."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "o__pRLFYcw4K"
+      },
+      "source": [
+        "<div class=\"alert alert-info\">\n",
+        "Quickstart\n",
+        "<br/>\n",
+        "    \n",
+        "Already have (out-of-sample) `pred_probs` from a model trained on an existing set of labels? Maybe you have some numeric `features` as well? Run the code below to find any potential label errors in your dataset.\n",
+        "\n",
+        "**Note:** If running on Colab, may want to use GPU (select: Runtime > Change runtime type > Hardware accelerator > GPU)\n",
+        "\n",
+        "<div  class=markdown markdown=\"1\" style=\"background:white;margin:16px\">  \n",
+        "    \n",
+        "```ipython3\n",
+        "from cleanlab import Datalab\n",
+        "\n",
+        "lab = Datalab(data=your_dataset, label_name=\"column_name_of_labels\")\n",
+        "lab.find_issues(pred_probs=your_pred_probs, features=your_features)\n",
+        "\n",
+        "lab.report()\n",
+        "lab.get_issues()\n",
+        "```\n",
+        "    \n",
+        "</div>\n",
+        "</div>"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "dp4lpApmcw4K"
+      },
+      "source": [
+        "## 1. Install required dependencies\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "DjoWBgGAcw4K"
+      },
+      "source": [
+        "You can use `pip` to install all packages required for this tutorial as follows:\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
+        "!pip install -U scikit-learn sentence-transformers datasets\n",
+        "!pip install -U \"cleanlab[datalab]\""
+      ],
+      "metadata": {
+        "id": "fRsBIj3L_RUb",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 1000
+        },
+        "outputId": "2b22c97c-2373-4740-d394-7486277aa694"
+      },
+      "execution_count": 41,
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (1.2.2)\n",
+            "Collecting scikit-learn\n",
+            "  Downloading scikit_learn-1.4.1.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.1 MB)\n",
+            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m12.1/12.1 MB\u001b[0m \u001b[31m38.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
+            "\u001b[?25hRequirement already satisfied: sentence-transformers in /usr/local/lib/python3.10/dist-packages (2.4.0)\n",
+            "Requirement already satisfied: datasets in /usr/local/lib/python3.10/dist-packages (2.17.1)\n",
+            "Requirement already satisfied: numpy<2.0,>=1.19.5 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.25.2)\n",
+            "Requirement already satisfied: scipy>=1.6.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.11.4)\n",
+            "Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.3.2)\n",
+            "Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (3.3.0)\n",
+            "Requirement already satisfied: transformers<5.0.0,>=4.32.0 in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (4.37.2)\n",
+            "Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (4.66.2)\n",
+            "Requirement already satisfied: torch>=1.11.0 in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (2.1.0+cu121)\n",
+            "Requirement already satisfied: huggingface-hub>=0.15.1 in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (0.20.3)\n",
+            "Requirement already satisfied: Pillow in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (9.4.0)\n",
+            "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from datasets) (3.13.1)\n",
+            "Requirement already satisfied: pyarrow>=12.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (14.0.2)\n",
+            "Requirement already satisfied: pyarrow-hotfix in /usr/local/lib/python3.10/dist-packages (from datasets) (0.6)\n",
+            "Requirement already satisfied: dill<0.3.9,>=0.3.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (0.3.8)\n",
+            "Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from datasets) (1.5.3)\n",
+            "Requirement already satisfied: requests>=2.19.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (2.31.0)\n",
+            "Requirement already satisfied: xxhash in /usr/local/lib/python3.10/dist-packages (from datasets) (3.4.1)\n",
+            "Requirement already satisfied: multiprocess in /usr/local/lib/python3.10/dist-packages (from datasets) (0.70.16)\n",
+            "Requirement already satisfied: fsspec[http]<=2023.10.0,>=2023.1.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (2023.6.0)\n",
+            "Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets) (3.9.3)\n",
+            "Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from datasets) (23.2)\n",
+            "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from datasets) (6.0.1)\n",
+            "Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.3.1)\n",
+            "Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (23.2.0)\n",
+            "Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.4.1)\n",
+            "Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (6.0.5)\n",
+            "Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.9.4)\n",
+            "Requirement already satisfied: async-timeout<5.0,>=4.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (4.0.3)\n",
+            "Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.15.1->sentence-transformers) (4.9.0)\n",
+            "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets) (3.3.2)\n",
+            "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets) (3.6)\n",
+            "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets) (2.0.7)\n",
+            "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets) (2024.2.2)\n",
+            "Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence-transformers) (1.12)\n",
+            "Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence-transformers) (3.2.1)\n",
+            "Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence-transformers) (3.1.3)\n",
+            "Requirement already satisfied: triton==2.1.0 in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence-transformers) (2.1.0)\n",
+            "Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers<5.0.0,>=4.32.0->sentence-transformers) (2023.12.25)\n",
+            "Requirement already satisfied: tokenizers<0.19,>=0.14 in /usr/local/lib/python3.10/dist-packages (from transformers<5.0.0,>=4.32.0->sentence-transformers) (0.15.2)\n",
+            "Requirement already satisfied: safetensors>=0.4.1 in /usr/local/lib/python3.10/dist-packages (from transformers<5.0.0,>=4.32.0->sentence-transformers) (0.4.2)\n",
+            "Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2.8.2)\n",
+            "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2023.4)\n",
+            "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.1->pandas->datasets) (1.16.0)\n",
+            "Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch>=1.11.0->sentence-transformers) (2.1.5)\n",
+            "Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch>=1.11.0->sentence-transformers) (1.3.0)\n",
+            "Installing collected packages: scikit-learn\n",
+            "  Attempting uninstall: scikit-learn\n",
+            "    Found existing installation: scikit-learn 1.2.2\n",
+            "    Uninstalling scikit-learn-1.2.2:\n",
+            "      Successfully uninstalled scikit-learn-1.2.2\n",
+            "Successfully installed scikit-learn-1.4.1.post1\n"
+          ]
+        },
+        {
+          "output_type": "display_data",
+          "data": {
+            "application/vnd.colab-display-data+json": {
+              "pip_warning": {
+                "packages": [
+                  "sklearn"
+                ]
+              },
+              "id": "207dfdbd8b714496a56fb33ee0f11a84"
+            }
+          },
+          "metadata": {}
+        },
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "Requirement already satisfied: cleanlab[datalab] in /usr/local/lib/python3.10/dist-packages (2.6.0)\n",
+            "Requirement already satisfied: numpy>=1.22.0 in /usr/local/lib/python3.10/dist-packages (from cleanlab[datalab]) (1.25.2)\n",
+            "Requirement already satisfied: scikit-learn>=1.1 in /usr/local/lib/python3.10/dist-packages (from cleanlab[datalab]) (1.4.1.post1)\n",
+            "Requirement already satisfied: tqdm>=4.53.0 in /usr/local/lib/python3.10/dist-packages (from cleanlab[datalab]) (4.66.2)\n",
+            "Requirement already satisfied: pandas>=1.4.0 in /usr/local/lib/python3.10/dist-packages (from cleanlab[datalab]) (1.5.3)\n",
+            "Requirement already satisfied: termcolor>=2.4.0 in /usr/local/lib/python3.10/dist-packages (from cleanlab[datalab]) (2.4.0)\n",
+            "Requirement already satisfied: datasets>=2.7.0 in /usr/local/lib/python3.10/dist-packages (from cleanlab[datalab]) (2.17.1)\n",
+            "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (3.13.1)\n",
+            "Requirement already satisfied: pyarrow>=12.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (14.0.2)\n",
+            "Requirement already satisfied: pyarrow-hotfix in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (0.6)\n",
+            "Requirement already satisfied: dill<0.3.9,>=0.3.0 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (0.3.8)\n",
+            "Requirement already satisfied: requests>=2.19.0 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (2.31.0)\n",
+            "Requirement already satisfied: xxhash in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (3.4.1)\n",
+            "Requirement already satisfied: multiprocess in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (0.70.16)\n",
+            "Requirement already satisfied: fsspec[http]<=2023.10.0,>=2023.1.0 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (2023.6.0)\n",
+            "Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (3.9.3)\n",
+            "Requirement already satisfied: huggingface-hub>=0.19.4 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (0.20.3)\n",
+            "Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (23.2)\n",
+            "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (6.0.1)\n",
+            "Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.4.0->cleanlab[datalab]) (2.8.2)\n",
+            "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.4.0->cleanlab[datalab]) (2023.4)\n",
+            "Requirement already satisfied: scipy>=1.6.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=1.1->cleanlab[datalab]) (1.11.4)\n",
+            "Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=1.1->cleanlab[datalab]) (1.3.2)\n",
+            "Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=1.1->cleanlab[datalab]) (3.3.0)\n",
+            "Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.7.0->cleanlab[datalab]) (1.3.1)\n",
+            "Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.7.0->cleanlab[datalab]) (23.2.0)\n",
+            "Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.7.0->cleanlab[datalab]) (1.4.1)\n",
+            "Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.7.0->cleanlab[datalab]) (6.0.5)\n",
+            "Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.7.0->cleanlab[datalab]) (1.9.4)\n",
+            "Requirement already satisfied: async-timeout<5.0,>=4.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.7.0->cleanlab[datalab]) (4.0.3)\n",
+            "Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.19.4->datasets>=2.7.0->cleanlab[datalab]) (4.9.0)\n",
+            "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.1->pandas>=1.4.0->cleanlab[datalab]) (1.16.0)\n",
+            "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets>=2.7.0->cleanlab[datalab]) (3.3.2)\n",
+            "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets>=2.7.0->cleanlab[datalab]) (3.6)\n",
+            "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets>=2.7.0->cleanlab[datalab]) (2.0.7)\n",
+            "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets>=2.7.0->cleanlab[datalab]) (2024.2.2)\n"
+          ]
+        }
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 22,
+      "metadata": {
+        "execution": {
+          "iopub.execute_input": "2024-02-16T06:26:13.467211Z",
+          "iopub.status.busy": "2024-02-16T06:26:13.466877Z",
+          "iopub.status.idle": "2024-02-16T06:26:13.470222Z",
+          "shell.execute_reply": "2024-02-16T06:26:13.469761Z"
+        },
+        "id": "zgezWF-2cw4L"
+      },
+      "outputs": [],
+      "source": [
+        "import re\n",
+        "import string\n",
+        "import pandas as pd\n",
+        "from sklearn.metrics import accuracy_score, log_loss\n",
+        "from sklearn.model_selection import cross_val_predict\n",
+        "from sklearn.linear_model import LogisticRegression\n",
+        "from sentence_transformers import SentenceTransformer\n",
+        "\n",
+        "from cleanlab import Datalab"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 23,
+      "metadata": {
+        "execution": {
+          "iopub.execute_input": "2024-02-16T06:26:13.472374Z",
+          "iopub.status.busy": "2024-02-16T06:26:13.471951Z",
+          "iopub.status.idle": "2024-02-16T06:26:13.475065Z",
+          "shell.execute_reply": "2024-02-16T06:26:13.474625Z"
+        },
+        "nbsphinx": "hidden",
+        "id": "mO3pnA1ncw4L"
+      },
+      "outputs": [],
+      "source": [
+        "import random\n",
+        "import numpy as np\n",
+        "\n",
+        "pd.set_option(\"display.max_colwidth\", None)\n",
+        "\n",
+        "SEED = 123456  # for reproducibility\n",
+        "np.random.seed(SEED)\n",
+        "random.seed(SEED)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "yj_5JcO1cw4L"
+      },
+      "source": [
+        "## 2. Load and format the text dataset\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 24,
+      "metadata": {
+        "execution": {
+          "iopub.execute_input": "2024-02-16T06:26:13.476949Z",
+          "iopub.status.busy": "2024-02-16T06:26:13.476773Z",
+          "iopub.status.idle": "2024-02-16T06:26:13.502278Z",
+          "shell.execute_reply": "2024-02-16T06:26:13.501755Z"
+        },
+        "id": "HztO4qU9cw4L",
+        "outputId": "c6ff9e95-6326-413e-a72f-6f3c05af1055",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 206
+        }
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "                                                           text  label\n",
+              "0                                I am still waiting on my card?     11\n",
+              "1  What can I do if my card still hasn't arrived after 2 weeks?     11\n",
+              "2    I have been waiting over a week. Is the card still coming?     11\n",
+              "3   Can I track my card while it is in the process of delivery?     11\n",
+              "4        How do I know if I will get my card, or if it is lost?     11"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-bb25ddea-7d53-4ee3-bdc9-92ba5f185022\" class=\"colab-df-container\">\n",
+              "    <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>text</th>\n",
+              "      <th>label</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>I am still waiting on my card?</td>\n",
+              "      <td>11</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>What can I do if my card still hasn't arrived after 2 weeks?</td>\n",
+              "      <td>11</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>I have been waiting over a week. Is the card still coming?</td>\n",
+              "      <td>11</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>Can I track my card while it is in the process of delivery?</td>\n",
+              "      <td>11</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>How do I know if I will get my card, or if it is lost?</td>\n",
+              "      <td>11</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "</div>\n",
+              "    <div class=\"colab-df-buttons\">\n",
+              "\n",
+              "  <div class=\"colab-df-container\">\n",
+              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-bb25ddea-7d53-4ee3-bdc9-92ba5f185022')\"\n",
+              "            title=\"Convert this dataframe to an interactive table.\"\n",
+              "            style=\"display:none;\">\n",
+              "\n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
+              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
+              "  </svg>\n",
+              "    </button>\n",
+              "\n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-buttons div {\n",
+              "      margin-bottom: 4px;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "    <script>\n",
+              "      const buttonEl =\n",
+              "        document.querySelector('#df-bb25ddea-7d53-4ee3-bdc9-92ba5f185022 button.colab-df-convert');\n",
+              "      buttonEl.style.display =\n",
+              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "      async function convertToInteractive(key) {\n",
+              "        const element = document.querySelector('#df-bb25ddea-7d53-4ee3-bdc9-92ba5f185022');\n",
+              "        const dataTable =\n",
+              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                    [key], {});\n",
+              "        if (!dataTable) return;\n",
+              "\n",
+              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "          + ' to learn more about interactive tables.';\n",
+              "        element.innerHTML = '';\n",
+              "        dataTable['output_type'] = 'display_data';\n",
+              "        await google.colab.output.renderOutput(dataTable, element);\n",
+              "        const docLink = document.createElement('div');\n",
+              "        docLink.innerHTML = docLinkHtml;\n",
+              "        element.appendChild(docLink);\n",
+              "      }\n",
+              "    </script>\n",
+              "  </div>\n",
+              "\n",
+              "\n",
+              "<div id=\"df-51d80644-98e0-40b3-89a9-c326994791c6\">\n",
+              "  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-51d80644-98e0-40b3-89a9-c326994791c6')\"\n",
+              "            title=\"Suggest charts\"\n",
+              "            style=\"display:none;\">\n",
+              "\n",
+              "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "     width=\"24px\">\n",
+              "    <g>\n",
+              "        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
+              "    </g>\n",
+              "</svg>\n",
+              "  </button>\n",
+              "\n",
+              "<style>\n",
+              "  .colab-df-quickchart {\n",
+              "      --bg-color: #E8F0FE;\n",
+              "      --fill-color: #1967D2;\n",
+              "      --hover-bg-color: #E2EBFA;\n",
+              "      --hover-fill-color: #174EA6;\n",
+              "      --disabled-fill-color: #AAA;\n",
+              "      --disabled-bg-color: #DDD;\n",
+              "  }\n",
+              "\n",
+              "  [theme=dark] .colab-df-quickchart {\n",
+              "      --bg-color: #3B4455;\n",
+              "      --fill-color: #D2E3FC;\n",
+              "      --hover-bg-color: #434B5C;\n",
+              "      --hover-fill-color: #FFFFFF;\n",
+              "      --disabled-bg-color: #3B4455;\n",
+              "      --disabled-fill-color: #666;\n",
+              "  }\n",
+              "\n",
+              "  .colab-df-quickchart {\n",
+              "    background-color: var(--bg-color);\n",
+              "    border: none;\n",
+              "    border-radius: 50%;\n",
+              "    cursor: pointer;\n",
+              "    display: none;\n",
+              "    fill: var(--fill-color);\n",
+              "    height: 32px;\n",
+              "    padding: 0;\n",
+              "    width: 32px;\n",
+              "  }\n",
+              "\n",
+              "  .colab-df-quickchart:hover {\n",
+              "    background-color: var(--hover-bg-color);\n",
+              "    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "    fill: var(--button-hover-fill-color);\n",
+              "  }\n",
+              "\n",
+              "  .colab-df-quickchart-complete:disabled,\n",
+              "  .colab-df-quickchart-complete:disabled:hover {\n",
+              "    background-color: var(--disabled-bg-color);\n",
+              "    fill: var(--disabled-fill-color);\n",
+              "    box-shadow: none;\n",
+              "  }\n",
+              "\n",
+              "  .colab-df-spinner {\n",
+              "    border: 2px solid var(--fill-color);\n",
+              "    border-color: transparent;\n",
+              "    border-bottom-color: var(--fill-color);\n",
+              "    animation:\n",
+              "      spin 1s steps(1) infinite;\n",
+              "  }\n",
+              "\n",
+              "  @keyframes spin {\n",
+              "    0% {\n",
+              "      border-color: transparent;\n",
+              "      border-bottom-color: var(--fill-color);\n",
+              "      border-left-color: var(--fill-color);\n",
+              "    }\n",
+              "    20% {\n",
+              "      border-color: transparent;\n",
+              "      border-left-color: var(--fill-color);\n",
+              "      border-top-color: var(--fill-color);\n",
+              "    }\n",
+              "    30% {\n",
+              "      border-color: transparent;\n",
+              "      border-left-color: var(--fill-color);\n",
+              "      border-top-color: var(--fill-color);\n",
+              "      border-right-color: var(--fill-color);\n",
+              "    }\n",
+              "    40% {\n",
+              "      border-color: transparent;\n",
+              "      border-right-color: var(--fill-color);\n",
+              "      border-top-color: var(--fill-color);\n",
+              "    }\n",
+              "    60% {\n",
+              "      border-color: transparent;\n",
+              "      border-right-color: var(--fill-color);\n",
+              "    }\n",
+              "    80% {\n",
+              "      border-color: transparent;\n",
+              "      border-right-color: var(--fill-color);\n",
+              "      border-bottom-color: var(--fill-color);\n",
+              "    }\n",
+              "    90% {\n",
+              "      border-color: transparent;\n",
+              "      border-bottom-color: var(--fill-color);\n",
+              "    }\n",
+              "  }\n",
+              "</style>\n",
+              "\n",
+              "  <script>\n",
+              "    async function quickchart(key) {\n",
+              "      const quickchartButtonEl =\n",
+              "        document.querySelector('#' + key + ' button');\n",
+              "      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n",
+              "      quickchartButtonEl.classList.add('colab-df-spinner');\n",
+              "      try {\n",
+              "        const charts = await google.colab.kernel.invokeFunction(\n",
+              "            'suggestCharts', [key], {});\n",
+              "      } catch (error) {\n",
+              "        console.error('Error during call to suggestCharts:', error);\n",
+              "      }\n",
+              "      quickchartButtonEl.classList.remove('colab-df-spinner');\n",
+              "      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
+              "    }\n",
+              "    (() => {\n",
+              "      let quickchartButtonEl =\n",
+              "        document.querySelector('#df-51d80644-98e0-40b3-89a9-c326994791c6 button');\n",
+              "      quickchartButtonEl.style.display =\n",
+              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "    })();\n",
+              "  </script>\n",
+              "</div>\n",
+              "    </div>\n",
+              "  </div>\n"
+            ],
+            "application/vnd.google.colaboratory.intrinsic+json": {
+              "type": "dataframe",
+              "variable_name": "data",
+              "summary": "{\n  \"name\": \"data\",\n  \"rows\": 1000,\n  \"fields\": [\n    {\n      \"column\": \"text\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 1000,\n        \"samples\": [\n          \"I made an international purchase, but the exchange rate was wrong\",\n          \"I would like to know why a withdraw I made for some cash shows up as pending.\",\n          \"I tried to get cash out of the ATM but it is taking too long\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"label\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": 12,\n        \"min\": 11,\n        \"max\": 46,\n        \"num_unique_values\": 7,\n        \"samples\": [\n          11,\n          13,\n          46\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}"
+            }
+          },
+          "metadata": {},
+          "execution_count": 24
+        }
+      ],
+      "source": [
+        "from datasets import load_dataset\n",
+        "\n",
+        "dataset = load_dataset(\"PolyAI/banking77\", split=\"train\")\n",
+        "data = pd.DataFrame(dataset[:1000])\n",
+        "data.head()"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 25,
+      "metadata": {
+        "execution": {
+          "iopub.execute_input": "2024-02-16T06:26:13.504463Z",
+          "iopub.status.busy": "2024-02-16T06:26:13.504049Z",
+          "iopub.status.idle": "2024-02-16T06:26:13.508243Z",
+          "shell.execute_reply": "2024-02-16T06:26:13.507706Z"
+        },
+        "id": "Ujp0luqRcw4M",
+        "outputId": "b438fed5-aa75-450d-dc84-0b3398960487",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        }
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "This dataset has 7 classes.\n",
+            "Classes: {32, 34, 36, 11, 13, 46, 17}\n"
+          ]
+        }
+      ],
+      "source": [
+        "raw_texts, labels = data[\"text\"].values, data[\"label\"].values\n",
+        "num_classes = len(set(labels))\n",
+        "\n",
+        "print(f\"This dataset has {num_classes} classes.\")\n",
+        "print(f\"Classes: {set(labels)}\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "PVza57cecw4M"
+      },
+      "source": [
+        "Let's view the i-th example in the dataset:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 26,
+      "metadata": {
+        "execution": {
+          "iopub.execute_input": "2024-02-16T06:26:13.510435Z",
+          "iopub.status.busy": "2024-02-16T06:26:13.510163Z",
+          "iopub.status.idle": "2024-02-16T06:26:13.513358Z",
+          "shell.execute_reply": "2024-02-16T06:26:13.512906Z"
+        },
+        "id": "lXHi90Kecw4M",
+        "outputId": "af8a9b19-986f-44fe-c564-dd83e400309e",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        }
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "Example Label: 11\n",
+            "Example Text: What can I do if my card still hasn't arrived after 2 weeks?\n"
+          ]
+        }
+      ],
+      "source": [
+        "i = 1  # change this to view other examples from the dataset\n",
+        "print(f\"Example Label: {labels[i]}\")\n",
+        "print(f\"Example Text: {raw_texts[i]}\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "JH7UU9Wscw4M"
+      },
+      "source": [
+        "The data is stored as two numpy arrays:\n",
+        "\n",
+        "1. `raw_texts` stores the customer service requests utterances in text format\n",
+        "2. `labels` stores the intent categories (labels) for each example"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "T0d80apCcw4M"
+      },
+      "source": [
+        "<div class=\"alert alert-info\">\n",
+        "Bringing Your Own Data (BYOD)?\n",
+        "\n",
+        "You can easily replace the above with your own text dataset, and continue with the rest of the tutorial.\n",
+        "\n",
+        "</div>"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "YLDeD09Ncw4M"
+      },
+      "source": [
+        "Next we convert the text strings into vectors better suited as inputs for our ML models.\n",
+        "\n",
+        "We will use numeric representations from a pretrained Transformer model as embeddings of our text. The [Sentence Transformers](https://huggingface.co/docs/hub/sentence-transformers) library offers simple methods to compute these embeddings for text data. Here, we load the pretrained `electra-small-discriminator` model, and then run our data through network to extract a vector embedding of each example."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 27,
+      "metadata": {
+        "execution": {
+          "iopub.execute_input": "2024-02-16T06:26:13.515306Z",
+          "iopub.status.busy": "2024-02-16T06:26:13.515126Z",
+          "iopub.status.idle": "2024-02-16T06:26:18.244024Z",
+          "shell.execute_reply": "2024-02-16T06:26:18.243354Z"
+        },
+        "id": "DbDb6Ni6cw4M",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        },
+        "outputId": "b3ff5ca8-afc6-4e0b-b2be-ba5dd7c0841b"
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stderr",
+          "text": [
+            "WARNING:sentence_transformers.SentenceTransformer:No sentence-transformers model found with name google/electra-small-discriminator. Creating a new one with MEAN pooling.\n"
+          ]
+        }
+      ],
+      "source": [
+        "transformer = SentenceTransformer('google/electra-small-discriminator')\n",
+        "text_embeddings = transformer.encode(raw_texts)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "Moz0KJvzcw4M"
+      },
+      "source": [
+        "Our subsequent ML model will directly operate on elements of `text_embeddings` in order to classify the customer service requests."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "4FK2Q72gcw4M"
+      },
+      "source": [
+        "## 3. Define a classification model and compute out-of-sample predicted probabilities"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "yaicOGrhcw4N"
+      },
+      "source": [
+        "A typical way to leverage pretrained networks for a particular classification task is to add a linear output layer and fine-tune the network parameters on the new data. However this can be computationally intensive. Alternatively, we can freeze the pretrained weights of the network and only train the output layer without having to rely on GPU(s). Here we do this conveniently by fitting a scikit-learn linear model on top of the extracted embeddings.\n",
+        "\n",
+        "To identify label issues, cleanlab requires a probabilistic prediction from your model for each datapoint. However these predictions will be _overfit_ (and thus unreliable) for datapoints the model was previously trained on. cleanlab is intended to only be used with **out-of-sample** predicted class probabilities, i.e. on datapoints held-out from the model during the training.\n",
+        "\n",
+        "Here we obtain out-of-sample predicted class probabilities for every example in our dataset using a Logistic Regression model with cross-validation.\n",
+        "Make sure that the columns of your `pred_probs` are properly ordered with respect to the ordering of classes, which for Datalab is: lexicographically sorted by class name."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 28,
+      "metadata": {
+        "execution": {
+          "iopub.execute_input": "2024-02-16T06:26:18.247142Z",
+          "iopub.status.busy": "2024-02-16T06:26:18.246652Z",
+          "iopub.status.idle": "2024-02-16T06:26:19.133641Z",
+          "shell.execute_reply": "2024-02-16T06:26:19.132953Z"
+        },
+        "scrolled": true,
+        "id": "tiIqp1arcw4N"
+      },
+      "outputs": [],
+      "source": [
+        "model = LogisticRegression(max_iter=400)\n",
+        "\n",
+        "pred_probs = cross_val_predict(model, text_embeddings, labels, method=\"predict_proba\")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "9s0pcMk1cw4N"
+      },
+      "source": [
+        "## 4. Use cleanlab to find issues in your dataset"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "qa8ltsx9cw4N"
+      },
+      "source": [
+        "Given feature embeddings and the (out-of-sample) predicted class probabilities obtained from any model you have, cleanlab can quickly help you identify low-quality examples in your dataset.\n",
+        "\n",
+        "Here, we use cleanlab's `Datalab` to find issues in our data. Datalab offers several ways of loading the data; we’ll simply wrap the training features and noisy labels in a dictionary."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 29,
+      "metadata": {
+        "execution": {
+          "iopub.execute_input": "2024-02-16T06:26:19.136722Z",
+          "iopub.status.busy": "2024-02-16T06:26:19.136482Z",
+          "iopub.status.idle": "2024-02-16T06:26:19.139419Z",
+          "shell.execute_reply": "2024-02-16T06:26:19.138870Z"
+        },
+        "id": "UNj4rWW2cw4N"
+      },
+      "outputs": [],
+      "source": [
+        "data_dict = {\"texts\": raw_texts, \"labels\": labels}"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "IpNmBc_Lcw4N"
+      },
+      "source": [
+        "All that is need to audit your data is to call `find_issues()`. We pass in the predicted probabilities and the feature embeddings obtained above, but you do not necessarily need to provide all of this information depending on which types of issues you are interested in. The more inputs you provide, the more types of issues `Datalab` can detect in your data. Using a better model to produce these inputs will ensure cleanlab more accurately estimates issues."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 30,
+      "metadata": {
+        "execution": {
+          "iopub.execute_input": "2024-02-16T06:26:19.141893Z",
+          "iopub.status.busy": "2024-02-16T06:26:19.141673Z",
+          "iopub.status.idle": "2024-02-16T06:26:20.809087Z",
+          "shell.execute_reply": "2024-02-16T06:26:20.808461Z"
+        },
+        "scrolled": true,
+        "id": "R0xuUDRWcw4N",
+        "outputId": "6e8541c2-0e28-4907-c41a-d097212fe8a4",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        }
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "Finding null issues ...\n",
+            "Finding label issues ...\n",
+            "Finding outlier issues ...\n",
+            "Fitting OOD estimator based on provided features ...\n",
+            "Finding near_duplicate issues ...\n",
+            "Finding non_iid issues ...\n",
+            "Finding class_imbalance issues ...\n",
+            "Finding underperforming_group issues ...\n",
+            "\n",
+            "Audit complete. 62 issues found in the dataset.\n"
+          ]
+        }
+      ],
+      "source": [
+        "lab = Datalab(data_dict, label_name=\"labels\")\n",
+        "lab.find_issues(pred_probs=pred_probs, features=text_embeddings)"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "The output would look like:\n",
+        "\n",
+        "```bash\n",
+        "Finding null issues ...\n",
+        "Finding label issues ...\n",
+        "Finding outlier issues ...\n",
+        "Fitting OOD estimator based on provided features ...\n",
+        "Finding near_duplicate issues ...\n",
+        "Finding non_iid issues ...\n",
+        "Finding class_imbalance issues ...\n",
+        "Finding underperforming_group issues ...\n",
+        "\n",
+        "Audit complete. 62 issues found in the dataset.\n",
+        "```"
+      ],
+      "metadata": {
+        "id": "d6Iqy0vGq7w9"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "4aitesJccw4N"
+      },
+      "source": [
+        "After the audit is complete, review the findings using the `report` method:"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 31,
+      "metadata": {
+        "execution": {
+          "iopub.execute_input": "2024-02-16T06:26:20.813057Z",
+          "iopub.status.busy": "2024-02-16T06:26:20.811515Z",
+          "iopub.status.idle": "2024-02-16T06:26:20.838760Z",
+          "shell.execute_reply": "2024-02-16T06:26:20.838088Z"
+        },
+        "scrolled": true,
+        "id": "ALXu32nzcw4N",
+        "outputId": "733d2ed4-5bcd-49e6-93a7-285f3d66278c",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        }
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "Here is a summary of the different kinds of issues found in the data:\n",
+            "\n",
+            "    issue_type  num_issues\n",
+            "       outlier          37\n",
+            "near_duplicate          14\n",
+            "         label          10\n",
+            "       non_iid           1\n",
+            "\n",
+            "Dataset Information: num_examples: 1000, num_classes: 7\n",
+            "\n",
+            "\n",
+            "---------------------- outlier issues ----------------------\n",
+            "\n",
+            "About this issue:\n",
+            "\tExamples that are very different from the rest of the dataset \n",
+            "    (i.e. potentially out-of-distribution or rare/anomalous instances).\n",
+            "    \n",
+            "\n",
+            "Number of examples with this issue: 37\n",
+            "Overall dataset quality in terms of this issue: 0.3671\n",
+            "\n",
+            "Examples representing most severe instances of this issue:\n",
+            "     is_outlier_issue  outlier_score\n",
+            "791              True       0.024866\n",
+            "601              True       0.031162\n",
+            "863              True       0.060738\n",
+            "355              True       0.064199\n",
+            "157              True       0.065075\n",
+            "\n",
+            "\n",
+            "------------------ near_duplicate issues -------------------\n",
+            "\n",
+            "About this issue:\n",
+            "\tA (near) duplicate issue refers to two or more examples in\n",
+            "    a dataset that are extremely similar to each other, relative\n",
+            "    to the rest of the dataset.  The examples flagged with this issue\n",
+            "    may be exactly duplicated, or lie atypically close together when\n",
+            "    represented as vectors (i.e. feature embeddings).\n",
+            "    \n",
+            "\n",
+            "Number of examples with this issue: 14\n",
+            "Overall dataset quality in terms of this issue: 0.5961\n",
+            "\n",
+            "Examples representing most severe instances of this issue:\n",
+            "     is_near_duplicate_issue  near_duplicate_score near_duplicate_sets  distance_to_nearest_neighbor\n",
+            "459                     True              0.009544               [429]                      0.000566\n",
+            "429                     True              0.009544               [459]                      0.000566\n",
+            "501                     True              0.046044          [412, 517]                      0.002781\n",
+            "412                     True              0.046044               [501]                      0.002781\n",
+            "698                     True              0.054626               [607]                      0.003314\n",
+            "\n",
+            "\n",
+            "----------------------- label issues -----------------------\n",
+            "\n",
+            "About this issue:\n",
+            "\tExamples whose given label is estimated to be potentially incorrect\n",
+            "    (e.g. due to annotation error) are flagged as having label issues.\n",
+            "    \n",
+            "\n",
+            "Number of examples with this issue: 10\n",
+            "Overall dataset quality in terms of this issue: 0.9930\n",
+            "\n",
+            "Examples representing most severe instances of this issue:\n",
+            "     is_label_issue  label_score  given_label  predicted_label\n",
+            "379           False     0.025486           32               11\n",
+            "100           False     0.032102           11               36\n",
+            "300           False     0.037742           32               46\n",
+            "485            True     0.057666           17               34\n",
+            "159            True     0.059408           13               11\n",
+            "\n",
+            "\n",
+            "---------------------- non_iid issues ----------------------\n",
+            "\n",
+            "About this issue:\n",
+            "\tWhether the dataset exhibits statistically significant\n",
+            "    violations of the IID assumption like:\n",
+            "    changepoints or shift, drift, autocorrelation, etc.\n",
+            "    The specific violation considered is whether the\n",
+            "    examples are ordered such that almost adjacent examples\n",
+            "    tend to have more similar feature values.\n",
+            "    \n",
+            "\n",
+            "Number of examples with this issue: 1\n",
+            "Overall dataset quality in terms of this issue: 0.0000\n",
+            "\n",
+            "Examples representing most severe instances of this issue:\n",
+            "     is_non_iid_issue  non_iid_score\n",
+            "988              True       0.563774\n",
+            "975             False       0.570179\n",
+            "997             False       0.571891\n",
+            "967             False       0.572357\n",
+            "956             False       0.577413\n",
+            "\n",
+            "Additional Information: \n",
+            "p-value: 0.0\n"
+          ]
+        }
+      ],
+      "source": [
+        "lab.report()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "The output for the `lab.report()` would look like below:\n",
+        "\n",
+        "```bash\n",
+        "Here is a summary of the different kinds of issues found in the data:\n",
+        "\n",
+        "    issue_type  num_issues\n",
+        "       outlier          37\n",
+        "near_duplicate          14\n",
+        "         label          10\n",
+        "       non_iid           1\n",
+        "\n",
+        "Dataset Information: num_examples: 1000, num_classes: 7\n",
+        "\n",
+        "\n",
+        "---------------------- outlier issues ----------------------\n",
+        "\n",
+        "About this issue:\n",
+        "\tExamples that are very different from the rest of the dataset\n",
+        "    (i.e. potentially out-of-distribution or rare/anomalous instances).\n",
+        "    \n",
+        "\n",
+        "Number of examples with this issue: 37\n",
+        "Overall dataset quality in terms of this issue: 0.3671\n",
+        "\n",
+        "Examples representing most severe instances of this issue:\n",
+        "     is_outlier_issue  outlier_score\n",
+        "791              True       0.024866\n",
+        "601              True       0.031162\n",
+        "863              True       0.060738\n",
+        "355              True       0.064199\n",
+        "157              True       0.065075\n",
+        "\n",
+        "\n",
+        "------------------ near_duplicate issues -------------------\n",
+        "\n",
+        "About this issue:\n",
+        "\tA (near) duplicate issue refers to two or more examples in\n",
+        "    a dataset that are extremely similar to each other, relative\n",
+        "    to the rest of the dataset.  The examples flagged with this issue\n",
+        "    may be exactly duplicated, or lie atypically close together when\n",
+        "    represented as vectors (i.e. feature embeddings).\n",
+        "    \n",
+        "\n",
+        "Number of examples with this issue: 14\n",
+        "Overall dataset quality in terms of this issue: 0.5961\n",
+        "\n",
+        "Examples representing most severe instances of this issue:\n",
+        "     is_near_duplicate_issue  near_duplicate_score near_duplicate_sets  distance_to_nearest_neighbor\n",
+        "459                     True              0.009544               [429]                      0.000566\n",
+        "429                     True              0.009544               [459]                      0.000566\n",
+        "501                     True              0.046044          [412, 517]                      0.002781\n",
+        "412                     True              0.046044               [501]                      0.002781\n",
+        "698                     True              0.054626               [607]                      0.003314\n",
+        "\n",
+        "\n",
+        "----------------------- label issues -----------------------\n",
+        "\n",
+        "About this issue:\n",
+        "\tExamples whose given label is estimated to be potentially incorrect\n",
+        "    (e.g. due to annotation error) are flagged as having label issues.\n",
+        "    \n",
+        "\n",
+        "Number of examples with this issue: 10\n",
+        "Overall dataset quality in terms of this issue: 0.9930\n",
+        "\n",
+        "Examples representing most severe instances of this issue:\n",
+        "     is_label_issue  label_score  given_label  predicted_label\n",
+        "379           False     0.025486           32               11\n",
+        "100           False     0.032102           11               36\n",
+        "300           False     0.037742           32               46\n",
+        "485            True     0.057666           17               34\n",
+        "159            True     0.059408           13               11\n",
+        "\n",
+        "\n",
+        "---------------------- non_iid issues ----------------------\n",
+        "\n",
+        "About this issue:\n",
+        "\tWhether the dataset exhibits statistically significant\n",
+        "    violations of the IID assumption like:\n",
+        "    changepoints or shift, drift, autocorrelation, etc.\n",
+        "    The specific violation considered is whether the\n",
+        "    examples are ordered such that almost adjacent examples\n",
+        "    tend to have more similar feature values.\n",
+        "    \n",
+        "\n",
+        "Number of examples with this issue: 1\n",
+        "Overall dataset quality in terms of this issue: 0.0000\n",
+        "\n",
+        "Examples representing most severe instances of this issue:\n",
+        "     is_non_iid_issue  non_iid_score\n",
+        "988              True       0.563774\n",
+        "975             False       0.570179\n",
+        "997             False       0.571891\n",
+        "967             False       0.572357\n",
+        "956             False       0.577413\n",
+        "\n",
+        "Additional Information:\n",
+        "p-value: 0.0\n",
+        "```"
+      ],
+      "metadata": {
+        "id": "XI03VkWHrixv"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "sAuLE6Macw4N"
+      },
+      "source": [
+        "### Label issues\n",
+        "\n",
+        "The report indicates that cleanlab identified many label issues in our dataset. We can see which examples are flagged as likely mislabeled and the label quality score for each example using the `get_issues` method, specifying `label` as an argument to focus on label issues in the data."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 32,
+      "metadata": {
+        "execution": {
+          "iopub.execute_input": "2024-02-16T06:26:20.843083Z",
+          "iopub.status.busy": "2024-02-16T06:26:20.842045Z",
+          "iopub.status.idle": "2024-02-16T06:26:20.852505Z",
+          "shell.execute_reply": "2024-02-16T06:26:20.852016Z"
+        },
+        "scrolled": true,
+        "id": "6gATaXWscw4N",
+        "outputId": "0d0e70c5-1548-4fe6-b67e-668c8dfedf0e",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 206
+        }
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "   is_label_issue  label_score  given_label  predicted_label\n",
+              "0           False     0.903926           11               11\n",
+              "1           False     0.860544           11               11\n",
+              "2           False     0.658309           11               11\n",
+              "3           False     0.697085           11               11\n",
+              "4           False     0.434934           11               11"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-5a26d9e1-9d7e-4327-9c38-a2fa12e59f28\" class=\"colab-df-container\">\n",
+              "    <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>is_label_issue</th>\n",
+              "      <th>label_score</th>\n",
+              "      <th>given_label</th>\n",
+              "      <th>predicted_label</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>0</th>\n",
+              "      <td>False</td>\n",
+              "      <td>0.903926</td>\n",
+              "      <td>11</td>\n",
+              "      <td>11</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>1</th>\n",
+              "      <td>False</td>\n",
+              "      <td>0.860544</td>\n",
+              "      <td>11</td>\n",
+              "      <td>11</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>2</th>\n",
+              "      <td>False</td>\n",
+              "      <td>0.658309</td>\n",
+              "      <td>11</td>\n",
+              "      <td>11</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>3</th>\n",
+              "      <td>False</td>\n",
+              "      <td>0.697085</td>\n",
+              "      <td>11</td>\n",
+              "      <td>11</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>4</th>\n",
+              "      <td>False</td>\n",
+              "      <td>0.434934</td>\n",
+              "      <td>11</td>\n",
+              "      <td>11</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "</div>\n",
+              "    <div class=\"colab-df-buttons\">\n",
+              "\n",
+              "  <div class=\"colab-df-container\">\n",
+              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-5a26d9e1-9d7e-4327-9c38-a2fa12e59f28')\"\n",
+              "            title=\"Convert this dataframe to an interactive table.\"\n",
+              "            style=\"display:none;\">\n",
+              "\n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
+              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
+              "  </svg>\n",
+              "    </button>\n",
+              "\n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-buttons div {\n",
+              "      margin-bottom: 4px;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "    <script>\n",
+              "      const buttonEl =\n",
+              "        document.querySelector('#df-5a26d9e1-9d7e-4327-9c38-a2fa12e59f28 button.colab-df-convert');\n",
+              "      buttonEl.style.display =\n",
+              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "      async function convertToInteractive(key) {\n",
+              "        const element = document.querySelector('#df-5a26d9e1-9d7e-4327-9c38-a2fa12e59f28');\n",
+              "        const dataTable =\n",
+              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                    [key], {});\n",
+              "        if (!dataTable) return;\n",
+              "\n",
+              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "          + ' to learn more about interactive tables.';\n",
+              "        element.innerHTML = '';\n",
+              "        dataTable['output_type'] = 'display_data';\n",
+              "        await google.colab.output.renderOutput(dataTable, element);\n",
+              "        const docLink = document.createElement('div');\n",
+              "        docLink.innerHTML = docLinkHtml;\n",
+              "        element.appendChild(docLink);\n",
+              "      }\n",
+              "    </script>\n",
+              "  </div>\n",
+              "\n",
+              "\n",
+              "<div id=\"df-4dd98a1c-b39c-464d-af07-2f385d2054b1\">\n",
+              "  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-4dd98a1c-b39c-464d-af07-2f385d2054b1')\"\n",
+              "            title=\"Suggest charts\"\n",
+              "            style=\"display:none;\">\n",
+              "\n",
+              "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "     width=\"24px\">\n",
+              "    <g>\n",
+              "        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
+              "    </g>\n",
+              "</svg>\n",
+              "  </button>\n",
+              "\n",
+              "<style>\n",
+              "  .colab-df-quickchart {\n",
+              "      --bg-color: #E8F0FE;\n",
+              "      --fill-color: #1967D2;\n",
+              "      --hover-bg-color: #E2EBFA;\n",
+              "      --hover-fill-color: #174EA6;\n",
+              "      --disabled-fill-color: #AAA;\n",
+              "      --disabled-bg-color: #DDD;\n",
+              "  }\n",
+              "\n",
+              "  [theme=dark] .colab-df-quickchart {\n",
+              "      --bg-color: #3B4455;\n",
+              "      --fill-color: #D2E3FC;\n",
+              "      --hover-bg-color: #434B5C;\n",
+              "      --hover-fill-color: #FFFFFF;\n",
+              "      --disabled-bg-color: #3B4455;\n",
+              "      --disabled-fill-color: #666;\n",
+              "  }\n",
+              "\n",
+              "  .colab-df-quickchart {\n",
+              "    background-color: var(--bg-color);\n",
+              "    border: none;\n",
+              "    border-radius: 50%;\n",
+              "    cursor: pointer;\n",
+              "    display: none;\n",
+              "    fill: var(--fill-color);\n",
+              "    height: 32px;\n",
+              "    padding: 0;\n",
+              "    width: 32px;\n",
+              "  }\n",
+              "\n",
+              "  .colab-df-quickchart:hover {\n",
+              "    background-color: var(--hover-bg-color);\n",
+              "    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "    fill: var(--button-hover-fill-color);\n",
+              "  }\n",
+              "\n",
+              "  .colab-df-quickchart-complete:disabled,\n",
+              "  .colab-df-quickchart-complete:disabled:hover {\n",
+              "    background-color: var(--disabled-bg-color);\n",
+              "    fill: var(--disabled-fill-color);\n",
+              "    box-shadow: none;\n",
+              "  }\n",
+              "\n",
+              "  .colab-df-spinner {\n",
+              "    border: 2px solid var(--fill-color);\n",
+              "    border-color: transparent;\n",
+              "    border-bottom-color: var(--fill-color);\n",
+              "    animation:\n",
+              "      spin 1s steps(1) infinite;\n",
+              "  }\n",
+              "\n",
+              "  @keyframes spin {\n",
+              "    0% {\n",
+              "      border-color: transparent;\n",
+              "      border-bottom-color: var(--fill-color);\n",
+              "      border-left-color: var(--fill-color);\n",
+              "    }\n",
+              "    20% {\n",
+              "      border-color: transparent;\n",
+              "      border-left-color: var(--fill-color);\n",
+              "      border-top-color: var(--fill-color);\n",
+              "    }\n",
+              "    30% {\n",
+              "      border-color: transparent;\n",
+              "      border-left-color: var(--fill-color);\n",
+              "      border-top-color: var(--fill-color);\n",
+              "      border-right-color: var(--fill-color);\n",
+              "    }\n",
+              "    40% {\n",
+              "      border-color: transparent;\n",
+              "      border-right-color: var(--fill-color);\n",
+              "      border-top-color: var(--fill-color);\n",
+              "    }\n",
+              "    60% {\n",
+              "      border-color: transparent;\n",
+              "      border-right-color: var(--fill-color);\n",
+              "    }\n",
+              "    80% {\n",
+              "      border-color: transparent;\n",
+              "      border-right-color: var(--fill-color);\n",
+              "      border-bottom-color: var(--fill-color);\n",
+              "    }\n",
+              "    90% {\n",
+              "      border-color: transparent;\n",
+              "      border-bottom-color: var(--fill-color);\n",
+              "    }\n",
+              "  }\n",
+              "</style>\n",
+              "\n",
+              "  <script>\n",
+              "    async function quickchart(key) {\n",
+              "      const quickchartButtonEl =\n",
+              "        document.querySelector('#' + key + ' button');\n",
+              "      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n",
+              "      quickchartButtonEl.classList.add('colab-df-spinner');\n",
+              "      try {\n",
+              "        const charts = await google.colab.kernel.invokeFunction(\n",
+              "            'suggestCharts', [key], {});\n",
+              "      } catch (error) {\n",
+              "        console.error('Error during call to suggestCharts:', error);\n",
+              "      }\n",
+              "      quickchartButtonEl.classList.remove('colab-df-spinner');\n",
+              "      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
+              "    }\n",
+              "    (() => {\n",
+              "      let quickchartButtonEl =\n",
+              "        document.querySelector('#df-4dd98a1c-b39c-464d-af07-2f385d2054b1 button');\n",
+              "      quickchartButtonEl.style.display =\n",
+              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "    })();\n",
+              "  </script>\n",
+              "</div>\n",
+              "    </div>\n",
+              "  </div>\n"
+            ],
+            "application/vnd.google.colaboratory.intrinsic+json": {
+              "type": "dataframe",
+              "variable_name": "label_issues",
+              "summary": "{\n  \"name\": \"label_issues\",\n  \"rows\": 1000,\n  \"fields\": [\n    {\n      \"column\": \"is_label_issue\",\n      \"properties\": {\n        \"dtype\": \"boolean\",\n        \"num_unique_values\": 2,\n        \"samples\": [\n          true,\n          false\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"label_score\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": 0.2150390046430028,\n        \"min\": 0.025486333476725527,\n        \"max\": 0.999751760644687,\n        \"num_unique_values\": 1000,\n        \"samples\": [\n          0.98954913626076,\n          0.44264330724848383\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"given_label\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": 12,\n        \"min\": 11,\n        \"max\": 46,\n        \"num_unique_values\": 7,\n        \"samples\": [\n          11,\n          13\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"predicted_label\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": 12,\n        \"min\": 11,\n        \"max\": 46,\n        \"num_unique_values\": 7,\n        \"samples\": [\n          11,\n          13\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}"
+            }
+          },
+          "metadata": {},
+          "execution_count": 32
+        }
+      ],
+      "source": [
+        "label_issues = lab.get_issues(\"label\")\n",
+        "label_issues.head()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "| is_label_issue | label_score | given_label | predicted_label |\n",
+        "|----------------|-------------|-------------|-----------------|\n",
+        "| 0              | False       | 0.903926    | 11              | 11 |\n",
+        "| 1              | False       | 0.860544    | 11              | 11 |\n",
+        "| 2              | False       | 0.658309    | 11              | 11 |\n",
+        "| 3              | False       | 0.697085    | 11              | 11 |\n",
+        "| 4              | False       | 0.434934    | 11              | 11 |\n"
+      ],
+      "metadata": {
+        "id": "eBLFyMMcs5NT"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "-tYlhmKYcw4N"
+      },
+      "source": [
+        "This method returns a dataframe containing a label quality score for each example. These numeric scores lie between 0 and 1, where lower scores indicate examples more likely to be mislabeled. The dataframe also contains a boolean column specifying whether or not each example is identified to have a label issue (indicating it is likely mislabeled)."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "XcD-oCLlcw4N"
+      },
+      "source": [
+        "We can get the subset of examples flagged with label issues, and also sort by label quality score to find the indices of the 5 most likely mislabeled examples in our dataset."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 33,
+      "metadata": {
+        "execution": {
+          "iopub.execute_input": "2024-02-16T06:26:20.854743Z",
+          "iopub.status.busy": "2024-02-16T06:26:20.854394Z",
+          "iopub.status.idle": "2024-02-16T06:26:20.858961Z",
+          "shell.execute_reply": "2024-02-16T06:26:20.858409Z"
+        },
+        "id": "QtloV-NBcw4N",
+        "outputId": "86c32e99-7dc8-470c-b102-f0f5acc13855",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        }
+      },
+      "outputs": [
+        {
+          "output_type": "stream",
+          "name": "stdout",
+          "text": [
+            "cleanlab found 10 potential label errors in the dataset.\n",
+            "Here are indices of the top 5 most likely errors: \n",
+            " [379 100 300 485 159]\n"
+          ]
+        }
+      ],
+      "source": [
+        "identified_label_issues = label_issues[label_issues[\"is_label_issue\"] == True]\n",
+        "lowest_quality_labels = label_issues[\"label_score\"].argsort()[:5].to_numpy()\n",
+        "\n",
+        "print(\n",
+        "    f\"cleanlab found {len(identified_label_issues)} potential label errors in the dataset.\\n\"\n",
+        "    f\"Here are indices of the top 5 most likely errors: \\n {lowest_quality_labels}\"\n",
+        ")"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "The output for the above cell would look like below:\n",
+        "\n",
+        "```bash\n",
+        "cleanlab found 10 potential label errors in the dataset.\n",
+        "Here are indices of the top 5 most likely errors:\n",
+        " [379 100 300 485 159]\n",
+        "\n",
+        "```"
+      ],
+      "metadata": {
+        "id": "QyW7qUNKXOz5"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "8J49bTeocw4N"
+      },
+      "source": [
+        "Let's review some of the most likely label errors.\n",
+        "\n",
+        "Here we display the top 5 examples identified as the most likely label errors in the dataset, together with their given (original) label and a suggested alternative label from cleanlab.\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 18,
+      "metadata": {
+        "execution": {
+          "iopub.execute_input": "2024-02-16T06:26:20.861048Z",
+          "iopub.status.busy": "2024-02-16T06:26:20.860742Z",
+          "iopub.status.idle": "2024-02-16T06:26:20.867443Z",
+          "shell.execute_reply": "2024-02-16T06:26:20.866904Z"
+        },
+        "id": "c-niFVJvcw4N",
+        "outputId": "5bbc5217-3581-4e2e-8b56-7a1fc77cc427",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 276
+        }
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "                                                                                                          text  \\\n",
+              "379  Is there a specific source that the exchange rate for the transfer I'm planning on making is pulled from?   \n",
+              "100                                                                        can you share card tracking number?   \n",
+              "300                                                   If I need to cash foreign transfers, how does that work?   \n",
+              "485                                          Was I charged more than I should of been for a currency exchange?   \n",
+              "159                                                                Is there any way to see my card in the app?   \n",
+              "\n",
+              "     given_label  suggested_label  \n",
+              "379           32               11  \n",
+              "100           11               36  \n",
+              "300           32               46  \n",
+              "485           17               34  \n",
+              "159           13               11  "
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-b3e89f50-5593-4907-a4ca-3b41c4a5e72c\" class=\"colab-df-container\">\n",
+              "    <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>text</th>\n",
+              "      <th>given_label</th>\n",
+              "      <th>suggested_label</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>379</th>\n",
+              "      <td>Is there a specific source that the exchange rate for the transfer I'm planning on making is pulled from?</td>\n",
+              "      <td>32</td>\n",
+              "      <td>11</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>100</th>\n",
+              "      <td>can you share card tracking number?</td>\n",
+              "      <td>11</td>\n",
+              "      <td>36</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>300</th>\n",
+              "      <td>If I need to cash foreign transfers, how does that work?</td>\n",
+              "      <td>32</td>\n",
+              "      <td>46</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>485</th>\n",
+              "      <td>Was I charged more than I should of been for a currency exchange?</td>\n",
+              "      <td>17</td>\n",
+              "      <td>34</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>159</th>\n",
+              "      <td>Is there any way to see my card in the app?</td>\n",
+              "      <td>13</td>\n",
+              "      <td>11</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "</div>\n",
+              "    <div class=\"colab-df-buttons\">\n",
+              "\n",
+              "  <div class=\"colab-df-container\">\n",
+              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-b3e89f50-5593-4907-a4ca-3b41c4a5e72c')\"\n",
+              "            title=\"Convert this dataframe to an interactive table.\"\n",
+              "            style=\"display:none;\">\n",
+              "\n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
+              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
+              "  </svg>\n",
+              "    </button>\n",
+              "\n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-buttons div {\n",
+              "      margin-bottom: 4px;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "    <script>\n",
+              "      const buttonEl =\n",
+              "        document.querySelector('#df-b3e89f50-5593-4907-a4ca-3b41c4a5e72c button.colab-df-convert');\n",
+              "      buttonEl.style.display =\n",
+              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "      async function convertToInteractive(key) {\n",
+              "        const element = document.querySelector('#df-b3e89f50-5593-4907-a4ca-3b41c4a5e72c');\n",
+              "        const dataTable =\n",
+              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                    [key], {});\n",
+              "        if (!dataTable) return;\n",
+              "\n",
+              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "          + ' to learn more about interactive tables.';\n",
+              "        element.innerHTML = '';\n",
+              "        dataTable['output_type'] = 'display_data';\n",
+              "        await google.colab.output.renderOutput(dataTable, element);\n",
+              "        const docLink = document.createElement('div');\n",
+              "        docLink.innerHTML = docLinkHtml;\n",
+              "        element.appendChild(docLink);\n",
+              "      }\n",
+              "    </script>\n",
+              "  </div>\n",
+              "\n",
+              "\n",
+              "<div id=\"df-8e310624-5a8c-4409-96e1-9757bd48d51a\">\n",
+              "  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-8e310624-5a8c-4409-96e1-9757bd48d51a')\"\n",
+              "            title=\"Suggest charts\"\n",
+              "            style=\"display:none;\">\n",
+              "\n",
+              "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "     width=\"24px\">\n",
+              "    <g>\n",
+              "        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
+              "    </g>\n",
+              "</svg>\n",
+              "  </button>\n",
+              "\n",
+              "<style>\n",
+              "  .colab-df-quickchart {\n",
+              "      --bg-color: #E8F0FE;\n",
+              "      --fill-color: #1967D2;\n",
+              "      --hover-bg-color: #E2EBFA;\n",
+              "      --hover-fill-color: #174EA6;\n",
+              "      --disabled-fill-color: #AAA;\n",
+              "      --disabled-bg-color: #DDD;\n",
+              "  }\n",
+              "\n",
+              "  [theme=dark] .colab-df-quickchart {\n",
+              "      --bg-color: #3B4455;\n",
+              "      --fill-color: #D2E3FC;\n",
+              "      --hover-bg-color: #434B5C;\n",
+              "      --hover-fill-color: #FFFFFF;\n",
+              "      --disabled-bg-color: #3B4455;\n",
+              "      --disabled-fill-color: #666;\n",
+              "  }\n",
+              "\n",
+              "  .colab-df-quickchart {\n",
+              "    background-color: var(--bg-color);\n",
+              "    border: none;\n",
+              "    border-radius: 50%;\n",
+              "    cursor: pointer;\n",
+              "    display: none;\n",
+              "    fill: var(--fill-color);\n",
+              "    height: 32px;\n",
+              "    padding: 0;\n",
+              "    width: 32px;\n",
+              "  }\n",
+              "\n",
+              "  .colab-df-quickchart:hover {\n",
+              "    background-color: var(--hover-bg-color);\n",
+              "    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "    fill: var(--button-hover-fill-color);\n",
+              "  }\n",
+              "\n",
+              "  .colab-df-quickchart-complete:disabled,\n",
+              "  .colab-df-quickchart-complete:disabled:hover {\n",
+              "    background-color: var(--disabled-bg-color);\n",
+              "    fill: var(--disabled-fill-color);\n",
+              "    box-shadow: none;\n",
+              "  }\n",
+              "\n",
+              "  .colab-df-spinner {\n",
+              "    border: 2px solid var(--fill-color);\n",
+              "    border-color: transparent;\n",
+              "    border-bottom-color: var(--fill-color);\n",
+              "    animation:\n",
+              "      spin 1s steps(1) infinite;\n",
+              "  }\n",
+              "\n",
+              "  @keyframes spin {\n",
+              "    0% {\n",
+              "      border-color: transparent;\n",
+              "      border-bottom-color: var(--fill-color);\n",
+              "      border-left-color: var(--fill-color);\n",
+              "    }\n",
+              "    20% {\n",
+              "      border-color: transparent;\n",
+              "      border-left-color: var(--fill-color);\n",
+              "      border-top-color: var(--fill-color);\n",
+              "    }\n",
+              "    30% {\n",
+              "      border-color: transparent;\n",
+              "      border-left-color: var(--fill-color);\n",
+              "      border-top-color: var(--fill-color);\n",
+              "      border-right-color: var(--fill-color);\n",
+              "    }\n",
+              "    40% {\n",
+              "      border-color: transparent;\n",
+              "      border-right-color: var(--fill-color);\n",
+              "      border-top-color: var(--fill-color);\n",
+              "    }\n",
+              "    60% {\n",
+              "      border-color: transparent;\n",
+              "      border-right-color: var(--fill-color);\n",
+              "    }\n",
+              "    80% {\n",
+              "      border-color: transparent;\n",
+              "      border-right-color: var(--fill-color);\n",
+              "      border-bottom-color: var(--fill-color);\n",
+              "    }\n",
+              "    90% {\n",
+              "      border-color: transparent;\n",
+              "      border-bottom-color: var(--fill-color);\n",
+              "    }\n",
+              "  }\n",
+              "</style>\n",
+              "\n",
+              "  <script>\n",
+              "    async function quickchart(key) {\n",
+              "      const quickchartButtonEl =\n",
+              "        document.querySelector('#' + key + ' button');\n",
+              "      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n",
+              "      quickchartButtonEl.classList.add('colab-df-spinner');\n",
+              "      try {\n",
+              "        const charts = await google.colab.kernel.invokeFunction(\n",
+              "            'suggestCharts', [key], {});\n",
+              "      } catch (error) {\n",
+              "        console.error('Error during call to suggestCharts:', error);\n",
+              "      }\n",
+              "      quickchartButtonEl.classList.remove('colab-df-spinner');\n",
+              "      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
+              "    }\n",
+              "    (() => {\n",
+              "      let quickchartButtonEl =\n",
+              "        document.querySelector('#df-8e310624-5a8c-4409-96e1-9757bd48d51a button');\n",
+              "      quickchartButtonEl.style.display =\n",
+              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "    })();\n",
+              "  </script>\n",
+              "</div>\n",
+              "    </div>\n",
+              "  </div>\n"
+            ],
+            "application/vnd.google.colaboratory.intrinsic+json": {
+              "type": "dataframe",
+              "summary": "{\n  \"name\": \"data_with_suggested_labels\",\n  \"rows\": 5,\n  \"fields\": [\n    {\n      \"column\": \"text\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 5,\n        \"samples\": [\n          \"can you share card tracking number?\",\n          \"Is there any way to see my card in the app?\",\n          \"If I need to cash foreign transfers, how does that work?\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"given_label\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": 10,\n        \"min\": 11,\n        \"max\": 32,\n        \"num_unique_values\": 4,\n        \"samples\": [\n          11,\n          13,\n          32\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"suggested_label\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": 15,\n        \"min\": 11,\n        \"max\": 46,\n        \"num_unique_values\": 4,\n        \"samples\": [\n          36,\n          34,\n          11\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}"
+            }
+          },
+          "metadata": {},
+          "execution_count": 18
+        }
+      ],
+      "source": [
+        "data_with_suggested_labels = pd.DataFrame(\n",
+        "    {\"text\": raw_texts, \"given_label\": labels, \"suggested_label\": label_issues[\"predicted_label\"]}\n",
+        ")\n",
+        "data_with_suggested_labels.iloc[lowest_quality_labels]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "  The output to the above command would like below:\n",
+        "  \n",
+        "  | text | given_label                                                                                               | suggested_label |\n",
+        "|------|-----------------------------------------------------------------------------------------------------------|-----------------|\n",
+        "| 379  | Is there a specific source that the exchange rate for the transfer I'm planning on making is pulled from? | 32              |\n",
+        "| 100  | can you share card tracking number?                                                                       | 11              |\n",
+        "| 300  | If I need to cash foreign transfers, how does that work?                                                  | 32              |\n",
+        "| 485  | Was I charged more than I should of been for a currency exchange?                                         | 17              |\n",
+        "| 159  | Is there any way to see my card in the app?                                                               | 13              |\n"
+      ],
+      "metadata": {
+        "id": "g2dvMySPtkbL"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "scrolled": true,
+        "id": "eH8ltGj0cw4O"
+      },
+      "source": [
+        "These are very clear label errors that cleanlab has identified in this data! Note that the `given_label` does not correctly reflect the intent of these requests, whoever produced this dataset made many mistakes that are important to address before modeling the data."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "ULFeD3bzcw4O"
+      },
+      "source": [
+        "### Outlier issues\n",
+        "\n",
+        "According to the report, our dataset contains some outliers.\n",
+        "We can see which examples are outliers (and a numeric quality score quantifying how typical each example appears to be) via `get_issues`. We sort the resulting DataFrame by cleanlab's outlier quality score to see the most severe outliers in our dataset."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 20,
+      "metadata": {
+        "execution": {
+          "iopub.execute_input": "2024-02-16T06:26:20.869718Z",
+          "iopub.status.busy": "2024-02-16T06:26:20.869251Z",
+          "iopub.status.idle": "2024-02-16T06:26:20.876386Z",
+          "shell.execute_reply": "2024-02-16T06:26:20.875851Z"
+        },
+        "id": "jBLuqUXBcw4O",
+        "outputId": "d5d2dbc6-c708-4750-e3ea-6dcd5c24a64d",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 206
+        }
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "     is_outlier_issue  outlier_score\n",
+              "791              True       0.024866\n",
+              "601              True       0.031162\n",
+              "863              True       0.060738\n",
+              "355              True       0.064199\n",
+              "157              True       0.065075"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-b302658a-5cf7-4fcc-a69d-150ef156f2ce\" class=\"colab-df-container\">\n",
+              "    <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>is_outlier_issue</th>\n",
+              "      <th>outlier_score</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>791</th>\n",
+              "      <td>True</td>\n",
+              "      <td>0.024866</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>601</th>\n",
+              "      <td>True</td>\n",
+              "      <td>0.031162</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>863</th>\n",
+              "      <td>True</td>\n",
+              "      <td>0.060738</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>355</th>\n",
+              "      <td>True</td>\n",
+              "      <td>0.064199</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>157</th>\n",
+              "      <td>True</td>\n",
+              "      <td>0.065075</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "</div>\n",
+              "    <div class=\"colab-df-buttons\">\n",
+              "\n",
+              "  <div class=\"colab-df-container\">\n",
+              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-b302658a-5cf7-4fcc-a69d-150ef156f2ce')\"\n",
+              "            title=\"Convert this dataframe to an interactive table.\"\n",
+              "            style=\"display:none;\">\n",
+              "\n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
+              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
+              "  </svg>\n",
+              "    </button>\n",
+              "\n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-buttons div {\n",
+              "      margin-bottom: 4px;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "    <script>\n",
+              "      const buttonEl =\n",
+              "        document.querySelector('#df-b302658a-5cf7-4fcc-a69d-150ef156f2ce button.colab-df-convert');\n",
+              "      buttonEl.style.display =\n",
+              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "      async function convertToInteractive(key) {\n",
+              "        const element = document.querySelector('#df-b302658a-5cf7-4fcc-a69d-150ef156f2ce');\n",
+              "        const dataTable =\n",
+              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                    [key], {});\n",
+              "        if (!dataTable) return;\n",
+              "\n",
+              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "          + ' to learn more about interactive tables.';\n",
+              "        element.innerHTML = '';\n",
+              "        dataTable['output_type'] = 'display_data';\n",
+              "        await google.colab.output.renderOutput(dataTable, element);\n",
+              "        const docLink = document.createElement('div');\n",
+              "        docLink.innerHTML = docLinkHtml;\n",
+              "        element.appendChild(docLink);\n",
+              "      }\n",
+              "    </script>\n",
+              "  </div>\n",
+              "\n",
+              "\n",
+              "<div id=\"df-6b51f708-0b0d-469d-b68a-156d487823c5\">\n",
+              "  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-6b51f708-0b0d-469d-b68a-156d487823c5')\"\n",
+              "            title=\"Suggest charts\"\n",
+              "            style=\"display:none;\">\n",
+              "\n",
+              "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "     width=\"24px\">\n",
+              "    <g>\n",
+              "        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
+              "    </g>\n",
+              "</svg>\n",
+              "  </button>\n",
+              "\n",
+              "<style>\n",
+              "  .colab-df-quickchart {\n",
+              "      --bg-color: #E8F0FE;\n",
+              "      --fill-color: #1967D2;\n",
+              "      --hover-bg-color: #E2EBFA;\n",
+              "      --hover-fill-color: #174EA6;\n",
+              "      --disabled-fill-color: #AAA;\n",
+              "      --disabled-bg-color: #DDD;\n",
+              "  }\n",
+              "\n",
+              "  [theme=dark] .colab-df-quickchart {\n",
+              "      --bg-color: #3B4455;\n",
+              "      --fill-color: #D2E3FC;\n",
+              "      --hover-bg-color: #434B5C;\n",
+              "      --hover-fill-color: #FFFFFF;\n",
+              "      --disabled-bg-color: #3B4455;\n",
+              "      --disabled-fill-color: #666;\n",
+              "  }\n",
+              "\n",
+              "  .colab-df-quickchart {\n",
+              "    background-color: var(--bg-color);\n",
+              "    border: none;\n",
+              "    border-radius: 50%;\n",
+              "    cursor: pointer;\n",
+              "    display: none;\n",
+              "    fill: var(--fill-color);\n",
+              "    height: 32px;\n",
+              "    padding: 0;\n",
+              "    width: 32px;\n",
+              "  }\n",
+              "\n",
+              "  .colab-df-quickchart:hover {\n",
+              "    background-color: var(--hover-bg-color);\n",
+              "    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "    fill: var(--button-hover-fill-color);\n",
+              "  }\n",
+              "\n",
+              "  .colab-df-quickchart-complete:disabled,\n",
+              "  .colab-df-quickchart-complete:disabled:hover {\n",
+              "    background-color: var(--disabled-bg-color);\n",
+              "    fill: var(--disabled-fill-color);\n",
+              "    box-shadow: none;\n",
+              "  }\n",
+              "\n",
+              "  .colab-df-spinner {\n",
+              "    border: 2px solid var(--fill-color);\n",
+              "    border-color: transparent;\n",
+              "    border-bottom-color: var(--fill-color);\n",
+              "    animation:\n",
+              "      spin 1s steps(1) infinite;\n",
+              "  }\n",
+              "\n",
+              "  @keyframes spin {\n",
+              "    0% {\n",
+              "      border-color: transparent;\n",
+              "      border-bottom-color: var(--fill-color);\n",
+              "      border-left-color: var(--fill-color);\n",
+              "    }\n",
+              "    20% {\n",
+              "      border-color: transparent;\n",
+              "      border-left-color: var(--fill-color);\n",
+              "      border-top-color: var(--fill-color);\n",
+              "    }\n",
+              "    30% {\n",
+              "      border-color: transparent;\n",
+              "      border-left-color: var(--fill-color);\n",
+              "      border-top-color: var(--fill-color);\n",
+              "      border-right-color: var(--fill-color);\n",
+              "    }\n",
+              "    40% {\n",
+              "      border-color: transparent;\n",
+              "      border-right-color: var(--fill-color);\n",
+              "      border-top-color: var(--fill-color);\n",
+              "    }\n",
+              "    60% {\n",
+              "      border-color: transparent;\n",
+              "      border-right-color: var(--fill-color);\n",
+              "    }\n",
+              "    80% {\n",
+              "      border-color: transparent;\n",
+              "      border-right-color: var(--fill-color);\n",
+              "      border-bottom-color: var(--fill-color);\n",
+              "    }\n",
+              "    90% {\n",
+              "      border-color: transparent;\n",
+              "      border-bottom-color: var(--fill-color);\n",
+              "    }\n",
+              "  }\n",
+              "</style>\n",
+              "\n",
+              "  <script>\n",
+              "    async function quickchart(key) {\n",
+              "      const quickchartButtonEl =\n",
+              "        document.querySelector('#' + key + ' button');\n",
+              "      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n",
+              "      quickchartButtonEl.classList.add('colab-df-spinner');\n",
+              "      try {\n",
+              "        const charts = await google.colab.kernel.invokeFunction(\n",
+              "            'suggestCharts', [key], {});\n",
+              "      } catch (error) {\n",
+              "        console.error('Error during call to suggestCharts:', error);\n",
+              "      }\n",
+              "      quickchartButtonEl.classList.remove('colab-df-spinner');\n",
+              "      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
+              "    }\n",
+              "    (() => {\n",
+              "      let quickchartButtonEl =\n",
+              "        document.querySelector('#df-6b51f708-0b0d-469d-b68a-156d487823c5 button');\n",
+              "      quickchartButtonEl.style.display =\n",
+              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "    })();\n",
+              "  </script>\n",
+              "</div>\n",
+              "    </div>\n",
+              "  </div>\n"
+            ],
+            "application/vnd.google.colaboratory.intrinsic+json": {
+              "type": "dataframe",
+              "summary": "{\n  \"name\": \"outlier_issues\",\n  \"rows\": 5,\n  \"fields\": [\n    {\n      \"column\": \"is_outlier_issue\",\n      \"properties\": {\n        \"dtype\": \"boolean\",\n        \"num_unique_values\": 1,\n        \"samples\": [\n          true\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"outlier_score\",\n      \"properties\": {\n        \"dtype\": \"float32\",\n        \"num_unique_values\": 5,\n        \"samples\": [\n          0.03116183541715145\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}"
+            }
+          },
+          "metadata": {},
+          "execution_count": 20
+        }
+      ],
+      "source": [
+        "outlier_issues = lab.get_issues(\"outlier\")\n",
+        "outlier_issues.sort_values(\"outlier_score\").head()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Output would look like below:\n",
+        "\n",
+        "| is_outlier_issue | outlier_score |\n",
+        "|------------------|---------------|\n",
+        "| True             | 0.024866      |\n",
+        "| True             | 0.031162      |\n",
+        "| True             | 0.060738      |\n",
+        "| True             | 0.064199      |\n",
+        "| True             | 0.065075      |"
+      ],
+      "metadata": {
+        "id": "F7Z2VJQAujui"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 34,
+      "metadata": {
+        "execution": {
+          "iopub.execute_input": "2024-02-16T06:26:20.878435Z",
+          "iopub.status.busy": "2024-02-16T06:26:20.878117Z",
+          "iopub.status.idle": "2024-02-16T06:26:20.884073Z",
+          "shell.execute_reply": "2024-02-16T06:26:20.883533Z"
+        },
+        "id": "Kjn-muLGcw4O",
+        "outputId": "a5ae0a32-cac4-442d-89fc-8f7f64da9dfc",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 246
+        }
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "                                            text  label\n",
+              "791                  withdrawal pending meaning?     46\n",
+              "601                    $1 charge in transaction.     34\n",
+              "863              My atm withdraw is stillpending     46\n",
+              "355          explain the interbank exchange rate     32\n",
+              "157  lost card found, want to put it back in app     13"
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-30ae3a5c-a5ac-48e2-be07-1bc96f12dd24\" class=\"colab-df-container\">\n",
+              "    <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>text</th>\n",
+              "      <th>label</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>791</th>\n",
+              "      <td>withdrawal pending meaning?</td>\n",
+              "      <td>46</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>601</th>\n",
+              "      <td>$1 charge in transaction.</td>\n",
+              "      <td>34</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>863</th>\n",
+              "      <td>My atm withdraw is stillpending</td>\n",
+              "      <td>46</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>355</th>\n",
+              "      <td>explain the interbank exchange rate</td>\n",
+              "      <td>32</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>157</th>\n",
+              "      <td>lost card found, want to put it back in app</td>\n",
+              "      <td>13</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "</div>\n",
+              "    <div class=\"colab-df-buttons\">\n",
+              "\n",
+              "  <div class=\"colab-df-container\">\n",
+              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-30ae3a5c-a5ac-48e2-be07-1bc96f12dd24')\"\n",
+              "            title=\"Convert this dataframe to an interactive table.\"\n",
+              "            style=\"display:none;\">\n",
+              "\n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
+              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
+              "  </svg>\n",
+              "    </button>\n",
+              "\n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-buttons div {\n",
+              "      margin-bottom: 4px;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "    <script>\n",
+              "      const buttonEl =\n",
+              "        document.querySelector('#df-30ae3a5c-a5ac-48e2-be07-1bc96f12dd24 button.colab-df-convert');\n",
+              "      buttonEl.style.display =\n",
+              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "      async function convertToInteractive(key) {\n",
+              "        const element = document.querySelector('#df-30ae3a5c-a5ac-48e2-be07-1bc96f12dd24');\n",
+              "        const dataTable =\n",
+              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                    [key], {});\n",
+              "        if (!dataTable) return;\n",
+              "\n",
+              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "          + ' to learn more about interactive tables.';\n",
+              "        element.innerHTML = '';\n",
+              "        dataTable['output_type'] = 'display_data';\n",
+              "        await google.colab.output.renderOutput(dataTable, element);\n",
+              "        const docLink = document.createElement('div');\n",
+              "        docLink.innerHTML = docLinkHtml;\n",
+              "        element.appendChild(docLink);\n",
+              "      }\n",
+              "    </script>\n",
+              "  </div>\n",
+              "\n",
+              "\n",
+              "<div id=\"df-bddbc2a7-3f55-4d4b-9f5b-bad055b943b6\">\n",
+              "  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-bddbc2a7-3f55-4d4b-9f5b-bad055b943b6')\"\n",
+              "            title=\"Suggest charts\"\n",
+              "            style=\"display:none;\">\n",
+              "\n",
+              "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "     width=\"24px\">\n",
+              "    <g>\n",
+              "        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
+              "    </g>\n",
+              "</svg>\n",
+              "  </button>\n",
+              "\n",
+              "<style>\n",
+              "  .colab-df-quickchart {\n",
+              "      --bg-color: #E8F0FE;\n",
+              "      --fill-color: #1967D2;\n",
+              "      --hover-bg-color: #E2EBFA;\n",
+              "      --hover-fill-color: #174EA6;\n",
+              "      --disabled-fill-color: #AAA;\n",
+              "      --disabled-bg-color: #DDD;\n",
+              "  }\n",
+              "\n",
+              "  [theme=dark] .colab-df-quickchart {\n",
+              "      --bg-color: #3B4455;\n",
+              "      --fill-color: #D2E3FC;\n",
+              "      --hover-bg-color: #434B5C;\n",
+              "      --hover-fill-color: #FFFFFF;\n",
+              "      --disabled-bg-color: #3B4455;\n",
+              "      --disabled-fill-color: #666;\n",
+              "  }\n",
+              "\n",
+              "  .colab-df-quickchart {\n",
+              "    background-color: var(--bg-color);\n",
+              "    border: none;\n",
+              "    border-radius: 50%;\n",
+              "    cursor: pointer;\n",
+              "    display: none;\n",
+              "    fill: var(--fill-color);\n",
+              "    height: 32px;\n",
+              "    padding: 0;\n",
+              "    width: 32px;\n",
+              "  }\n",
+              "\n",
+              "  .colab-df-quickchart:hover {\n",
+              "    background-color: var(--hover-bg-color);\n",
+              "    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "    fill: var(--button-hover-fill-color);\n",
+              "  }\n",
+              "\n",
+              "  .colab-df-quickchart-complete:disabled,\n",
+              "  .colab-df-quickchart-complete:disabled:hover {\n",
+              "    background-color: var(--disabled-bg-color);\n",
+              "    fill: var(--disabled-fill-color);\n",
+              "    box-shadow: none;\n",
+              "  }\n",
+              "\n",
+              "  .colab-df-spinner {\n",
+              "    border: 2px solid var(--fill-color);\n",
+              "    border-color: transparent;\n",
+              "    border-bottom-color: var(--fill-color);\n",
+              "    animation:\n",
+              "      spin 1s steps(1) infinite;\n",
+              "  }\n",
+              "\n",
+              "  @keyframes spin {\n",
+              "    0% {\n",
+              "      border-color: transparent;\n",
+              "      border-bottom-color: var(--fill-color);\n",
+              "      border-left-color: var(--fill-color);\n",
+              "    }\n",
+              "    20% {\n",
+              "      border-color: transparent;\n",
+              "      border-left-color: var(--fill-color);\n",
+              "      border-top-color: var(--fill-color);\n",
+              "    }\n",
+              "    30% {\n",
+              "      border-color: transparent;\n",
+              "      border-left-color: var(--fill-color);\n",
+              "      border-top-color: var(--fill-color);\n",
+              "      border-right-color: var(--fill-color);\n",
+              "    }\n",
+              "    40% {\n",
+              "      border-color: transparent;\n",
+              "      border-right-color: var(--fill-color);\n",
+              "      border-top-color: var(--fill-color);\n",
+              "    }\n",
+              "    60% {\n",
+              "      border-color: transparent;\n",
+              "      border-right-color: var(--fill-color);\n",
+              "    }\n",
+              "    80% {\n",
+              "      border-color: transparent;\n",
+              "      border-right-color: var(--fill-color);\n",
+              "      border-bottom-color: var(--fill-color);\n",
+              "    }\n",
+              "    90% {\n",
+              "      border-color: transparent;\n",
+              "      border-bottom-color: var(--fill-color);\n",
+              "    }\n",
+              "  }\n",
+              "</style>\n",
+              "\n",
+              "  <script>\n",
+              "    async function quickchart(key) {\n",
+              "      const quickchartButtonEl =\n",
+              "        document.querySelector('#' + key + ' button');\n",
+              "      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n",
+              "      quickchartButtonEl.classList.add('colab-df-spinner');\n",
+              "      try {\n",
+              "        const charts = await google.colab.kernel.invokeFunction(\n",
+              "            'suggestCharts', [key], {});\n",
+              "      } catch (error) {\n",
+              "        console.error('Error during call to suggestCharts:', error);\n",
+              "      }\n",
+              "      quickchartButtonEl.classList.remove('colab-df-spinner');\n",
+              "      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
+              "    }\n",
+              "    (() => {\n",
+              "      let quickchartButtonEl =\n",
+              "        document.querySelector('#df-bddbc2a7-3f55-4d4b-9f5b-bad055b943b6 button');\n",
+              "      quickchartButtonEl.style.display =\n",
+              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "    })();\n",
+              "  </script>\n",
+              "</div>\n",
+              "    </div>\n",
+              "  </div>\n"
+            ],
+            "application/vnd.google.colaboratory.intrinsic+json": {
+              "type": "dataframe",
+              "summary": "{\n  \"name\": \"data\",\n  \"rows\": 5,\n  \"fields\": [\n    {\n      \"column\": \"text\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 5,\n        \"samples\": [\n          \"$1 charge in transaction.\",\n          \"lost card found, want to put it back in app\",\n          \"My atm withdraw is stillpending\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"label\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": 13,\n        \"min\": 13,\n        \"max\": 46,\n        \"num_unique_values\": 4,\n        \"samples\": [\n          34,\n          13,\n          46\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}"
+            }
+          },
+          "metadata": {},
+          "execution_count": 34
+        }
+      ],
+      "source": [
+        "lowest_quality_outliers = outlier_issues[\"outlier_score\"].argsort()[:5]\n",
+        "\n",
+        "data.iloc[lowest_quality_outliers]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "A sample output for the lowest quality outliers would look like below:\n",
+        "\n",
+        "|index|text|label|\n",
+        "|---|---|---|\n",
+        "|791|withdrawal pending meaning?|46|\n",
+        "|601|$1 charge in transaction\\.|34|\n",
+        "|863|My atm withdraw is stillpending|46|\n",
+        "|355|explain the interbank exchange rate|32|\n",
+        "|157|lost card found, want to put it back in app|13|\n"
+      ],
+      "metadata": {
+        "id": "kuZMsLPZYARL"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "sBal-KDrcw4R"
+      },
+      "source": [
+        "We see that cleanlab has identified entries in this dataset that do not appear to be proper customer requests. Outliers in this dataset appear to be out-of-scope customer requests and other nonsensical text which does not make sense for intent classification. Carefully consider whether such outliers may detrimentally affect your data modeling, and consider removing them from the dataset if so."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "ch71b_0qcw4S"
+      },
+      "source": [
+        "### Near-duplicate issues\n",
+        "\n",
+        "According to the report, our dataset contains some sets of nearly duplicated examples.\n",
+        "We can see which examples are (nearly) duplicated (and a numeric quality score quantifying how dissimilar each example is from its nearest neighbor in the dataset) via `get_issues`. We sort the resulting DataFrame by cleanlab's near-duplicate quality score to see the text examples in our dataset that are most nearly duplicated."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 35,
+      "metadata": {
+        "execution": {
+          "iopub.execute_input": "2024-02-16T06:26:20.886079Z",
+          "iopub.status.busy": "2024-02-16T06:26:20.885805Z",
+          "iopub.status.idle": "2024-02-16T06:26:20.894466Z",
+          "shell.execute_reply": "2024-02-16T06:26:20.893919Z"
+        },
+        "id": "TbI49Rdccw4S",
+        "outputId": "1978cdb5-02c2-4f82-e7d5-553ad1b6dca9",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 226
+        }
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "     is_near_duplicate_issue  near_duplicate_score near_duplicate_sets  \\\n",
+              "459                     True              0.009544               [429]   \n",
+              "429                     True              0.009544               [459]   \n",
+              "501                     True              0.046044          [412, 517]   \n",
+              "412                     True              0.046044               [501]   \n",
+              "698                     True              0.054626               [607]   \n",
+              "\n",
+              "     distance_to_nearest_neighbor  \n",
+              "459                      0.000566  \n",
+              "429                      0.000566  \n",
+              "501                      0.002781  \n",
+              "412                      0.002781  \n",
+              "698                      0.003314  "
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-d081a3cf-0a30-4c25-9122-25edb4f5cb8d\" class=\"colab-df-container\">\n",
+              "    <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>is_near_duplicate_issue</th>\n",
+              "      <th>near_duplicate_score</th>\n",
+              "      <th>near_duplicate_sets</th>\n",
+              "      <th>distance_to_nearest_neighbor</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>459</th>\n",
+              "      <td>True</td>\n",
+              "      <td>0.009544</td>\n",
+              "      <td>[429]</td>\n",
+              "      <td>0.000566</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>429</th>\n",
+              "      <td>True</td>\n",
+              "      <td>0.009544</td>\n",
+              "      <td>[459]</td>\n",
+              "      <td>0.000566</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>501</th>\n",
+              "      <td>True</td>\n",
+              "      <td>0.046044</td>\n",
+              "      <td>[412, 517]</td>\n",
+              "      <td>0.002781</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>412</th>\n",
+              "      <td>True</td>\n",
+              "      <td>0.046044</td>\n",
+              "      <td>[501]</td>\n",
+              "      <td>0.002781</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>698</th>\n",
+              "      <td>True</td>\n",
+              "      <td>0.054626</td>\n",
+              "      <td>[607]</td>\n",
+              "      <td>0.003314</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "</div>\n",
+              "    <div class=\"colab-df-buttons\">\n",
+              "\n",
+              "  <div class=\"colab-df-container\">\n",
+              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-d081a3cf-0a30-4c25-9122-25edb4f5cb8d')\"\n",
+              "            title=\"Convert this dataframe to an interactive table.\"\n",
+              "            style=\"display:none;\">\n",
+              "\n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
+              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
+              "  </svg>\n",
+              "    </button>\n",
+              "\n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-buttons div {\n",
+              "      margin-bottom: 4px;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "    <script>\n",
+              "      const buttonEl =\n",
+              "        document.querySelector('#df-d081a3cf-0a30-4c25-9122-25edb4f5cb8d button.colab-df-convert');\n",
+              "      buttonEl.style.display =\n",
+              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "      async function convertToInteractive(key) {\n",
+              "        const element = document.querySelector('#df-d081a3cf-0a30-4c25-9122-25edb4f5cb8d');\n",
+              "        const dataTable =\n",
+              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                    [key], {});\n",
+              "        if (!dataTable) return;\n",
+              "\n",
+              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "          + ' to learn more about interactive tables.';\n",
+              "        element.innerHTML = '';\n",
+              "        dataTable['output_type'] = 'display_data';\n",
+              "        await google.colab.output.renderOutput(dataTable, element);\n",
+              "        const docLink = document.createElement('div');\n",
+              "        docLink.innerHTML = docLinkHtml;\n",
+              "        element.appendChild(docLink);\n",
+              "      }\n",
+              "    </script>\n",
+              "  </div>\n",
+              "\n",
+              "\n",
+              "<div id=\"df-87faf38c-7402-4ae8-b190-5cecbc434665\">\n",
+              "  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-87faf38c-7402-4ae8-b190-5cecbc434665')\"\n",
+              "            title=\"Suggest charts\"\n",
+              "            style=\"display:none;\">\n",
+              "\n",
+              "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "     width=\"24px\">\n",
+              "    <g>\n",
+              "        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
+              "    </g>\n",
+              "</svg>\n",
+              "  </button>\n",
+              "\n",
+              "<style>\n",
+              "  .colab-df-quickchart {\n",
+              "      --bg-color: #E8F0FE;\n",
+              "      --fill-color: #1967D2;\n",
+              "      --hover-bg-color: #E2EBFA;\n",
+              "      --hover-fill-color: #174EA6;\n",
+              "      --disabled-fill-color: #AAA;\n",
+              "      --disabled-bg-color: #DDD;\n",
+              "  }\n",
+              "\n",
+              "  [theme=dark] .colab-df-quickchart {\n",
+              "      --bg-color: #3B4455;\n",
+              "      --fill-color: #D2E3FC;\n",
+              "      --hover-bg-color: #434B5C;\n",
+              "      --hover-fill-color: #FFFFFF;\n",
+              "      --disabled-bg-color: #3B4455;\n",
+              "      --disabled-fill-color: #666;\n",
+              "  }\n",
+              "\n",
+              "  .colab-df-quickchart {\n",
+              "    background-color: var(--bg-color);\n",
+              "    border: none;\n",
+              "    border-radius: 50%;\n",
+              "    cursor: pointer;\n",
+              "    display: none;\n",
+              "    fill: var(--fill-color);\n",
+              "    height: 32px;\n",
+              "    padding: 0;\n",
+              "    width: 32px;\n",
+              "  }\n",
+              "\n",
+              "  .colab-df-quickchart:hover {\n",
+              "    background-color: var(--hover-bg-color);\n",
+              "    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "    fill: var(--button-hover-fill-color);\n",
+              "  }\n",
+              "\n",
+              "  .colab-df-quickchart-complete:disabled,\n",
+              "  .colab-df-quickchart-complete:disabled:hover {\n",
+              "    background-color: var(--disabled-bg-color);\n",
+              "    fill: var(--disabled-fill-color);\n",
+              "    box-shadow: none;\n",
+              "  }\n",
+              "\n",
+              "  .colab-df-spinner {\n",
+              "    border: 2px solid var(--fill-color);\n",
+              "    border-color: transparent;\n",
+              "    border-bottom-color: var(--fill-color);\n",
+              "    animation:\n",
+              "      spin 1s steps(1) infinite;\n",
+              "  }\n",
+              "\n",
+              "  @keyframes spin {\n",
+              "    0% {\n",
+              "      border-color: transparent;\n",
+              "      border-bottom-color: var(--fill-color);\n",
+              "      border-left-color: var(--fill-color);\n",
+              "    }\n",
+              "    20% {\n",
+              "      border-color: transparent;\n",
+              "      border-left-color: var(--fill-color);\n",
+              "      border-top-color: var(--fill-color);\n",
+              "    }\n",
+              "    30% {\n",
+              "      border-color: transparent;\n",
+              "      border-left-color: var(--fill-color);\n",
+              "      border-top-color: var(--fill-color);\n",
+              "      border-right-color: var(--fill-color);\n",
+              "    }\n",
+              "    40% {\n",
+              "      border-color: transparent;\n",
+              "      border-right-color: var(--fill-color);\n",
+              "      border-top-color: var(--fill-color);\n",
+              "    }\n",
+              "    60% {\n",
+              "      border-color: transparent;\n",
+              "      border-right-color: var(--fill-color);\n",
+              "    }\n",
+              "    80% {\n",
+              "      border-color: transparent;\n",
+              "      border-right-color: var(--fill-color);\n",
+              "      border-bottom-color: var(--fill-color);\n",
+              "    }\n",
+              "    90% {\n",
+              "      border-color: transparent;\n",
+              "      border-bottom-color: var(--fill-color);\n",
+              "    }\n",
+              "  }\n",
+              "</style>\n",
+              "\n",
+              "  <script>\n",
+              "    async function quickchart(key) {\n",
+              "      const quickchartButtonEl =\n",
+              "        document.querySelector('#' + key + ' button');\n",
+              "      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n",
+              "      quickchartButtonEl.classList.add('colab-df-spinner');\n",
+              "      try {\n",
+              "        const charts = await google.colab.kernel.invokeFunction(\n",
+              "            'suggestCharts', [key], {});\n",
+              "      } catch (error) {\n",
+              "        console.error('Error during call to suggestCharts:', error);\n",
+              "      }\n",
+              "      quickchartButtonEl.classList.remove('colab-df-spinner');\n",
+              "      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
+              "    }\n",
+              "    (() => {\n",
+              "      let quickchartButtonEl =\n",
+              "        document.querySelector('#df-87faf38c-7402-4ae8-b190-5cecbc434665 button');\n",
+              "      quickchartButtonEl.style.display =\n",
+              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "    })();\n",
+              "  </script>\n",
+              "</div>\n",
+              "    </div>\n",
+              "  </div>\n"
+            ],
+            "application/vnd.google.colaboratory.intrinsic+json": {
+              "type": "dataframe",
+              "summary": "{\n  \"name\": \"duplicate_issues\",\n  \"rows\": 5,\n  \"fields\": [\n    {\n      \"column\": \"is_near_duplicate_issue\",\n      \"properties\": {\n        \"dtype\": \"boolean\",\n        \"num_unique_values\": 1,\n        \"samples\": [\n          true\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"near_duplicate_score\",\n      \"properties\": {\n        \"dtype\": \"float32\",\n        \"num_unique_values\": 3,\n        \"samples\": [\n          0.00954437255859375\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"near_duplicate_sets\",\n      \"properties\": {\n        \"dtype\": \"object\",\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"distance_to_nearest_neighbor\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": 0.0013286758192588926,\n        \"min\": 0.0005658268928527832,\n        \"max\": 0.0033143162727355957,\n        \"num_unique_values\": 3,\n        \"samples\": [\n          0.0005658268928527832\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}"
+            }
+          },
+          "metadata": {},
+          "execution_count": 35
+        }
+      ],
+      "source": [
+        "duplicate_issues = lab.get_issues(\"near_duplicate\")\n",
+        "duplicate_issues.sort_values(\"near_duplicate_score\").head()"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "EawP0y1Lcw4S"
+      },
+      "source": [
+        "The results above show which examples cleanlab considers nearly duplicated (rows where `is_near_duplicate_issue == True`). Here, we see that example 459 and 429 are nearly duplicated, as are example 501 and 412.\n",
+        "\n",
+        "Let's view these examples to see how similar they are."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 38,
+      "metadata": {
+        "execution": {
+          "iopub.execute_input": "2024-02-16T06:26:20.896501Z",
+          "iopub.status.busy": "2024-02-16T06:26:20.896175Z",
+          "iopub.status.idle": "2024-02-16T06:26:20.901983Z",
+          "shell.execute_reply": "2024-02-16T06:26:20.901420Z"
+        },
+        "id": "0TEW5igFcw4S",
+        "outputId": "86343985-26bb-44ce-f27b-610357f43030",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 182
+        }
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "                                                                            text  \\\n",
+              "459    I purchased something abroad and the incorrect exchange rate was applied.   \n",
+              "429  I purchased something overseas and the incorrect exchange rate was applied.   \n",
+              "\n",
+              "     label  \n",
+              "459     17  \n",
+              "429     17  "
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-262b7208-d7d7-4fba-9d6c-7bda298822bc\" class=\"colab-df-container\">\n",
+              "    <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>text</th>\n",
+              "      <th>label</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>459</th>\n",
+              "      <td>I purchased something abroad and the incorrect exchange rate was applied.</td>\n",
+              "      <td>17</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>429</th>\n",
+              "      <td>I purchased something overseas and the incorrect exchange rate was applied.</td>\n",
+              "      <td>17</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "</div>\n",
+              "    <div class=\"colab-df-buttons\">\n",
+              "\n",
+              "  <div class=\"colab-df-container\">\n",
+              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-262b7208-d7d7-4fba-9d6c-7bda298822bc')\"\n",
+              "            title=\"Convert this dataframe to an interactive table.\"\n",
+              "            style=\"display:none;\">\n",
+              "\n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
+              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
+              "  </svg>\n",
+              "    </button>\n",
+              "\n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-buttons div {\n",
+              "      margin-bottom: 4px;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "    <script>\n",
+              "      const buttonEl =\n",
+              "        document.querySelector('#df-262b7208-d7d7-4fba-9d6c-7bda298822bc button.colab-df-convert');\n",
+              "      buttonEl.style.display =\n",
+              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "      async function convertToInteractive(key) {\n",
+              "        const element = document.querySelector('#df-262b7208-d7d7-4fba-9d6c-7bda298822bc');\n",
+              "        const dataTable =\n",
+              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                    [key], {});\n",
+              "        if (!dataTable) return;\n",
+              "\n",
+              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "          + ' to learn more about interactive tables.';\n",
+              "        element.innerHTML = '';\n",
+              "        dataTable['output_type'] = 'display_data';\n",
+              "        await google.colab.output.renderOutput(dataTable, element);\n",
+              "        const docLink = document.createElement('div');\n",
+              "        docLink.innerHTML = docLinkHtml;\n",
+              "        element.appendChild(docLink);\n",
+              "      }\n",
+              "    </script>\n",
+              "  </div>\n",
+              "\n",
+              "\n",
+              "<div id=\"df-af242cbb-1eba-4356-88ea-3d159158902a\">\n",
+              "  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-af242cbb-1eba-4356-88ea-3d159158902a')\"\n",
+              "            title=\"Suggest charts\"\n",
+              "            style=\"display:none;\">\n",
+              "\n",
+              "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "     width=\"24px\">\n",
+              "    <g>\n",
+              "        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
+              "    </g>\n",
+              "</svg>\n",
+              "  </button>\n",
+              "\n",
+              "<style>\n",
+              "  .colab-df-quickchart {\n",
+              "      --bg-color: #E8F0FE;\n",
+              "      --fill-color: #1967D2;\n",
+              "      --hover-bg-color: #E2EBFA;\n",
+              "      --hover-fill-color: #174EA6;\n",
+              "      --disabled-fill-color: #AAA;\n",
+              "      --disabled-bg-color: #DDD;\n",
+              "  }\n",
+              "\n",
+              "  [theme=dark] .colab-df-quickchart {\n",
+              "      --bg-color: #3B4455;\n",
+              "      --fill-color: #D2E3FC;\n",
+              "      --hover-bg-color: #434B5C;\n",
+              "      --hover-fill-color: #FFFFFF;\n",
+              "      --disabled-bg-color: #3B4455;\n",
+              "      --disabled-fill-color: #666;\n",
+              "  }\n",
+              "\n",
+              "  .colab-df-quickchart {\n",
+              "    background-color: var(--bg-color);\n",
+              "    border: none;\n",
+              "    border-radius: 50%;\n",
+              "    cursor: pointer;\n",
+              "    display: none;\n",
+              "    fill: var(--fill-color);\n",
+              "    height: 32px;\n",
+              "    padding: 0;\n",
+              "    width: 32px;\n",
+              "  }\n",
+              "\n",
+              "  .colab-df-quickchart:hover {\n",
+              "    background-color: var(--hover-bg-color);\n",
+              "    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "    fill: var(--button-hover-fill-color);\n",
+              "  }\n",
+              "\n",
+              "  .colab-df-quickchart-complete:disabled,\n",
+              "  .colab-df-quickchart-complete:disabled:hover {\n",
+              "    background-color: var(--disabled-bg-color);\n",
+              "    fill: var(--disabled-fill-color);\n",
+              "    box-shadow: none;\n",
+              "  }\n",
+              "\n",
+              "  .colab-df-spinner {\n",
+              "    border: 2px solid var(--fill-color);\n",
+              "    border-color: transparent;\n",
+              "    border-bottom-color: var(--fill-color);\n",
+              "    animation:\n",
+              "      spin 1s steps(1) infinite;\n",
+              "  }\n",
+              "\n",
+              "  @keyframes spin {\n",
+              "    0% {\n",
+              "      border-color: transparent;\n",
+              "      border-bottom-color: var(--fill-color);\n",
+              "      border-left-color: var(--fill-color);\n",
+              "    }\n",
+              "    20% {\n",
+              "      border-color: transparent;\n",
+              "      border-left-color: var(--fill-color);\n",
+              "      border-top-color: var(--fill-color);\n",
+              "    }\n",
+              "    30% {\n",
+              "      border-color: transparent;\n",
+              "      border-left-color: var(--fill-color);\n",
+              "      border-top-color: var(--fill-color);\n",
+              "      border-right-color: var(--fill-color);\n",
+              "    }\n",
+              "    40% {\n",
+              "      border-color: transparent;\n",
+              "      border-right-color: var(--fill-color);\n",
+              "      border-top-color: var(--fill-color);\n",
+              "    }\n",
+              "    60% {\n",
+              "      border-color: transparent;\n",
+              "      border-right-color: var(--fill-color);\n",
+              "    }\n",
+              "    80% {\n",
+              "      border-color: transparent;\n",
+              "      border-right-color: var(--fill-color);\n",
+              "      border-bottom-color: var(--fill-color);\n",
+              "    }\n",
+              "    90% {\n",
+              "      border-color: transparent;\n",
+              "      border-bottom-color: var(--fill-color);\n",
+              "    }\n",
+              "  }\n",
+              "</style>\n",
+              "\n",
+              "  <script>\n",
+              "    async function quickchart(key) {\n",
+              "      const quickchartButtonEl =\n",
+              "        document.querySelector('#' + key + ' button');\n",
+              "      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n",
+              "      quickchartButtonEl.classList.add('colab-df-spinner');\n",
+              "      try {\n",
+              "        const charts = await google.colab.kernel.invokeFunction(\n",
+              "            'suggestCharts', [key], {});\n",
+              "      } catch (error) {\n",
+              "        console.error('Error during call to suggestCharts:', error);\n",
+              "      }\n",
+              "      quickchartButtonEl.classList.remove('colab-df-spinner');\n",
+              "      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
+              "    }\n",
+              "    (() => {\n",
+              "      let quickchartButtonEl =\n",
+              "        document.querySelector('#df-af242cbb-1eba-4356-88ea-3d159158902a button');\n",
+              "      quickchartButtonEl.style.display =\n",
+              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "    })();\n",
+              "  </script>\n",
+              "</div>\n",
+              "    </div>\n",
+              "  </div>\n"
+            ],
+            "application/vnd.google.colaboratory.intrinsic+json": {
+              "type": "dataframe",
+              "summary": "{\n  \"name\": \"data\",\n  \"rows\": 2,\n  \"fields\": [\n    {\n      \"column\": \"text\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 2,\n        \"samples\": [\n          \"I purchased something overseas and the incorrect exchange rate was applied.\",\n          \"I purchased something abroad and the incorrect exchange rate was applied.\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"label\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": 0,\n        \"min\": 17,\n        \"max\": 17,\n        \"num_unique_values\": 1,\n        \"samples\": [\n          17\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}"
+            }
+          },
+          "metadata": {},
+          "execution_count": 38
+        }
+      ],
+      "source": [
+        "data.iloc[[459, 429]]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Sample output:\n",
+        "\n",
+        "|index|text|label|\n",
+        "|---|---|---|\n",
+        "|459|I purchased something abroad and the incorrect exchange rate was applied\\.|17|\n",
+        "|429|I purchased something overseas and the incorrect exchange rate was applied\\.|17|"
+      ],
+      "metadata": {
+        "id": "DoAyD-FZpsSm"
+      }
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 39,
+      "metadata": {
+        "execution": {
+          "iopub.execute_input": "2024-02-16T06:26:20.904159Z",
+          "iopub.status.busy": "2024-02-16T06:26:20.903821Z",
+          "iopub.status.idle": "2024-02-16T06:26:20.909681Z",
+          "shell.execute_reply": "2024-02-16T06:26:20.909160Z"
+        },
+        "id": "VnbIBYaHcw4S",
+        "outputId": "8b00bb96-0d9d-43f6-b85f-c41e437d41b5",
+        "colab": {
+          "base_uri": "https://localhost:8080/",
+          "height": 198
+        }
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "                                                                                                  text  \\\n",
+              "501  The exchange rate you are using is really bad.This can't be the official interbank exchange rate.   \n",
+              "412         The exchange rate you are using is bad.This can't be the official interbank exchange rate.   \n",
+              "\n",
+              "     label  \n",
+              "501     17  \n",
+              "412     17  "
+            ],
+            "text/html": [
+              "\n",
+              "  <div id=\"df-85c08b5b-da74-4092-97e6-f8908702eeaf\" class=\"colab-df-container\">\n",
+              "    <div>\n",
+              "<style scoped>\n",
+              "    .dataframe tbody tr th:only-of-type {\n",
+              "        vertical-align: middle;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe tbody tr th {\n",
+              "        vertical-align: top;\n",
+              "    }\n",
+              "\n",
+              "    .dataframe thead th {\n",
+              "        text-align: right;\n",
+              "    }\n",
+              "</style>\n",
+              "<table border=\"1\" class=\"dataframe\">\n",
+              "  <thead>\n",
+              "    <tr style=\"text-align: right;\">\n",
+              "      <th></th>\n",
+              "      <th>text</th>\n",
+              "      <th>label</th>\n",
+              "    </tr>\n",
+              "  </thead>\n",
+              "  <tbody>\n",
+              "    <tr>\n",
+              "      <th>501</th>\n",
+              "      <td>The exchange rate you are using is really bad.This can't be the official interbank exchange rate.</td>\n",
+              "      <td>17</td>\n",
+              "    </tr>\n",
+              "    <tr>\n",
+              "      <th>412</th>\n",
+              "      <td>The exchange rate you are using is bad.This can't be the official interbank exchange rate.</td>\n",
+              "      <td>17</td>\n",
+              "    </tr>\n",
+              "  </tbody>\n",
+              "</table>\n",
+              "</div>\n",
+              "    <div class=\"colab-df-buttons\">\n",
+              "\n",
+              "  <div class=\"colab-df-container\">\n",
+              "    <button class=\"colab-df-convert\" onclick=\"convertToInteractive('df-85c08b5b-da74-4092-97e6-f8908702eeaf')\"\n",
+              "            title=\"Convert this dataframe to an interactive table.\"\n",
+              "            style=\"display:none;\">\n",
+              "\n",
+              "  <svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\" viewBox=\"0 -960 960 960\">\n",
+              "    <path d=\"M120-120v-720h720v720H120Zm60-500h600v-160H180v160Zm220 220h160v-160H400v160Zm0 220h160v-160H400v160ZM180-400h160v-160H180v160Zm440 0h160v-160H620v160ZM180-180h160v-160H180v160Zm440 0h160v-160H620v160Z\"/>\n",
+              "  </svg>\n",
+              "    </button>\n",
+              "\n",
+              "  <style>\n",
+              "    .colab-df-container {\n",
+              "      display:flex;\n",
+              "      gap: 12px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert {\n",
+              "      background-color: #E8F0FE;\n",
+              "      border: none;\n",
+              "      border-radius: 50%;\n",
+              "      cursor: pointer;\n",
+              "      display: none;\n",
+              "      fill: #1967D2;\n",
+              "      height: 32px;\n",
+              "      padding: 0 0 0 0;\n",
+              "      width: 32px;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-convert:hover {\n",
+              "      background-color: #E2EBFA;\n",
+              "      box-shadow: 0px 1px 2px rgba(60, 64, 67, 0.3), 0px 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "      fill: #174EA6;\n",
+              "    }\n",
+              "\n",
+              "    .colab-df-buttons div {\n",
+              "      margin-bottom: 4px;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert {\n",
+              "      background-color: #3B4455;\n",
+              "      fill: #D2E3FC;\n",
+              "    }\n",
+              "\n",
+              "    [theme=dark] .colab-df-convert:hover {\n",
+              "      background-color: #434B5C;\n",
+              "      box-shadow: 0px 1px 3px 1px rgba(0, 0, 0, 0.15);\n",
+              "      filter: drop-shadow(0px 1px 2px rgba(0, 0, 0, 0.3));\n",
+              "      fill: #FFFFFF;\n",
+              "    }\n",
+              "  </style>\n",
+              "\n",
+              "    <script>\n",
+              "      const buttonEl =\n",
+              "        document.querySelector('#df-85c08b5b-da74-4092-97e6-f8908702eeaf button.colab-df-convert');\n",
+              "      buttonEl.style.display =\n",
+              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "\n",
+              "      async function convertToInteractive(key) {\n",
+              "        const element = document.querySelector('#df-85c08b5b-da74-4092-97e6-f8908702eeaf');\n",
+              "        const dataTable =\n",
+              "          await google.colab.kernel.invokeFunction('convertToInteractive',\n",
+              "                                                    [key], {});\n",
+              "        if (!dataTable) return;\n",
+              "\n",
+              "        const docLinkHtml = 'Like what you see? Visit the ' +\n",
+              "          '<a target=\"_blank\" href=https://colab.research.google.com/notebooks/data_table.ipynb>data table notebook</a>'\n",
+              "          + ' to learn more about interactive tables.';\n",
+              "        element.innerHTML = '';\n",
+              "        dataTable['output_type'] = 'display_data';\n",
+              "        await google.colab.output.renderOutput(dataTable, element);\n",
+              "        const docLink = document.createElement('div');\n",
+              "        docLink.innerHTML = docLinkHtml;\n",
+              "        element.appendChild(docLink);\n",
+              "      }\n",
+              "    </script>\n",
+              "  </div>\n",
+              "\n",
+              "\n",
+              "<div id=\"df-74bfcb88-0e60-4aec-a441-90ce1d28370b\">\n",
+              "  <button class=\"colab-df-quickchart\" onclick=\"quickchart('df-74bfcb88-0e60-4aec-a441-90ce1d28370b')\"\n",
+              "            title=\"Suggest charts\"\n",
+              "            style=\"display:none;\">\n",
+              "\n",
+              "<svg xmlns=\"http://www.w3.org/2000/svg\" height=\"24px\"viewBox=\"0 0 24 24\"\n",
+              "     width=\"24px\">\n",
+              "    <g>\n",
+              "        <path d=\"M19 3H5c-1.1 0-2 .9-2 2v14c0 1.1.9 2 2 2h14c1.1 0 2-.9 2-2V5c0-1.1-.9-2-2-2zM9 17H7v-7h2v7zm4 0h-2V7h2v10zm4 0h-2v-4h2v4z\"/>\n",
+              "    </g>\n",
+              "</svg>\n",
+              "  </button>\n",
+              "\n",
+              "<style>\n",
+              "  .colab-df-quickchart {\n",
+              "      --bg-color: #E8F0FE;\n",
+              "      --fill-color: #1967D2;\n",
+              "      --hover-bg-color: #E2EBFA;\n",
+              "      --hover-fill-color: #174EA6;\n",
+              "      --disabled-fill-color: #AAA;\n",
+              "      --disabled-bg-color: #DDD;\n",
+              "  }\n",
+              "\n",
+              "  [theme=dark] .colab-df-quickchart {\n",
+              "      --bg-color: #3B4455;\n",
+              "      --fill-color: #D2E3FC;\n",
+              "      --hover-bg-color: #434B5C;\n",
+              "      --hover-fill-color: #FFFFFF;\n",
+              "      --disabled-bg-color: #3B4455;\n",
+              "      --disabled-fill-color: #666;\n",
+              "  }\n",
+              "\n",
+              "  .colab-df-quickchart {\n",
+              "    background-color: var(--bg-color);\n",
+              "    border: none;\n",
+              "    border-radius: 50%;\n",
+              "    cursor: pointer;\n",
+              "    display: none;\n",
+              "    fill: var(--fill-color);\n",
+              "    height: 32px;\n",
+              "    padding: 0;\n",
+              "    width: 32px;\n",
+              "  }\n",
+              "\n",
+              "  .colab-df-quickchart:hover {\n",
+              "    background-color: var(--hover-bg-color);\n",
+              "    box-shadow: 0 1px 2px rgba(60, 64, 67, 0.3), 0 1px 3px 1px rgba(60, 64, 67, 0.15);\n",
+              "    fill: var(--button-hover-fill-color);\n",
+              "  }\n",
+              "\n",
+              "  .colab-df-quickchart-complete:disabled,\n",
+              "  .colab-df-quickchart-complete:disabled:hover {\n",
+              "    background-color: var(--disabled-bg-color);\n",
+              "    fill: var(--disabled-fill-color);\n",
+              "    box-shadow: none;\n",
+              "  }\n",
+              "\n",
+              "  .colab-df-spinner {\n",
+              "    border: 2px solid var(--fill-color);\n",
+              "    border-color: transparent;\n",
+              "    border-bottom-color: var(--fill-color);\n",
+              "    animation:\n",
+              "      spin 1s steps(1) infinite;\n",
+              "  }\n",
+              "\n",
+              "  @keyframes spin {\n",
+              "    0% {\n",
+              "      border-color: transparent;\n",
+              "      border-bottom-color: var(--fill-color);\n",
+              "      border-left-color: var(--fill-color);\n",
+              "    }\n",
+              "    20% {\n",
+              "      border-color: transparent;\n",
+              "      border-left-color: var(--fill-color);\n",
+              "      border-top-color: var(--fill-color);\n",
+              "    }\n",
+              "    30% {\n",
+              "      border-color: transparent;\n",
+              "      border-left-color: var(--fill-color);\n",
+              "      border-top-color: var(--fill-color);\n",
+              "      border-right-color: var(--fill-color);\n",
+              "    }\n",
+              "    40% {\n",
+              "      border-color: transparent;\n",
+              "      border-right-color: var(--fill-color);\n",
+              "      border-top-color: var(--fill-color);\n",
+              "    }\n",
+              "    60% {\n",
+              "      border-color: transparent;\n",
+              "      border-right-color: var(--fill-color);\n",
+              "    }\n",
+              "    80% {\n",
+              "      border-color: transparent;\n",
+              "      border-right-color: var(--fill-color);\n",
+              "      border-bottom-color: var(--fill-color);\n",
+              "    }\n",
+              "    90% {\n",
+              "      border-color: transparent;\n",
+              "      border-bottom-color: var(--fill-color);\n",
+              "    }\n",
+              "  }\n",
+              "</style>\n",
+              "\n",
+              "  <script>\n",
+              "    async function quickchart(key) {\n",
+              "      const quickchartButtonEl =\n",
+              "        document.querySelector('#' + key + ' button');\n",
+              "      quickchartButtonEl.disabled = true;  // To prevent multiple clicks.\n",
+              "      quickchartButtonEl.classList.add('colab-df-spinner');\n",
+              "      try {\n",
+              "        const charts = await google.colab.kernel.invokeFunction(\n",
+              "            'suggestCharts', [key], {});\n",
+              "      } catch (error) {\n",
+              "        console.error('Error during call to suggestCharts:', error);\n",
+              "      }\n",
+              "      quickchartButtonEl.classList.remove('colab-df-spinner');\n",
+              "      quickchartButtonEl.classList.add('colab-df-quickchart-complete');\n",
+              "    }\n",
+              "    (() => {\n",
+              "      let quickchartButtonEl =\n",
+              "        document.querySelector('#df-74bfcb88-0e60-4aec-a441-90ce1d28370b button');\n",
+              "      quickchartButtonEl.style.display =\n",
+              "        google.colab.kernel.accessAllowed ? 'block' : 'none';\n",
+              "    })();\n",
+              "  </script>\n",
+              "</div>\n",
+              "    </div>\n",
+              "  </div>\n"
+            ],
+            "application/vnd.google.colaboratory.intrinsic+json": {
+              "type": "dataframe",
+              "summary": "{\n  \"name\": \"data\",\n  \"rows\": 2,\n  \"fields\": [\n    {\n      \"column\": \"text\",\n      \"properties\": {\n        \"dtype\": \"string\",\n        \"num_unique_values\": 2,\n        \"samples\": [\n          \"The exchange rate you are using is bad.This can't be the official interbank exchange rate.\",\n          \"The exchange rate you are using is really bad.This can't be the official interbank exchange rate.\"\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    },\n    {\n      \"column\": \"label\",\n      \"properties\": {\n        \"dtype\": \"number\",\n        \"std\": 0,\n        \"min\": 17,\n        \"max\": 17,\n        \"num_unique_values\": 1,\n        \"samples\": [\n          17\n        ],\n        \"semantic_type\": \"\",\n        \"description\": \"\"\n      }\n    }\n  ]\n}"
+            }
+          },
+          "metadata": {},
+          "execution_count": 39
+        }
+      ],
+      "source": [
+        "data.iloc[[501, 412]]"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "source": [
+        "Sample output:\n",
+        "\n",
+        "|index|text|label|\n",
+        "|---|---|---|\n",
+        "|501|The exchange rate you are using is really bad\\.This can't be the official interbank exchange rate\\.|17|\n",
+        "|412|The exchange rate you are using is bad\\.This can't be the official interbank exchange rate\\.|17|"
+      ],
+      "metadata": {
+        "id": "Y4QD35-dqeGg"
+      }
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "UG8xfTa5cw4S"
+      },
+      "source": [
+        "We see that these two sets of request are indeed very similar to one another! Including near duplicates in a dataset may have unintended effects on models, and be wary about splitting them across training/test sets. Learn more about handling near duplicates in a dataset from [the FAQ](../faq.html#How-to-handle-near-duplicate-data-identified-by-cleanlab?)."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "iefctl3rcw4S"
+      },
+      "source": [
+        "### Non-IID issues (data drift)\n",
+        "According to the report, our dataset does not appear to be Independent and Identically Distributed (IID).  The overall non-iid score for the dataset (displayed below) corresponds to the `p-value` of a statistical test for whether the ordering of samples in the dataset appears related to the similarity between their feature values.  A low `p-value` strongly suggests that the dataset violates the IID assumption, which is a key assumption required for conclusions (models) produced from the dataset to generalize to a larger population."
+      ]
+    },
+    {
+      "cell_type": "code",
+      "execution_count": 40,
+      "metadata": {
+        "execution": {
+          "iopub.execute_input": "2024-02-16T06:26:20.911817Z",
+          "iopub.status.busy": "2024-02-16T06:26:20.911434Z",
+          "iopub.status.idle": "2024-02-16T06:26:20.915049Z",
+          "shell.execute_reply": "2024-02-16T06:26:20.914501Z"
+        },
+        "id": "oEMWOQQPcw4S",
+        "outputId": "18eca4cd-2451-4850-960c-0bf1e35d9729",
+        "colab": {
+          "base_uri": "https://localhost:8080/"
+        }
+      },
+      "outputs": [
+        {
+          "output_type": "execute_result",
+          "data": {
+            "text/plain": [
+              "0.0"
+            ]
+          },
+          "metadata": {},
+          "execution_count": 40
+        }
+      ],
+      "source": [
+        "p_value = lab.get_info('non_iid')['p-value']\n",
+        "p_value"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "c6swPCnncw4S"
+      },
+      "source": [
+        "Here, our dataset was flagged as non-IID because the rows happened to be sorted by class label in the original data. This may be benign if we remember to shuffle rows before model training and data splitting. But if you don't know why your data was flagged as non-IID, then you should be worried about potential data drift or unexpected interactions between data points (their values may not be statistically independent). Think carefully about what future test data may look like (and whether your data is representative of the population you care about). You should not shuffle your data before the non-IID test runs (will invalidate its conclusions)."
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "uCoKXqBrcw4S"
+      },
+      "source": [
+        "As demonstrated above, cleanlab can automatically shortlist the most likely issues in your dataset to help you better curate your dataset for subsequent modeling. With this shortlist, you can decide whether to fix these label issues or remove nonsensical or duplicated examples from your dataset to obtain a higher-quality dataset for training your next ML model. cleanlab's issue detection can be run with outputs from *any* type of model you initially trained.\n"
+      ]
+    },
+    {
+      "cell_type": "markdown",
+      "metadata": {
+        "id": "qnncoRWUcw4S"
+      },
+      "source": [
+        "### Easy Mode\n",
+        "\n",
+        "Cleanlab is most effective when you run this code with a good ML model. Try to produce the best ML model you can for your data (instead of the basic model from this tutorial). If you don't know the best ML model for your data, try [Cleanlab Studio](https://cleanlab.ai/blog/data-centric-ai/) which will automatically produce one for you. Super easy to use, [Cleanlab Studio](https://cleanlab.ai/blog/data-centric-ai/) is no-code platform for data-centric AI that automatically: detects data issues (more types of issues than this cleanlab package), helps you quickly correct these data issues, confidently labels large subsets of an unlabeled dataset, and provides other smart metadata about each of your data points -- all powered by a system that automatically trains/deploys the best ML model for your data. [Try it for free!](https://cleanlab.ai/signup/)"
+      ]
+    }
+  ],
+  "metadata": {
+    "colab": {
+      "provenance": []
+    },
+    "kernelspec": {
+      "display_name": "Python 3 (ipykernel)",
+      "language": "python",
+      "name": "python3"
+    },
+    "language_info": {
+      "codemirror_mode": {
+        "name": "ipython",
+        "version": 3
+      },
+      "file_extension": ".py",
+      "mimetype": "text/x-python",
+      "name": "python",
+      "nbconvert_exporter": "python",
+      "pygments_lexer": "ipython3",
+      "version": "3.11.8"
+    }
   },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "In this 5-minute quickstart tutorial, we use Datalab to detect various issues in an intent classification dataset composed of (text) customer service requests at an online bank. We consider a subset of the [Banking77-OOS Dataset](https://arxiv.org/abs/2106.04564) containing 1,000 customer service requests which are classified into 10 categories based on their intent (you can run this same code on any text classification dataset). Cleanlab automatically identifies bad examples in our dataset, including mislabeled data, out-of-scope examples (outliers), or otherwise ambiguous examples. Consider filtering or correcting such bad examples before you dive deep into modeling your data!\n",
-    "\n",
-    "**Overview of what we'll do in this tutorial:**\n",
-    "\n",
-    "- Use a pretrained transformer model to extract the text embeddings from the customer service requests\n",
-    "\n",
-    "- Train a simple Logistic Regression model on the text embeddings to compute out-of-sample predicted probabilities\n",
-    "\n",
-    "- Run cleanlab's `Datalab` audit with these predictions and embeddings in order to identify problems like: label issues, outliers, and near duplicates in the dataset."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "<div class=\"alert alert-info\">\n",
-    "Quickstart\n",
-    "<br/>\n",
-    "    \n",
-    "Already have (out-of-sample) `pred_probs` from a model trained on an existing set of labels? Maybe you have some numeric `features` as well? Run the code below to find any potential label errors in your dataset.\n",
-    "\n",
-    "<div  class=markdown markdown=\"1\" style=\"background:white;margin:16px\">  \n",
-    "    \n",
-    "```ipython3 \n",
-    "from cleanlab import Datalab\n",
-    "\n",
-    "lab = Datalab(data=your_dataset, label_name=\"column_name_of_labels\")\n",
-    "lab.find_issues(pred_probs=your_pred_probs, features=your_features)\n",
-    "\n",
-    "lab.report()\n",
-    "lab.get_issues()\n",
-    "```\n",
-    "    \n",
-    "</div>\n",
-    "</div>"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## 1. Install required dependencies\n"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "You can use `pip` to install all packages required for this tutorial as follows:\n",
-    "\n",
-    "```ipython3\n",
-    "!pip install sklearn sentence-transformers\n",
-    "!pip install \"cleanlab[datalab]\"\n",
-    "# Make sure to install the version corresponding to this tutorial\n",
-    "# E.g. if viewing master branch documentation:\n",
-    "#     !pip install git+https://github.com/cleanlab/cleanlab.git\n",
-    "```"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "nbsphinx": "hidden"
-   },
-   "outputs": [],
-   "source": [
-    "# Package installation (hidden on docs.cleanlab.ai).\n",
-    "# If running on Colab, may want to use GPU (select: Runtime > Change runtime type > Hardware accelerator > GPU)\n",
-    "# Package versions we used:scikit-learn==1.2.0 sentence-transformers==2.2.2\n",
-    "\n",
-    "dependencies = [\"cleanlab\", \"sentence_transformers\", \"datasets\"]\n",
-    "\n",
-    "# Supress outputs that may appear if tensorflow happens to be improperly installed: \n",
-    "import os \n",
-    "\n",
-    "os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\"  # disable parallelism to avoid deadlocks with huggingface\n",
-    "\n",
-    "if \"google.colab\" in str(get_ipython()):  # Check if it's running in Google Colab\n",
-    "    %pip install cleanlab  # for colab\n",
-    "    cmd = ' '.join([dep for dep in dependencies if dep != \"cleanlab\"])\n",
-    "    %pip install $cmd\n",
-    "else:\n",
-    "    dependencies_test = [dependency.split('>')[0] if '>' in dependency \n",
-    "                         else dependency.split('<')[0] if '<' in dependency \n",
-    "                         else dependency.split('=')[0] for dependency in dependencies]\n",
-    "    missing_dependencies = []\n",
-    "    for dependency in dependencies_test:\n",
-    "        try:\n",
-    "            __import__(dependency)\n",
-    "        except ImportError:\n",
-    "            missing_dependencies.append(dependency)\n",
-    "\n",
-    "    if len(missing_dependencies) > 0:\n",
-    "        print(\"Missing required dependencies:\")\n",
-    "        print(*missing_dependencies, sep=\", \")\n",
-    "        print(\"\\nPlease install them before running the rest of this notebook.\")"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "import re \n",
-    "import string \n",
-    "import pandas as pd \n",
-    "from sklearn.metrics import accuracy_score, log_loss \n",
-    "from sklearn.model_selection import cross_val_predict \n",
-    "from sklearn.linear_model import LogisticRegression\n",
-    "from sentence_transformers import SentenceTransformer\n",
-    "\n",
-    "from cleanlab import Datalab"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "nbsphinx": "hidden"
-   },
-   "outputs": [],
-   "source": [
-    "# This cell is hidden from docs.cleanlab.ai \n",
-    "\n",
-    "import random \n",
-    "import numpy as np \n",
-    "\n",
-    "pd.set_option(\"display.max_colwidth\", None) \n",
-    "\n",
-    "SEED = 123456  # for reproducibility\n",
-    "np.random.seed(SEED)\n",
-    "random.seed(SEED)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## 2. Load and format the text dataset\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "data = pd.read_csv(\"https://s.cleanlab.ai/banking-intent-classification.csv\")\n",
-    "data.head()"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "raw_texts, labels = data[\"text\"].values, data[\"label\"].values\n",
-    "num_classes = len(set(labels))\n",
-    "\n",
-    "print(f\"This dataset has {num_classes} classes.\")\n",
-    "print(f\"Classes: {set(labels)}\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Let's view the i-th example in the dataset:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "i = 1  # change this to view other examples from the dataset\n",
-    "print(f\"Example Label: {labels[i]}\")\n",
-    "print(f\"Example Text: {raw_texts[i]}\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "The data is stored as two numpy arrays:\n",
-    "\n",
-    "1. `raw_texts` stores the customer service requests utterances in text format\n",
-    "2. `labels` stores the intent categories (labels) for each example"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "<div class=\"alert alert-info\">\n",
-    "Bringing Your Own Data (BYOD)?\n",
-    "\n",
-    "You can easily replace the above with your own text dataset, and continue with the rest of the tutorial.\n",
-    "\n",
-    "</div>"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Next we convert the text strings into vectors better suited as inputs for our ML models. \n",
-    "\n",
-    "We will use numeric representations from a pretrained Transformer model as embeddings of our text. The [Sentence Transformers](https://huggingface.co/docs/hub/sentence-transformers) library offers simple methods to compute these embeddings for text data. Here, we load the pretrained `electra-small-discriminator` model, and then run our data through network to extract a vector embedding of each example."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "transformer = SentenceTransformer('google/electra-small-discriminator')\n",
-    "text_embeddings = transformer.encode(raw_texts)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Our subsequent ML model will directly operate on elements of `text_embeddings` in order to classify the customer service requests."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## 3. Define a classification model and compute out-of-sample predicted probabilities"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "A typical way to leverage pretrained networks for a particular classification task is to add a linear output layer and fine-tune the network parameters on the new data. However this can be computationally intensive. Alternatively, we can freeze the pretrained weights of the network and only train the output layer without having to rely on GPU(s). Here we do this conveniently by fitting a scikit-learn linear model on top of the extracted embeddings.\n",
-    "\n",
-    "To identify label issues, cleanlab requires a probabilistic prediction from your model for each datapoint. However these predictions will be _overfit_ (and thus unreliable) for datapoints the model was previously trained on. cleanlab is intended to only be used with **out-of-sample** predicted class probabilities, i.e. on datapoints held-out from the model during the training.\n",
-    "\n",
-    "Here we obtain out-of-sample predicted class probabilities for every example in our dataset using a Logistic Regression model with cross-validation.\n",
-    "Make sure that the columns of your `pred_probs` are properly ordered with respect to the ordering of classes, which for Datalab is: lexicographically sorted by class name."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "scrolled": true
-   },
-   "outputs": [],
-   "source": [
-    "model = LogisticRegression(max_iter=400)\n",
-    "\n",
-    "pred_probs = cross_val_predict(model, text_embeddings, labels, method=\"predict_proba\")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "## 4. Use cleanlab to find issues in your dataset"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Given feature embeddings and the (out-of-sample) predicted class probabilities obtained from any model you have, cleanlab can quickly help you identify low-quality examples in your dataset.\n",
-    "\n",
-    "Here, we use cleanlab's `Datalab` to find issues in our data. Datalab offers several ways of loading the data; we’ll simply wrap the training features and noisy labels in a dictionary. "
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "data_dict = {\"texts\": raw_texts, \"labels\": labels}"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "All that is need to audit your data is to call `find_issues()`. We pass in the predicted probabilities and the feature embeddings obtained above, but you do not necessarily need to provide all of this information depending on which types of issues you are interested in. The more inputs you provide, the more types of issues `Datalab` can detect in your data. Using a better model to produce these inputs will ensure cleanlab more accurately estimates issues."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "scrolled": true
-   },
-   "outputs": [],
-   "source": [
-    "lab = Datalab(data_dict, label_name=\"labels\")\n",
-    "lab.find_issues(pred_probs=pred_probs, features=text_embeddings)"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "After the audit is complete, review the findings using the `report` method:"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "scrolled": true
-   },
-   "outputs": [],
-   "source": [
-    "lab.report()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Label issues\n",
-    "\n",
-    "The report indicates that cleanlab identified many label issues in our dataset. We can see which examples are flagged as likely mislabeled and the label quality score for each example using the `get_issues` method, specifying `label` as an argument to focus on label issues in the data."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {
-    "scrolled": true
-   },
-   "outputs": [],
-   "source": [
-    "label_issues = lab.get_issues(\"label\")\n",
-    "label_issues.head() "
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "This method returns a dataframe containing a label quality score for each example. These numeric scores lie between 0 and 1, where lower scores indicate examples more likely to be mislabeled. The dataframe also contains a boolean column specifying whether or not each example is identified to have a label issue (indicating it is likely mislabeled)."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "We can get the subset of examples flagged with label issues, and also sort by label quality score to find the indices of the 5 most likely mislabeled examples in our dataset."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "identified_label_issues = label_issues[label_issues[\"is_label_issue\"] == True]\n",
-    "lowest_quality_labels = label_issues[\"label_score\"].argsort()[:5].to_numpy()\n",
-    "\n",
-    "print(\n",
-    "    f\"cleanlab found {len(identified_label_issues)} potential label errors in the dataset.\\n\"\n",
-    "    f\"Here are indices of the top 5 most likely errors: \\n {lowest_quality_labels}\"\n",
-    ")"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Let's review some of the most likely label errors. \n",
-    "\n",
-    "Here we display the top 5 examples identified as the most likely label errors in the dataset, together with their given (original) label and a suggested alternative label from cleanlab.\n"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "data_with_suggested_labels = pd.DataFrame(\n",
-    "    {\"text\": raw_texts, \"given_label\": labels, \"suggested_label\": label_issues[\"predicted_label\"]}\n",
-    ")\n",
-    "data_with_suggested_labels.iloc[lowest_quality_labels]"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {
-    "scrolled": true
-   },
-   "source": [
-    "These are very clear label errors that cleanlab has identified in this data! Note that the `given_label` does not correctly reflect the intent of these requests, whoever produced this dataset made many mistakes that are important to address before modeling the data."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Outlier issues\n",
-    "\n",
-    "According to the report, our dataset contains some outliers.\n",
-    "We can see which examples are outliers (and a numeric quality score quantifying how typical each example appears to be) via `get_issues`. We sort the resulting DataFrame by cleanlab's outlier quality score to see the most severe outliers in our dataset."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "outlier_issues = lab.get_issues(\"outlier\")\n",
-    "outlier_issues.sort_values(\"outlier_score\").head()"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "lowest_quality_outliers = outlier_issues[\"outlier_score\"].argsort()[:5]\n",
-    "\n",
-    "data.iloc[lowest_quality_outliers]"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "We see that cleanlab has identified entries in this dataset that do not appear to be proper customer requests. Outliers in this dataset appear to be out-of-scope customer requests and other nonsensical text which does not make sense for intent classification. Carefully consider whether such outliers may detrimentally affect your data modeling, and consider removing them from the dataset if so."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Near-duplicate issues\n",
-    "\n",
-    "According to the report, our dataset contains some sets of nearly duplicated examples.\n",
-    "We can see which examples are (nearly) duplicated (and a numeric quality score quantifying how dissimilar each example is from its nearest neighbor in the dataset) via `get_issues`. We sort the resulting DataFrame by cleanlab's near-duplicate quality score to see the text examples in our dataset that are most nearly duplicated."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "duplicate_issues = lab.get_issues(\"near_duplicate\")\n",
-    "duplicate_issues.sort_values(\"near_duplicate_score\").head()"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "The results above show which examples cleanlab considers nearly duplicated (rows where `is_near_duplicate_issue == True`). Here, we see that example 160 and 148 are nearly duplicated, as are example 546 and 514.\n",
-    "\n",
-    "Let's view these examples to see how similar they are."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "data.iloc[[160, 148]]"
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "data.iloc[[546, 514]]"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "We see that these two sets of request are indeed very similar to one another! Including near duplicates in a dataset may have unintended effects on models, and be wary about splitting them across training/test sets. Learn more about handling near duplicates in a dataset from [the FAQ](https://docs.cleanlab.ai/stable/tutorials/faq.html#How-to-handle-near-duplicate-data-identified-by-cleanlab?)."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### Non-IID issues (data drift)\n",
-    "According to the report, our dataset does not appear to be Independent and Identically Distributed (IID).  The overall non-iid score for the dataset (displayed below) corresponds to the `p-value` of a statistical test for whether the ordering of samples in the dataset appears related to the similarity between their feature values.  A low `p-value` strongly suggests that the dataset violates the IID assumption, which is a key assumption required for conclusions (models) produced from the dataset to generalize to a larger population."
-   ]
-  },
-  {
-   "cell_type": "code",
-   "execution_count": null,
-   "metadata": {},
-   "outputs": [],
-   "source": [
-    "p_value = lab.get_info('non_iid')['p-value']\n",
-    "p_value"
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "Here, our dataset was flagged as non-IID because the rows happened to be sorted by class label in the original data. This may be benign if we remember to shuffle rows before model training and data splitting. But if you don't know why your data was flagged as non-IID, then you should be worried about potential data drift or unexpected interactions between data points (their values may not be statistically independent). Think carefully about what future test data may look like (and whether your data is representative of the population you care about). You should not shuffle your data before the non-IID test runs (will invalidate its conclusions)."
-   ]
-  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "As demonstrated above, cleanlab can automatically shortlist the most likely issues in your dataset to help you better curate your dataset for subsequent modeling. With this shortlist, you can decide whether to fix these label issues or remove nonsensical or duplicated examples from your dataset to obtain a higher-quality dataset for training your next ML model. cleanlab's issue detection can be run with outputs from *any* type of model you initially trained.\n"
-   ]
-  }
- ],
- "metadata": {
-  "colab": {
-   "collapsed_sections": [],
-   "name": "Text x TensorFlow",
-   "provenance": []
-  },
-  "kernelspec": {
-   "display_name": "Python 3 (ipykernel)",
-   "language": "python",
-   "name": "python3"
-  },
-  "language_info": {
-   "codemirror_mode": {
-    "name": "ipython",
-    "version": 3
-   },
-   "file_extension": ".py",
-   "mimetype": "text/x-python",
-   "name": "python",
-   "nbconvert_exporter": "python",
-   "pygments_lexer": "ipython3",
-   "version": "3.11.7"
-  }
- },
- "nbformat": 4,
- "nbformat_minor": 4
-}
+  "nbformat": 4,
+  "nbformat_minor": 0
+}
\ No newline at end of file

From d112dcd63a6fbaf4e0f99f602fda8ae31a590a32 Mon Sep 17 00:00:00 2001
From: aravindputrevu <aravind.putrevu@gmail.com>
Date: Sat, 9 Mar 2024 02:38:31 +0530
Subject: [PATCH 3/6] Addressed some more review comments in the notebook

---
 notebooks/en/issues_in_text_dataset.ipynb | 389 ++++------------------
 1 file changed, 57 insertions(+), 332 deletions(-)

diff --git a/notebooks/en/issues_in_text_dataset.ipynb b/notebooks/en/issues_in_text_dataset.ipynb
index 2c1cda57..ef02982d 100644
--- a/notebooks/en/issues_in_text_dataset.ipynb
+++ b/notebooks/en/issues_in_text_dataset.ipynb
@@ -6,7 +6,7 @@
         "id": "pw6cvzTocw4G"
       },
       "source": [
-        "# Detecting Issues in a Text Dataset with Datalab\n"
+        "# Detecting Issues in a Text Dataset with Cleanlab\n"
       ]
     },
     {
@@ -15,10 +15,10 @@
         "id": "0yPBE0Xccw4J"
       },
       "source": [
-        "Authored by: [@aravindputrevu](https://huggingface.co/aravindputrevu)\n",
+        "Authored by: [Aravind Putrevu](https://huggingface.co/aravindputrevu)\n",
         "\n",
         "\n",
-        "In this 5-minute quickstart tutorial, we use Datalab to detect various issues in an intent classification dataset composed of (text) customer service requests at an online bank. We consider a subset of the [Banking77-OOS Dataset](https://arxiv.org/abs/2106.04564) containing 1,000 customer service requests which are classified into 10 categories based on their intent (you can run this same code on any text classification dataset). [Cleanlab](https://github.com/cleanlab/cleanlab) automatically identifies bad examples in our dataset, including mislabeled data, out-of-scope examples (outliers), or otherwise ambiguous examples. Consider filtering or correcting such bad examples before you dive deep into modeling your data!\n",
+        "In this 5-minute quickstart tutorial, we use Cleanlab to detect various issues in an intent classification dataset composed of (text) customer service requests at an online bank. We consider a subset of the [Banking77-OOS Dataset](https://arxiv.org/abs/2106.04564) containing 1,000 customer service requests which are classified into 10 categories based on their intent (you can run this same code on any text classification dataset). [Cleanlab](https://github.com/cleanlab/cleanlab) automatically identifies bad examples in our dataset, including mislabeled data, out-of-scope examples (outliers), or otherwise ambiguous examples. Consider filtering or correcting such bad examples before you dive deep into modeling your data!\n",
         "\n",
         "**Overview of what we'll do in this tutorial:**\n",
         "\n",
@@ -26,7 +26,7 @@
         "\n",
         "- Train a simple Logistic Regression model on the text embeddings to compute out-of-sample predicted probabilities\n",
         "\n",
-        "- Run cleanlab's `Datalab` audit with these predictions and embeddings in order to identify problems like: label issues, outliers, and near duplicates in the dataset."
+        "- Run Cleanlab's `Datalab` audit with these predictions and embeddings in order to identify problems like: label issues, outliers, and near duplicates in the dataset."
       ]
     },
     {
@@ -35,29 +35,31 @@
         "id": "o__pRLFYcw4K"
       },
       "source": [
-        "<div class=\"alert alert-info\">\n",
-        "Quickstart\n",
-        "<br/>\n",
-        "    \n",
-        "Already have (out-of-sample) `pred_probs` from a model trained on an existing set of labels? Maybe you have some numeric `features` as well? Run the code below to find any potential label errors in your dataset.\n",
         "\n",
-        "**Note:** If running on Colab, may want to use GPU (select: Runtime > Change runtime type > Hardware accelerator > GPU)\n",
+        "## Quickstart\n",
         "\n",
-        "<div  class=markdown markdown=\"1\" style=\"background:white;margin:16px\">  \n",
         "    \n",
-        "```ipython3\n",
+        "Already have (out-of-sample) `pred_probs` from a model trained on an existing set of labels? Maybe you have some numeric `features` as well? Run the code below to find any potential label errors in your dataset.\n",
+        "\n",
+        "**Note:** If running on Colab, may want to use GPU (select: Runtime > Change runtime type > Hardware accelerator > GPU)\n"
+      ]
+    },
+    {
+      "cell_type": "code",
+      "source": [
         "from cleanlab import Datalab\n",
         "\n",
         "lab = Datalab(data=your_dataset, label_name=\"column_name_of_labels\")\n",
         "lab.find_issues(pred_probs=your_pred_probs, features=your_features)\n",
         "\n",
         "lab.report()\n",
-        "lab.get_issues()\n",
-        "```\n",
-        "    \n",
-        "</div>\n",
-        "</div>"
-      ]
+        "lab.get_issues()\n"
+      ],
+      "metadata": {
+        "id": "qaZA0cFs1fW4"
+      },
+      "execution_count": null,
+      "outputs": []
     },
     {
       "cell_type": "markdown",
@@ -65,7 +67,7 @@
         "id": "dp4lpApmcw4K"
       },
       "source": [
-        "## 1. Install required dependencies\n"
+        "## Install required dependencies\n"
       ]
     },
     {
@@ -84,138 +86,14 @@
         "!pip install -U \"cleanlab[datalab]\""
       ],
       "metadata": {
-        "id": "fRsBIj3L_RUb",
-        "colab": {
-          "base_uri": "https://localhost:8080/",
-          "height": 1000
-        },
-        "outputId": "2b22c97c-2373-4740-d394-7486277aa694"
+        "id": "fRsBIj3L_RUb"
       },
-      "execution_count": 41,
-      "outputs": [
-        {
-          "output_type": "stream",
-          "name": "stdout",
-          "text": [
-            "Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (1.2.2)\n",
-            "Collecting scikit-learn\n",
-            "  Downloading scikit_learn-1.4.1.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.1 MB)\n",
-            "\u001b[2K     \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m12.1/12.1 MB\u001b[0m \u001b[31m38.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n",
-            "\u001b[?25hRequirement already satisfied: sentence-transformers in /usr/local/lib/python3.10/dist-packages (2.4.0)\n",
-            "Requirement already satisfied: datasets in /usr/local/lib/python3.10/dist-packages (2.17.1)\n",
-            "Requirement already satisfied: numpy<2.0,>=1.19.5 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.25.2)\n",
-            "Requirement already satisfied: scipy>=1.6.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.11.4)\n",
-            "Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.3.2)\n",
-            "Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (3.3.0)\n",
-            "Requirement already satisfied: transformers<5.0.0,>=4.32.0 in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (4.37.2)\n",
-            "Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (4.66.2)\n",
-            "Requirement already satisfied: torch>=1.11.0 in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (2.1.0+cu121)\n",
-            "Requirement already satisfied: huggingface-hub>=0.15.1 in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (0.20.3)\n",
-            "Requirement already satisfied: Pillow in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (9.4.0)\n",
-            "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from datasets) (3.13.1)\n",
-            "Requirement already satisfied: pyarrow>=12.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (14.0.2)\n",
-            "Requirement already satisfied: pyarrow-hotfix in /usr/local/lib/python3.10/dist-packages (from datasets) (0.6)\n",
-            "Requirement already satisfied: dill<0.3.9,>=0.3.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (0.3.8)\n",
-            "Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from datasets) (1.5.3)\n",
-            "Requirement already satisfied: requests>=2.19.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (2.31.0)\n",
-            "Requirement already satisfied: xxhash in /usr/local/lib/python3.10/dist-packages (from datasets) (3.4.1)\n",
-            "Requirement already satisfied: multiprocess in /usr/local/lib/python3.10/dist-packages (from datasets) (0.70.16)\n",
-            "Requirement already satisfied: fsspec[http]<=2023.10.0,>=2023.1.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (2023.6.0)\n",
-            "Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets) (3.9.3)\n",
-            "Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from datasets) (23.2)\n",
-            "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from datasets) (6.0.1)\n",
-            "Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.3.1)\n",
-            "Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (23.2.0)\n",
-            "Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.4.1)\n",
-            "Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (6.0.5)\n",
-            "Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.9.4)\n",
-            "Requirement already satisfied: async-timeout<5.0,>=4.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (4.0.3)\n",
-            "Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.15.1->sentence-transformers) (4.9.0)\n",
-            "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets) (3.3.2)\n",
-            "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets) (3.6)\n",
-            "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets) (2.0.7)\n",
-            "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets) (2024.2.2)\n",
-            "Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence-transformers) (1.12)\n",
-            "Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence-transformers) (3.2.1)\n",
-            "Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence-transformers) (3.1.3)\n",
-            "Requirement already satisfied: triton==2.1.0 in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence-transformers) (2.1.0)\n",
-            "Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers<5.0.0,>=4.32.0->sentence-transformers) (2023.12.25)\n",
-            "Requirement already satisfied: tokenizers<0.19,>=0.14 in /usr/local/lib/python3.10/dist-packages (from transformers<5.0.0,>=4.32.0->sentence-transformers) (0.15.2)\n",
-            "Requirement already satisfied: safetensors>=0.4.1 in /usr/local/lib/python3.10/dist-packages (from transformers<5.0.0,>=4.32.0->sentence-transformers) (0.4.2)\n",
-            "Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2.8.2)\n",
-            "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2023.4)\n",
-            "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.1->pandas->datasets) (1.16.0)\n",
-            "Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch>=1.11.0->sentence-transformers) (2.1.5)\n",
-            "Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch>=1.11.0->sentence-transformers) (1.3.0)\n",
-            "Installing collected packages: scikit-learn\n",
-            "  Attempting uninstall: scikit-learn\n",
-            "    Found existing installation: scikit-learn 1.2.2\n",
-            "    Uninstalling scikit-learn-1.2.2:\n",
-            "      Successfully uninstalled scikit-learn-1.2.2\n",
-            "Successfully installed scikit-learn-1.4.1.post1\n"
-          ]
-        },
-        {
-          "output_type": "display_data",
-          "data": {
-            "application/vnd.colab-display-data+json": {
-              "pip_warning": {
-                "packages": [
-                  "sklearn"
-                ]
-              },
-              "id": "207dfdbd8b714496a56fb33ee0f11a84"
-            }
-          },
-          "metadata": {}
-        },
-        {
-          "output_type": "stream",
-          "name": "stdout",
-          "text": [
-            "Requirement already satisfied: cleanlab[datalab] in /usr/local/lib/python3.10/dist-packages (2.6.0)\n",
-            "Requirement already satisfied: numpy>=1.22.0 in /usr/local/lib/python3.10/dist-packages (from cleanlab[datalab]) (1.25.2)\n",
-            "Requirement already satisfied: scikit-learn>=1.1 in /usr/local/lib/python3.10/dist-packages (from cleanlab[datalab]) (1.4.1.post1)\n",
-            "Requirement already satisfied: tqdm>=4.53.0 in /usr/local/lib/python3.10/dist-packages (from cleanlab[datalab]) (4.66.2)\n",
-            "Requirement already satisfied: pandas>=1.4.0 in /usr/local/lib/python3.10/dist-packages (from cleanlab[datalab]) (1.5.3)\n",
-            "Requirement already satisfied: termcolor>=2.4.0 in /usr/local/lib/python3.10/dist-packages (from cleanlab[datalab]) (2.4.0)\n",
-            "Requirement already satisfied: datasets>=2.7.0 in /usr/local/lib/python3.10/dist-packages (from cleanlab[datalab]) (2.17.1)\n",
-            "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (3.13.1)\n",
-            "Requirement already satisfied: pyarrow>=12.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (14.0.2)\n",
-            "Requirement already satisfied: pyarrow-hotfix in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (0.6)\n",
-            "Requirement already satisfied: dill<0.3.9,>=0.3.0 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (0.3.8)\n",
-            "Requirement already satisfied: requests>=2.19.0 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (2.31.0)\n",
-            "Requirement already satisfied: xxhash in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (3.4.1)\n",
-            "Requirement already satisfied: multiprocess in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (0.70.16)\n",
-            "Requirement already satisfied: fsspec[http]<=2023.10.0,>=2023.1.0 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (2023.6.0)\n",
-            "Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (3.9.3)\n",
-            "Requirement already satisfied: huggingface-hub>=0.19.4 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (0.20.3)\n",
-            "Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (23.2)\n",
-            "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (6.0.1)\n",
-            "Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.4.0->cleanlab[datalab]) (2.8.2)\n",
-            "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.4.0->cleanlab[datalab]) (2023.4)\n",
-            "Requirement already satisfied: scipy>=1.6.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=1.1->cleanlab[datalab]) (1.11.4)\n",
-            "Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=1.1->cleanlab[datalab]) (1.3.2)\n",
-            "Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=1.1->cleanlab[datalab]) (3.3.0)\n",
-            "Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.7.0->cleanlab[datalab]) (1.3.1)\n",
-            "Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.7.0->cleanlab[datalab]) (23.2.0)\n",
-            "Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.7.0->cleanlab[datalab]) (1.4.1)\n",
-            "Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.7.0->cleanlab[datalab]) (6.0.5)\n",
-            "Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.7.0->cleanlab[datalab]) (1.9.4)\n",
-            "Requirement already satisfied: async-timeout<5.0,>=4.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.7.0->cleanlab[datalab]) (4.0.3)\n",
-            "Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.19.4->datasets>=2.7.0->cleanlab[datalab]) (4.9.0)\n",
-            "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.1->pandas>=1.4.0->cleanlab[datalab]) (1.16.0)\n",
-            "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets>=2.7.0->cleanlab[datalab]) (3.3.2)\n",
-            "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets>=2.7.0->cleanlab[datalab]) (3.6)\n",
-            "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets>=2.7.0->cleanlab[datalab]) (2.0.7)\n",
-            "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets>=2.7.0->cleanlab[datalab]) (2024.2.2)\n"
-          ]
-        }
-      ]
+      "execution_count": null,
+      "outputs": []
     },
     {
       "cell_type": "code",
-      "execution_count": 22,
+      "execution_count": null,
       "metadata": {
         "execution": {
           "iopub.execute_input": "2024-02-16T06:26:13.467211Z",
@@ -240,7 +118,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 23,
+      "execution_count": null,
       "metadata": {
         "execution": {
           "iopub.execute_input": "2024-02-16T06:26:13.472374Z",
@@ -269,12 +147,12 @@
         "id": "yj_5JcO1cw4L"
       },
       "source": [
-        "## 2. Load and format the text dataset\n"
+        "## Load and format the text dataset\n"
       ]
     },
     {
       "cell_type": "code",
-      "execution_count": 24,
+      "execution_count": null,
       "metadata": {
         "execution": {
           "iopub.execute_input": "2024-02-16T06:26:13.476949Z",
@@ -584,7 +462,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 25,
+      "execution_count": null,
       "metadata": {
         "execution": {
           "iopub.execute_input": "2024-02-16T06:26:13.504463Z",
@@ -627,7 +505,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 26,
+      "execution_count": null,
       "metadata": {
         "execution": {
           "iopub.execute_input": "2024-02-16T06:26:13.510435Z",
@@ -696,7 +574,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 27,
+      "execution_count": null,
       "metadata": {
         "execution": {
           "iopub.execute_input": "2024-02-16T06:26:13.515306Z",
@@ -704,21 +582,9 @@
           "iopub.status.idle": "2024-02-16T06:26:18.244024Z",
           "shell.execute_reply": "2024-02-16T06:26:18.243354Z"
         },
-        "id": "DbDb6Ni6cw4M",
-        "colab": {
-          "base_uri": "https://localhost:8080/"
-        },
-        "outputId": "b3ff5ca8-afc6-4e0b-b2be-ba5dd7c0841b"
+        "id": "DbDb6Ni6cw4M"
       },
-      "outputs": [
-        {
-          "output_type": "stream",
-          "name": "stderr",
-          "text": [
-            "WARNING:sentence_transformers.SentenceTransformer:No sentence-transformers model found with name google/electra-small-discriminator. Creating a new one with MEAN pooling.\n"
-          ]
-        }
-      ],
+      "outputs": [],
       "source": [
         "transformer = SentenceTransformer('google/electra-small-discriminator')\n",
         "text_embeddings = transformer.encode(raw_texts)"
@@ -739,7 +605,7 @@
         "id": "4FK2Q72gcw4M"
       },
       "source": [
-        "## 3. Define a classification model and compute out-of-sample predicted probabilities"
+        "## Define a classification model and compute out-of-sample predicted probabilities"
       ]
     },
     {
@@ -758,7 +624,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 28,
+      "execution_count": null,
       "metadata": {
         "execution": {
           "iopub.execute_input": "2024-02-16T06:26:18.247142Z",
@@ -782,7 +648,7 @@
         "id": "9s0pcMk1cw4N"
       },
       "source": [
-        "## 4. Use cleanlab to find issues in your dataset"
+        "## Use Cleanlab to find issues in your dataset"
       ]
     },
     {
@@ -793,12 +659,12 @@
       "source": [
         "Given feature embeddings and the (out-of-sample) predicted class probabilities obtained from any model you have, cleanlab can quickly help you identify low-quality examples in your dataset.\n",
         "\n",
-        "Here, we use cleanlab's `Datalab` to find issues in our data. Datalab offers several ways of loading the data; we’ll simply wrap the training features and noisy labels in a dictionary."
+        "Here, we use Cleanlab's `Datalab` to find issues in our data. Datalab offers several ways of loading the data; we’ll simply wrap the training features and noisy labels in a dictionary."
       ]
     },
     {
       "cell_type": "code",
-      "execution_count": 29,
+      "execution_count": null,
       "metadata": {
         "execution": {
           "iopub.execute_input": "2024-02-16T06:26:19.136722Z",
@@ -824,7 +690,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 30,
+      "execution_count": null,
       "metadata": {
         "execution": {
           "iopub.execute_input": "2024-02-16T06:26:19.141893Z",
@@ -833,30 +699,9 @@
           "shell.execute_reply": "2024-02-16T06:26:20.808461Z"
         },
         "scrolled": true,
-        "id": "R0xuUDRWcw4N",
-        "outputId": "6e8541c2-0e28-4907-c41a-d097212fe8a4",
-        "colab": {
-          "base_uri": "https://localhost:8080/"
-        }
+        "id": "R0xuUDRWcw4N"
       },
-      "outputs": [
-        {
-          "output_type": "stream",
-          "name": "stdout",
-          "text": [
-            "Finding null issues ...\n",
-            "Finding label issues ...\n",
-            "Finding outlier issues ...\n",
-            "Fitting OOD estimator based on provided features ...\n",
-            "Finding near_duplicate issues ...\n",
-            "Finding non_iid issues ...\n",
-            "Finding class_imbalance issues ...\n",
-            "Finding underperforming_group issues ...\n",
-            "\n",
-            "Audit complete. 62 issues found in the dataset.\n"
-          ]
-        }
-      ],
+      "outputs": [],
       "source": [
         "lab = Datalab(data_dict, label_name=\"labels\")\n",
         "lab.find_issues(pred_probs=pred_probs, features=text_embeddings)"
@@ -895,7 +740,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 31,
+      "execution_count": null,
       "metadata": {
         "execution": {
           "iopub.execute_input": "2024-02-16T06:26:20.813057Z",
@@ -1017,113 +862,6 @@
         "lab.report()"
       ]
     },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "The output for the `lab.report()` would look like below:\n",
-        "\n",
-        "```bash\n",
-        "Here is a summary of the different kinds of issues found in the data:\n",
-        "\n",
-        "    issue_type  num_issues\n",
-        "       outlier          37\n",
-        "near_duplicate          14\n",
-        "         label          10\n",
-        "       non_iid           1\n",
-        "\n",
-        "Dataset Information: num_examples: 1000, num_classes: 7\n",
-        "\n",
-        "\n",
-        "---------------------- outlier issues ----------------------\n",
-        "\n",
-        "About this issue:\n",
-        "\tExamples that are very different from the rest of the dataset\n",
-        "    (i.e. potentially out-of-distribution or rare/anomalous instances).\n",
-        "    \n",
-        "\n",
-        "Number of examples with this issue: 37\n",
-        "Overall dataset quality in terms of this issue: 0.3671\n",
-        "\n",
-        "Examples representing most severe instances of this issue:\n",
-        "     is_outlier_issue  outlier_score\n",
-        "791              True       0.024866\n",
-        "601              True       0.031162\n",
-        "863              True       0.060738\n",
-        "355              True       0.064199\n",
-        "157              True       0.065075\n",
-        "\n",
-        "\n",
-        "------------------ near_duplicate issues -------------------\n",
-        "\n",
-        "About this issue:\n",
-        "\tA (near) duplicate issue refers to two or more examples in\n",
-        "    a dataset that are extremely similar to each other, relative\n",
-        "    to the rest of the dataset.  The examples flagged with this issue\n",
-        "    may be exactly duplicated, or lie atypically close together when\n",
-        "    represented as vectors (i.e. feature embeddings).\n",
-        "    \n",
-        "\n",
-        "Number of examples with this issue: 14\n",
-        "Overall dataset quality in terms of this issue: 0.5961\n",
-        "\n",
-        "Examples representing most severe instances of this issue:\n",
-        "     is_near_duplicate_issue  near_duplicate_score near_duplicate_sets  distance_to_nearest_neighbor\n",
-        "459                     True              0.009544               [429]                      0.000566\n",
-        "429                     True              0.009544               [459]                      0.000566\n",
-        "501                     True              0.046044          [412, 517]                      0.002781\n",
-        "412                     True              0.046044               [501]                      0.002781\n",
-        "698                     True              0.054626               [607]                      0.003314\n",
-        "\n",
-        "\n",
-        "----------------------- label issues -----------------------\n",
-        "\n",
-        "About this issue:\n",
-        "\tExamples whose given label is estimated to be potentially incorrect\n",
-        "    (e.g. due to annotation error) are flagged as having label issues.\n",
-        "    \n",
-        "\n",
-        "Number of examples with this issue: 10\n",
-        "Overall dataset quality in terms of this issue: 0.9930\n",
-        "\n",
-        "Examples representing most severe instances of this issue:\n",
-        "     is_label_issue  label_score  given_label  predicted_label\n",
-        "379           False     0.025486           32               11\n",
-        "100           False     0.032102           11               36\n",
-        "300           False     0.037742           32               46\n",
-        "485            True     0.057666           17               34\n",
-        "159            True     0.059408           13               11\n",
-        "\n",
-        "\n",
-        "---------------------- non_iid issues ----------------------\n",
-        "\n",
-        "About this issue:\n",
-        "\tWhether the dataset exhibits statistically significant\n",
-        "    violations of the IID assumption like:\n",
-        "    changepoints or shift, drift, autocorrelation, etc.\n",
-        "    The specific violation considered is whether the\n",
-        "    examples are ordered such that almost adjacent examples\n",
-        "    tend to have more similar feature values.\n",
-        "    \n",
-        "\n",
-        "Number of examples with this issue: 1\n",
-        "Overall dataset quality in terms of this issue: 0.0000\n",
-        "\n",
-        "Examples representing most severe instances of this issue:\n",
-        "     is_non_iid_issue  non_iid_score\n",
-        "988              True       0.563774\n",
-        "975             False       0.570179\n",
-        "997             False       0.571891\n",
-        "967             False       0.572357\n",
-        "956             False       0.577413\n",
-        "\n",
-        "Additional Information:\n",
-        "p-value: 0.0\n",
-        "```"
-      ],
-      "metadata": {
-        "id": "XI03VkWHrixv"
-      }
-    },
     {
       "cell_type": "markdown",
       "metadata": {
@@ -1137,7 +875,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 32,
+      "execution_count": null,
       "metadata": {
         "execution": {
           "iopub.execute_input": "2024-02-16T06:26:20.843083Z",
@@ -1490,7 +1228,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 33,
+      "execution_count": null,
       "metadata": {
         "execution": {
           "iopub.execute_input": "2024-02-16T06:26:20.854743Z",
@@ -1525,22 +1263,6 @@
         ")"
       ]
     },
-    {
-      "cell_type": "markdown",
-      "source": [
-        "The output for the above cell would look like below:\n",
-        "\n",
-        "```bash\n",
-        "cleanlab found 10 potential label errors in the dataset.\n",
-        "Here are indices of the top 5 most likely errors:\n",
-        " [379 100 300 485 159]\n",
-        "\n",
-        "```"
-      ],
-      "metadata": {
-        "id": "QyW7qUNKXOz5"
-      }
-    },
     {
       "cell_type": "markdown",
       "metadata": {
@@ -1554,7 +1276,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 18,
+      "execution_count": null,
       "metadata": {
         "execution": {
           "iopub.execute_input": "2024-02-16T06:26:20.861048Z",
@@ -1914,7 +1636,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 20,
+      "execution_count": null,
       "metadata": {
         "execution": {
           "iopub.execute_input": "2024-02-16T06:26:20.869718Z",
@@ -2237,7 +1959,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 34,
+      "execution_count": null,
       "metadata": {
         "execution": {
           "iopub.execute_input": "2024-02-16T06:26:20.878435Z",
@@ -2582,7 +2304,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 35,
+      "execution_count": null,
       "metadata": {
         "execution": {
           "iopub.execute_input": "2024-02-16T06:26:20.886079Z",
@@ -2918,7 +2640,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 38,
+      "execution_count": null,
       "metadata": {
         "execution": {
           "iopub.execute_input": "2024-02-16T06:26:20.896501Z",
@@ -3223,7 +2945,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 39,
+      "execution_count": null,
       "metadata": {
         "execution": {
           "iopub.execute_input": "2024-02-16T06:26:20.904159Z",
@@ -3547,7 +3269,7 @@
     },
     {
       "cell_type": "code",
-      "execution_count": 40,
+      "execution_count": null,
       "metadata": {
         "execution": {
           "iopub.execute_input": "2024-02-16T06:26:20.911817Z",
@@ -3602,15 +3324,18 @@
         "id": "qnncoRWUcw4S"
       },
       "source": [
-        "### Easy Mode\n",
+        "### Cleanlab Opensource Project\n",
+        "\n",
+        "[Cleanlab](https://github.com/cleanlab/cleanlab) is a standard Data-centric AI package designed to address data quality issues for messy, real-world data.\n",
         "\n",
-        "Cleanlab is most effective when you run this code with a good ML model. Try to produce the best ML model you can for your data (instead of the basic model from this tutorial). If you don't know the best ML model for your data, try [Cleanlab Studio](https://cleanlab.ai/blog/data-centric-ai/) which will automatically produce one for you. Super easy to use, [Cleanlab Studio](https://cleanlab.ai/blog/data-centric-ai/) is no-code platform for data-centric AI that automatically: detects data issues (more types of issues than this cleanlab package), helps you quickly correct these data issues, confidently labels large subsets of an unlabeled dataset, and provides other smart metadata about each of your data points -- all powered by a system that automatically trains/deploys the best ML model for your data. [Try it for free!](https://cleanlab.ai/signup/)"
+        "Do consider giving Cleanlab Github Repository a Star, and we welcome [contributions](https://github.com/cleanlab/cleanlab/issues?q=is:issue+is:open+label:%22good+first+issue%22) to the project."
       ]
     }
   ],
   "metadata": {
     "colab": {
-      "provenance": []
+      "provenance": [],
+      "toc_visible": true
     },
     "kernelspec": {
       "display_name": "Python 3 (ipykernel)",

From 89cd83a9cf58df372df59635093aed5e2e5eca1a Mon Sep 17 00:00:00 2001
From: aravindputrevu <aravind.putrevu@gmail.com>
Date: Mon, 11 Mar 2024 14:11:37 +0530
Subject: [PATCH 4/6] Changes to the TOC tree

---
 notebooks/en/_toctree.yml | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/notebooks/en/_toctree.yml b/notebooks/en/_toctree.yml
index 8c699a1b..6df0a009 100644
--- a/notebooks/en/_toctree.yml
+++ b/notebooks/en/_toctree.yml
@@ -1,5 +1,7 @@
 - title: Open-Source AI Cookbook
   sections:
+  - local: issues_in_text_dataset
+    title: Detecting Issues in a Text Dataset with Cleanlab
   - local: index
     title: Open-Source AI Cookbook
   - local: stable_diffusion_interpolation
@@ -20,7 +22,5 @@
     title: Advanced RAG on HuggingFace documentation using LangChain
   - local: rag_evaluation
     title: RAG Evaluation
-  - local: issues_in_text_dataset
-    title: Detecting Issues in a Text Dataset with Datalab
   - local: prompt_tuning_peft
     title: Prompt tuning with PEFT

From 06e8b38cce09e1fd7c9d3556dd421904f7a57335 Mon Sep 17 00:00:00 2001
From: Maria Khalusova <kafooster@gmail.com>
Date: Mon, 11 Mar 2024 11:45:10 -0400
Subject: [PATCH 5/6] Moved the recipe after the index page

---
 notebooks/en/_toctree.yml | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/notebooks/en/_toctree.yml b/notebooks/en/_toctree.yml
index 6df0a009..4ccb23d4 100644
--- a/notebooks/en/_toctree.yml
+++ b/notebooks/en/_toctree.yml
@@ -1,9 +1,9 @@
 - title: Open-Source AI Cookbook
   sections:
-  - local: issues_in_text_dataset
-    title: Detecting Issues in a Text Dataset with Cleanlab
   - local: index
     title: Open-Source AI Cookbook
+  - local: issues_in_text_dataset
+    title: Detecting Issues in a Text Dataset with Cleanlab
   - local: stable_diffusion_interpolation
     title: Stable Diffusion Interpolation
   - local: rag_with_hugging_face_gemma_mongodb

From 1ced1f1b695464204cf2daaa6fdce55b0d83e7c0 Mon Sep 17 00:00:00 2001
From: Maria Khalusova <kafooster@gmail.com>
Date: Mon, 11 Mar 2024 12:28:23 -0400
Subject: [PATCH 6/6] Fixes missing columns in tables

---
 notebooks/en/issues_in_text_dataset.ipynb | 34 +++++++++++------------
 1 file changed, 17 insertions(+), 17 deletions(-)

diff --git a/notebooks/en/issues_in_text_dataset.ipynb b/notebooks/en/issues_in_text_dataset.ipynb
index ef02982d..e568de20 100644
--- a/notebooks/en/issues_in_text_dataset.ipynb
+++ b/notebooks/en/issues_in_text_dataset.ipynb
@@ -1196,8 +1196,8 @@
     {
       "cell_type": "markdown",
       "source": [
-        "| is_label_issue | label_score | given_label | predicted_label |\n",
-        "|----------------|-------------|-------------|-----------------|\n",
+        "| | is_label_issue | label_score | given_label | predicted_label |\n",
+        "|----------------|-------------|-------------|-----------------|-----------------|\n",
         "| 0              | False       | 0.903926    | 11              | 11 |\n",
         "| 1              | False       | 0.860544    | 11              | 11 |\n",
         "| 2              | False       | 0.658309    | 11              | 11 |\n",
@@ -1600,13 +1600,13 @@
       "source": [
         "  The output to the above command would like below:\n",
         "  \n",
-        "  | text | given_label                                                                                               | suggested_label |\n",
-        "|------|-----------------------------------------------------------------------------------------------------------|-----------------|\n",
-        "| 379  | Is there a specific source that the exchange rate for the transfer I'm planning on making is pulled from? | 32              |\n",
-        "| 100  | can you share card tracking number?                                                                       | 11              |\n",
-        "| 300  | If I need to cash foreign transfers, how does that work?                                                  | 32              |\n",
-        "| 485  | Was I charged more than I should of been for a currency exchange?                                         | 17              |\n",
-        "| 159  | Is there any way to see my card in the app?                                                               | 13              |\n"
+        "|      | text                                                                                                      | given_label    | suggested_label |\n",
+        "|------|-----------------------------------------------------------------------------------------------------------|----------------|-----------------|\n",
+        "| 379  | Is there a specific source that the exchange rate for the transfer I'm planning on making is pulled from? | 32             | 11              |\n",
+        "| 100  | can you share card tracking number?                                                                       | 11             | 36              |\n",
+        "| 300  | If I need to cash foreign transfers, how does that work?                                                  | 32             | 46              |\n",
+        "| 485  | Was I charged more than I should of been for a currency exchange?                                         | 17             | 34              |\n",
+        "| 159  | Is there any way to see my card in the app?                                                               | 13             | 11              |\n"
       ],
       "metadata": {
         "id": "g2dvMySPtkbL"
@@ -1945,13 +1945,13 @@
       "source": [
         "Output would look like below:\n",
         "\n",
-        "| is_outlier_issue | outlier_score |\n",
-        "|------------------|---------------|\n",
-        "| True             | 0.024866      |\n",
-        "| True             | 0.031162      |\n",
-        "| True             | 0.060738      |\n",
-        "| True             | 0.064199      |\n",
-        "| True             | 0.065075      |"
+        "|   | is_outlier_issue | outlier_score |\n",
+        "|---| ----------------|---------------|\n",
+        "| 791 | True             | 0.024866      |\n",
+        "| 601 | True             | 0.031162      |\n",
+        "| 863 | True             | 0.060738      |\n",
+        "| 355 | True             | 0.064199      |\n",
+        "| 157 | True             | 0.065075      |"
       ],
       "metadata": {
         "id": "F7Z2VJQAujui"
@@ -3357,4 +3357,4 @@
   },
   "nbformat": 4,
   "nbformat_minor": 0
-}
\ No newline at end of file
+}

	text	label
0	I am still waiting on my card?	11
1	What can I do if my card still hasn't arrived after 2 weeks?	11
2	I have been waiting over a week. Is the card still coming?	11
3	Can I track my card while it is in the process of delivery?	11
4	How do I know if I will get my card, or if it is lost?	11
	is_label_issue	label_score	given_label	predicted_label
0	False	0.903926	11	11
1	False	0.860544	11	11
2	False	0.658309	11	11
3	False	0.697085	11	11
4	False	0.434934	11	11
	text	given_label	suggested_label
379	Is there a specific source that the exchange rate for the transfer I'm planning on making is pulled from?	32	11
100	can you share card tracking number?	11	36
300	If I need to cash foreign transfers, how does that work?	32	46
485	Was I charged more than I should of been for a currency exchange?	17	34
159	Is there any way to see my card in the app?	13	11
	is_outlier_issue	outlier_score
791	True	0.024866
601	True	0.031162
863	True	0.060738
355	True	0.064199
157	True	0.065075
	text	label
791	withdrawal pending meaning?	46
601	$1 charge in transaction.	34
863	My atm withdraw is stillpending	46
355	explain the interbank exchange rate	32
157	lost card found, want to put it back in app	13
	is_near_duplicate_issue	near_duplicate_score	near_duplicate_sets	distance_to_nearest_neighbor
459	True	0.009544	[429]	0.000566
429	True	0.009544	[459]	0.000566
501	True	0.046044	[412, 517]	0.002781
412	True	0.046044	[501]	0.002781
698	True	0.054626	[607]	0.003314
	text	label
459	I purchased something abroad and the incorrect exchange rate was applied.	17
429	I purchased something overseas and the incorrect exchange rate was applied.	17
	text	label
501	The exchange rate you are using is really bad.This can't be the official interbank exchange rate.	17
412	The exchange rate you are using is bad.This can't be the official interbank exchange rate.	17