From 744e474dfe57b1992463fc267994baabde0134eb Mon Sep 17 00:00:00 2001 From: aravindputrevu Date: Sat, 17 Feb 2024 01:30:22 +0100 Subject: [PATCH 1/6] Changes for Detecting Issues in a Text Dataset with Datalab --- notebooks/en/_toctree.yml | 2 + notebooks/en/index.md | 1 + notebooks/en/issues_in_text_dataset.ipynb | 571 ++++++++++++++++++++++ 3 files changed, 574 insertions(+) create mode 100644 notebooks/en/issues_in_text_dataset.ipynb diff --git a/notebooks/en/_toctree.yml b/notebooks/en/_toctree.yml index 09825b25..0b6d609f 100644 --- a/notebooks/en/_toctree.yml +++ b/notebooks/en/_toctree.yml @@ -12,3 +12,5 @@ title: Advanced RAG on HuggingFace documentation using LangChain - local: rag_evaluation title: RAG Evaluation + - local: issues_in_text_dataset + title: Detecting Issues in a Text Dataset with Datalab diff --git a/notebooks/en/index.md b/notebooks/en/index.md index b9b2a530..029a4da9 100644 --- a/notebooks/en/index.md +++ b/notebooks/en/index.md @@ -12,6 +12,7 @@ Check out the recently added notebooks: - [Fine-tuning a Code LLM on Custom Code on a single GPU](fine_tuning_code_llm_on_single_gpu) - [RAG Evaluation Using Synthetic data and LLM-As-A-Judge](rag_evaluation) - [Advanced RAG on HuggingFace documentation using LangChain](advanced_rag) +- [Detecting Issues in a Text Dataset with Datalab](issues_in_text_dataset) You can also check out the notebooks in the cookbook's [GitHub repo](https://github.com/huggingface/cookbook). diff --git a/notebooks/en/issues_in_text_dataset.ipynb b/notebooks/en/issues_in_text_dataset.ipynb new file mode 100644 index 00000000..36f19d14 --- /dev/null +++ b/notebooks/en/issues_in_text_dataset.ipynb @@ -0,0 +1,571 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Detecting Issues in a Text Dataset with Datalab\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "In this 5-minute quickstart tutorial, we use Datalab to detect various issues in an intent classification dataset composed of (text) customer service requests at an online bank. We consider a subset of the [Banking77-OOS Dataset](https://arxiv.org/abs/2106.04564) containing 1,000 customer service requests which are classified into 10 categories based on their intent (you can run this same code on any text classification dataset). Cleanlab automatically identifies bad examples in our dataset, including mislabeled data, out-of-scope examples (outliers), or otherwise ambiguous examples. Consider filtering or correcting such bad examples before you dive deep into modeling your data!\n", + "\n", + "**Overview of what we'll do in this tutorial:**\n", + "\n", + "- Use a pretrained transformer model to extract the text embeddings from the customer service requests\n", + "\n", + "- Train a simple Logistic Regression model on the text embeddings to compute out-of-sample predicted probabilities\n", + "\n", + "- Run cleanlab's `Datalab` audit with these predictions and embeddings in order to identify problems like: label issues, outliers, and near duplicates in the dataset." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + "Quickstart\n", + "
\n", + " \n", + "Already have (out-of-sample) `pred_probs` from a model trained on an existing set of labels? Maybe you have some numeric `features` as well? Run the code below to find any potential label errors in your dataset.\n", + "\n", + "
\n", + " \n", + "```ipython3 \n", + "from cleanlab import Datalab\n", + "\n", + "lab = Datalab(data=your_dataset, label_name=\"column_name_of_labels\")\n", + "lab.find_issues(pred_probs=your_pred_probs, features=your_features)\n", + "\n", + "lab.report()\n", + "lab.get_issues()\n", + "```\n", + " \n", + "
\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Install required dependencies\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "You can use `pip` to install all packages required for this tutorial as follows:\n", + "\n", + "```ipython3\n", + "!pip install sklearn sentence-transformers\n", + "!pip install \"cleanlab[datalab]\"\n", + "# Make sure to install the version corresponding to this tutorial\n", + "# E.g. if viewing master branch documentation:\n", + "# !pip install git+https://github.com/cleanlab/cleanlab.git\n", + "```" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "nbsphinx": "hidden" + }, + "outputs": [], + "source": [ + "# Package installation (hidden on docs.cleanlab.ai).\n", + "# If running on Colab, may want to use GPU (select: Runtime > Change runtime type > Hardware accelerator > GPU)\n", + "# Package versions we used:scikit-learn==1.2.0 sentence-transformers==2.2.2\n", + "\n", + "dependencies = [\"cleanlab\", \"sentence_transformers\", \"datasets\"]\n", + "\n", + "# Supress outputs that may appear if tensorflow happens to be improperly installed: \n", + "import os \n", + "\n", + "os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\" # disable parallelism to avoid deadlocks with huggingface\n", + "\n", + "if \"google.colab\" in str(get_ipython()): # Check if it's running in Google Colab\n", + " %pip install cleanlab # for colab\n", + " cmd = ' '.join([dep for dep in dependencies if dep != \"cleanlab\"])\n", + " %pip install $cmd\n", + "else:\n", + " dependencies_test = [dependency.split('>')[0] if '>' in dependency \n", + " else dependency.split('<')[0] if '<' in dependency \n", + " else dependency.split('=')[0] for dependency in dependencies]\n", + " missing_dependencies = []\n", + " for dependency in dependencies_test:\n", + " try:\n", + " __import__(dependency)\n", + " except ImportError:\n", + " missing_dependencies.append(dependency)\n", + "\n", + " if len(missing_dependencies) > 0:\n", + " print(\"Missing required dependencies:\")\n", + " print(*missing_dependencies, sep=\", \")\n", + " print(\"\\nPlease install them before running the rest of this notebook.\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "import re \n", + "import string \n", + "import pandas as pd \n", + "from sklearn.metrics import accuracy_score, log_loss \n", + "from sklearn.model_selection import cross_val_predict \n", + "from sklearn.linear_model import LogisticRegression\n", + "from sentence_transformers import SentenceTransformer\n", + "\n", + "from cleanlab import Datalab" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "nbsphinx": "hidden" + }, + "outputs": [], + "source": [ + "# This cell is hidden from docs.cleanlab.ai \n", + "\n", + "import random \n", + "import numpy as np \n", + "\n", + "pd.set_option(\"display.max_colwidth\", None) \n", + "\n", + "SEED = 123456 # for reproducibility\n", + "np.random.seed(SEED)\n", + "random.seed(SEED)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Load and format the text dataset\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data = pd.read_csv(\"https://s.cleanlab.ai/banking-intent-classification.csv\")\n", + "data.head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "raw_texts, labels = data[\"text\"].values, data[\"label\"].values\n", + "num_classes = len(set(labels))\n", + "\n", + "print(f\"This dataset has {num_classes} classes.\")\n", + "print(f\"Classes: {set(labels)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's view the i-th example in the dataset:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "i = 1 # change this to view other examples from the dataset\n", + "print(f\"Example Label: {labels[i]}\")\n", + "print(f\"Example Text: {raw_texts[i]}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The data is stored as two numpy arrays:\n", + "\n", + "1. `raw_texts` stores the customer service requests utterances in text format\n", + "2. `labels` stores the intent categories (labels) for each example" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "
\n", + "Bringing Your Own Data (BYOD)?\n", + "\n", + "You can easily replace the above with your own text dataset, and continue with the rest of the tutorial.\n", + "\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Next we convert the text strings into vectors better suited as inputs for our ML models. \n", + "\n", + "We will use numeric representations from a pretrained Transformer model as embeddings of our text. The [Sentence Transformers](https://huggingface.co/docs/hub/sentence-transformers) library offers simple methods to compute these embeddings for text data. Here, we load the pretrained `electra-small-discriminator` model, and then run our data through network to extract a vector embedding of each example." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "transformer = SentenceTransformer('google/electra-small-discriminator')\n", + "text_embeddings = transformer.encode(raw_texts)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Our subsequent ML model will directly operate on elements of `text_embeddings` in order to classify the customer service requests." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3. Define a classification model and compute out-of-sample predicted probabilities" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "A typical way to leverage pretrained networks for a particular classification task is to add a linear output layer and fine-tune the network parameters on the new data. However this can be computationally intensive. Alternatively, we can freeze the pretrained weights of the network and only train the output layer without having to rely on GPU(s). Here we do this conveniently by fitting a scikit-learn linear model on top of the extracted embeddings.\n", + "\n", + "To identify label issues, cleanlab requires a probabilistic prediction from your model for each datapoint. However these predictions will be _overfit_ (and thus unreliable) for datapoints the model was previously trained on. cleanlab is intended to only be used with **out-of-sample** predicted class probabilities, i.e. on datapoints held-out from the model during the training.\n", + "\n", + "Here we obtain out-of-sample predicted class probabilities for every example in our dataset using a Logistic Regression model with cross-validation.\n", + "Make sure that the columns of your `pred_probs` are properly ordered with respect to the ordering of classes, which for Datalab is: lexicographically sorted by class name." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "model = LogisticRegression(max_iter=400)\n", + "\n", + "pred_probs = cross_val_predict(model, text_embeddings, labels, method=\"predict_proba\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4. Use cleanlab to find issues in your dataset" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Given feature embeddings and the (out-of-sample) predicted class probabilities obtained from any model you have, cleanlab can quickly help you identify low-quality examples in your dataset.\n", + "\n", + "Here, we use cleanlab's `Datalab` to find issues in our data. Datalab offers several ways of loading the data; we’ll simply wrap the training features and noisy labels in a dictionary. " + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data_dict = {\"texts\": raw_texts, \"labels\": labels}" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "All that is need to audit your data is to call `find_issues()`. We pass in the predicted probabilities and the feature embeddings obtained above, but you do not necessarily need to provide all of this information depending on which types of issues you are interested in. The more inputs you provide, the more types of issues `Datalab` can detect in your data. Using a better model to produce these inputs will ensure cleanlab more accurately estimates issues." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "lab = Datalab(data_dict, label_name=\"labels\")\n", + "lab.find_issues(pred_probs=pred_probs, features=text_embeddings)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "After the audit is complete, review the findings using the `report` method:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "lab.report()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Label issues\n", + "\n", + "The report indicates that cleanlab identified many label issues in our dataset. We can see which examples are flagged as likely mislabeled and the label quality score for each example using the `get_issues` method, specifying `label` as an argument to focus on label issues in the data." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "scrolled": true + }, + "outputs": [], + "source": [ + "label_issues = lab.get_issues(\"label\")\n", + "label_issues.head() " + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "This method returns a dataframe containing a label quality score for each example. These numeric scores lie between 0 and 1, where lower scores indicate examples more likely to be mislabeled. The dataframe also contains a boolean column specifying whether or not each example is identified to have a label issue (indicating it is likely mislabeled)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We can get the subset of examples flagged with label issues, and also sort by label quality score to find the indices of the 5 most likely mislabeled examples in our dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "identified_label_issues = label_issues[label_issues[\"is_label_issue\"] == True]\n", + "lowest_quality_labels = label_issues[\"label_score\"].argsort()[:5].to_numpy()\n", + "\n", + "print(\n", + " f\"cleanlab found {len(identified_label_issues)} potential label errors in the dataset.\\n\"\n", + " f\"Here are indices of the top 5 most likely errors: \\n {lowest_quality_labels}\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Let's review some of the most likely label errors. \n", + "\n", + "Here we display the top 5 examples identified as the most likely label errors in the dataset, together with their given (original) label and a suggested alternative label from cleanlab.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data_with_suggested_labels = pd.DataFrame(\n", + " {\"text\": raw_texts, \"given_label\": labels, \"suggested_label\": label_issues[\"predicted_label\"]}\n", + ")\n", + "data_with_suggested_labels.iloc[lowest_quality_labels]" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "scrolled": true + }, + "source": [ + "These are very clear label errors that cleanlab has identified in this data! Note that the `given_label` does not correctly reflect the intent of these requests, whoever produced this dataset made many mistakes that are important to address before modeling the data." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Outlier issues\n", + "\n", + "According to the report, our dataset contains some outliers.\n", + "We can see which examples are outliers (and a numeric quality score quantifying how typical each example appears to be) via `get_issues`. We sort the resulting DataFrame by cleanlab's outlier quality score to see the most severe outliers in our dataset." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "outlier_issues = lab.get_issues(\"outlier\")\n", + "outlier_issues.sort_values(\"outlier_score\").head()" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "lowest_quality_outliers = outlier_issues[\"outlier_score\"].argsort()[:5]\n", + "\n", + "data.iloc[lowest_quality_outliers]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We see that cleanlab has identified entries in this dataset that do not appear to be proper customer requests. Outliers in this dataset appear to be out-of-scope customer requests and other nonsensical text which does not make sense for intent classification. Carefully consider whether such outliers may detrimentally affect your data modeling, and consider removing them from the dataset if so." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Near-duplicate issues\n", + "\n", + "According to the report, our dataset contains some sets of nearly duplicated examples.\n", + "We can see which examples are (nearly) duplicated (and a numeric quality score quantifying how dissimilar each example is from its nearest neighbor in the dataset) via `get_issues`. We sort the resulting DataFrame by cleanlab's near-duplicate quality score to see the text examples in our dataset that are most nearly duplicated." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "duplicate_issues = lab.get_issues(\"near_duplicate\")\n", + "duplicate_issues.sort_values(\"near_duplicate_score\").head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "The results above show which examples cleanlab considers nearly duplicated (rows where `is_near_duplicate_issue == True`). Here, we see that example 160 and 148 are nearly duplicated, as are example 546 and 514.\n", + "\n", + "Let's view these examples to see how similar they are." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data.iloc[[160, 148]]" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "data.iloc[[546, 514]]" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "We see that these two sets of request are indeed very similar to one another! Including near duplicates in a dataset may have unintended effects on models, and be wary about splitting them across training/test sets. Learn more about handling near duplicates in a dataset from [the FAQ](https://docs.cleanlab.ai/stable/tutorials/faq.html#How-to-handle-near-duplicate-data-identified-by-cleanlab?)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Non-IID issues (data drift)\n", + "According to the report, our dataset does not appear to be Independent and Identically Distributed (IID). The overall non-iid score for the dataset (displayed below) corresponds to the `p-value` of a statistical test for whether the ordering of samples in the dataset appears related to the similarity between their feature values. A low `p-value` strongly suggests that the dataset violates the IID assumption, which is a key assumption required for conclusions (models) produced from the dataset to generalize to a larger population." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "p_value = lab.get_info('non_iid')['p-value']\n", + "p_value" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Here, our dataset was flagged as non-IID because the rows happened to be sorted by class label in the original data. This may be benign if we remember to shuffle rows before model training and data splitting. But if you don't know why your data was flagged as non-IID, then you should be worried about potential data drift or unexpected interactions between data points (their values may not be statistically independent). Think carefully about what future test data may look like (and whether your data is representative of the population you care about). You should not shuffle your data before the non-IID test runs (will invalidate its conclusions)." + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "As demonstrated above, cleanlab can automatically shortlist the most likely issues in your dataset to help you better curate your dataset for subsequent modeling. With this shortlist, you can decide whether to fix these label issues or remove nonsensical or duplicated examples from your dataset to obtain a higher-quality dataset for training your next ML model. cleanlab's issue detection can be run with outputs from *any* type of model you initially trained.\n" + ] + } + ], + "metadata": { + "colab": { + "collapsed_sections": [], + "name": "Text x TensorFlow", + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.7" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +} From 6404c7358238d89be62d9328424adf2bb7183426 Mon Sep 17 00:00:00 2001 From: aravindputrevu Date: Tue, 27 Feb 2024 23:46:59 +0100 Subject: [PATCH 2/6] Fixed the review comments --- notebooks/en/issues_in_text_dataset.ipynb | 4202 ++++++++++++++++++--- 1 file changed, 3633 insertions(+), 569 deletions(-) diff --git a/notebooks/en/issues_in_text_dataset.ipynb b/notebooks/en/issues_in_text_dataset.ipynb index 36f19d14..2c1cda57 100644 --- a/notebooks/en/issues_in_text_dataset.ipynb +++ b/notebooks/en/issues_in_text_dataset.ipynb @@ -1,571 +1,3635 @@ { - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Detecting Issues in a Text Dataset with Datalab\n" - ] + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "pw6cvzTocw4G" + }, + "source": [ + "# Detecting Issues in a Text Dataset with Datalab\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0yPBE0Xccw4J" + }, + "source": [ + "Authored by: [@aravindputrevu](https://huggingface.co/aravindputrevu)\n", + "\n", + "\n", + "In this 5-minute quickstart tutorial, we use Datalab to detect various issues in an intent classification dataset composed of (text) customer service requests at an online bank. We consider a subset of the [Banking77-OOS Dataset](https://arxiv.org/abs/2106.04564) containing 1,000 customer service requests which are classified into 10 categories based on their intent (you can run this same code on any text classification dataset). [Cleanlab](https://github.com/cleanlab/cleanlab) automatically identifies bad examples in our dataset, including mislabeled data, out-of-scope examples (outliers), or otherwise ambiguous examples. Consider filtering or correcting such bad examples before you dive deep into modeling your data!\n", + "\n", + "**Overview of what we'll do in this tutorial:**\n", + "\n", + "- Use a pretrained transformer model to extract the text embeddings from the customer service requests\n", + "\n", + "- Train a simple Logistic Regression model on the text embeddings to compute out-of-sample predicted probabilities\n", + "\n", + "- Run cleanlab's `Datalab` audit with these predictions and embeddings in order to identify problems like: label issues, outliers, and near duplicates in the dataset." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "o__pRLFYcw4K" + }, + "source": [ + "
\n", + "Quickstart\n", + "
\n", + " \n", + "Already have (out-of-sample) `pred_probs` from a model trained on an existing set of labels? Maybe you have some numeric `features` as well? Run the code below to find any potential label errors in your dataset.\n", + "\n", + "**Note:** If running on Colab, may want to use GPU (select: Runtime > Change runtime type > Hardware accelerator > GPU)\n", + "\n", + "
\n", + " \n", + "```ipython3\n", + "from cleanlab import Datalab\n", + "\n", + "lab = Datalab(data=your_dataset, label_name=\"column_name_of_labels\")\n", + "lab.find_issues(pred_probs=your_pred_probs, features=your_features)\n", + "\n", + "lab.report()\n", + "lab.get_issues()\n", + "```\n", + " \n", + "
\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "dp4lpApmcw4K" + }, + "source": [ + "## 1. Install required dependencies\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "DjoWBgGAcw4K" + }, + "source": [ + "You can use `pip` to install all packages required for this tutorial as follows:\n" + ] + }, + { + "cell_type": "code", + "source": [ + "!pip install -U scikit-learn sentence-transformers datasets\n", + "!pip install -U \"cleanlab[datalab]\"" + ], + "metadata": { + "id": "fRsBIj3L_RUb", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 1000 + }, + "outputId": "2b22c97c-2373-4740-d394-7486277aa694" + }, + "execution_count": 41, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (1.2.2)\n", + "Collecting scikit-learn\n", + " Downloading scikit_learn-1.4.1.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.1 MB)\n", + "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m12.1/12.1 MB\u001b[0m \u001b[31m38.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", + "\u001b[?25hRequirement already satisfied: sentence-transformers in /usr/local/lib/python3.10/dist-packages (2.4.0)\n", + "Requirement already satisfied: datasets in /usr/local/lib/python3.10/dist-packages (2.17.1)\n", + "Requirement already satisfied: numpy<2.0,>=1.19.5 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.25.2)\n", + "Requirement already satisfied: scipy>=1.6.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.11.4)\n", + "Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.3.2)\n", + "Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (3.3.0)\n", + "Requirement already satisfied: transformers<5.0.0,>=4.32.0 in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (4.37.2)\n", + "Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (4.66.2)\n", + "Requirement already satisfied: torch>=1.11.0 in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (2.1.0+cu121)\n", + "Requirement already satisfied: huggingface-hub>=0.15.1 in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (0.20.3)\n", + "Requirement already satisfied: Pillow in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (9.4.0)\n", + "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from datasets) (3.13.1)\n", + "Requirement already satisfied: pyarrow>=12.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (14.0.2)\n", + "Requirement already satisfied: pyarrow-hotfix in /usr/local/lib/python3.10/dist-packages (from datasets) (0.6)\n", + "Requirement already satisfied: dill<0.3.9,>=0.3.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (0.3.8)\n", + "Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from datasets) (1.5.3)\n", + "Requirement already satisfied: requests>=2.19.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (2.31.0)\n", + "Requirement already satisfied: xxhash in /usr/local/lib/python3.10/dist-packages (from datasets) (3.4.1)\n", + "Requirement already satisfied: multiprocess in /usr/local/lib/python3.10/dist-packages (from datasets) (0.70.16)\n", + "Requirement already satisfied: fsspec[http]<=2023.10.0,>=2023.1.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (2023.6.0)\n", + "Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets) (3.9.3)\n", + "Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from datasets) (23.2)\n", + "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from datasets) (6.0.1)\n", + "Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.3.1)\n", + "Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (23.2.0)\n", + "Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.4.1)\n", + "Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (6.0.5)\n", + "Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.9.4)\n", + "Requirement already satisfied: async-timeout<5.0,>=4.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (4.0.3)\n", + "Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.15.1->sentence-transformers) (4.9.0)\n", + "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets) (3.3.2)\n", + "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets) (3.6)\n", + "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets) (2.0.7)\n", + "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets) (2024.2.2)\n", + "Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence-transformers) (1.12)\n", + "Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence-transformers) (3.2.1)\n", + "Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence-transformers) (3.1.3)\n", + "Requirement already satisfied: triton==2.1.0 in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence-transformers) (2.1.0)\n", + "Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers<5.0.0,>=4.32.0->sentence-transformers) (2023.12.25)\n", + "Requirement already satisfied: tokenizers<0.19,>=0.14 in /usr/local/lib/python3.10/dist-packages (from transformers<5.0.0,>=4.32.0->sentence-transformers) (0.15.2)\n", + "Requirement already satisfied: safetensors>=0.4.1 in /usr/local/lib/python3.10/dist-packages (from transformers<5.0.0,>=4.32.0->sentence-transformers) (0.4.2)\n", + "Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2.8.2)\n", + "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2023.4)\n", + "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.1->pandas->datasets) (1.16.0)\n", + "Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch>=1.11.0->sentence-transformers) (2.1.5)\n", + "Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch>=1.11.0->sentence-transformers) (1.3.0)\n", + "Installing collected packages: scikit-learn\n", + " Attempting uninstall: scikit-learn\n", + " Found existing installation: scikit-learn 1.2.2\n", + " Uninstalling scikit-learn-1.2.2:\n", + " Successfully uninstalled scikit-learn-1.2.2\n", + "Successfully installed scikit-learn-1.4.1.post1\n" + ] + }, + { + "output_type": "display_data", + "data": { + "application/vnd.colab-display-data+json": { + "pip_warning": { + "packages": [ + "sklearn" + ] + }, + "id": "207dfdbd8b714496a56fb33ee0f11a84" + } + }, + "metadata": {} + }, + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Requirement already satisfied: cleanlab[datalab] in /usr/local/lib/python3.10/dist-packages (2.6.0)\n", + "Requirement already satisfied: numpy>=1.22.0 in /usr/local/lib/python3.10/dist-packages (from cleanlab[datalab]) (1.25.2)\n", + "Requirement already satisfied: scikit-learn>=1.1 in /usr/local/lib/python3.10/dist-packages (from cleanlab[datalab]) (1.4.1.post1)\n", + "Requirement already satisfied: tqdm>=4.53.0 in /usr/local/lib/python3.10/dist-packages (from cleanlab[datalab]) (4.66.2)\n", + "Requirement already satisfied: pandas>=1.4.0 in /usr/local/lib/python3.10/dist-packages (from cleanlab[datalab]) (1.5.3)\n", + "Requirement already satisfied: termcolor>=2.4.0 in /usr/local/lib/python3.10/dist-packages (from cleanlab[datalab]) (2.4.0)\n", + "Requirement already satisfied: datasets>=2.7.0 in /usr/local/lib/python3.10/dist-packages (from cleanlab[datalab]) (2.17.1)\n", + "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (3.13.1)\n", + "Requirement already satisfied: pyarrow>=12.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (14.0.2)\n", + "Requirement already satisfied: pyarrow-hotfix in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (0.6)\n", + "Requirement already satisfied: dill<0.3.9,>=0.3.0 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (0.3.8)\n", + "Requirement already satisfied: requests>=2.19.0 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (2.31.0)\n", + "Requirement already satisfied: xxhash in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (3.4.1)\n", + "Requirement already satisfied: multiprocess in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (0.70.16)\n", + "Requirement already satisfied: fsspec[http]<=2023.10.0,>=2023.1.0 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (2023.6.0)\n", + "Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (3.9.3)\n", + "Requirement already satisfied: huggingface-hub>=0.19.4 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (0.20.3)\n", + "Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (23.2)\n", + "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (6.0.1)\n", + "Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.4.0->cleanlab[datalab]) (2.8.2)\n", + "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.4.0->cleanlab[datalab]) (2023.4)\n", + "Requirement already satisfied: scipy>=1.6.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=1.1->cleanlab[datalab]) (1.11.4)\n", + "Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=1.1->cleanlab[datalab]) (1.3.2)\n", + "Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=1.1->cleanlab[datalab]) (3.3.0)\n", + "Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.7.0->cleanlab[datalab]) (1.3.1)\n", + "Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.7.0->cleanlab[datalab]) (23.2.0)\n", + "Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.7.0->cleanlab[datalab]) (1.4.1)\n", + "Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.7.0->cleanlab[datalab]) (6.0.5)\n", + "Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.7.0->cleanlab[datalab]) (1.9.4)\n", + "Requirement already satisfied: async-timeout<5.0,>=4.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.7.0->cleanlab[datalab]) (4.0.3)\n", + "Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.19.4->datasets>=2.7.0->cleanlab[datalab]) (4.9.0)\n", + "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.1->pandas>=1.4.0->cleanlab[datalab]) (1.16.0)\n", + "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets>=2.7.0->cleanlab[datalab]) (3.3.2)\n", + "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets>=2.7.0->cleanlab[datalab]) (3.6)\n", + "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets>=2.7.0->cleanlab[datalab]) (2.0.7)\n", + "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets>=2.7.0->cleanlab[datalab]) (2024.2.2)\n" + ] + } + ] + }, + { + "cell_type": "code", + "execution_count": 22, + "metadata": { + "execution": { + "iopub.execute_input": "2024-02-16T06:26:13.467211Z", + "iopub.status.busy": "2024-02-16T06:26:13.466877Z", + "iopub.status.idle": "2024-02-16T06:26:13.470222Z", + "shell.execute_reply": "2024-02-16T06:26:13.469761Z" + }, + "id": "zgezWF-2cw4L" + }, + "outputs": [], + "source": [ + "import re\n", + "import string\n", + "import pandas as pd\n", + "from sklearn.metrics import accuracy_score, log_loss\n", + "from sklearn.model_selection import cross_val_predict\n", + "from sklearn.linear_model import LogisticRegression\n", + "from sentence_transformers import SentenceTransformer\n", + "\n", + "from cleanlab import Datalab" + ] + }, + { + "cell_type": "code", + "execution_count": 23, + "metadata": { + "execution": { + "iopub.execute_input": "2024-02-16T06:26:13.472374Z", + "iopub.status.busy": "2024-02-16T06:26:13.471951Z", + "iopub.status.idle": "2024-02-16T06:26:13.475065Z", + "shell.execute_reply": "2024-02-16T06:26:13.474625Z" + }, + "nbsphinx": "hidden", + "id": "mO3pnA1ncw4L" + }, + "outputs": [], + "source": [ + "import random\n", + "import numpy as np\n", + "\n", + "pd.set_option(\"display.max_colwidth\", None)\n", + "\n", + "SEED = 123456 # for reproducibility\n", + "np.random.seed(SEED)\n", + "random.seed(SEED)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yj_5JcO1cw4L" + }, + "source": [ + "## 2. Load and format the text dataset\n" + ] + }, + { + "cell_type": "code", + "execution_count": 24, + "metadata": { + "execution": { + "iopub.execute_input": "2024-02-16T06:26:13.476949Z", + "iopub.status.busy": "2024-02-16T06:26:13.476773Z", + "iopub.status.idle": "2024-02-16T06:26:13.502278Z", + "shell.execute_reply": "2024-02-16T06:26:13.501755Z" + }, + "id": "HztO4qU9cw4L", + "outputId": "c6ff9e95-6326-413e-a72f-6f3c05af1055", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 206 + } + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " text label\n", + "0 I am still waiting on my card? 11\n", + "1 What can I do if my card still hasn't arrived after 2 weeks? 11\n", + "2 I have been waiting over a week. Is the card still coming? 11\n", + "3 Can I track my card while it is in the process of delivery? 11\n", + "4 How do I know if I will get my card, or if it is lost? 11" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
textlabel
0I am still waiting on my card?11
1What can I do if my card still hasn't arrived after 2 weeks?11
2I have been waiting over a week. Is the card still coming?11
3Can I track my card while it is in the process of delivery?11
4How do I know if I will get my card, or if it is lost?11
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "variable_name": "data", + "summary": "{\n \"name\": \"data\",\n \"rows\": 1000,\n \"fields\": [\n {\n \"column\": \"text\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 1000,\n \"samples\": [\n \"I made an international purchase, but the exchange rate was wrong\",\n \"I would like to know why a withdraw I made for some cash shows up as pending.\",\n \"I tried to get cash out of the ATM but it is taking too long\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"label\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 12,\n \"min\": 11,\n \"max\": 46,\n \"num_unique_values\": 7,\n \"samples\": [\n 11,\n 13,\n 46\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 24 + } + ], + "source": [ + "from datasets import load_dataset\n", + "\n", + "dataset = load_dataset(\"PolyAI/banking77\", split=\"train\")\n", + "data = pd.DataFrame(dataset[:1000])\n", + "data.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 25, + "metadata": { + "execution": { + "iopub.execute_input": "2024-02-16T06:26:13.504463Z", + "iopub.status.busy": "2024-02-16T06:26:13.504049Z", + "iopub.status.idle": "2024-02-16T06:26:13.508243Z", + "shell.execute_reply": "2024-02-16T06:26:13.507706Z" + }, + "id": "Ujp0luqRcw4M", + "outputId": "b438fed5-aa75-450d-dc84-0b3398960487", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "This dataset has 7 classes.\n", + "Classes: {32, 34, 36, 11, 13, 46, 17}\n" + ] + } + ], + "source": [ + "raw_texts, labels = data[\"text\"].values, data[\"label\"].values\n", + "num_classes = len(set(labels))\n", + "\n", + "print(f\"This dataset has {num_classes} classes.\")\n", + "print(f\"Classes: {set(labels)}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "PVza57cecw4M" + }, + "source": [ + "Let's view the i-th example in the dataset:" + ] + }, + { + "cell_type": "code", + "execution_count": 26, + "metadata": { + "execution": { + "iopub.execute_input": "2024-02-16T06:26:13.510435Z", + "iopub.status.busy": "2024-02-16T06:26:13.510163Z", + "iopub.status.idle": "2024-02-16T06:26:13.513358Z", + "shell.execute_reply": "2024-02-16T06:26:13.512906Z" + }, + "id": "lXHi90Kecw4M", + "outputId": "af8a9b19-986f-44fe-c564-dd83e400309e", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Example Label: 11\n", + "Example Text: What can I do if my card still hasn't arrived after 2 weeks?\n" + ] + } + ], + "source": [ + "i = 1 # change this to view other examples from the dataset\n", + "print(f\"Example Label: {labels[i]}\")\n", + "print(f\"Example Text: {raw_texts[i]}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "JH7UU9Wscw4M" + }, + "source": [ + "The data is stored as two numpy arrays:\n", + "\n", + "1. `raw_texts` stores the customer service requests utterances in text format\n", + "2. `labels` stores the intent categories (labels) for each example" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "T0d80apCcw4M" + }, + "source": [ + "
\n", + "Bringing Your Own Data (BYOD)?\n", + "\n", + "You can easily replace the above with your own text dataset, and continue with the rest of the tutorial.\n", + "\n", + "
" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YLDeD09Ncw4M" + }, + "source": [ + "Next we convert the text strings into vectors better suited as inputs for our ML models.\n", + "\n", + "We will use numeric representations from a pretrained Transformer model as embeddings of our text. The [Sentence Transformers](https://huggingface.co/docs/hub/sentence-transformers) library offers simple methods to compute these embeddings for text data. Here, we load the pretrained `electra-small-discriminator` model, and then run our data through network to extract a vector embedding of each example." + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": { + "execution": { + "iopub.execute_input": "2024-02-16T06:26:13.515306Z", + "iopub.status.busy": "2024-02-16T06:26:13.515126Z", + "iopub.status.idle": "2024-02-16T06:26:18.244024Z", + "shell.execute_reply": "2024-02-16T06:26:18.243354Z" + }, + "id": "DbDb6Ni6cw4M", + "colab": { + "base_uri": "https://localhost:8080/" + }, + "outputId": "b3ff5ca8-afc6-4e0b-b2be-ba5dd7c0841b" + }, + "outputs": [ + { + "output_type": "stream", + "name": "stderr", + "text": [ + "WARNING:sentence_transformers.SentenceTransformer:No sentence-transformers model found with name google/electra-small-discriminator. Creating a new one with MEAN pooling.\n" + ] + } + ], + "source": [ + "transformer = SentenceTransformer('google/electra-small-discriminator')\n", + "text_embeddings = transformer.encode(raw_texts)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Moz0KJvzcw4M" + }, + "source": [ + "Our subsequent ML model will directly operate on elements of `text_embeddings` in order to classify the customer service requests." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4FK2Q72gcw4M" + }, + "source": [ + "## 3. Define a classification model and compute out-of-sample predicted probabilities" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yaicOGrhcw4N" + }, + "source": [ + "A typical way to leverage pretrained networks for a particular classification task is to add a linear output layer and fine-tune the network parameters on the new data. However this can be computationally intensive. Alternatively, we can freeze the pretrained weights of the network and only train the output layer without having to rely on GPU(s). Here we do this conveniently by fitting a scikit-learn linear model on top of the extracted embeddings.\n", + "\n", + "To identify label issues, cleanlab requires a probabilistic prediction from your model for each datapoint. However these predictions will be _overfit_ (and thus unreliable) for datapoints the model was previously trained on. cleanlab is intended to only be used with **out-of-sample** predicted class probabilities, i.e. on datapoints held-out from the model during the training.\n", + "\n", + "Here we obtain out-of-sample predicted class probabilities for every example in our dataset using a Logistic Regression model with cross-validation.\n", + "Make sure that the columns of your `pred_probs` are properly ordered with respect to the ordering of classes, which for Datalab is: lexicographically sorted by class name." + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": { + "execution": { + "iopub.execute_input": "2024-02-16T06:26:18.247142Z", + "iopub.status.busy": "2024-02-16T06:26:18.246652Z", + "iopub.status.idle": "2024-02-16T06:26:19.133641Z", + "shell.execute_reply": "2024-02-16T06:26:19.132953Z" + }, + "scrolled": true, + "id": "tiIqp1arcw4N" + }, + "outputs": [], + "source": [ + "model = LogisticRegression(max_iter=400)\n", + "\n", + "pred_probs = cross_val_predict(model, text_embeddings, labels, method=\"predict_proba\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "9s0pcMk1cw4N" + }, + "source": [ + "## 4. Use cleanlab to find issues in your dataset" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qa8ltsx9cw4N" + }, + "source": [ + "Given feature embeddings and the (out-of-sample) predicted class probabilities obtained from any model you have, cleanlab can quickly help you identify low-quality examples in your dataset.\n", + "\n", + "Here, we use cleanlab's `Datalab` to find issues in our data. Datalab offers several ways of loading the data; we’ll simply wrap the training features and noisy labels in a dictionary." + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": { + "execution": { + "iopub.execute_input": "2024-02-16T06:26:19.136722Z", + "iopub.status.busy": "2024-02-16T06:26:19.136482Z", + "iopub.status.idle": "2024-02-16T06:26:19.139419Z", + "shell.execute_reply": "2024-02-16T06:26:19.138870Z" + }, + "id": "UNj4rWW2cw4N" + }, + "outputs": [], + "source": [ + "data_dict = {\"texts\": raw_texts, \"labels\": labels}" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "IpNmBc_Lcw4N" + }, + "source": [ + "All that is need to audit your data is to call `find_issues()`. We pass in the predicted probabilities and the feature embeddings obtained above, but you do not necessarily need to provide all of this information depending on which types of issues you are interested in. The more inputs you provide, the more types of issues `Datalab` can detect in your data. Using a better model to produce these inputs will ensure cleanlab more accurately estimates issues." + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": { + "execution": { + "iopub.execute_input": "2024-02-16T06:26:19.141893Z", + "iopub.status.busy": "2024-02-16T06:26:19.141673Z", + "iopub.status.idle": "2024-02-16T06:26:20.809087Z", + "shell.execute_reply": "2024-02-16T06:26:20.808461Z" + }, + "scrolled": true, + "id": "R0xuUDRWcw4N", + "outputId": "6e8541c2-0e28-4907-c41a-d097212fe8a4", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Finding null issues ...\n", + "Finding label issues ...\n", + "Finding outlier issues ...\n", + "Fitting OOD estimator based on provided features ...\n", + "Finding near_duplicate issues ...\n", + "Finding non_iid issues ...\n", + "Finding class_imbalance issues ...\n", + "Finding underperforming_group issues ...\n", + "\n", + "Audit complete. 62 issues found in the dataset.\n" + ] + } + ], + "source": [ + "lab = Datalab(data_dict, label_name=\"labels\")\n", + "lab.find_issues(pred_probs=pred_probs, features=text_embeddings)" + ] + }, + { + "cell_type": "markdown", + "source": [ + "The output would look like:\n", + "\n", + "```bash\n", + "Finding null issues ...\n", + "Finding label issues ...\n", + "Finding outlier issues ...\n", + "Fitting OOD estimator based on provided features ...\n", + "Finding near_duplicate issues ...\n", + "Finding non_iid issues ...\n", + "Finding class_imbalance issues ...\n", + "Finding underperforming_group issues ...\n", + "\n", + "Audit complete. 62 issues found in the dataset.\n", + "```" + ], + "metadata": { + "id": "d6Iqy0vGq7w9" + } + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4aitesJccw4N" + }, + "source": [ + "After the audit is complete, review the findings using the `report` method:" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": { + "execution": { + "iopub.execute_input": "2024-02-16T06:26:20.813057Z", + "iopub.status.busy": "2024-02-16T06:26:20.811515Z", + "iopub.status.idle": "2024-02-16T06:26:20.838760Z", + "shell.execute_reply": "2024-02-16T06:26:20.838088Z" + }, + "scrolled": true, + "id": "ALXu32nzcw4N", + "outputId": "733d2ed4-5bcd-49e6-93a7-285f3d66278c", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "Here is a summary of the different kinds of issues found in the data:\n", + "\n", + " issue_type num_issues\n", + " outlier 37\n", + "near_duplicate 14\n", + " label 10\n", + " non_iid 1\n", + "\n", + "Dataset Information: num_examples: 1000, num_classes: 7\n", + "\n", + "\n", + "---------------------- outlier issues ----------------------\n", + "\n", + "About this issue:\n", + "\tExamples that are very different from the rest of the dataset \n", + " (i.e. potentially out-of-distribution or rare/anomalous instances).\n", + " \n", + "\n", + "Number of examples with this issue: 37\n", + "Overall dataset quality in terms of this issue: 0.3671\n", + "\n", + "Examples representing most severe instances of this issue:\n", + " is_outlier_issue outlier_score\n", + "791 True 0.024866\n", + "601 True 0.031162\n", + "863 True 0.060738\n", + "355 True 0.064199\n", + "157 True 0.065075\n", + "\n", + "\n", + "------------------ near_duplicate issues -------------------\n", + "\n", + "About this issue:\n", + "\tA (near) duplicate issue refers to two or more examples in\n", + " a dataset that are extremely similar to each other, relative\n", + " to the rest of the dataset. The examples flagged with this issue\n", + " may be exactly duplicated, or lie atypically close together when\n", + " represented as vectors (i.e. feature embeddings).\n", + " \n", + "\n", + "Number of examples with this issue: 14\n", + "Overall dataset quality in terms of this issue: 0.5961\n", + "\n", + "Examples representing most severe instances of this issue:\n", + " is_near_duplicate_issue near_duplicate_score near_duplicate_sets distance_to_nearest_neighbor\n", + "459 True 0.009544 [429] 0.000566\n", + "429 True 0.009544 [459] 0.000566\n", + "501 True 0.046044 [412, 517] 0.002781\n", + "412 True 0.046044 [501] 0.002781\n", + "698 True 0.054626 [607] 0.003314\n", + "\n", + "\n", + "----------------------- label issues -----------------------\n", + "\n", + "About this issue:\n", + "\tExamples whose given label is estimated to be potentially incorrect\n", + " (e.g. due to annotation error) are flagged as having label issues.\n", + " \n", + "\n", + "Number of examples with this issue: 10\n", + "Overall dataset quality in terms of this issue: 0.9930\n", + "\n", + "Examples representing most severe instances of this issue:\n", + " is_label_issue label_score given_label predicted_label\n", + "379 False 0.025486 32 11\n", + "100 False 0.032102 11 36\n", + "300 False 0.037742 32 46\n", + "485 True 0.057666 17 34\n", + "159 True 0.059408 13 11\n", + "\n", + "\n", + "---------------------- non_iid issues ----------------------\n", + "\n", + "About this issue:\n", + "\tWhether the dataset exhibits statistically significant\n", + " violations of the IID assumption like:\n", + " changepoints or shift, drift, autocorrelation, etc.\n", + " The specific violation considered is whether the\n", + " examples are ordered such that almost adjacent examples\n", + " tend to have more similar feature values.\n", + " \n", + "\n", + "Number of examples with this issue: 1\n", + "Overall dataset quality in terms of this issue: 0.0000\n", + "\n", + "Examples representing most severe instances of this issue:\n", + " is_non_iid_issue non_iid_score\n", + "988 True 0.563774\n", + "975 False 0.570179\n", + "997 False 0.571891\n", + "967 False 0.572357\n", + "956 False 0.577413\n", + "\n", + "Additional Information: \n", + "p-value: 0.0\n" + ] + } + ], + "source": [ + "lab.report()" + ] + }, + { + "cell_type": "markdown", + "source": [ + "The output for the `lab.report()` would look like below:\n", + "\n", + "```bash\n", + "Here is a summary of the different kinds of issues found in the data:\n", + "\n", + " issue_type num_issues\n", + " outlier 37\n", + "near_duplicate 14\n", + " label 10\n", + " non_iid 1\n", + "\n", + "Dataset Information: num_examples: 1000, num_classes: 7\n", + "\n", + "\n", + "---------------------- outlier issues ----------------------\n", + "\n", + "About this issue:\n", + "\tExamples that are very different from the rest of the dataset\n", + " (i.e. potentially out-of-distribution or rare/anomalous instances).\n", + " \n", + "\n", + "Number of examples with this issue: 37\n", + "Overall dataset quality in terms of this issue: 0.3671\n", + "\n", + "Examples representing most severe instances of this issue:\n", + " is_outlier_issue outlier_score\n", + "791 True 0.024866\n", + "601 True 0.031162\n", + "863 True 0.060738\n", + "355 True 0.064199\n", + "157 True 0.065075\n", + "\n", + "\n", + "------------------ near_duplicate issues -------------------\n", + "\n", + "About this issue:\n", + "\tA (near) duplicate issue refers to two or more examples in\n", + " a dataset that are extremely similar to each other, relative\n", + " to the rest of the dataset. The examples flagged with this issue\n", + " may be exactly duplicated, or lie atypically close together when\n", + " represented as vectors (i.e. feature embeddings).\n", + " \n", + "\n", + "Number of examples with this issue: 14\n", + "Overall dataset quality in terms of this issue: 0.5961\n", + "\n", + "Examples representing most severe instances of this issue:\n", + " is_near_duplicate_issue near_duplicate_score near_duplicate_sets distance_to_nearest_neighbor\n", + "459 True 0.009544 [429] 0.000566\n", + "429 True 0.009544 [459] 0.000566\n", + "501 True 0.046044 [412, 517] 0.002781\n", + "412 True 0.046044 [501] 0.002781\n", + "698 True 0.054626 [607] 0.003314\n", + "\n", + "\n", + "----------------------- label issues -----------------------\n", + "\n", + "About this issue:\n", + "\tExamples whose given label is estimated to be potentially incorrect\n", + " (e.g. due to annotation error) are flagged as having label issues.\n", + " \n", + "\n", + "Number of examples with this issue: 10\n", + "Overall dataset quality in terms of this issue: 0.9930\n", + "\n", + "Examples representing most severe instances of this issue:\n", + " is_label_issue label_score given_label predicted_label\n", + "379 False 0.025486 32 11\n", + "100 False 0.032102 11 36\n", + "300 False 0.037742 32 46\n", + "485 True 0.057666 17 34\n", + "159 True 0.059408 13 11\n", + "\n", + "\n", + "---------------------- non_iid issues ----------------------\n", + "\n", + "About this issue:\n", + "\tWhether the dataset exhibits statistically significant\n", + " violations of the IID assumption like:\n", + " changepoints or shift, drift, autocorrelation, etc.\n", + " The specific violation considered is whether the\n", + " examples are ordered such that almost adjacent examples\n", + " tend to have more similar feature values.\n", + " \n", + "\n", + "Number of examples with this issue: 1\n", + "Overall dataset quality in terms of this issue: 0.0000\n", + "\n", + "Examples representing most severe instances of this issue:\n", + " is_non_iid_issue non_iid_score\n", + "988 True 0.563774\n", + "975 False 0.570179\n", + "997 False 0.571891\n", + "967 False 0.572357\n", + "956 False 0.577413\n", + "\n", + "Additional Information:\n", + "p-value: 0.0\n", + "```" + ], + "metadata": { + "id": "XI03VkWHrixv" + } + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sAuLE6Macw4N" + }, + "source": [ + "### Label issues\n", + "\n", + "The report indicates that cleanlab identified many label issues in our dataset. We can see which examples are flagged as likely mislabeled and the label quality score for each example using the `get_issues` method, specifying `label` as an argument to focus on label issues in the data." + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": { + "execution": { + "iopub.execute_input": "2024-02-16T06:26:20.843083Z", + "iopub.status.busy": "2024-02-16T06:26:20.842045Z", + "iopub.status.idle": "2024-02-16T06:26:20.852505Z", + "shell.execute_reply": "2024-02-16T06:26:20.852016Z" + }, + "scrolled": true, + "id": "6gATaXWscw4N", + "outputId": "0d0e70c5-1548-4fe6-b67e-668c8dfedf0e", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 206 + } + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " is_label_issue label_score given_label predicted_label\n", + "0 False 0.903926 11 11\n", + "1 False 0.860544 11 11\n", + "2 False 0.658309 11 11\n", + "3 False 0.697085 11 11\n", + "4 False 0.434934 11 11" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
is_label_issuelabel_scoregiven_labelpredicted_label
0False0.9039261111
1False0.8605441111
2False0.6583091111
3False0.6970851111
4False0.4349341111
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "variable_name": "label_issues", + "summary": "{\n \"name\": \"label_issues\",\n \"rows\": 1000,\n \"fields\": [\n {\n \"column\": \"is_label_issue\",\n \"properties\": {\n \"dtype\": \"boolean\",\n \"num_unique_values\": 2,\n \"samples\": [\n true,\n false\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"label_score\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.2150390046430028,\n \"min\": 0.025486333476725527,\n \"max\": 0.999751760644687,\n \"num_unique_values\": 1000,\n \"samples\": [\n 0.98954913626076,\n 0.44264330724848383\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"given_label\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 12,\n \"min\": 11,\n \"max\": 46,\n \"num_unique_values\": 7,\n \"samples\": [\n 11,\n 13\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"predicted_label\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 12,\n \"min\": 11,\n \"max\": 46,\n \"num_unique_values\": 7,\n \"samples\": [\n 11,\n 13\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 32 + } + ], + "source": [ + "label_issues = lab.get_issues(\"label\")\n", + "label_issues.head()" + ] + }, + { + "cell_type": "markdown", + "source": [ + "| is_label_issue | label_score | given_label | predicted_label |\n", + "|----------------|-------------|-------------|-----------------|\n", + "| 0 | False | 0.903926 | 11 | 11 |\n", + "| 1 | False | 0.860544 | 11 | 11 |\n", + "| 2 | False | 0.658309 | 11 | 11 |\n", + "| 3 | False | 0.697085 | 11 | 11 |\n", + "| 4 | False | 0.434934 | 11 | 11 |\n" + ], + "metadata": { + "id": "eBLFyMMcs5NT" + } + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-tYlhmKYcw4N" + }, + "source": [ + "This method returns a dataframe containing a label quality score for each example. These numeric scores lie between 0 and 1, where lower scores indicate examples more likely to be mislabeled. The dataframe also contains a boolean column specifying whether or not each example is identified to have a label issue (indicating it is likely mislabeled)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "XcD-oCLlcw4N" + }, + "source": [ + "We can get the subset of examples flagged with label issues, and also sort by label quality score to find the indices of the 5 most likely mislabeled examples in our dataset." + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "metadata": { + "execution": { + "iopub.execute_input": "2024-02-16T06:26:20.854743Z", + "iopub.status.busy": "2024-02-16T06:26:20.854394Z", + "iopub.status.idle": "2024-02-16T06:26:20.858961Z", + "shell.execute_reply": "2024-02-16T06:26:20.858409Z" + }, + "id": "QtloV-NBcw4N", + "outputId": "86c32e99-7dc8-470c-b102-f0f5acc13855", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "outputs": [ + { + "output_type": "stream", + "name": "stdout", + "text": [ + "cleanlab found 10 potential label errors in the dataset.\n", + "Here are indices of the top 5 most likely errors: \n", + " [379 100 300 485 159]\n" + ] + } + ], + "source": [ + "identified_label_issues = label_issues[label_issues[\"is_label_issue\"] == True]\n", + "lowest_quality_labels = label_issues[\"label_score\"].argsort()[:5].to_numpy()\n", + "\n", + "print(\n", + " f\"cleanlab found {len(identified_label_issues)} potential label errors in the dataset.\\n\"\n", + " f\"Here are indices of the top 5 most likely errors: \\n {lowest_quality_labels}\"\n", + ")" + ] + }, + { + "cell_type": "markdown", + "source": [ + "The output for the above cell would look like below:\n", + "\n", + "```bash\n", + "cleanlab found 10 potential label errors in the dataset.\n", + "Here are indices of the top 5 most likely errors:\n", + " [379 100 300 485 159]\n", + "\n", + "```" + ], + "metadata": { + "id": "QyW7qUNKXOz5" + } + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8J49bTeocw4N" + }, + "source": [ + "Let's review some of the most likely label errors.\n", + "\n", + "Here we display the top 5 examples identified as the most likely label errors in the dataset, together with their given (original) label and a suggested alternative label from cleanlab.\n" + ] + }, + { + "cell_type": "code", + "execution_count": 18, + "metadata": { + "execution": { + "iopub.execute_input": "2024-02-16T06:26:20.861048Z", + "iopub.status.busy": "2024-02-16T06:26:20.860742Z", + "iopub.status.idle": "2024-02-16T06:26:20.867443Z", + "shell.execute_reply": "2024-02-16T06:26:20.866904Z" + }, + "id": "c-niFVJvcw4N", + "outputId": "5bbc5217-3581-4e2e-8b56-7a1fc77cc427", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 276 + } + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " text \\\n", + "379 Is there a specific source that the exchange rate for the transfer I'm planning on making is pulled from? \n", + "100 can you share card tracking number? \n", + "300 If I need to cash foreign transfers, how does that work? \n", + "485 Was I charged more than I should of been for a currency exchange? \n", + "159 Is there any way to see my card in the app? \n", + "\n", + " given_label suggested_label \n", + "379 32 11 \n", + "100 11 36 \n", + "300 32 46 \n", + "485 17 34 \n", + "159 13 11 " + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
textgiven_labelsuggested_label
379Is there a specific source that the exchange rate for the transfer I'm planning on making is pulled from?3211
100can you share card tracking number?1136
300If I need to cash foreign transfers, how does that work?3246
485Was I charged more than I should of been for a currency exchange?1734
159Is there any way to see my card in the app?1311
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "summary": "{\n \"name\": \"data_with_suggested_labels\",\n \"rows\": 5,\n \"fields\": [\n {\n \"column\": \"text\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"can you share card tracking number?\",\n \"Is there any way to see my card in the app?\",\n \"If I need to cash foreign transfers, how does that work?\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"given_label\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 10,\n \"min\": 11,\n \"max\": 32,\n \"num_unique_values\": 4,\n \"samples\": [\n 11,\n 13,\n 32\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"suggested_label\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 15,\n \"min\": 11,\n \"max\": 46,\n \"num_unique_values\": 4,\n \"samples\": [\n 36,\n 34,\n 11\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 18 + } + ], + "source": [ + "data_with_suggested_labels = pd.DataFrame(\n", + " {\"text\": raw_texts, \"given_label\": labels, \"suggested_label\": label_issues[\"predicted_label\"]}\n", + ")\n", + "data_with_suggested_labels.iloc[lowest_quality_labels]" + ] + }, + { + "cell_type": "markdown", + "source": [ + " The output to the above command would like below:\n", + " \n", + " | text | given_label | suggested_label |\n", + "|------|-----------------------------------------------------------------------------------------------------------|-----------------|\n", + "| 379 | Is there a specific source that the exchange rate for the transfer I'm planning on making is pulled from? | 32 |\n", + "| 100 | can you share card tracking number? | 11 |\n", + "| 300 | If I need to cash foreign transfers, how does that work? | 32 |\n", + "| 485 | Was I charged more than I should of been for a currency exchange? | 17 |\n", + "| 159 | Is there any way to see my card in the app? | 13 |\n" + ], + "metadata": { + "id": "g2dvMySPtkbL" + } + }, + { + "cell_type": "markdown", + "metadata": { + "scrolled": true, + "id": "eH8ltGj0cw4O" + }, + "source": [ + "These are very clear label errors that cleanlab has identified in this data! Note that the `given_label` does not correctly reflect the intent of these requests, whoever produced this dataset made many mistakes that are important to address before modeling the data." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ULFeD3bzcw4O" + }, + "source": [ + "### Outlier issues\n", + "\n", + "According to the report, our dataset contains some outliers.\n", + "We can see which examples are outliers (and a numeric quality score quantifying how typical each example appears to be) via `get_issues`. We sort the resulting DataFrame by cleanlab's outlier quality score to see the most severe outliers in our dataset." + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "metadata": { + "execution": { + "iopub.execute_input": "2024-02-16T06:26:20.869718Z", + "iopub.status.busy": "2024-02-16T06:26:20.869251Z", + "iopub.status.idle": "2024-02-16T06:26:20.876386Z", + "shell.execute_reply": "2024-02-16T06:26:20.875851Z" + }, + "id": "jBLuqUXBcw4O", + "outputId": "d5d2dbc6-c708-4750-e3ea-6dcd5c24a64d", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 206 + } + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " is_outlier_issue outlier_score\n", + "791 True 0.024866\n", + "601 True 0.031162\n", + "863 True 0.060738\n", + "355 True 0.064199\n", + "157 True 0.065075" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
is_outlier_issueoutlier_score
791True0.024866
601True0.031162
863True0.060738
355True0.064199
157True0.065075
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "summary": "{\n \"name\": \"outlier_issues\",\n \"rows\": 5,\n \"fields\": [\n {\n \"column\": \"is_outlier_issue\",\n \"properties\": {\n \"dtype\": \"boolean\",\n \"num_unique_values\": 1,\n \"samples\": [\n true\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"outlier_score\",\n \"properties\": {\n \"dtype\": \"float32\",\n \"num_unique_values\": 5,\n \"samples\": [\n 0.03116183541715145\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 20 + } + ], + "source": [ + "outlier_issues = lab.get_issues(\"outlier\")\n", + "outlier_issues.sort_values(\"outlier_score\").head()" + ] + }, + { + "cell_type": "markdown", + "source": [ + "Output would look like below:\n", + "\n", + "| is_outlier_issue | outlier_score |\n", + "|------------------|---------------|\n", + "| True | 0.024866 |\n", + "| True | 0.031162 |\n", + "| True | 0.060738 |\n", + "| True | 0.064199 |\n", + "| True | 0.065075 |" + ], + "metadata": { + "id": "F7Z2VJQAujui" + } + }, + { + "cell_type": "code", + "execution_count": 34, + "metadata": { + "execution": { + "iopub.execute_input": "2024-02-16T06:26:20.878435Z", + "iopub.status.busy": "2024-02-16T06:26:20.878117Z", + "iopub.status.idle": "2024-02-16T06:26:20.884073Z", + "shell.execute_reply": "2024-02-16T06:26:20.883533Z" + }, + "id": "Kjn-muLGcw4O", + "outputId": "a5ae0a32-cac4-442d-89fc-8f7f64da9dfc", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 246 + } + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " text label\n", + "791 withdrawal pending meaning? 46\n", + "601 $1 charge in transaction. 34\n", + "863 My atm withdraw is stillpending 46\n", + "355 explain the interbank exchange rate 32\n", + "157 lost card found, want to put it back in app 13" + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
textlabel
791withdrawal pending meaning?46
601$1 charge in transaction.34
863My atm withdraw is stillpending46
355explain the interbank exchange rate32
157lost card found, want to put it back in app13
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "summary": "{\n \"name\": \"data\",\n \"rows\": 5,\n \"fields\": [\n {\n \"column\": \"text\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 5,\n \"samples\": [\n \"$1 charge in transaction.\",\n \"lost card found, want to put it back in app\",\n \"My atm withdraw is stillpending\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"label\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 13,\n \"min\": 13,\n \"max\": 46,\n \"num_unique_values\": 4,\n \"samples\": [\n 34,\n 13,\n 46\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 34 + } + ], + "source": [ + "lowest_quality_outliers = outlier_issues[\"outlier_score\"].argsort()[:5]\n", + "\n", + "data.iloc[lowest_quality_outliers]" + ] + }, + { + "cell_type": "markdown", + "source": [ + "A sample output for the lowest quality outliers would look like below:\n", + "\n", + "|index|text|label|\n", + "|---|---|---|\n", + "|791|withdrawal pending meaning?|46|\n", + "|601|$1 charge in transaction\\.|34|\n", + "|863|My atm withdraw is stillpending|46|\n", + "|355|explain the interbank exchange rate|32|\n", + "|157|lost card found, want to put it back in app|13|\n" + ], + "metadata": { + "id": "kuZMsLPZYARL" + } + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sBal-KDrcw4R" + }, + "source": [ + "We see that cleanlab has identified entries in this dataset that do not appear to be proper customer requests. Outliers in this dataset appear to be out-of-scope customer requests and other nonsensical text which does not make sense for intent classification. Carefully consider whether such outliers may detrimentally affect your data modeling, and consider removing them from the dataset if so." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "ch71b_0qcw4S" + }, + "source": [ + "### Near-duplicate issues\n", + "\n", + "According to the report, our dataset contains some sets of nearly duplicated examples.\n", + "We can see which examples are (nearly) duplicated (and a numeric quality score quantifying how dissimilar each example is from its nearest neighbor in the dataset) via `get_issues`. We sort the resulting DataFrame by cleanlab's near-duplicate quality score to see the text examples in our dataset that are most nearly duplicated." + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": { + "execution": { + "iopub.execute_input": "2024-02-16T06:26:20.886079Z", + "iopub.status.busy": "2024-02-16T06:26:20.885805Z", + "iopub.status.idle": "2024-02-16T06:26:20.894466Z", + "shell.execute_reply": "2024-02-16T06:26:20.893919Z" + }, + "id": "TbI49Rdccw4S", + "outputId": "1978cdb5-02c2-4f82-e7d5-553ad1b6dca9", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 226 + } + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " is_near_duplicate_issue near_duplicate_score near_duplicate_sets \\\n", + "459 True 0.009544 [429] \n", + "429 True 0.009544 [459] \n", + "501 True 0.046044 [412, 517] \n", + "412 True 0.046044 [501] \n", + "698 True 0.054626 [607] \n", + "\n", + " distance_to_nearest_neighbor \n", + "459 0.000566 \n", + "429 0.000566 \n", + "501 0.002781 \n", + "412 0.002781 \n", + "698 0.003314 " + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
is_near_duplicate_issuenear_duplicate_scorenear_duplicate_setsdistance_to_nearest_neighbor
459True0.009544[429]0.000566
429True0.009544[459]0.000566
501True0.046044[412, 517]0.002781
412True0.046044[501]0.002781
698True0.054626[607]0.003314
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "summary": "{\n \"name\": \"duplicate_issues\",\n \"rows\": 5,\n \"fields\": [\n {\n \"column\": \"is_near_duplicate_issue\",\n \"properties\": {\n \"dtype\": \"boolean\",\n \"num_unique_values\": 1,\n \"samples\": [\n true\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"near_duplicate_score\",\n \"properties\": {\n \"dtype\": \"float32\",\n \"num_unique_values\": 3,\n \"samples\": [\n 0.00954437255859375\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"near_duplicate_sets\",\n \"properties\": {\n \"dtype\": \"object\",\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"distance_to_nearest_neighbor\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0.0013286758192588926,\n \"min\": 0.0005658268928527832,\n \"max\": 0.0033143162727355957,\n \"num_unique_values\": 3,\n \"samples\": [\n 0.0005658268928527832\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 35 + } + ], + "source": [ + "duplicate_issues = lab.get_issues(\"near_duplicate\")\n", + "duplicate_issues.sort_values(\"near_duplicate_score\").head()" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EawP0y1Lcw4S" + }, + "source": [ + "The results above show which examples cleanlab considers nearly duplicated (rows where `is_near_duplicate_issue == True`). Here, we see that example 459 and 429 are nearly duplicated, as are example 501 and 412.\n", + "\n", + "Let's view these examples to see how similar they are." + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "metadata": { + "execution": { + "iopub.execute_input": "2024-02-16T06:26:20.896501Z", + "iopub.status.busy": "2024-02-16T06:26:20.896175Z", + "iopub.status.idle": "2024-02-16T06:26:20.901983Z", + "shell.execute_reply": "2024-02-16T06:26:20.901420Z" + }, + "id": "0TEW5igFcw4S", + "outputId": "86343985-26bb-44ce-f27b-610357f43030", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 182 + } + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " text \\\n", + "459 I purchased something abroad and the incorrect exchange rate was applied. \n", + "429 I purchased something overseas and the incorrect exchange rate was applied. \n", + "\n", + " label \n", + "459 17 \n", + "429 17 " + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
textlabel
459I purchased something abroad and the incorrect exchange rate was applied.17
429I purchased something overseas and the incorrect exchange rate was applied.17
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "summary": "{\n \"name\": \"data\",\n \"rows\": 2,\n \"fields\": [\n {\n \"column\": \"text\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"I purchased something overseas and the incorrect exchange rate was applied.\",\n \"I purchased something abroad and the incorrect exchange rate was applied.\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"label\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 17,\n \"max\": 17,\n \"num_unique_values\": 1,\n \"samples\": [\n 17\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 38 + } + ], + "source": [ + "data.iloc[[459, 429]]" + ] + }, + { + "cell_type": "markdown", + "source": [ + "Sample output:\n", + "\n", + "|index|text|label|\n", + "|---|---|---|\n", + "|459|I purchased something abroad and the incorrect exchange rate was applied\\.|17|\n", + "|429|I purchased something overseas and the incorrect exchange rate was applied\\.|17|" + ], + "metadata": { + "id": "DoAyD-FZpsSm" + } + }, + { + "cell_type": "code", + "execution_count": 39, + "metadata": { + "execution": { + "iopub.execute_input": "2024-02-16T06:26:20.904159Z", + "iopub.status.busy": "2024-02-16T06:26:20.903821Z", + "iopub.status.idle": "2024-02-16T06:26:20.909681Z", + "shell.execute_reply": "2024-02-16T06:26:20.909160Z" + }, + "id": "VnbIBYaHcw4S", + "outputId": "8b00bb96-0d9d-43f6-b85f-c41e437d41b5", + "colab": { + "base_uri": "https://localhost:8080/", + "height": 198 + } + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + " text \\\n", + "501 The exchange rate you are using is really bad.This can't be the official interbank exchange rate. \n", + "412 The exchange rate you are using is bad.This can't be the official interbank exchange rate. \n", + "\n", + " label \n", + "501 17 \n", + "412 17 " + ], + "text/html": [ + "\n", + "
\n", + "
\n", + "\n", + "\n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + " \n", + "
textlabel
501The exchange rate you are using is really bad.This can't be the official interbank exchange rate.17
412The exchange rate you are using is bad.This can't be the official interbank exchange rate.17
\n", + "
\n", + "
\n", + "\n", + "
\n", + " \n", + "\n", + " \n", + "\n", + " \n", + "
\n", + "\n", + "\n", + "
\n", + " \n", + "\n", + "\n", + "\n", + " \n", + "
\n", + "
\n", + "
\n" + ], + "application/vnd.google.colaboratory.intrinsic+json": { + "type": "dataframe", + "summary": "{\n \"name\": \"data\",\n \"rows\": 2,\n \"fields\": [\n {\n \"column\": \"text\",\n \"properties\": {\n \"dtype\": \"string\",\n \"num_unique_values\": 2,\n \"samples\": [\n \"The exchange rate you are using is bad.This can't be the official interbank exchange rate.\",\n \"The exchange rate you are using is really bad.This can't be the official interbank exchange rate.\"\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n },\n {\n \"column\": \"label\",\n \"properties\": {\n \"dtype\": \"number\",\n \"std\": 0,\n \"min\": 17,\n \"max\": 17,\n \"num_unique_values\": 1,\n \"samples\": [\n 17\n ],\n \"semantic_type\": \"\",\n \"description\": \"\"\n }\n }\n ]\n}" + } + }, + "metadata": {}, + "execution_count": 39 + } + ], + "source": [ + "data.iloc[[501, 412]]" + ] + }, + { + "cell_type": "markdown", + "source": [ + "Sample output:\n", + "\n", + "|index|text|label|\n", + "|---|---|---|\n", + "|501|The exchange rate you are using is really bad\\.This can't be the official interbank exchange rate\\.|17|\n", + "|412|The exchange rate you are using is bad\\.This can't be the official interbank exchange rate\\.|17|" + ], + "metadata": { + "id": "Y4QD35-dqeGg" + } + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UG8xfTa5cw4S" + }, + "source": [ + "We see that these two sets of request are indeed very similar to one another! Including near duplicates in a dataset may have unintended effects on models, and be wary about splitting them across training/test sets. Learn more about handling near duplicates in a dataset from [the FAQ](../faq.html#How-to-handle-near-duplicate-data-identified-by-cleanlab?)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "iefctl3rcw4S" + }, + "source": [ + "### Non-IID issues (data drift)\n", + "According to the report, our dataset does not appear to be Independent and Identically Distributed (IID). The overall non-iid score for the dataset (displayed below) corresponds to the `p-value` of a statistical test for whether the ordering of samples in the dataset appears related to the similarity between their feature values. A low `p-value` strongly suggests that the dataset violates the IID assumption, which is a key assumption required for conclusions (models) produced from the dataset to generalize to a larger population." + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "metadata": { + "execution": { + "iopub.execute_input": "2024-02-16T06:26:20.911817Z", + "iopub.status.busy": "2024-02-16T06:26:20.911434Z", + "iopub.status.idle": "2024-02-16T06:26:20.915049Z", + "shell.execute_reply": "2024-02-16T06:26:20.914501Z" + }, + "id": "oEMWOQQPcw4S", + "outputId": "18eca4cd-2451-4850-960c-0bf1e35d9729", + "colab": { + "base_uri": "https://localhost:8080/" + } + }, + "outputs": [ + { + "output_type": "execute_result", + "data": { + "text/plain": [ + "0.0" + ] + }, + "metadata": {}, + "execution_count": 40 + } + ], + "source": [ + "p_value = lab.get_info('non_iid')['p-value']\n", + "p_value" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "c6swPCnncw4S" + }, + "source": [ + "Here, our dataset was flagged as non-IID because the rows happened to be sorted by class label in the original data. This may be benign if we remember to shuffle rows before model training and data splitting. But if you don't know why your data was flagged as non-IID, then you should be worried about potential data drift or unexpected interactions between data points (their values may not be statistically independent). Think carefully about what future test data may look like (and whether your data is representative of the population you care about). You should not shuffle your data before the non-IID test runs (will invalidate its conclusions)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "uCoKXqBrcw4S" + }, + "source": [ + "As demonstrated above, cleanlab can automatically shortlist the most likely issues in your dataset to help you better curate your dataset for subsequent modeling. With this shortlist, you can decide whether to fix these label issues or remove nonsensical or duplicated examples from your dataset to obtain a higher-quality dataset for training your next ML model. cleanlab's issue detection can be run with outputs from *any* type of model you initially trained.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "qnncoRWUcw4S" + }, + "source": [ + "### Easy Mode\n", + "\n", + "Cleanlab is most effective when you run this code with a good ML model. Try to produce the best ML model you can for your data (instead of the basic model from this tutorial). If you don't know the best ML model for your data, try [Cleanlab Studio](https://cleanlab.ai/blog/data-centric-ai/) which will automatically produce one for you. Super easy to use, [Cleanlab Studio](https://cleanlab.ai/blog/data-centric-ai/) is no-code platform for data-centric AI that automatically: detects data issues (more types of issues than this cleanlab package), helps you quickly correct these data issues, confidently labels large subsets of an unlabeled dataset, and provides other smart metadata about each of your data points -- all powered by a system that automatically trains/deploys the best ML model for your data. [Try it for free!](https://cleanlab.ai/signup/)" + ] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.8" + } }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "In this 5-minute quickstart tutorial, we use Datalab to detect various issues in an intent classification dataset composed of (text) customer service requests at an online bank. We consider a subset of the [Banking77-OOS Dataset](https://arxiv.org/abs/2106.04564) containing 1,000 customer service requests which are classified into 10 categories based on their intent (you can run this same code on any text classification dataset). Cleanlab automatically identifies bad examples in our dataset, including mislabeled data, out-of-scope examples (outliers), or otherwise ambiguous examples. Consider filtering or correcting such bad examples before you dive deep into modeling your data!\n", - "\n", - "**Overview of what we'll do in this tutorial:**\n", - "\n", - "- Use a pretrained transformer model to extract the text embeddings from the customer service requests\n", - "\n", - "- Train a simple Logistic Regression model on the text embeddings to compute out-of-sample predicted probabilities\n", - "\n", - "- Run cleanlab's `Datalab` audit with these predictions and embeddings in order to identify problems like: label issues, outliers, and near duplicates in the dataset." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "
\n", - "Quickstart\n", - "
\n", - " \n", - "Already have (out-of-sample) `pred_probs` from a model trained on an existing set of labels? Maybe you have some numeric `features` as well? Run the code below to find any potential label errors in your dataset.\n", - "\n", - "
\n", - " \n", - "```ipython3 \n", - "from cleanlab import Datalab\n", - "\n", - "lab = Datalab(data=your_dataset, label_name=\"column_name_of_labels\")\n", - "lab.find_issues(pred_probs=your_pred_probs, features=your_features)\n", - "\n", - "lab.report()\n", - "lab.get_issues()\n", - "```\n", - " \n", - "
\n", - "
" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 1. Install required dependencies\n" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "You can use `pip` to install all packages required for this tutorial as follows:\n", - "\n", - "```ipython3\n", - "!pip install sklearn sentence-transformers\n", - "!pip install \"cleanlab[datalab]\"\n", - "# Make sure to install the version corresponding to this tutorial\n", - "# E.g. if viewing master branch documentation:\n", - "# !pip install git+https://github.com/cleanlab/cleanlab.git\n", - "```" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "nbsphinx": "hidden" - }, - "outputs": [], - "source": [ - "# Package installation (hidden on docs.cleanlab.ai).\n", - "# If running on Colab, may want to use GPU (select: Runtime > Change runtime type > Hardware accelerator > GPU)\n", - "# Package versions we used:scikit-learn==1.2.0 sentence-transformers==2.2.2\n", - "\n", - "dependencies = [\"cleanlab\", \"sentence_transformers\", \"datasets\"]\n", - "\n", - "# Supress outputs that may appear if tensorflow happens to be improperly installed: \n", - "import os \n", - "\n", - "os.environ[\"TOKENIZERS_PARALLELISM\"] = \"false\" # disable parallelism to avoid deadlocks with huggingface\n", - "\n", - "if \"google.colab\" in str(get_ipython()): # Check if it's running in Google Colab\n", - " %pip install cleanlab # for colab\n", - " cmd = ' '.join([dep for dep in dependencies if dep != \"cleanlab\"])\n", - " %pip install $cmd\n", - "else:\n", - " dependencies_test = [dependency.split('>')[0] if '>' in dependency \n", - " else dependency.split('<')[0] if '<' in dependency \n", - " else dependency.split('=')[0] for dependency in dependencies]\n", - " missing_dependencies = []\n", - " for dependency in dependencies_test:\n", - " try:\n", - " __import__(dependency)\n", - " except ImportError:\n", - " missing_dependencies.append(dependency)\n", - "\n", - " if len(missing_dependencies) > 0:\n", - " print(\"Missing required dependencies:\")\n", - " print(*missing_dependencies, sep=\", \")\n", - " print(\"\\nPlease install them before running the rest of this notebook.\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "import re \n", - "import string \n", - "import pandas as pd \n", - "from sklearn.metrics import accuracy_score, log_loss \n", - "from sklearn.model_selection import cross_val_predict \n", - "from sklearn.linear_model import LogisticRegression\n", - "from sentence_transformers import SentenceTransformer\n", - "\n", - "from cleanlab import Datalab" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "nbsphinx": "hidden" - }, - "outputs": [], - "source": [ - "# This cell is hidden from docs.cleanlab.ai \n", - "\n", - "import random \n", - "import numpy as np \n", - "\n", - "pd.set_option(\"display.max_colwidth\", None) \n", - "\n", - "SEED = 123456 # for reproducibility\n", - "np.random.seed(SEED)\n", - "random.seed(SEED)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 2. Load and format the text dataset\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "data = pd.read_csv(\"https://s.cleanlab.ai/banking-intent-classification.csv\")\n", - "data.head()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "raw_texts, labels = data[\"text\"].values, data[\"label\"].values\n", - "num_classes = len(set(labels))\n", - "\n", - "print(f\"This dataset has {num_classes} classes.\")\n", - "print(f\"Classes: {set(labels)}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let's view the i-th example in the dataset:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "i = 1 # change this to view other examples from the dataset\n", - "print(f\"Example Label: {labels[i]}\")\n", - "print(f\"Example Text: {raw_texts[i]}\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The data is stored as two numpy arrays:\n", - "\n", - "1. `raw_texts` stores the customer service requests utterances in text format\n", - "2. `labels` stores the intent categories (labels) for each example" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "
\n", - "Bringing Your Own Data (BYOD)?\n", - "\n", - "You can easily replace the above with your own text dataset, and continue with the rest of the tutorial.\n", - "\n", - "
" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Next we convert the text strings into vectors better suited as inputs for our ML models. \n", - "\n", - "We will use numeric representations from a pretrained Transformer model as embeddings of our text. The [Sentence Transformers](https://huggingface.co/docs/hub/sentence-transformers) library offers simple methods to compute these embeddings for text data. Here, we load the pretrained `electra-small-discriminator` model, and then run our data through network to extract a vector embedding of each example." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "transformer = SentenceTransformer('google/electra-small-discriminator')\n", - "text_embeddings = transformer.encode(raw_texts)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Our subsequent ML model will directly operate on elements of `text_embeddings` in order to classify the customer service requests." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 3. Define a classification model and compute out-of-sample predicted probabilities" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "A typical way to leverage pretrained networks for a particular classification task is to add a linear output layer and fine-tune the network parameters on the new data. However this can be computationally intensive. Alternatively, we can freeze the pretrained weights of the network and only train the output layer without having to rely on GPU(s). Here we do this conveniently by fitting a scikit-learn linear model on top of the extracted embeddings.\n", - "\n", - "To identify label issues, cleanlab requires a probabilistic prediction from your model for each datapoint. However these predictions will be _overfit_ (and thus unreliable) for datapoints the model was previously trained on. cleanlab is intended to only be used with **out-of-sample** predicted class probabilities, i.e. on datapoints held-out from the model during the training.\n", - "\n", - "Here we obtain out-of-sample predicted class probabilities for every example in our dataset using a Logistic Regression model with cross-validation.\n", - "Make sure that the columns of your `pred_probs` are properly ordered with respect to the ordering of classes, which for Datalab is: lexicographically sorted by class name." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": true - }, - "outputs": [], - "source": [ - "model = LogisticRegression(max_iter=400)\n", - "\n", - "pred_probs = cross_val_predict(model, text_embeddings, labels, method=\"predict_proba\")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## 4. Use cleanlab to find issues in your dataset" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Given feature embeddings and the (out-of-sample) predicted class probabilities obtained from any model you have, cleanlab can quickly help you identify low-quality examples in your dataset.\n", - "\n", - "Here, we use cleanlab's `Datalab` to find issues in our data. Datalab offers several ways of loading the data; we’ll simply wrap the training features and noisy labels in a dictionary. " - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "data_dict = {\"texts\": raw_texts, \"labels\": labels}" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "All that is need to audit your data is to call `find_issues()`. We pass in the predicted probabilities and the feature embeddings obtained above, but you do not necessarily need to provide all of this information depending on which types of issues you are interested in. The more inputs you provide, the more types of issues `Datalab` can detect in your data. Using a better model to produce these inputs will ensure cleanlab more accurately estimates issues." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": true - }, - "outputs": [], - "source": [ - "lab = Datalab(data_dict, label_name=\"labels\")\n", - "lab.find_issues(pred_probs=pred_probs, features=text_embeddings)" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "After the audit is complete, review the findings using the `report` method:" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": true - }, - "outputs": [], - "source": [ - "lab.report()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Label issues\n", - "\n", - "The report indicates that cleanlab identified many label issues in our dataset. We can see which examples are flagged as likely mislabeled and the label quality score for each example using the `get_issues` method, specifying `label` as an argument to focus on label issues in the data." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": { - "scrolled": true - }, - "outputs": [], - "source": [ - "label_issues = lab.get_issues(\"label\")\n", - "label_issues.head() " - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "This method returns a dataframe containing a label quality score for each example. These numeric scores lie between 0 and 1, where lower scores indicate examples more likely to be mislabeled. The dataframe also contains a boolean column specifying whether or not each example is identified to have a label issue (indicating it is likely mislabeled)." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We can get the subset of examples flagged with label issues, and also sort by label quality score to find the indices of the 5 most likely mislabeled examples in our dataset." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "identified_label_issues = label_issues[label_issues[\"is_label_issue\"] == True]\n", - "lowest_quality_labels = label_issues[\"label_score\"].argsort()[:5].to_numpy()\n", - "\n", - "print(\n", - " f\"cleanlab found {len(identified_label_issues)} potential label errors in the dataset.\\n\"\n", - " f\"Here are indices of the top 5 most likely errors: \\n {lowest_quality_labels}\"\n", - ")" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Let's review some of the most likely label errors. \n", - "\n", - "Here we display the top 5 examples identified as the most likely label errors in the dataset, together with their given (original) label and a suggested alternative label from cleanlab.\n" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "data_with_suggested_labels = pd.DataFrame(\n", - " {\"text\": raw_texts, \"given_label\": labels, \"suggested_label\": label_issues[\"predicted_label\"]}\n", - ")\n", - "data_with_suggested_labels.iloc[lowest_quality_labels]" - ] - }, - { - "cell_type": "markdown", - "metadata": { - "scrolled": true - }, - "source": [ - "These are very clear label errors that cleanlab has identified in this data! Note that the `given_label` does not correctly reflect the intent of these requests, whoever produced this dataset made many mistakes that are important to address before modeling the data." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Outlier issues\n", - "\n", - "According to the report, our dataset contains some outliers.\n", - "We can see which examples are outliers (and a numeric quality score quantifying how typical each example appears to be) via `get_issues`. We sort the resulting DataFrame by cleanlab's outlier quality score to see the most severe outliers in our dataset." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "outlier_issues = lab.get_issues(\"outlier\")\n", - "outlier_issues.sort_values(\"outlier_score\").head()" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "lowest_quality_outliers = outlier_issues[\"outlier_score\"].argsort()[:5]\n", - "\n", - "data.iloc[lowest_quality_outliers]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We see that cleanlab has identified entries in this dataset that do not appear to be proper customer requests. Outliers in this dataset appear to be out-of-scope customer requests and other nonsensical text which does not make sense for intent classification. Carefully consider whether such outliers may detrimentally affect your data modeling, and consider removing them from the dataset if so." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Near-duplicate issues\n", - "\n", - "According to the report, our dataset contains some sets of nearly duplicated examples.\n", - "We can see which examples are (nearly) duplicated (and a numeric quality score quantifying how dissimilar each example is from its nearest neighbor in the dataset) via `get_issues`. We sort the resulting DataFrame by cleanlab's near-duplicate quality score to see the text examples in our dataset that are most nearly duplicated." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "duplicate_issues = lab.get_issues(\"near_duplicate\")\n", - "duplicate_issues.sort_values(\"near_duplicate_score\").head()" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "The results above show which examples cleanlab considers nearly duplicated (rows where `is_near_duplicate_issue == True`). Here, we see that example 160 and 148 are nearly duplicated, as are example 546 and 514.\n", - "\n", - "Let's view these examples to see how similar they are." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "data.iloc[[160, 148]]" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "data.iloc[[546, 514]]" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "We see that these two sets of request are indeed very similar to one another! Including near duplicates in a dataset may have unintended effects on models, and be wary about splitting them across training/test sets. Learn more about handling near duplicates in a dataset from [the FAQ](https://docs.cleanlab.ai/stable/tutorials/faq.html#How-to-handle-near-duplicate-data-identified-by-cleanlab?)." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Non-IID issues (data drift)\n", - "According to the report, our dataset does not appear to be Independent and Identically Distributed (IID). The overall non-iid score for the dataset (displayed below) corresponds to the `p-value` of a statistical test for whether the ordering of samples in the dataset appears related to the similarity between their feature values. A low `p-value` strongly suggests that the dataset violates the IID assumption, which is a key assumption required for conclusions (models) produced from the dataset to generalize to a larger population." - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "p_value = lab.get_info('non_iid')['p-value']\n", - "p_value" - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Here, our dataset was flagged as non-IID because the rows happened to be sorted by class label in the original data. This may be benign if we remember to shuffle rows before model training and data splitting. But if you don't know why your data was flagged as non-IID, then you should be worried about potential data drift or unexpected interactions between data points (their values may not be statistically independent). Think carefully about what future test data may look like (and whether your data is representative of the population you care about). You should not shuffle your data before the non-IID test runs (will invalidate its conclusions)." - ] - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "As demonstrated above, cleanlab can automatically shortlist the most likely issues in your dataset to help you better curate your dataset for subsequent modeling. With this shortlist, you can decide whether to fix these label issues or remove nonsensical or duplicated examples from your dataset to obtain a higher-quality dataset for training your next ML model. cleanlab's issue detection can be run with outputs from *any* type of model you initially trained.\n" - ] - } - ], - "metadata": { - "colab": { - "collapsed_sections": [], - "name": "Text x TensorFlow", - "provenance": [] - }, - "kernelspec": { - "display_name": "Python 3 (ipykernel)", - "language": "python", - "name": "python3" - }, - "language_info": { - "codemirror_mode": { - "name": "ipython", - "version": 3 - }, - "file_extension": ".py", - "mimetype": "text/x-python", - "name": "python", - "nbconvert_exporter": "python", - "pygments_lexer": "ipython3", - "version": "3.11.7" - } - }, - "nbformat": 4, - "nbformat_minor": 4 -} + "nbformat": 4, + "nbformat_minor": 0 +} \ No newline at end of file From d112dcd63a6fbaf4e0f99f602fda8ae31a590a32 Mon Sep 17 00:00:00 2001 From: aravindputrevu Date: Sat, 9 Mar 2024 02:38:31 +0530 Subject: [PATCH 3/6] Addressed some more review comments in the notebook --- notebooks/en/issues_in_text_dataset.ipynb | 389 ++++------------------ 1 file changed, 57 insertions(+), 332 deletions(-) diff --git a/notebooks/en/issues_in_text_dataset.ipynb b/notebooks/en/issues_in_text_dataset.ipynb index 2c1cda57..ef02982d 100644 --- a/notebooks/en/issues_in_text_dataset.ipynb +++ b/notebooks/en/issues_in_text_dataset.ipynb @@ -6,7 +6,7 @@ "id": "pw6cvzTocw4G" }, "source": [ - "# Detecting Issues in a Text Dataset with Datalab\n" + "# Detecting Issues in a Text Dataset with Cleanlab\n" ] }, { @@ -15,10 +15,10 @@ "id": "0yPBE0Xccw4J" }, "source": [ - "Authored by: [@aravindputrevu](https://huggingface.co/aravindputrevu)\n", + "Authored by: [Aravind Putrevu](https://huggingface.co/aravindputrevu)\n", "\n", "\n", - "In this 5-minute quickstart tutorial, we use Datalab to detect various issues in an intent classification dataset composed of (text) customer service requests at an online bank. We consider a subset of the [Banking77-OOS Dataset](https://arxiv.org/abs/2106.04564) containing 1,000 customer service requests which are classified into 10 categories based on their intent (you can run this same code on any text classification dataset). [Cleanlab](https://github.com/cleanlab/cleanlab) automatically identifies bad examples in our dataset, including mislabeled data, out-of-scope examples (outliers), or otherwise ambiguous examples. Consider filtering or correcting such bad examples before you dive deep into modeling your data!\n", + "In this 5-minute quickstart tutorial, we use Cleanlab to detect various issues in an intent classification dataset composed of (text) customer service requests at an online bank. We consider a subset of the [Banking77-OOS Dataset](https://arxiv.org/abs/2106.04564) containing 1,000 customer service requests which are classified into 10 categories based on their intent (you can run this same code on any text classification dataset). [Cleanlab](https://github.com/cleanlab/cleanlab) automatically identifies bad examples in our dataset, including mislabeled data, out-of-scope examples (outliers), or otherwise ambiguous examples. Consider filtering or correcting such bad examples before you dive deep into modeling your data!\n", "\n", "**Overview of what we'll do in this tutorial:**\n", "\n", @@ -26,7 +26,7 @@ "\n", "- Train a simple Logistic Regression model on the text embeddings to compute out-of-sample predicted probabilities\n", "\n", - "- Run cleanlab's `Datalab` audit with these predictions and embeddings in order to identify problems like: label issues, outliers, and near duplicates in the dataset." + "- Run Cleanlab's `Datalab` audit with these predictions and embeddings in order to identify problems like: label issues, outliers, and near duplicates in the dataset." ] }, { @@ -35,29 +35,31 @@ "id": "o__pRLFYcw4K" }, "source": [ - "
\n", - "Quickstart\n", - "
\n", - " \n", - "Already have (out-of-sample) `pred_probs` from a model trained on an existing set of labels? Maybe you have some numeric `features` as well? Run the code below to find any potential label errors in your dataset.\n", "\n", - "**Note:** If running on Colab, may want to use GPU (select: Runtime > Change runtime type > Hardware accelerator > GPU)\n", + "## Quickstart\n", "\n", - "
\n", " \n", - "```ipython3\n", + "Already have (out-of-sample) `pred_probs` from a model trained on an existing set of labels? Maybe you have some numeric `features` as well? Run the code below to find any potential label errors in your dataset.\n", + "\n", + "**Note:** If running on Colab, may want to use GPU (select: Runtime > Change runtime type > Hardware accelerator > GPU)\n" + ] + }, + { + "cell_type": "code", + "source": [ "from cleanlab import Datalab\n", "\n", "lab = Datalab(data=your_dataset, label_name=\"column_name_of_labels\")\n", "lab.find_issues(pred_probs=your_pred_probs, features=your_features)\n", "\n", "lab.report()\n", - "lab.get_issues()\n", - "```\n", - " \n", - "
\n", - "
" - ] + "lab.get_issues()\n" + ], + "metadata": { + "id": "qaZA0cFs1fW4" + }, + "execution_count": null, + "outputs": [] }, { "cell_type": "markdown", @@ -65,7 +67,7 @@ "id": "dp4lpApmcw4K" }, "source": [ - "## 1. Install required dependencies\n" + "## Install required dependencies\n" ] }, { @@ -84,138 +86,14 @@ "!pip install -U \"cleanlab[datalab]\"" ], "metadata": { - "id": "fRsBIj3L_RUb", - "colab": { - "base_uri": "https://localhost:8080/", - "height": 1000 - }, - "outputId": "2b22c97c-2373-4740-d394-7486277aa694" + "id": "fRsBIj3L_RUb" }, - "execution_count": 41, - "outputs": [ - { - "output_type": "stream", - "name": "stdout", - "text": [ - "Requirement already satisfied: scikit-learn in /usr/local/lib/python3.10/dist-packages (1.2.2)\n", - "Collecting scikit-learn\n", - " Downloading scikit_learn-1.4.1.post1-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (12.1 MB)\n", - "\u001b[2K \u001b[90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━\u001b[0m \u001b[32m12.1/12.1 MB\u001b[0m \u001b[31m38.9 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m\n", - "\u001b[?25hRequirement already satisfied: sentence-transformers in /usr/local/lib/python3.10/dist-packages (2.4.0)\n", - "Requirement already satisfied: datasets in /usr/local/lib/python3.10/dist-packages (2.17.1)\n", - "Requirement already satisfied: numpy<2.0,>=1.19.5 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.25.2)\n", - "Requirement already satisfied: scipy>=1.6.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.11.4)\n", - "Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (1.3.2)\n", - "Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn) (3.3.0)\n", - "Requirement already satisfied: transformers<5.0.0,>=4.32.0 in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (4.37.2)\n", - "Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (4.66.2)\n", - "Requirement already satisfied: torch>=1.11.0 in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (2.1.0+cu121)\n", - "Requirement already satisfied: huggingface-hub>=0.15.1 in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (0.20.3)\n", - "Requirement already satisfied: Pillow in /usr/local/lib/python3.10/dist-packages (from sentence-transformers) (9.4.0)\n", - "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from datasets) (3.13.1)\n", - "Requirement already satisfied: pyarrow>=12.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (14.0.2)\n", - "Requirement already satisfied: pyarrow-hotfix in /usr/local/lib/python3.10/dist-packages (from datasets) (0.6)\n", - "Requirement already satisfied: dill<0.3.9,>=0.3.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (0.3.8)\n", - "Requirement already satisfied: pandas in /usr/local/lib/python3.10/dist-packages (from datasets) (1.5.3)\n", - "Requirement already satisfied: requests>=2.19.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (2.31.0)\n", - "Requirement already satisfied: xxhash in /usr/local/lib/python3.10/dist-packages (from datasets) (3.4.1)\n", - "Requirement already satisfied: multiprocess in /usr/local/lib/python3.10/dist-packages (from datasets) (0.70.16)\n", - "Requirement already satisfied: fsspec[http]<=2023.10.0,>=2023.1.0 in /usr/local/lib/python3.10/dist-packages (from datasets) (2023.6.0)\n", - "Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets) (3.9.3)\n", - "Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from datasets) (23.2)\n", - "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from datasets) (6.0.1)\n", - "Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.3.1)\n", - "Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (23.2.0)\n", - "Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.4.1)\n", - "Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (6.0.5)\n", - "Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (1.9.4)\n", - "Requirement already satisfied: async-timeout<5.0,>=4.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets) (4.0.3)\n", - "Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.15.1->sentence-transformers) (4.9.0)\n", - "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets) (3.3.2)\n", - "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets) (3.6)\n", - "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets) (2.0.7)\n", - "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets) (2024.2.2)\n", - "Requirement already satisfied: sympy in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence-transformers) (1.12)\n", - "Requirement already satisfied: networkx in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence-transformers) (3.2.1)\n", - "Requirement already satisfied: jinja2 in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence-transformers) (3.1.3)\n", - "Requirement already satisfied: triton==2.1.0 in /usr/local/lib/python3.10/dist-packages (from torch>=1.11.0->sentence-transformers) (2.1.0)\n", - "Requirement already satisfied: regex!=2019.12.17 in /usr/local/lib/python3.10/dist-packages (from transformers<5.0.0,>=4.32.0->sentence-transformers) (2023.12.25)\n", - "Requirement already satisfied: tokenizers<0.19,>=0.14 in /usr/local/lib/python3.10/dist-packages (from transformers<5.0.0,>=4.32.0->sentence-transformers) (0.15.2)\n", - "Requirement already satisfied: safetensors>=0.4.1 in /usr/local/lib/python3.10/dist-packages (from transformers<5.0.0,>=4.32.0->sentence-transformers) (0.4.2)\n", - "Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2.8.2)\n", - "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas->datasets) (2023.4)\n", - "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.1->pandas->datasets) (1.16.0)\n", - "Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/lib/python3.10/dist-packages (from jinja2->torch>=1.11.0->sentence-transformers) (2.1.5)\n", - "Requirement already satisfied: mpmath>=0.19 in /usr/local/lib/python3.10/dist-packages (from sympy->torch>=1.11.0->sentence-transformers) (1.3.0)\n", - "Installing collected packages: scikit-learn\n", - " Attempting uninstall: scikit-learn\n", - " Found existing installation: scikit-learn 1.2.2\n", - " Uninstalling scikit-learn-1.2.2:\n", - " Successfully uninstalled scikit-learn-1.2.2\n", - "Successfully installed scikit-learn-1.4.1.post1\n" - ] - }, - { - "output_type": "display_data", - "data": { - "application/vnd.colab-display-data+json": { - "pip_warning": { - "packages": [ - "sklearn" - ] - }, - "id": "207dfdbd8b714496a56fb33ee0f11a84" - } - }, - "metadata": {} - }, - { - "output_type": "stream", - "name": "stdout", - "text": [ - "Requirement already satisfied: cleanlab[datalab] in /usr/local/lib/python3.10/dist-packages (2.6.0)\n", - "Requirement already satisfied: numpy>=1.22.0 in /usr/local/lib/python3.10/dist-packages (from cleanlab[datalab]) (1.25.2)\n", - "Requirement already satisfied: scikit-learn>=1.1 in /usr/local/lib/python3.10/dist-packages (from cleanlab[datalab]) (1.4.1.post1)\n", - "Requirement already satisfied: tqdm>=4.53.0 in /usr/local/lib/python3.10/dist-packages (from cleanlab[datalab]) (4.66.2)\n", - "Requirement already satisfied: pandas>=1.4.0 in /usr/local/lib/python3.10/dist-packages (from cleanlab[datalab]) (1.5.3)\n", - "Requirement already satisfied: termcolor>=2.4.0 in /usr/local/lib/python3.10/dist-packages (from cleanlab[datalab]) (2.4.0)\n", - "Requirement already satisfied: datasets>=2.7.0 in /usr/local/lib/python3.10/dist-packages (from cleanlab[datalab]) (2.17.1)\n", - "Requirement already satisfied: filelock in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (3.13.1)\n", - "Requirement already satisfied: pyarrow>=12.0.0 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (14.0.2)\n", - "Requirement already satisfied: pyarrow-hotfix in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (0.6)\n", - "Requirement already satisfied: dill<0.3.9,>=0.3.0 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (0.3.8)\n", - "Requirement already satisfied: requests>=2.19.0 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (2.31.0)\n", - "Requirement already satisfied: xxhash in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (3.4.1)\n", - "Requirement already satisfied: multiprocess in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (0.70.16)\n", - "Requirement already satisfied: fsspec[http]<=2023.10.0,>=2023.1.0 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (2023.6.0)\n", - "Requirement already satisfied: aiohttp in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (3.9.3)\n", - "Requirement already satisfied: huggingface-hub>=0.19.4 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (0.20.3)\n", - "Requirement already satisfied: packaging in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (23.2)\n", - "Requirement already satisfied: pyyaml>=5.1 in /usr/local/lib/python3.10/dist-packages (from datasets>=2.7.0->cleanlab[datalab]) (6.0.1)\n", - "Requirement already satisfied: python-dateutil>=2.8.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.4.0->cleanlab[datalab]) (2.8.2)\n", - "Requirement already satisfied: pytz>=2020.1 in /usr/local/lib/python3.10/dist-packages (from pandas>=1.4.0->cleanlab[datalab]) (2023.4)\n", - "Requirement already satisfied: scipy>=1.6.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=1.1->cleanlab[datalab]) (1.11.4)\n", - "Requirement already satisfied: joblib>=1.2.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=1.1->cleanlab[datalab]) (1.3.2)\n", - "Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=1.1->cleanlab[datalab]) (3.3.0)\n", - "Requirement already satisfied: aiosignal>=1.1.2 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.7.0->cleanlab[datalab]) (1.3.1)\n", - "Requirement already satisfied: attrs>=17.3.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.7.0->cleanlab[datalab]) (23.2.0)\n", - "Requirement already satisfied: frozenlist>=1.1.1 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.7.0->cleanlab[datalab]) (1.4.1)\n", - "Requirement already satisfied: multidict<7.0,>=4.5 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.7.0->cleanlab[datalab]) (6.0.5)\n", - "Requirement already satisfied: yarl<2.0,>=1.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.7.0->cleanlab[datalab]) (1.9.4)\n", - "Requirement already satisfied: async-timeout<5.0,>=4.0 in /usr/local/lib/python3.10/dist-packages (from aiohttp->datasets>=2.7.0->cleanlab[datalab]) (4.0.3)\n", - "Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/lib/python3.10/dist-packages (from huggingface-hub>=0.19.4->datasets>=2.7.0->cleanlab[datalab]) (4.9.0)\n", - "Requirement already satisfied: six>=1.5 in /usr/local/lib/python3.10/dist-packages (from python-dateutil>=2.8.1->pandas>=1.4.0->cleanlab[datalab]) (1.16.0)\n", - "Requirement already satisfied: charset-normalizer<4,>=2 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets>=2.7.0->cleanlab[datalab]) (3.3.2)\n", - "Requirement already satisfied: idna<4,>=2.5 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets>=2.7.0->cleanlab[datalab]) (3.6)\n", - "Requirement already satisfied: urllib3<3,>=1.21.1 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets>=2.7.0->cleanlab[datalab]) (2.0.7)\n", - "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/lib/python3.10/dist-packages (from requests>=2.19.0->datasets>=2.7.0->cleanlab[datalab]) (2024.2.2)\n" - ] - } - ] + "execution_count": null, + "outputs": [] }, { "cell_type": "code", - "execution_count": 22, + "execution_count": null, "metadata": { "execution": { "iopub.execute_input": "2024-02-16T06:26:13.467211Z", @@ -240,7 +118,7 @@ }, { "cell_type": "code", - "execution_count": 23, + "execution_count": null, "metadata": { "execution": { "iopub.execute_input": "2024-02-16T06:26:13.472374Z", @@ -269,12 +147,12 @@ "id": "yj_5JcO1cw4L" }, "source": [ - "## 2. Load and format the text dataset\n" + "## Load and format the text dataset\n" ] }, { "cell_type": "code", - "execution_count": 24, + "execution_count": null, "metadata": { "execution": { "iopub.execute_input": "2024-02-16T06:26:13.476949Z", @@ -584,7 +462,7 @@ }, { "cell_type": "code", - "execution_count": 25, + "execution_count": null, "metadata": { "execution": { "iopub.execute_input": "2024-02-16T06:26:13.504463Z", @@ -627,7 +505,7 @@ }, { "cell_type": "code", - "execution_count": 26, + "execution_count": null, "metadata": { "execution": { "iopub.execute_input": "2024-02-16T06:26:13.510435Z", @@ -696,7 +574,7 @@ }, { "cell_type": "code", - "execution_count": 27, + "execution_count": null, "metadata": { "execution": { "iopub.execute_input": "2024-02-16T06:26:13.515306Z", @@ -704,21 +582,9 @@ "iopub.status.idle": "2024-02-16T06:26:18.244024Z", "shell.execute_reply": "2024-02-16T06:26:18.243354Z" }, - "id": "DbDb6Ni6cw4M", - "colab": { - "base_uri": "https://localhost:8080/" - }, - "outputId": "b3ff5ca8-afc6-4e0b-b2be-ba5dd7c0841b" + "id": "DbDb6Ni6cw4M" }, - "outputs": [ - { - "output_type": "stream", - "name": "stderr", - "text": [ - "WARNING:sentence_transformers.SentenceTransformer:No sentence-transformers model found with name google/electra-small-discriminator. Creating a new one with MEAN pooling.\n" - ] - } - ], + "outputs": [], "source": [ "transformer = SentenceTransformer('google/electra-small-discriminator')\n", "text_embeddings = transformer.encode(raw_texts)" @@ -739,7 +605,7 @@ "id": "4FK2Q72gcw4M" }, "source": [ - "## 3. Define a classification model and compute out-of-sample predicted probabilities" + "## Define a classification model and compute out-of-sample predicted probabilities" ] }, { @@ -758,7 +624,7 @@ }, { "cell_type": "code", - "execution_count": 28, + "execution_count": null, "metadata": { "execution": { "iopub.execute_input": "2024-02-16T06:26:18.247142Z", @@ -782,7 +648,7 @@ "id": "9s0pcMk1cw4N" }, "source": [ - "## 4. Use cleanlab to find issues in your dataset" + "## Use Cleanlab to find issues in your dataset" ] }, { @@ -793,12 +659,12 @@ "source": [ "Given feature embeddings and the (out-of-sample) predicted class probabilities obtained from any model you have, cleanlab can quickly help you identify low-quality examples in your dataset.\n", "\n", - "Here, we use cleanlab's `Datalab` to find issues in our data. Datalab offers several ways of loading the data; we’ll simply wrap the training features and noisy labels in a dictionary." + "Here, we use Cleanlab's `Datalab` to find issues in our data. Datalab offers several ways of loading the data; we’ll simply wrap the training features and noisy labels in a dictionary." ] }, { "cell_type": "code", - "execution_count": 29, + "execution_count": null, "metadata": { "execution": { "iopub.execute_input": "2024-02-16T06:26:19.136722Z", @@ -824,7 +690,7 @@ }, { "cell_type": "code", - "execution_count": 30, + "execution_count": null, "metadata": { "execution": { "iopub.execute_input": "2024-02-16T06:26:19.141893Z", @@ -833,30 +699,9 @@ "shell.execute_reply": "2024-02-16T06:26:20.808461Z" }, "scrolled": true, - "id": "R0xuUDRWcw4N", - "outputId": "6e8541c2-0e28-4907-c41a-d097212fe8a4", - "colab": { - "base_uri": "https://localhost:8080/" - } + "id": "R0xuUDRWcw4N" }, - "outputs": [ - { - "output_type": "stream", - "name": "stdout", - "text": [ - "Finding null issues ...\n", - "Finding label issues ...\n", - "Finding outlier issues ...\n", - "Fitting OOD estimator based on provided features ...\n", - "Finding near_duplicate issues ...\n", - "Finding non_iid issues ...\n", - "Finding class_imbalance issues ...\n", - "Finding underperforming_group issues ...\n", - "\n", - "Audit complete. 62 issues found in the dataset.\n" - ] - } - ], + "outputs": [], "source": [ "lab = Datalab(data_dict, label_name=\"labels\")\n", "lab.find_issues(pred_probs=pred_probs, features=text_embeddings)" @@ -895,7 +740,7 @@ }, { "cell_type": "code", - "execution_count": 31, + "execution_count": null, "metadata": { "execution": { "iopub.execute_input": "2024-02-16T06:26:20.813057Z", @@ -1017,113 +862,6 @@ "lab.report()" ] }, - { - "cell_type": "markdown", - "source": [ - "The output for the `lab.report()` would look like below:\n", - "\n", - "```bash\n", - "Here is a summary of the different kinds of issues found in the data:\n", - "\n", - " issue_type num_issues\n", - " outlier 37\n", - "near_duplicate 14\n", - " label 10\n", - " non_iid 1\n", - "\n", - "Dataset Information: num_examples: 1000, num_classes: 7\n", - "\n", - "\n", - "---------------------- outlier issues ----------------------\n", - "\n", - "About this issue:\n", - "\tExamples that are very different from the rest of the dataset\n", - " (i.e. potentially out-of-distribution or rare/anomalous instances).\n", - " \n", - "\n", - "Number of examples with this issue: 37\n", - "Overall dataset quality in terms of this issue: 0.3671\n", - "\n", - "Examples representing most severe instances of this issue:\n", - " is_outlier_issue outlier_score\n", - "791 True 0.024866\n", - "601 True 0.031162\n", - "863 True 0.060738\n", - "355 True 0.064199\n", - "157 True 0.065075\n", - "\n", - "\n", - "------------------ near_duplicate issues -------------------\n", - "\n", - "About this issue:\n", - "\tA (near) duplicate issue refers to two or more examples in\n", - " a dataset that are extremely similar to each other, relative\n", - " to the rest of the dataset. The examples flagged with this issue\n", - " may be exactly duplicated, or lie atypically close together when\n", - " represented as vectors (i.e. feature embeddings).\n", - " \n", - "\n", - "Number of examples with this issue: 14\n", - "Overall dataset quality in terms of this issue: 0.5961\n", - "\n", - "Examples representing most severe instances of this issue:\n", - " is_near_duplicate_issue near_duplicate_score near_duplicate_sets distance_to_nearest_neighbor\n", - "459 True 0.009544 [429] 0.000566\n", - "429 True 0.009544 [459] 0.000566\n", - "501 True 0.046044 [412, 517] 0.002781\n", - "412 True 0.046044 [501] 0.002781\n", - "698 True 0.054626 [607] 0.003314\n", - "\n", - "\n", - "----------------------- label issues -----------------------\n", - "\n", - "About this issue:\n", - "\tExamples whose given label is estimated to be potentially incorrect\n", - " (e.g. due to annotation error) are flagged as having label issues.\n", - " \n", - "\n", - "Number of examples with this issue: 10\n", - "Overall dataset quality in terms of this issue: 0.9930\n", - "\n", - "Examples representing most severe instances of this issue:\n", - " is_label_issue label_score given_label predicted_label\n", - "379 False 0.025486 32 11\n", - "100 False 0.032102 11 36\n", - "300 False 0.037742 32 46\n", - "485 True 0.057666 17 34\n", - "159 True 0.059408 13 11\n", - "\n", - "\n", - "---------------------- non_iid issues ----------------------\n", - "\n", - "About this issue:\n", - "\tWhether the dataset exhibits statistically significant\n", - " violations of the IID assumption like:\n", - " changepoints or shift, drift, autocorrelation, etc.\n", - " The specific violation considered is whether the\n", - " examples are ordered such that almost adjacent examples\n", - " tend to have more similar feature values.\n", - " \n", - "\n", - "Number of examples with this issue: 1\n", - "Overall dataset quality in terms of this issue: 0.0000\n", - "\n", - "Examples representing most severe instances of this issue:\n", - " is_non_iid_issue non_iid_score\n", - "988 True 0.563774\n", - "975 False 0.570179\n", - "997 False 0.571891\n", - "967 False 0.572357\n", - "956 False 0.577413\n", - "\n", - "Additional Information:\n", - "p-value: 0.0\n", - "```" - ], - "metadata": { - "id": "XI03VkWHrixv" - } - }, { "cell_type": "markdown", "metadata": { @@ -1137,7 +875,7 @@ }, { "cell_type": "code", - "execution_count": 32, + "execution_count": null, "metadata": { "execution": { "iopub.execute_input": "2024-02-16T06:26:20.843083Z", @@ -1490,7 +1228,7 @@ }, { "cell_type": "code", - "execution_count": 33, + "execution_count": null, "metadata": { "execution": { "iopub.execute_input": "2024-02-16T06:26:20.854743Z", @@ -1525,22 +1263,6 @@ ")" ] }, - { - "cell_type": "markdown", - "source": [ - "The output for the above cell would look like below:\n", - "\n", - "```bash\n", - "cleanlab found 10 potential label errors in the dataset.\n", - "Here are indices of the top 5 most likely errors:\n", - " [379 100 300 485 159]\n", - "\n", - "```" - ], - "metadata": { - "id": "QyW7qUNKXOz5" - } - }, { "cell_type": "markdown", "metadata": { @@ -1554,7 +1276,7 @@ }, { "cell_type": "code", - "execution_count": 18, + "execution_count": null, "metadata": { "execution": { "iopub.execute_input": "2024-02-16T06:26:20.861048Z", @@ -1914,7 +1636,7 @@ }, { "cell_type": "code", - "execution_count": 20, + "execution_count": null, "metadata": { "execution": { "iopub.execute_input": "2024-02-16T06:26:20.869718Z", @@ -2237,7 +1959,7 @@ }, { "cell_type": "code", - "execution_count": 34, + "execution_count": null, "metadata": { "execution": { "iopub.execute_input": "2024-02-16T06:26:20.878435Z", @@ -2582,7 +2304,7 @@ }, { "cell_type": "code", - "execution_count": 35, + "execution_count": null, "metadata": { "execution": { "iopub.execute_input": "2024-02-16T06:26:20.886079Z", @@ -2918,7 +2640,7 @@ }, { "cell_type": "code", - "execution_count": 38, + "execution_count": null, "metadata": { "execution": { "iopub.execute_input": "2024-02-16T06:26:20.896501Z", @@ -3223,7 +2945,7 @@ }, { "cell_type": "code", - "execution_count": 39, + "execution_count": null, "metadata": { "execution": { "iopub.execute_input": "2024-02-16T06:26:20.904159Z", @@ -3547,7 +3269,7 @@ }, { "cell_type": "code", - "execution_count": 40, + "execution_count": null, "metadata": { "execution": { "iopub.execute_input": "2024-02-16T06:26:20.911817Z", @@ -3602,15 +3324,18 @@ "id": "qnncoRWUcw4S" }, "source": [ - "### Easy Mode\n", + "### Cleanlab Opensource Project\n", + "\n", + "[Cleanlab](https://github.com/cleanlab/cleanlab) is a standard Data-centric AI package designed to address data quality issues for messy, real-world data.\n", "\n", - "Cleanlab is most effective when you run this code with a good ML model. Try to produce the best ML model you can for your data (instead of the basic model from this tutorial). If you don't know the best ML model for your data, try [Cleanlab Studio](https://cleanlab.ai/blog/data-centric-ai/) which will automatically produce one for you. Super easy to use, [Cleanlab Studio](https://cleanlab.ai/blog/data-centric-ai/) is no-code platform for data-centric AI that automatically: detects data issues (more types of issues than this cleanlab package), helps you quickly correct these data issues, confidently labels large subsets of an unlabeled dataset, and provides other smart metadata about each of your data points -- all powered by a system that automatically trains/deploys the best ML model for your data. [Try it for free!](https://cleanlab.ai/signup/)" + "Do consider giving Cleanlab Github Repository a Star, and we welcome [contributions](https://github.com/cleanlab/cleanlab/issues?q=is:issue+is:open+label:%22good+first+issue%22) to the project." ] } ], "metadata": { "colab": { - "provenance": [] + "provenance": [], + "toc_visible": true }, "kernelspec": { "display_name": "Python 3 (ipykernel)", From 89cd83a9cf58df372df59635093aed5e2e5eca1a Mon Sep 17 00:00:00 2001 From: aravindputrevu Date: Mon, 11 Mar 2024 14:11:37 +0530 Subject: [PATCH 4/6] Changes to the TOC tree --- notebooks/en/_toctree.yml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/notebooks/en/_toctree.yml b/notebooks/en/_toctree.yml index 8c699a1b..6df0a009 100644 --- a/notebooks/en/_toctree.yml +++ b/notebooks/en/_toctree.yml @@ -1,5 +1,7 @@ - title: Open-Source AI Cookbook sections: + - local: issues_in_text_dataset + title: Detecting Issues in a Text Dataset with Cleanlab - local: index title: Open-Source AI Cookbook - local: stable_diffusion_interpolation @@ -20,7 +22,5 @@ title: Advanced RAG on HuggingFace documentation using LangChain - local: rag_evaluation title: RAG Evaluation - - local: issues_in_text_dataset - title: Detecting Issues in a Text Dataset with Datalab - local: prompt_tuning_peft title: Prompt tuning with PEFT From 06e8b38cce09e1fd7c9d3556dd421904f7a57335 Mon Sep 17 00:00:00 2001 From: Maria Khalusova Date: Mon, 11 Mar 2024 11:45:10 -0400 Subject: [PATCH 5/6] Moved the recipe after the index page --- notebooks/en/_toctree.yml | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/notebooks/en/_toctree.yml b/notebooks/en/_toctree.yml index 6df0a009..4ccb23d4 100644 --- a/notebooks/en/_toctree.yml +++ b/notebooks/en/_toctree.yml @@ -1,9 +1,9 @@ - title: Open-Source AI Cookbook sections: - - local: issues_in_text_dataset - title: Detecting Issues in a Text Dataset with Cleanlab - local: index title: Open-Source AI Cookbook + - local: issues_in_text_dataset + title: Detecting Issues in a Text Dataset with Cleanlab - local: stable_diffusion_interpolation title: Stable Diffusion Interpolation - local: rag_with_hugging_face_gemma_mongodb From 1ced1f1b695464204cf2daaa6fdce55b0d83e7c0 Mon Sep 17 00:00:00 2001 From: Maria Khalusova Date: Mon, 11 Mar 2024 12:28:23 -0400 Subject: [PATCH 6/6] Fixes missing columns in tables --- notebooks/en/issues_in_text_dataset.ipynb | 34 +++++++++++------------ 1 file changed, 17 insertions(+), 17 deletions(-) diff --git a/notebooks/en/issues_in_text_dataset.ipynb b/notebooks/en/issues_in_text_dataset.ipynb index ef02982d..e568de20 100644 --- a/notebooks/en/issues_in_text_dataset.ipynb +++ b/notebooks/en/issues_in_text_dataset.ipynb @@ -1196,8 +1196,8 @@ { "cell_type": "markdown", "source": [ - "| is_label_issue | label_score | given_label | predicted_label |\n", - "|----------------|-------------|-------------|-----------------|\n", + "| | is_label_issue | label_score | given_label | predicted_label |\n", + "|----------------|-------------|-------------|-----------------|-----------------|\n", "| 0 | False | 0.903926 | 11 | 11 |\n", "| 1 | False | 0.860544 | 11 | 11 |\n", "| 2 | False | 0.658309 | 11 | 11 |\n", @@ -1600,13 +1600,13 @@ "source": [ " The output to the above command would like below:\n", " \n", - " | text | given_label | suggested_label |\n", - "|------|-----------------------------------------------------------------------------------------------------------|-----------------|\n", - "| 379 | Is there a specific source that the exchange rate for the transfer I'm planning on making is pulled from? | 32 |\n", - "| 100 | can you share card tracking number? | 11 |\n", - "| 300 | If I need to cash foreign transfers, how does that work? | 32 |\n", - "| 485 | Was I charged more than I should of been for a currency exchange? | 17 |\n", - "| 159 | Is there any way to see my card in the app? | 13 |\n" + "| | text | given_label | suggested_label |\n", + "|------|-----------------------------------------------------------------------------------------------------------|----------------|-----------------|\n", + "| 379 | Is there a specific source that the exchange rate for the transfer I'm planning on making is pulled from? | 32 | 11 |\n", + "| 100 | can you share card tracking number? | 11 | 36 |\n", + "| 300 | If I need to cash foreign transfers, how does that work? | 32 | 46 |\n", + "| 485 | Was I charged more than I should of been for a currency exchange? | 17 | 34 |\n", + "| 159 | Is there any way to see my card in the app? | 13 | 11 |\n" ], "metadata": { "id": "g2dvMySPtkbL" @@ -1945,13 +1945,13 @@ "source": [ "Output would look like below:\n", "\n", - "| is_outlier_issue | outlier_score |\n", - "|------------------|---------------|\n", - "| True | 0.024866 |\n", - "| True | 0.031162 |\n", - "| True | 0.060738 |\n", - "| True | 0.064199 |\n", - "| True | 0.065075 |" + "| | is_outlier_issue | outlier_score |\n", + "|---| ----------------|---------------|\n", + "| 791 | True | 0.024866 |\n", + "| 601 | True | 0.031162 |\n", + "| 863 | True | 0.060738 |\n", + "| 355 | True | 0.064199 |\n", + "| 157 | True | 0.065075 |" ], "metadata": { "id": "F7Z2VJQAujui" @@ -3357,4 +3357,4 @@ }, "nbformat": 4, "nbformat_minor": 0 -} \ No newline at end of file +}