Detecting Issues in a Text Dataset with Datalab #30

aravindputrevu · 2024-02-17T00:44:27Z

What does this PR do?

This notebook is about detecting issues in a text dataset using Data-centric AI using Opensource package Cleanlab. It uses Datalab object from Cleanlab package.

List of out comes from the notebook:

Compute out-of-sample predicted probabilities for a sample dataset using cross-validation.
Use Datalab to identify issues such as noisy labels, outliers, (near) duplicates, and other types of problems
View the issue summaries and other information about our sample dataset

@MKhalusova appreciate your review.

review-notebook-app · 2024-02-17T00:44:32Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

MKhalusova · 2024-02-19T18:56:18Z

notebooks/en/issues_in_text_dataset.ipynb

@@ -0,0 +1,571 @@
+{


Can you add a link to cleanlab's GitHub repo?

Reply via ReviewNB

MKhalusova · 2024-02-19T18:56:19Z

notebooks/en/issues_in_text_dataset.ipynb

@@ -0,0 +1,571 @@
+{


Make sure to install the version corresponding to this tutorial
What are these versions? Perhaps, we can recommend to install the latest (add -U flag to install the newest versions). We can also probably remove the comment.

Reply via ReviewNB

MKhalusova · 2024-02-19T18:56:19Z

notebooks/en/issues_in_text_dataset.ipynb

@@ -0,0 +1,571 @@
+{


Can we just leave a pip install?

Reply via ReviewNB

Yes, removing this block.

MKhalusova · 2024-02-19T18:56:19Z

notebooks/en/issues_in_text_dataset.ipynb

@@ -0,0 +1,571 @@
+{


Line #1. # Package installation (hidden on docs.cleanlab.ai).
This comment can be removed.

Reply via ReviewNB

MKhalusova · 2024-02-19T18:56:19Z

notebooks/en/issues_in_text_dataset.ipynb

@@ -0,0 +1,571 @@
+{


Line #2. # If running on Colab, may want to use GPU (select: Runtime > Change runtime type > Hardware accelerator > GPU)
This you can add to the introduction.

Reply via ReviewNB

MKhalusova · 2024-02-19T18:56:19Z

notebooks/en/issues_in_text_dataset.ipynb

@@ -0,0 +1,571 @@
+{


Add the output

Reply via ReviewNB

MKhalusova · 2024-02-19T18:56:19Z

notebooks/en/issues_in_text_dataset.ipynb

@@ -0,0 +1,571 @@
+{


Add the output

Reply via ReviewNB

MKhalusova · 2024-02-19T18:56:19Z

notebooks/en/issues_in_text_dataset.ipynb

@@ -0,0 +1,571 @@
+{


Add the output

Reply via ReviewNB

MKhalusova · 2024-02-19T18:56:19Z

notebooks/en/issues_in_text_dataset.ipynb

@@ -0,0 +1,571 @@
+{


Add the output here as well.

Reply via ReviewNB

MKhalusova · 2024-02-19T18:56:19Z

notebooks/en/issues_in_text_dataset.ipynb

@@ -0,0 +1,571 @@
+{


It would be cool to quickly show how you would update the dataset based on the report (e.g. remove all of the "bad" examples, or add a column indicating which ones are good-to-use, and which ones are not). I imagine one would want to run such cleanup on schedule, and somehow integrate the results.

Reply via ReviewNB

Well, Cleanlab is the package that helps identify these issues. One can simply delete the near_duplicates or outliers from dataframe and export the CSV. Cleanlab calls it Cleanset like Cleaned Dataset I'd want the user to take a look and objectively delete the data points according to their choice.

As per the goal of the project, it is aimed at showcasing the problems within dataset , it would be a bit difficult to integrate package like a workflow in my opinion (could be done using a GH Action or so!)

Hence also the last paragraph, Cleanlab Studio helps with the necessary UI and longterm solution in maintaining the datasets throughout.

MKhalusova · 2024-02-19T18:59:07Z

notebooks/en/_toctree.yml

@@ -12,3 +12,5 @@
    title: Advanced RAG on HuggingFace documentation using LangChain
  - local: rag_evaluation
    title: RAG Evaluation
+ - local: issues_in_text_dataset


Feel free to move this to the top, right after the index page

Also it looks like there's a space missing here that breaks the CI/CD check. Make sure it's aligned with other entries

MKhalusova · 2024-02-19T18:59:20Z

notebooks/en/index.md

@@ -12,6 +12,7 @@ Check out the recently added notebooks:
 - [Fine-tuning a Code LLM on Custom Code on a single GPU](fine_tuning_code_llm_on_single_gpu)
 - [RAG Evaluation Using Synthetic data and LLM-As-A-Judge](rag_evaluation)
 - [Advanced RAG on HuggingFace documentation using LangChain](advanced_rag)
+- [Detecting Issues in a Text Dataset with Datalab](issues_in_text_dataset)


feel free to add it to the top of the list

MKhalusova · 2024-02-19T19:00:41Z

Awesome tutorial, @aravindputrevu !
I left some comments. My main feedback:

let's include the informative outputs (where you print things out, where you show the reports, etc.)
consider adding an alternative way of loading data - the same dataset is available on Hugging Face Hub, and this can be a great option for larger datasets.
At the end of the tutorial, it would be cool to show how to integrate the results back into the dataset.

Also, please add yourself as an author, right after the main title, like this: Authored by: Your Name Feel free to use either your Hugging Face profile, or GitHub profile, it's up to you which one to link.

aravindputrevu · 2024-02-20T15:14:25Z

@MKhalusova Thanks for the review, I will be working on the comments.

aravindputrevu · 2024-02-27T22:48:37Z

@MKhalusova I have fixed the review comments, and responded on the other questions. Please let me know.

MKhalusova · 2024-02-28T16:23:30Z

notebooks/en/issues_in_text_dataset.ipynb

@@ -0,0 +1,3635 @@
+{


Please use the actual name and not the account handle for the author, for consistency with other notebooks, i.e.
[FirstName LastName](link_to_HF_profile)

Reply via ReviewNB

MKhalusova · 2024-02-28T16:23:30Z

notebooks/en/issues_in_text_dataset.ipynb

@@ -0,0 +1,3635 @@
+{


The doc-builder that we use to publish notebooks, seems to have issues with the <div> tags in this markdown. Please reformat to remove them, and leave only the markdown formatting.

Reply via ReviewNB

MKhalusova · 2024-02-28T16:23:30Z

notebooks/en/issues_in_text_dataset.ipynb

@@ -0,0 +1,3635 @@
+{


Let's remove these outputs as they take a lot of space and are not super informative.

Reply via ReviewNB

MKhalusova · 2024-02-28T16:23:30Z

notebooks/en/issues_in_text_dataset.ipynb

@@ -0,0 +1,3635 @@
+{


Feel free to remove this output as well.

Reply via ReviewNB

MKhalusova · 2024-02-28T16:23:31Z

notebooks/en/issues_in_text_dataset.ipynb

@@ -0,0 +1,3635 @@
+{


Since you added the output of this cell, you can remove this.

Reply via ReviewNB

MKhalusova · 2024-02-28T16:23:31Z

notebooks/en/issues_in_text_dataset.ipynb

@@ -0,0 +1,3635 @@
+{


There is no need to duplicate the output in the markdown cell, it will be shown in the rendered notebook. Please remove the markdown copy.

Reply via ReviewNB

MKhalusova · 2024-02-28T16:23:31Z

notebooks/en/issues_in_text_dataset.ipynb

@@ -0,0 +1,3635 @@
+{


At the moment there are some issues displaying pandas dataframe outputs, so you can actually leave this markdown version of the output

Reply via ReviewNB

MKhalusova · 2024-02-28T16:23:31Z

notebooks/en/issues_in_text_dataset.ipynb

@@ -0,0 +1,3635 @@
+{


Please remove the output duplicated in markdown, only leave the actual cell output

Reply via ReviewNB

MKhalusova · 2024-02-28T16:23:31Z

notebooks/en/issues_in_text_dataset.ipynb

@@ -0,0 +1,3635 @@
+{


Same here. While for the rest of the outputs I encourage you to remove the duplication, you can leave this for pandas dataframes.

Reply via ReviewNB

MKhalusova · 2024-02-28T16:23:31Z

notebooks/en/issues_in_text_dataset.ipynb

@@ -0,0 +1,3635 @@
+{


This reads like a sales pitch, which is not aligned with the goals of the Open Source AI cookbook. Please remove.

Reply via ReviewNB

MKhalusova · 2024-02-28T16:24:00Z

A few finishing touches, and the notebook will be good to merge!

aravindputrevu · 2024-03-08T21:11:57Z

@MKhalusova Made requested changes and corrected, few other items. Please review.

HuggingFaceDocBuilderDev · 2024-03-11T15:11:32Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

MKhalusova · 2024-03-11T16:54:23Z

I fixed some missing columns, and we can merge now. Will share the new recipe tomorrow!

Changes for Detecting Issues in a Text Dataset with Datalab

744e474

MKhalusova reviewed Feb 19, 2024

View reviewed changes

Fixed the review comments

6404c73

Merge branch 'main' into main

2011732

MKhalusova reviewed Feb 28, 2024

View reviewed changes

Addressed some more review comments in the notebook

d112dcd

Changes to the TOC tree

89cd83a

MKhalusova added 2 commits March 11, 2024 11:45

Moved the recipe after the index page

06e8b38

Fixes missing columns in tables

1ced1f1

MKhalusova approved these changes Mar 11, 2024

View reviewed changes

MKhalusova merged commit bce2a75 into huggingface:main Mar 11, 2024
1 check passed

Detecting Issues in a Text Dataset with Datalab #30

Detecting Issues in a Text Dataset with Datalab #30

Conversation

aravindputrevu commented Feb 17, 2024 • edited Loading

What does this PR do?

review-notebook-app bot commented Feb 17, 2024

MKhalusova Feb 19, 2024 • edited Loading

Choose a reason for hiding this comment

aravindputrevu Feb 27, 2024

Choose a reason for hiding this comment

MKhalusova Feb 19, 2024 • edited Loading

Choose a reason for hiding this comment

aravindputrevu Feb 27, 2024

Choose a reason for hiding this comment

MKhalusova Feb 19, 2024 • edited Loading

Choose a reason for hiding this comment

aravindputrevu Feb 27, 2024

Choose a reason for hiding this comment

MKhalusova Feb 19, 2024 • edited Loading

Choose a reason for hiding this comment

MKhalusova Feb 19, 2024 • edited Loading

Choose a reason for hiding this comment

MKhalusova Feb 19, 2024 • edited Loading

Choose a reason for hiding this comment

MKhalusova Feb 19, 2024 • edited Loading

Choose a reason for hiding this comment

MKhalusova Feb 19, 2024 • edited Loading

Choose a reason for hiding this comment

MKhalusova Feb 19, 2024 • edited Loading

Choose a reason for hiding this comment

MKhalusova Feb 19, 2024 • edited Loading

Choose a reason for hiding this comment

aravindputrevu Feb 27, 2024

Choose a reason for hiding this comment

MKhalusova Feb 19, 2024

Choose a reason for hiding this comment

MKhalusova Feb 19, 2024

Choose a reason for hiding this comment

MKhalusova Feb 19, 2024

Choose a reason for hiding this comment

MKhalusova commented Feb 19, 2024 • edited Loading

aravindputrevu commented Feb 20, 2024

aravindputrevu commented Feb 27, 2024

MKhalusova Feb 28, 2024 • edited Loading

Choose a reason for hiding this comment

MKhalusova Feb 28, 2024 • edited Loading

Choose a reason for hiding this comment

MKhalusova Feb 28, 2024 • edited Loading

Choose a reason for hiding this comment

MKhalusova Feb 28, 2024 • edited Loading

Choose a reason for hiding this comment

MKhalusova Feb 28, 2024 • edited Loading

Choose a reason for hiding this comment

MKhalusova Feb 28, 2024 • edited Loading

Choose a reason for hiding this comment

MKhalusova Feb 28, 2024 • edited Loading

Choose a reason for hiding this comment

MKhalusova Feb 28, 2024 • edited Loading

Choose a reason for hiding this comment

MKhalusova Feb 28, 2024 • edited Loading

Choose a reason for hiding this comment

MKhalusova Feb 28, 2024 • edited Loading

Choose a reason for hiding this comment

MKhalusova commented Feb 28, 2024

aravindputrevu commented Mar 8, 2024

HuggingFaceDocBuilderDev commented Mar 11, 2024

MKhalusova commented Mar 11, 2024

aravindputrevu commented Feb 17, 2024 •

edited

Loading

MKhalusova Feb 19, 2024 •

edited

Loading

MKhalusova Feb 19, 2024 •

edited

Loading

MKhalusova Feb 19, 2024 •

edited

Loading

MKhalusova Feb 19, 2024 •

edited

Loading

MKhalusova Feb 19, 2024 •

edited

Loading

MKhalusova Feb 19, 2024 •

edited

Loading

MKhalusova Feb 19, 2024 •

edited

Loading

MKhalusova Feb 19, 2024 •

edited

Loading

MKhalusova Feb 19, 2024 •

edited

Loading

MKhalusova Feb 19, 2024 •

edited

Loading

MKhalusova commented Feb 19, 2024 •

edited

Loading

MKhalusova Feb 28, 2024 •

edited

Loading

MKhalusova Feb 28, 2024 •

edited

Loading

MKhalusova Feb 28, 2024 •

edited

Loading

MKhalusova Feb 28, 2024 •

edited

Loading

MKhalusova Feb 28, 2024 •

edited

Loading

MKhalusova Feb 28, 2024 •

edited

Loading

MKhalusova Feb 28, 2024 •

edited

Loading

MKhalusova Feb 28, 2024 •

edited

Loading

MKhalusova Feb 28, 2024 •

edited

Loading

MKhalusova Feb 28, 2024 •

edited

Loading