Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detecting Issues in a Text Dataset with Datalab #30

Merged
merged 7 commits into from
Mar 11, 2024

Conversation

aravindputrevu
Copy link
Contributor

@aravindputrevu aravindputrevu commented Feb 17, 2024

What does this PR do?

This notebook is about detecting issues in a text dataset using Data-centric AI using Opensource package Cleanlab. It uses Datalab object from Cleanlab package.

List of out comes from the notebook:

  • Compute out-of-sample predicted probabilities for a sample dataset using cross-validation.

  • Use Datalab to identify issues such as noisy labels, outliers, (near) duplicates, and other types of problems

  • View the issue summaries and other information about our sample dataset

@MKhalusova appreciate your review.

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@@ -0,0 +1,571 @@
{
Copy link
Contributor

@MKhalusova MKhalusova Feb 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add a link to cleanlab's GitHub repo?


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@@ -0,0 +1,571 @@
{
Copy link
Contributor

@MKhalusova MKhalusova Feb 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Make sure to install the version corresponding to this tutorial

What are these versions? Perhaps, we can recommend to install the latest (add -U flag to install the newest versions). We can also probably remove the comment.


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@@ -0,0 +1,571 @@
{
Copy link
Contributor

@MKhalusova MKhalusova Feb 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we just leave a pip install?


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, removing this block.

@@ -0,0 +1,571 @@
{
Copy link
Contributor

@MKhalusova MKhalusova Feb 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #1.    # Package installation (hidden on docs.cleanlab.ai).

This comment can be removed.


Reply via ReviewNB

@@ -0,0 +1,571 @@
{
Copy link
Contributor

@MKhalusova MKhalusova Feb 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #2.    # If running on Colab, may want to use GPU (select: Runtime > Change runtime type > Hardware accelerator > GPU)

This you can add to the introduction.


Reply via ReviewNB

@@ -0,0 +1,571 @@
{
Copy link
Contributor

@MKhalusova MKhalusova Feb 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the output


Reply via ReviewNB

@@ -0,0 +1,571 @@
{
Copy link
Contributor

@MKhalusova MKhalusova Feb 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the output


Reply via ReviewNB

@@ -0,0 +1,571 @@
{
Copy link
Contributor

@MKhalusova MKhalusova Feb 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the output


Reply via ReviewNB

@@ -0,0 +1,571 @@
{
Copy link
Contributor

@MKhalusova MKhalusova Feb 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the output here as well.


Reply via ReviewNB

@@ -0,0 +1,571 @@
{
Copy link
Contributor

@MKhalusova MKhalusova Feb 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be cool to quickly show how you would update the dataset based on the report (e.g. remove all of the "bad" examples, or add a column indicating which ones are good-to-use, and which ones are not). I imagine one would want to run such cleanup on schedule, and somehow integrate the results.


Reply via ReviewNB

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, Cleanlab is the package that helps identify these issues. One can simply delete the near_duplicates or outliers from dataframe and export the CSV. Cleanlab calls it Cleanset like Cleaned Dataset I'd want the user to take a look and objectively delete the data points according to their choice.

As per the goal of the project, it is aimed at showcasing the problems within dataset , it would be a bit difficult to integrate package like a workflow in my opinion (could be done using a GH Action or so!)

Hence also the last paragraph, Cleanlab Studio helps with the necessary UI and longterm solution in maintaining the datasets throughout.

@@ -12,3 +12,5 @@
title: Advanced RAG on HuggingFace documentation using LangChain
- local: rag_evaluation
title: RAG Evaluation
- local: issues_in_text_dataset
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to move this to the top, right after the index page

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also it looks like there's a space missing here that breaks the CI/CD check. Make sure it's aligned with other entries

@@ -12,6 +12,7 @@ Check out the recently added notebooks:
- [Fine-tuning a Code LLM on Custom Code on a single GPU](fine_tuning_code_llm_on_single_gpu)
- [RAG Evaluation Using Synthetic data and LLM-As-A-Judge](rag_evaluation)
- [Advanced RAG on HuggingFace documentation using LangChain](advanced_rag)
- [Detecting Issues in a Text Dataset with Datalab](issues_in_text_dataset)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

feel free to add it to the top of the list

@MKhalusova
Copy link
Contributor

MKhalusova commented Feb 19, 2024

Awesome tutorial, @aravindputrevu !
I left some comments. My main feedback:

  • let's include the informative outputs (where you print things out, where you show the reports, etc.)
  • consider adding an alternative way of loading data - the same dataset is available on Hugging Face Hub, and this can be a great option for larger datasets.
  • At the end of the tutorial, it would be cool to show how to integrate the results back into the dataset.

Also, please add yourself as an author, right after the main title, like this: Authored by: Your Name Feel free to use either your Hugging Face profile, or GitHub profile, it's up to you which one to link.

@aravindputrevu
Copy link
Contributor Author

@MKhalusova Thanks for the review, I will be working on the comments.

@aravindputrevu
Copy link
Contributor Author

@MKhalusova I have fixed the review comments, and responded on the other questions. Please let me know.

@@ -0,0 +1,3635 @@
{
Copy link
Contributor

@MKhalusova MKhalusova Feb 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use the actual name and not the account handle for the author, for consistency with other notebooks, i.e.

[FirstName LastName](link_to_HF_profile)


Reply via ReviewNB

@@ -0,0 +1,3635 @@
{
Copy link
Contributor

@MKhalusova MKhalusova Feb 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The doc-builder that we use to publish notebooks, seems to have issues with the <div> tags in this markdown. Please reformat to remove them, and leave only the markdown formatting.


Reply via ReviewNB

@@ -0,0 +1,3635 @@
{
Copy link
Contributor

@MKhalusova MKhalusova Feb 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove these outputs as they take a lot of space and are not super informative.


Reply via ReviewNB

@@ -0,0 +1,3635 @@
{
Copy link
Contributor

@MKhalusova MKhalusova Feb 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feel free to remove this output as well.


Reply via ReviewNB

@@ -0,0 +1,3635 @@
{
Copy link
Contributor

@MKhalusova MKhalusova Feb 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since you added the output of this cell, you can remove this.


Reply via ReviewNB

@@ -0,0 +1,3635 @@
{
Copy link
Contributor

@MKhalusova MKhalusova Feb 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no need to duplicate the output in the markdown cell, it will be shown in the rendered notebook. Please remove the markdown copy.


Reply via ReviewNB

@@ -0,0 +1,3635 @@
{
Copy link
Contributor

@MKhalusova MKhalusova Feb 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the moment there are some issues displaying pandas dataframe outputs, so you can actually leave this markdown version of the output


Reply via ReviewNB

@@ -0,0 +1,3635 @@
{
Copy link
Contributor

@MKhalusova MKhalusova Feb 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the output duplicated in markdown, only leave the actual cell output


Reply via ReviewNB

@@ -0,0 +1,3635 @@
{
Copy link
Contributor

@MKhalusova MKhalusova Feb 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. While for the rest of the outputs I encourage you to remove the duplication, you can leave this for pandas dataframes.


Reply via ReviewNB

@@ -0,0 +1,3635 @@
{
Copy link
Contributor

@MKhalusova MKhalusova Feb 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This reads like a sales pitch, which is not aligned with the goals of the Open Source AI cookbook. Please remove.


Reply via ReviewNB

@MKhalusova
Copy link
Contributor

A few finishing touches, and the notebook will be good to merge!

Copy link
Contributor Author

@MKhalusova Made requested changes and corrected, few other items. Please review.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@MKhalusova
Copy link
Contributor

I fixed some missing columns, and we can merge now. Will share the new recipe tomorrow!

@MKhalusova MKhalusova merged commit bce2a75 into huggingface:main Mar 11, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants