Add functionality to add sync `gr.Interface` to Hugging Face hub datasets. #8634

davidberenstein1957 · 2024-06-26T17:35:18Z

Description

This functionality syncs data from gr.Interface to the Hugging Face hub.

Stretch: I can also see this being implemented on a lower level with a default processing function per Component type, and where people would be able to pass component identifiers as a list to determine what to log. @pngwn @abidlabs WDYT?

from transformers import pipeline

import gradio as gr

pipe = pipeline("text-classification", model="mrm8488/bert-tiny-finetuned-sms-spam-detection")
demo = gr.Interface.from_pipeline(pipe)

demo.sync_with_hub(
    repo_id="davidberenstein1957/bert-tiny-finetuned-sms-spam-detection",
    every=1
)
demo.launch()

https://huggingface.co/datasets/davidberenstein1957/bert-tiny-finetuned-sms-spam-detection

🎯 PRs Should Target Issues

Closes: #8635

Not adhering to this guideline will result in the PR being closed.

Tests

PRs will only be merged if tests pass on CI. To run the tests locally, please set up your Gradio environment locally and run the tests: bash scripts/run_all_tests.sh
You may need to run the linters: bash scripts/format_backend.sh and bash scripts/format_frontend.sh

This reverts commit fdcb497.

gradio-pr-bot · 2024-06-26T17:35:59Z

🪼 branch checks and previews

•	Name	Status	URL
	Spaces	ready!	Spaces preview
	Website	ready!	Website preview
🦄	Changes	detecting...

Install Gradio from this PR

pip install https://gradio-builds.s3.amazonaws.com/3fde2e2ab1eb583f6529821e336396e02f4b14bc/gradio-4.37.1-py3-none-any.whl

Install Gradio Python Client from this PR

pip install "gradio-client @ git+https://github.com/gradio-app/gradio@3fde2e2ab1eb583f6529821e336396e02f4b14bc#subdirectory=client/python"

Install Gradio JS Client from this PR

npm install https://gradio-builds.s3.amazonaws.com/3fde2e2ab1eb583f6529821e336396e02f4b14bc/gradio-client-1.2.0.tgz

abidlabs · 2024-06-26T20:20:22Z

Thanks @davidberenstein1957 for this PR! I quite like this API though we already have the HuggingFaceDatasetSaver class which I believe does the same thing, see: https://www.gradio.app/guides/using-flagging#the-hugging-face-dataset-saver-callback

Is this PR aiming at something else?

davidberenstein1957 · 2024-06-27T05:43:12Z

Hi @abidlabs, I did not know this functionality existed and it generally covers the same workflow and functionalities, however, the implemented flagging callback does not work with API calls.

With Argilla, we were hoping to find ways to sync datasets, models and of course human feedback a bit better and allow us to make an iterative process of data quality the default for working with models.

Having different callback mechanisms is nice but maybe the config along with flagging="auto" makes it less easy to set a default for this. Ideally, I can see people building Gradio demos, deploying them and directly collect data for a hub dataset. Ideally, we can then open these datasets in Argilla to correct and review, after which we can train a model to start fine-tuning a model, which we can then redeploy.

abidlabs · 2024-06-27T05:45:23Z

With Argilla, we were hoping to find ways to sync datasets, models and of course human feedback a bit better and allow us to make an iterative process of data quality the default for working with models.

Having different callback mechanisms is nice but maybe the config along with flagging="auto" makes it less easy to set a default for this. Ideally, I can see people building Gradio demos, deploying them and directly collect data for a hub dataset. Ideally, we can then open these datasets in Argilla to correct and review, after which we can train a model to start fine-tuning a model, which we can then redeploy.

Workflow totally makes sense to me. However, its less clear to me what the current limitations of HuggingFaceDatasetSaver are. What do you mean by

the implemented flagging callback does not work with API calls?

gradio-pr-bot · 2024-06-27T05:46:11Z

🦄 change detected

This Pull Request includes changes to the following packages.

Package	Version
`gradio`	`minor`

Maintainers can select this checkbox to manually select packages to update.

With the following changelog entry.

Add functionality to add sync gr.Interface to Hugging Face hub datasets.

Maintainers or the PR author can modify the PR title to modify this entry.

Something isn't right?

Maintainers can change the version label to modify the version bump.
If the bot has failed to detect any changes, or if this pull request needs to update multiple packages to different versions or requires a more comprehensive changelog entry, maintainers can update the changelog file directly.

davidberenstein1957 · 2024-06-27T06:00:22Z

With Argilla, we were hoping to find ways to sync datasets, models and of course human feedback a bit better and allow us to make an iterative process of data quality the default for working with models.

Having different callback mechanisms is nice but maybe the config along with flagging="auto" makes it less easy to set a default for this. Ideally, I can see people building Gradio demos, deploying them and directly collect data for a hub dataset. Ideally, we can then open these datasets in Argilla to correct and review, after which we can train a model to start fine-tuning a model, which we can then redeploy.

Workflow totally makes sense to me. However, its less clear to me what the current limitations of HuggingFaceDatasetSaver are. What do you mean by

the implemented flagging callback does not work with API calls?

Firstly, when using the Gradio client for API calls/requests, the outputs don't seem to get flagged when using the HuggingFaceDatasetSaver.

Also, the usability is partly about the flow where you explicitly need to import HuggingFaceDatasetSaver and set flagging="auto", to start collecting data without manual intervention. This flow might also be a bit hidden when working with something like gr.Interface.from_pipeline because both argument names are hidden in the **kwargs.

I think "flagging" feels a bit more as if it were designed to be reactive as a response to potentially bad examples, while "logging" (everything), which I intend to do, is more pro-active in nature.

FYI, we are still exploring how this Argilla back and forth between the hub should work but our initial idea was to create a argilla-settings.yml file which feels it should work nicely together with the dataset_info.json (I just realized it sets everything to string for some reason?)

I think it would be interesting to explore some additional features covering some model card enhancements too.

Not saying we should keep my code by the way, but more sharing my reasoning on the idea sketched above, and hope to dive a bit deeper into the matter during a meeting with @pngwn when he is back from his travels @abidlabs, would you like to join too?

abidlabs · 2024-06-28T14:45:45Z

@davidberenstein1957 for sure happy to sync, I do think all of these issues could be solveable though:

Firstly, when using the Gradio client for API calls/requests, the outputs don't seem to get flagged when using the HuggingFaceDatasetSaver.

Interesting, didn't consider this use case. Sounds like a bug though that we should resolve

Also, the usability is partly about the flow where you explicitly need to import HuggingFaceDatasetSaver and set flagging="auto", to start collecting data without manual intervention. This flow might also be a bit hidden when working with something like gr.Interface.from_pipeline because both argument names are hidden in the **kwargs.

We can expose these arguments 👍

I think "flagging" feels a bit more as if it were designed to be reactive as a response to potentially bad examples, while "logging" (everything), which I intend to do, is more pro-active in nature.

That's where the "allow_flagging" argument comes in, though I agree the naming could be better

davidberenstein1957 added 10 commits June 26, 2024 09:52

Add: example argilla

852a99c

Update. example argilla with some dataset logic

0a390cd

Update: example datasets integration

84a2164

Add: sync_to_hub integration to Interface

076fb4e

Add: example script sync_with_hub

fdcb497

Revert "Add: example script sync_with_hub"

79bbba8

This reverts commit fdcb497.

Revert commit

ea8889b

Update example script

ec38d96

Delete example_datasets.py

e5172bc

Delete example_argilla.py

1a1c77f

davidberenstein1957 added 2 commits June 26, 2024 19:45

Update formatting

11ac9f2

Merge branch 'main' into feat/sync-with-hub

6cc8c6f

add changeset

3fde2e2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add functionality to add sync `gr.Interface` to Hugging Face hub datasets. #8634

Add functionality to add sync `gr.Interface` to Hugging Face hub datasets. #8634

davidberenstein1957 commented Jun 26, 2024 •

edited

Loading

gradio-pr-bot commented Jun 26, 2024 •

edited

Loading

abidlabs commented Jun 26, 2024

davidberenstein1957 commented Jun 27, 2024

abidlabs commented Jun 27, 2024 •

edited

Loading

gradio-pr-bot commented Jun 27, 2024 •

edited

Loading

Something isn't right?

davidberenstein1957 commented Jun 27, 2024 •

edited

Loading

abidlabs commented Jun 28, 2024

Add functionality to add sync gr.Interface to Hugging Face hub datasets. #8634

Are you sure you want to change the base?

Add functionality to add sync gr.Interface to Hugging Face hub datasets. #8634

Conversation

davidberenstein1957 commented Jun 26, 2024 • edited Loading

Description

🎯 PRs Should Target Issues

Tests

gradio-pr-bot commented Jun 26, 2024 • edited Loading

🪼 branch checks and previews

abidlabs commented Jun 26, 2024

davidberenstein1957 commented Jun 27, 2024

abidlabs commented Jun 27, 2024 • edited Loading

gradio-pr-bot commented Jun 27, 2024 • edited Loading

🦄 change detected

This Pull Request includes changes to the following packages.

With the following changelog entry.

Something isn't right?

davidberenstein1957 commented Jun 27, 2024 • edited Loading

abidlabs commented Jun 28, 2024

Add functionality to add sync `gr.Interface` to Hugging Face hub datasets. #8634

Add functionality to add sync `gr.Interface` to Hugging Face hub datasets. #8634

davidberenstein1957 commented Jun 26, 2024 •

edited

Loading

gradio-pr-bot commented Jun 26, 2024 •

edited

Loading

abidlabs commented Jun 27, 2024 •

edited

Loading

gradio-pr-bot commented Jun 27, 2024 •

edited

Loading

davidberenstein1957 commented Jun 27, 2024 •

edited

Loading