diff --git a/notebooks/en/_toctree.yml b/notebooks/en/_toctree.yml index 16054153..84125fd3 100644 --- a/notebooks/en/_toctree.yml +++ b/notebooks/en/_toctree.yml @@ -66,4 +66,6 @@ - title: Enterprise Hub Cookbook sections: - local: enterprise_cookbook_overview - title: Overview \ No newline at end of file + title: Overview + - local: enterprise_cookbook_argilla + title: Data annotation with Argilla Spaces \ No newline at end of file diff --git a/notebooks/en/enterprise_cookbook_argilla.ipynb b/notebooks/en/enterprise_cookbook_argilla.ipynb new file mode 100644 index 00000000..fbe10886 --- /dev/null +++ b/notebooks/en/enterprise_cookbook_argilla.ipynb @@ -0,0 +1,1413 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "c9a872bb-d364-4939-865e-6f01b16ca1f4", + "metadata": {}, + "source": [ + "# Data Annotation with Argilla Spaces\n", + "This notebook illustrates the workflow for systematically evaluating LLM outputs and creating LLM training data. You can start by using this notebook for evaluating the zeroshot performance of your favourite LLM on your task without any fine-tuning. If you want to improve performance, you can then easily reuse this workflow to create training data.\n", + "\n", + "**Example use-case: code generation.** For this tutorial we demonstrate how to create high quality test & train data for *code generation tasks*. The same workflow can, however, be adapted to any other task that's relevant for your specific use-case. \n", + "\n", + "**In this notebook, we:**\n", + "1. Download data for the example task.\n", + "2. Prompt two LLMs to respond to these tasks. This results in \"synthetic data\" to speed up manual data creation. \n", + "3. Create an Argilla annotation interface on HF Spaces to compare and evaluate the outputs from the two LLMs.\n", + "4. Upload the example data and the zeroshot LLM responses into the Argilla annotation interface.\n", + "5. Download the annotated data.\n", + "\n", + "You can adapt this notebook to your needs, e.g. by using a different LLM and API provider for step (2) or by adapting the annotation interface in step (3)." + ] + }, + { + "cell_type": "markdown", + "id": "a482a2f5-9f0d-4117-a606-6d6bf80c4c14", + "metadata": {}, + "source": [ + "## Install required packages and connect to HF Hub" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "972076ae-2ad4-4afa-b9be-e3146ffbfe69", + "metadata": { + "tags": [] + }, + "outputs": [], + "source": [ + "!pip install \"argilla[server]~=1.27.0\"\n", + "!pip install transformers~=4.40.0\n", + "!pip install datasets~=2.19.0\n", + "!pip install huggingface_hub~=0.23.2" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "dbc6293c-4f10-4cd3-b009-664929a3cbb9", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "08d5e2d3ab4644c9b2e31ca0649b43ec", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "VBox(children=(HTML(value='
[05/29/24 12:51:18] INFO INFO:argilla.client.feedback.dataset.local.mixins:✓ Dataset succesfully mixins.py:271\n", + " pushed to Argilla \n", + "\n" + ], + "text/plain": [ + "\u001b[2;36m[05/29/24 12:51:18]\u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m INFO:argilla.client.feedback.dataset.local.mixins:✓ Dataset succesfully \u001b]8;id=789408;file:///home/user/miniconda/lib/python3.9/site-packages/argilla/client/feedback/dataset/local/mixins.py\u001b\\\u001b[2mmixins.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=322126;file:///home/user/miniconda/lib/python3.9/site-packages/argilla/client/feedback/dataset/local/mixins.py#271\u001b\\\u001b[2m271\u001b[0m\u001b]8;;\u001b\\\n", + "\u001b[2;36m \u001b[0m pushed to Argilla \u001b[2m \u001b[0m\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
INFO INFO:argilla.client.feedback.dataset.local.mixins:RemoteFeedbackDataset( mixins.py:272\n", + " id=b3a9098f-25a9-4b59-9fb6-739e885c5ab3 \n", + " name=code-llm \n", + " workspace=Workspace(id=d9ec781d-8505-430c-ab87-0deac2951f00, \n", + " name=admin, inserted_at=2024-05-02 11:40:30.831848, \n", + " updated_at=2024-05-02 11:40:30.831848) \n", + " url=https://moritzlaurer-argilla-00.hf.space/dataset/b3a9098f-25a9-4b \n", + " 59-9fb6-739e885c5ab3/annotation-mode \n", + " fields=[RemoteTextField(id=UUID('22667685-d826-4281-82ec-fa2d8334aff2 \n", + " '), client=None, name='instruction', title='Instruction:', \n", + " required=True, type='text', use_markdown=True), \n", + " RemoteTextField(id=UUID('17140de6-63c8-43a9-ad04-10606c85d70b'), \n", + " client=None, name='generation_1', title='Response model 1:', \n", + " required=True, type='text', use_markdown=True), \n", + " RemoteTextField(id=UUID('f095d45a-53b3-4531-b541-29c8bbfa0156'), \n", + " client=None, name='generation_2', title='Response model 2:', \n", + " required=True, type='text', use_markdown=True)] \n", + " questions=[RemoteRatingQuestion(id=UUID('7714ab9f-5297-461b-9b9c-53de \n", + " 0bc1633a'), client=None, name='score_response_1', title='Your score for \n", + " the response of model 1:', description='- Add up to +2 points, if the \n", + " code is properly commented, with inline comments and doc strings for \n", + " functions.\\n- Add up to +2 points, if the code contains a good example \n", + " for testing. \\n- Add up to +3 points, if the code runs and works \n", + " correctly. Copy the code into an IDE and test it with at least two \n", + " different inputs. Attribute one point if the code works mostly \n", + " correctly, but has some issues. Attribute three points if the code is \n", + " fully correct and robust against different scenarios. \\n', \n", + " required=True, type='rating', values=[1, 2, 3, 4, 5, 6, 7]), \n", + " RemoteRatingQuestion(id=UUID('43db47a7-1d83-4d2d-b5e2-3992262afb97'), \n", + " client=None, name='score_response_2', title='Your score for the response \n", + " of model 2:', description='- Add up to +2 points, if the code is \n", + " properly commented, with inline comments and doc strings for \n", + " functions.\\n- Add up to +2 points, if the code contains a good example \n", + " for testing. \\n- Add up to +3 points, if the code runs and works \n", + " correctly. Copy the code into an IDE and test it with at least two \n", + " different inputs. Attribute one point if the code works mostly \n", + " correctly, but has some issues. Attribute three points if the code is \n", + " fully correct and robust against different scenarios. \\n', \n", + " required=True, type='rating', values=[1, 2, 3, 4, 5, 6, 7]), \n", + " RemoteLabelQuestion(id=UUID('dd2d916c-f0d7-4d2b-aa27-fc1e60575d37'), \n", + " client=None, name='which_response_corrected', title='If both responses \n", + " score below 4, select a response to correct:', description='Select the \n", + " response you will correct in the text field below.', required=False, \n", + " type='label_selection', labels=['Response 1', 'Response 2', 'Combination \n", + " of both', 'Neither'], visible_labels=None), \n", + " RemoteTextQuestion(id=UUID('700649a7-a3f9-4a58-a3b9-3b40b2090a3d'), \n", + " client=None, name='correction', title='Paste the selected response below \n", + " and correct it manually:', description='Your corrected response must \n", + " fulfill all criteria from the annotation guidelines.', required=False, \n", + " type='text', use_markdown=True), \n", + " RemoteTextQuestion(id=UUID('a10e164c-655b-434a-ba8a-fdf8f4b572e6'), \n", + " client=None, name='comments', title='Annotator Comments', \n", + " description='Add any additional comments here. E.g.: edge cases, issues \n", + " with the interface etc.', required=False, type='text', \n", + " use_markdown=True)] \n", + " guidelines=Your task is to evaluate the responses of two LLMs to code \n", + " generation tasks. \n", + " \n", + " First, you need to score each response on a scale from 0 to 7. You \n", + " add points to your final score based on the following criteria: \n", + " - Add up to +2 points, if the code is properly commented, with inline \n", + " comments and doc strings for functions. \n", + " - Add up to +2 points, if the code contains a good example for \n", + " testing. \n", + " - Add up to +3 points, if the code runs and works correctly. Copy the \n", + " code into an IDE and test it with at least two different inputs. \n", + " Attribute one point if the code is overall correct, but has some issues. \n", + " Attribute three points if the code is fully correct and robust against \n", + " different scenarios. \n", + " Your resulting final score can be any value between 0 to 7. \n", + " \n", + " If both responses have a final score of <= 4, select one response and \n", + " correct it manually in the text field. \n", + " The corrected response must fulfill all criteria from above. \n", + " \n", + " metadata_properties=[RemoteTermsMetadataProperty(id=UUID('77d28aae-77 \n", + " 44-431d-ada9-817e49b55ae3'), client=<httpx.Client object at \n", + " 0x7f10042af790>, name='annotator-groups', title='Annotator groups', \n", + " visible_for_annotators=True, type='terms', values=['annotator-1', \n", + " 'annotator-2', 'annotator-3']), \n", + " RemoteTermsMetadataProperty(id=UUID('ebeed9fe-23df-4f07-b29e-501e2e42c93 \n", + " f'), client=<httpx.Client object at 0x7f10042af790>, \n", + " name='source-dataset', title='Original dataset source', \n", + " visible_for_annotators=True, type='terms', values=None)] \n", + " vectors_settings=[] \n", + " ) \n", + "\n" + ], + "text/plain": [ + "\u001b[2;36m \u001b[0m\u001b[2;36m \u001b[0m\u001b[34mINFO \u001b[0m INFO:argilla.client.feedback.dataset.local.mixins:\u001b[1;35mRemoteFeedbackDataset\u001b[0m\u001b[1m(\u001b[0m \u001b]8;id=289816;file:///home/user/miniconda/lib/python3.9/site-packages/argilla/client/feedback/dataset/local/mixins.py\u001b\\\u001b[2mmixins.py\u001b[0m\u001b]8;;\u001b\\\u001b[2m:\u001b[0m\u001b]8;id=960637;file:///home/user/miniconda/lib/python3.9/site-packages/argilla/client/feedback/dataset/local/mixins.py#272\u001b\\\u001b[2m272\u001b[0m\u001b]8;;\u001b\\\n", + "\u001b[2;36m \u001b[0m \u001b[33mid\u001b[0m=\u001b[93mb3a9098f\u001b[0m\u001b[93m-25a9-4b59-9fb6-739e885c5ab3\u001b[0m \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[33mname\u001b[0m=\u001b[35mcode\u001b[0m-llm \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[33mworkspace\u001b[0m=\u001b[1;35mWorkspace\u001b[0m\u001b[1m(\u001b[0m\u001b[33mid\u001b[0m=\u001b[93md9ec781d\u001b[0m\u001b[93m-8505-430c-ab87-0deac2951f00\u001b[0m, \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[33mname\u001b[0m=\u001b[35madmin\u001b[0m, \u001b[33minserted_at\u001b[0m=\u001b[1;36m2024\u001b[0m-\u001b[1;36m05\u001b[0m-\u001b[1;36m02\u001b[0m \u001b[1;92m11:40:30\u001b[0m.\u001b[1;36m831848\u001b[0m, \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[33mupdated_at\u001b[0m=\u001b[1;36m2024\u001b[0m-\u001b[1;36m05\u001b[0m-\u001b[1;36m02\u001b[0m \u001b[1;92m11:40:30\u001b[0m.\u001b[1;36m831848\u001b[0m\u001b[1m)\u001b[0m \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[33murl\u001b[0m=\u001b[4;94mhttps\u001b[0m\u001b[4;94m://moritzlaurer-argilla-00.hf.space/dataset/b3a9098f-25a9-4b\u001b[0m \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[4;94m59-9fb6-739e885c5ab3/annotation-mode\u001b[0m \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[33mfields\u001b[0m=\u001b[1m[\u001b[0m\u001b[1;35mRemoteTextField\u001b[0m\u001b[1m(\u001b[0m\u001b[33mid\u001b[0m=\u001b[1;35mUUID\u001b[0m\u001b[1m(\u001b[0m\u001b[32m'22667685-d826-4281-82ec-fa2d8334aff2\u001b[0m \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[32m'\u001b[0m\u001b[1m)\u001b[0m, \u001b[33mclient\u001b[0m=\u001b[3;35mNone\u001b[0m, \u001b[33mname\u001b[0m=\u001b[32m'instruction'\u001b[0m, \u001b[33mtitle\u001b[0m=\u001b[32m'Instruction:'\u001b[0m, \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[33mrequired\u001b[0m=\u001b[3;92mTrue\u001b[0m, \u001b[33mtype\u001b[0m=\u001b[32m'text'\u001b[0m, \u001b[33muse_markdown\u001b[0m=\u001b[3;92mTrue\u001b[0m\u001b[1m)\u001b[0m, \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[1;35mRemoteTextField\u001b[0m\u001b[1m(\u001b[0m\u001b[33mid\u001b[0m=\u001b[1;35mUUID\u001b[0m\u001b[1m(\u001b[0m\u001b[32m'17140de6-63c8-43a9-ad04-10606c85d70b'\u001b[0m\u001b[1m)\u001b[0m, \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[33mclient\u001b[0m=\u001b[3;35mNone\u001b[0m, \u001b[33mname\u001b[0m=\u001b[32m'generation_1'\u001b[0m, \u001b[33mtitle\u001b[0m=\u001b[32m'Response model 1:'\u001b[0m, \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[33mrequired\u001b[0m=\u001b[3;92mTrue\u001b[0m, \u001b[33mtype\u001b[0m=\u001b[32m'text'\u001b[0m, \u001b[33muse_markdown\u001b[0m=\u001b[3;92mTrue\u001b[0m\u001b[1m)\u001b[0m, \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[1;35mRemoteTextField\u001b[0m\u001b[1m(\u001b[0m\u001b[33mid\u001b[0m=\u001b[1;35mUUID\u001b[0m\u001b[1m(\u001b[0m\u001b[32m'f095d45a-53b3-4531-b541-29c8bbfa0156'\u001b[0m\u001b[1m)\u001b[0m, \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[33mclient\u001b[0m=\u001b[3;35mNone\u001b[0m, \u001b[33mname\u001b[0m=\u001b[32m'generation_2'\u001b[0m, \u001b[33mtitle\u001b[0m=\u001b[32m'Response model 2:'\u001b[0m, \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[33mrequired\u001b[0m=\u001b[3;92mTrue\u001b[0m, \u001b[33mtype\u001b[0m=\u001b[32m'text'\u001b[0m, \u001b[33muse_markdown\u001b[0m=\u001b[3;92mTrue\u001b[0m\u001b[1m)\u001b[0m\u001b[1m]\u001b[0m \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[33mquestions\u001b[0m=\u001b[1m[\u001b[0m\u001b[1;35mRemoteRatingQuestion\u001b[0m\u001b[1m(\u001b[0m\u001b[33mid\u001b[0m=\u001b[1;35mUUID\u001b[0m\u001b[1m(\u001b[0m\u001b[32m'7714ab9f-5297-461b-9b9c-53de\u001b[0m \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[32m0bc1633a'\u001b[0m\u001b[1m)\u001b[0m, \u001b[33mclient\u001b[0m=\u001b[3;35mNone\u001b[0m, \u001b[33mname\u001b[0m=\u001b[32m'score_response_1'\u001b[0m, \u001b[33mtitle\u001b[0m=\u001b[32m'Your score for \u001b[0m \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[32mthe response of model 1:'\u001b[0m, \u001b[33mdescription\u001b[0m=\u001b[32m'- Add up to +2 points, if the \u001b[0m \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[32mcode is properly commented, with inline comments and doc strings for \u001b[0m \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[32mfunctions.\\n- Add up to +2 points, if the code contains a good example \u001b[0m \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[32mfor testing. \\n- Add up to +3 points, if the code runs and works \u001b[0m \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[32mcorrectly. Copy the code into an IDE and test it with at least two \u001b[0m \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[32mdifferent inputs. Attribute one point if the code works mostly \u001b[0m \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[32mcorrectly, but has some issues. Attribute three points if the code is \u001b[0m \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[32mfully correct and robust against different scenarios. \\n'\u001b[0m, \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[33mrequired\u001b[0m=\u001b[3;92mTrue\u001b[0m, \u001b[33mtype\u001b[0m=\u001b[32m'rating'\u001b[0m, \u001b[33mvalues\u001b[0m=\u001b[1m[\u001b[0m\u001b[1;36m1\u001b[0m, \u001b[1;36m2\u001b[0m, \u001b[1;36m3\u001b[0m, \u001b[1;36m4\u001b[0m, \u001b[1;36m5\u001b[0m, \u001b[1;36m6\u001b[0m, \u001b[1;36m7\u001b[0m\u001b[1m]\u001b[0m\u001b[1m)\u001b[0m, \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[1;35mRemoteRatingQuestion\u001b[0m\u001b[1m(\u001b[0m\u001b[33mid\u001b[0m=\u001b[1;35mUUID\u001b[0m\u001b[1m(\u001b[0m\u001b[32m'43db47a7-1d83-4d2d-b5e2-3992262afb97'\u001b[0m\u001b[1m)\u001b[0m, \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[33mclient\u001b[0m=\u001b[3;35mNone\u001b[0m, \u001b[33mname\u001b[0m=\u001b[32m'score_response_2'\u001b[0m, \u001b[33mtitle\u001b[0m=\u001b[32m'Your score for the response\u001b[0m \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[32mof model 2:'\u001b[0m, \u001b[33mdescription\u001b[0m=\u001b[32m'- Add up to +2 points, if the code is \u001b[0m \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[32mproperly commented, with inline comments and doc strings for \u001b[0m \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[32mfunctions.\\n- Add up to +2 points, if the code contains a good example \u001b[0m \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[32mfor testing. \\n- Add up to +3 points, if the code runs and works \u001b[0m \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[32mcorrectly. Copy the code into an IDE and test it with at least two \u001b[0m \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[32mdifferent inputs. Attribute one point if the code works mostly \u001b[0m \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[32mcorrectly, but has some issues. Attribute three points if the code is \u001b[0m \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[32mfully correct and robust against different scenarios. \\n'\u001b[0m, \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[33mrequired\u001b[0m=\u001b[3;92mTrue\u001b[0m, \u001b[33mtype\u001b[0m=\u001b[32m'rating'\u001b[0m, \u001b[33mvalues\u001b[0m=\u001b[1m[\u001b[0m\u001b[1;36m1\u001b[0m, \u001b[1;36m2\u001b[0m, \u001b[1;36m3\u001b[0m, \u001b[1;36m4\u001b[0m, \u001b[1;36m5\u001b[0m, \u001b[1;36m6\u001b[0m, \u001b[1;36m7\u001b[0m\u001b[1m]\u001b[0m\u001b[1m)\u001b[0m, \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[1;35mRemoteLabelQuestion\u001b[0m\u001b[1m(\u001b[0m\u001b[33mid\u001b[0m=\u001b[1;35mUUID\u001b[0m\u001b[1m(\u001b[0m\u001b[32m'dd2d916c-f0d7-4d2b-aa27-fc1e60575d37'\u001b[0m\u001b[1m)\u001b[0m, \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[33mclient\u001b[0m=\u001b[3;35mNone\u001b[0m, \u001b[33mname\u001b[0m=\u001b[32m'which_response_corrected'\u001b[0m, \u001b[33mtitle\u001b[0m=\u001b[32m'If both responses \u001b[0m \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[32mscore below 4, select a response to correct:'\u001b[0m, \u001b[33mdescription\u001b[0m=\u001b[32m'Select the \u001b[0m \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[32mresponse you will correct in the text field below.'\u001b[0m, \u001b[33mrequired\u001b[0m=\u001b[3;91mFalse\u001b[0m, \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[33mtype\u001b[0m=\u001b[32m'label_selection'\u001b[0m, \u001b[33mlabels\u001b[0m=\u001b[1m[\u001b[0m\u001b[32m'Response 1'\u001b[0m, \u001b[32m'Response 2'\u001b[0m, \u001b[32m'Combination\u001b[0m \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[32mof both'\u001b[0m, \u001b[32m'Neither'\u001b[0m\u001b[1m]\u001b[0m, \u001b[33mvisible_labels\u001b[0m=\u001b[3;35mNone\u001b[0m\u001b[1m)\u001b[0m, \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[1;35mRemoteTextQuestion\u001b[0m\u001b[1m(\u001b[0m\u001b[33mid\u001b[0m=\u001b[1;35mUUID\u001b[0m\u001b[1m(\u001b[0m\u001b[32m'700649a7-a3f9-4a58-a3b9-3b40b2090a3d'\u001b[0m\u001b[1m)\u001b[0m, \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[33mclient\u001b[0m=\u001b[3;35mNone\u001b[0m, \u001b[33mname\u001b[0m=\u001b[32m'correction'\u001b[0m, \u001b[33mtitle\u001b[0m=\u001b[32m'Paste the selected response below\u001b[0m \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[32mand correct it manually:'\u001b[0m, \u001b[33mdescription\u001b[0m=\u001b[32m'Your corrected response must \u001b[0m \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[32mfulfill all criteria from the annotation guidelines.'\u001b[0m, \u001b[33mrequired\u001b[0m=\u001b[3;91mFalse\u001b[0m, \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[33mtype\u001b[0m=\u001b[32m'text'\u001b[0m, \u001b[33muse_markdown\u001b[0m=\u001b[3;92mTrue\u001b[0m\u001b[1m)\u001b[0m, \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[1;35mRemoteTextQuestion\u001b[0m\u001b[1m(\u001b[0m\u001b[33mid\u001b[0m=\u001b[1;35mUUID\u001b[0m\u001b[1m(\u001b[0m\u001b[32m'a10e164c-655b-434a-ba8a-fdf8f4b572e6'\u001b[0m\u001b[1m)\u001b[0m, \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[33mclient\u001b[0m=\u001b[3;35mNone\u001b[0m, \u001b[33mname\u001b[0m=\u001b[32m'comments'\u001b[0m, \u001b[33mtitle\u001b[0m=\u001b[32m'Annotator Comments'\u001b[0m, \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[33mdescription\u001b[0m=\u001b[32m'Add any additional comments here. E.g.: edge cases, issues \u001b[0m \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[32mwith the interface etc.'\u001b[0m, \u001b[33mrequired\u001b[0m=\u001b[3;91mFalse\u001b[0m, \u001b[33mtype\u001b[0m=\u001b[32m'text'\u001b[0m, \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[33muse_markdown\u001b[0m=\u001b[3;92mTrue\u001b[0m\u001b[1m)\u001b[0m\u001b[1m]\u001b[0m \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[33mguidelines\u001b[0m=\u001b[35mYour\u001b[0m task is to evaluate the responses of two LLMs to code \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m generation tasks. \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m First, you need to score each response on a scale from \u001b[1;36m0\u001b[0m to \u001b[1;36m7\u001b[0m. You \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m add points to your final score based on the following criteria: \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m - Add up to +\u001b[1;36m2\u001b[0m points, if the code is properly commented, with inline \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m comments and doc strings for functions. \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m - Add up to +\u001b[1;36m2\u001b[0m points, if the code contains a good example for \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m testing. \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m - Add up to +\u001b[1;36m3\u001b[0m points, if the code runs and works correctly. Copy the \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m code into an IDE and test it with at least two different inputs. \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m Attribute one point if the code is overall correct, but has some issues. \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m Attribute three points if the code is fully correct and robust against \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m different scenarios. \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m Your resulting final score can be any value between \u001b[1;36m0\u001b[0m to \u001b[1;36m7\u001b[0m. \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m If both responses have a final score of \u001b[1m<\u001b[0m\u001b[39m= \u001b[0m\u001b[1;36m4\u001b[0m\u001b[39m, select one response and\u001b[0m \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[39mcorrect it manually in the text field. \u001b[0m \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[39m The corrected response must fulfill all criteria from above. \u001b[0m \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[39m \u001b[0m\u001b[33mmetadata_properties\u001b[0m\u001b[39m=\u001b[0m\u001b[1;39m[\u001b[0m\u001b[1;35mRemoteTermsMetadataProperty\u001b[0m\u001b[1;39m(\u001b[0m\u001b[33mid\u001b[0m\u001b[39m=\u001b[0m\u001b[1;35mUUID\u001b[0m\u001b[1;39m(\u001b[0m\u001b[32m'77d28aae-77\u001b[0m \u001b[2m \u001b[0m\n", + "\u001b[2;36m \u001b[0m \u001b[32m44-431d-ada9-817e49b55ae3'\u001b[0m\u001b[1;39m)\u001b[0m\u001b[39m, \u001b[0m\u001b[33mclient\u001b[0m\u001b[39m=
\n", + "\n" + ], + "text/plain": [ + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "cbcdf844d9334395af6bf494b35d9b2a", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Output()" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "\n" + ], + "text/plain": [] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n", + "\n" + ], + "text/plain": [ + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "application/vnd.jupyter.widget-view+json": { + "model_id": "7ca6bbf2d60b4b1ca27502790cdf4593", + "version_major": 2, + "version_minor": 0 + }, + "text/plain": [ + "Output()" + ] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "\n" + ], + "text/plain": [] + }, + "metadata": {}, + "output_type": "display_data" + }, + { + "data": { + "text/html": [ + "
\n", + "\n" + ], + "text/plain": [ + "\n" + ] + }, + "metadata": {}, + "output_type": "display_data" + } + ], + "source": [ + "import random\n", + "\n", + "# Iterate over the samples in the dataset\n", + "records = []\n", + "for example in dataset:\n", + " \n", + " # Add the records to the FeedbackDataset\n", + " record = rg.FeedbackRecord(\n", + " fields={\n", + " \"instruction\": example[\"instructions\"],\n", + " \"generation_1\": example[\"response_model_1\"],\n", + " \"generation_2\": example[\"response_model_2\"]\n", + " },\n", + " metadata={\n", + " # we randomly assign a record/task to the annotators\n", + " \"annotator-groups\": random.choice(annotators), \n", + " \"source-dataset\": \"bigcode/self-oss-instruct-sc2-exec-filter-50k\"\n", + " }\n", + " )\n", + " \n", + " # Optional: add prefilled suggestion\n", + " # you can use this to fill Questions with suggestions from an LLM-as-a-judge system\n", + " # to further speed up manual annotation\n", + " #record.suggestions = [\n", + " # {\n", + " # \"question_name\": \"score_response_1\",\n", + " # \"value\": example[\"llm_judge_rating\"],\n", + " # \"agent\": \"llama-3-70b-instruct\"\n", + " # },\n", + " #]\n", + " \n", + " try:\n", + " dataset_argilla.add_records(record, show_progress=True)\n", + " except Exception as e:\n", + " print(\"Exception:\", e)\n" + ] + }, + { + "cell_type": "markdown", + "id": "e6488c2f-d30c-46ad-af7f-15cfc8b2baee", + "metadata": {}, + "source": [ + "**The final annotation interface** will look similar to this:\n", + "\n", + " " + ] + }, + { + "cell_type": "markdown", + "id": "56980744-2394-41e1-b004-89c137afdf5d", + "metadata": { + "tags": [] + }, + "source": [ + "**Assign tasks to annotators**: Argilla supports assigning tasks to multiple users/annotators. There are different ways of implementing task assignments, [documented here](https://docs.argilla.io/en/latest/practical_guides/assign_records.html). For this tutorial, we use the simplest metadata method, where everyone has access to the same full dataset and all annotations (via the `annotators` variable created above). To access the annotations assigned to them, an annotator then needs to use the `Metadata` filter in the interface to filter the data to only see records assigned to them (see image below). For larger teams and to get multiple annotations for the same record, it is better to use other task assignment methods. \n", + "\n", + "" + ] + }, + { + "cell_type": "markdown", + "id": "c998cd39-9a5e-4554-b577-fac62bd3bfe6", + "metadata": {}, + "source": [ + "## Annotate" + ] + }, + { + "cell_type": "markdown", + "id": "1501558a-0c96-4b01-9a25-b0c6d6903d68", + "metadata": {}, + "source": [ + "That's it, we've created our custom data annotation interface with Argilla and we can now start annotating! \n", + "\n", + "\n", + "**Important**: If you use Argilla in a HF Space, you need to activate persistent storage so that your data is safely stored and not automatically deleted after a while. For production settings, make sure that persistent storage is activated **before** making any annotations to avoid data loss. " + ] + }, + { + "cell_type": "markdown", + "id": "a34e3e51-f68f-4980-89e6-d7fb6435109f", + "metadata": {}, + "source": [ + "## Download annotated data\n", + "After annotating, you can pull the data from Argilla and simply store and process them locally in any tabular format (see [docs here](https://docs.argilla.io/en/latest/practical_guides/export_dataset.html)). You can also download filtered version of the dataset ([docs](https://docs.argilla.io/en/latest/tutorials_and_integrations/tutorials/feedback/end2end_examples/filter-and-query-008.html))." + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "id": "c12858d4-c1bc-4750-bed2-b84f9ed3afe9", + "metadata": { + "tags": [] + }, + "outputs": [ + { + "data": { + "text/html": [ + "
\n", + " | instruction | \n", + "generation_1 | \n", + "generation_2 | \n", + "score_response_1 | \n", + "score_response_1-suggestion | \n", + "score_response_1-suggestion-metadata | \n", + "score_response_2 | \n", + "score_response_2-suggestion | \n", + "score_response_2-suggestion-metadata | \n", + "which_response_corrected | \n", + "which_response_corrected-suggestion | \n", + "which_response_corrected-suggestion-metadata | \n", + "correction | \n", + "correction-suggestion | \n", + "correction-suggestion-metadata | \n", + "comments | \n", + "comments-suggestion | \n", + "comments-suggestion-metadata | \n", + "external_id | \n", + "metadata | \n", + "
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | \n", + "Write a Python function named `get_value` that... | \n", + "Here's a Python function that meets your requi... | \n", + "Here is a Python function that does what you d... | \n", + "[] | \n", + "NaN | \n", + "{'type': None, 'score': None, 'agent': None} | \n", + "[] | \n", + "NaN | \n", + "{'type': None, 'score': None, 'agent': None} | \n", + "[] | \n", + "None | \n", + "{'type': None, 'score': None, 'agent': None} | \n", + "[] | \n", + "None | \n", + "{'type': None, 'score': None, 'agent': None} | \n", + "[] | \n", + "None | \n", + "{'type': None, 'score': None, 'agent': None} | \n", + "None | \n", + "{\"annotator-groups\": \"annotator-2\", \"source-da... | \n", + "
1 | \n", + "Write a Python function `check_collision` that... | \n", + "Here's a Python function `check_collision` tha... | \n", + "Here is a Python function that checks for coll... | \n", + "[] | \n", + "NaN | \n", + "{'type': None, 'score': None, 'agent': None} | \n", + "[] | \n", + "NaN | \n", + "{'type': None, 'score': None, 'agent': None} | \n", + "[] | \n", + "None | \n", + "{'type': None, 'score': None, 'agent': None} | \n", + "[] | \n", + "None | \n", + "{'type': None, 'score': None, 'agent': None} | \n", + "[] | \n", + "None | \n", + "{'type': None, 'score': None, 'agent': None} | \n", + "None | \n", + "{\"annotator-groups\": \"annotator-3\", \"source-da... | \n", + "
2 | \n", + "Create a Python function to serialize and dese... | \n", + "Here's a Python function that serializes and d... | \n", + "Here is an example of a Python function that s... | \n", + "[] | \n", + "NaN | \n", + "{'type': None, 'score': None, 'agent': None} | \n", + "[] | \n", + "NaN | \n", + "{'type': None, 'score': None, 'agent': None} | \n", + "[] | \n", + "None | \n", + "{'type': None, 'score': None, 'agent': None} | \n", + "[] | \n", + "None | \n", + "{'type': None, 'score': None, 'agent': None} | \n", + "[] | \n", + "None | \n", + "{'type': None, 'score': None, 'agent': None} | \n", + "None | \n", + "{\"annotator-groups\": \"annotator-1\", \"source-da... | \n", + "