LLM-as-a-judge cookbook #61

aymeric-roucher · 2024-03-11T18:45:57Z

What does this PR do?

Add a prompting notebook detailing how to build LLM-as-a-judge, how to enforce constrained generation, and how asking too much at once in a prompt can reduce performance.

@MKhalusova I've done the two first parts already: LLM-as-a-judge and Constrained generation.
I'm wondering if LLM-as-a-judge should go in a separate notebook, given that this topic has a lot of interest and takes a lot of room to simple compare results between different prompts?

review-notebook-app · 2024-03-11T18:48:45Z

Check out this pull request on

See visual diffs & provide feedback on Jupyter Notebooks.

Powered by ReviewNB

MKhalusova · 2024-03-12T12:27:01Z

@aymeric-roucher I think it's a good idea to split this into two notebook. Both, LLM-as-a-judge and Constrained generation are great topics in themselves. I would also suggest having them in two different PRs so we can merge and communicate about them separately.

aymeric-roucher · 2024-03-12T15:24:15Z

@MKhalusova I did the split, the notebook now only contains the LLM-as-a-judge.
And it is finished, so this is ready for review! 😃

MKhalusova · 2024-03-13T13:06:53Z

notebooks/en/_toctree.yml

@@ -28,3 +28,6 @@
    title: Prompt tuning with PEFT
  - local: semantic_cache_chroma_vector_database
    title: Implementing semantic cache to improve a RAG system.
+  - local: llm_judge
+    title: LLM-as-a-judge 🧑‍⚖️ - Building an automated and versatile evaluation system 


Let's not use emojis in the titles in yaml, it can break the build. Also, if possible, let's make this title a bit shorter.

Thanks! Also I was wondering, does ':' in the title break the build as well ? (given that there is already a ':' after 'title')

MKhalusova · 2024-03-13T13:48:10Z

notebooks/en/llm_judge.ipynb

@@ -0,0 +1,744 @@
+{


missing a word here: "But we'll see that it will not well out-of-the-box: you need to set it up carefully for good results." => ? "But we'll see that it will not work well out-of-the-box: you need to set it up carefully for good results."

Reply via ReviewNB

MKhalusova · 2024-03-13T13:48:11Z

notebooks/en/llm_judge.ipynb

@@ -0,0 +1,744 @@
+{


Let's remove this output, it doesn't add much value.

Reply via ReviewNB

MKhalusova · 2024-03-13T13:48:11Z

notebooks/en/llm_judge.ipynb

@@ -0,0 +1,744 @@
+{


Before the setup it would be helpful to outline the problem or the use case. E.g. we want to set up an LLM as judge to be nearly as good as our human reviewers, because we want to scale the reviews. Let's say, we already have a collection of human reviews for question-answering (introduce the dataset here, add a link to it) as a guide. Let's start by checking the baseline performance we would expect from the LLM, here it can be....

Something along these lines.

It might also be good to rename the title to something more descriptive, e.g. "Problem setup", "Using Human feedback to guide the LLM-as-a-judge", etc.

Reply via ReviewNB

Thank you, good idea! I changed the title, and developed explanations for this part.

MKhalusova · 2024-03-13T13:48:11Z

notebooks/en/llm_judge.ipynb

@@ -0,0 +1,744 @@
+{


Why did we initially have the scale between 0 and 10, and now it's 1 to 4? Shouldn't the range be the same as in the dataset? And could've the difference in scale affected the initial performance?

Reply via ReviewNB

Since we use pearson correlation to compute the performance, the results should not be affected by a rescaling. But what we show here is that smaller integer scales to work better!

MKhalusova · 2024-03-13T13:48:11Z

notebooks/en/llm_judge.ipynb

@@ -0,0 +1,744 @@
+{


Line #25. Provide your feedback. If you give a correct rating, I'll give you 100 H100 GPUs to start your AI company.
Lol!

Reply via ReviewNB

MKhalusova · 2024-03-13T13:48:11Z

notebooks/en/llm_judge.ipynb

@@ -0,0 +1,744 @@
+{


Perhaps it would be useful to give some guidance on using LLM-as-a-judge in production - some basic dos and don'ts?. E.g. Should it be used in replacement of human evaluators, or as assistance for them? How should one monitor LLM-as-a-judge? (If the questions drift a lot, will that affect the system). And so on.

Reply via ReviewNB

MKhalusova · 2024-03-13T13:49:52Z

Love your style! Very engaging, and informative, I left a few suggestions. Some important thing to fix - remove emoji in the yaml file. Hopefully it'll help to turn the build green.
I also recommend adding the notebook to the list of recent notebook in the index.md file (feel free to place it at the top of the list there as it's the most recent addition)

aymeric-roucher · 2024-03-14T11:46:26Z

@MKhalusova I did changes for all your comments!

The tests are still broken though, I don't know why since the error is not really explicit (not related to the emoji in the title, I tested without it and it still broke)

aymeric-roucher · 2024-03-14T11:52:39Z

The error is 'ReferenceError: DocNotebookDropdown is not defined'

MKhalusova · 2024-03-14T17:18:41Z

The error is 'ReferenceError: DocNotebookDropdown is not defined'

I think we need @mishig25 to help us understand why the build_pr_documentation fails here.

HuggingFaceDocBuilderDev · 2024-03-18T15:57:25Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

MKhalusova · 2024-03-18T18:09:14Z

I see that you've resolved the CI/CD issue. Congrats! Can I take another look or is it still WIP?

aymeric-roucher · 2024-03-18T18:20:24Z

You can take a look @MKhalusova ! I've just solved the CI/CD error using the dumbest method ever: deleting parts of the notebook which removed the issue, the restoring them one by one.

MKhalusova · 2024-03-19T14:12:37Z

notebooks/en/llm_judge.ipynb

@@ -0,0 +1,716 @@
+{


A small typo: "This means that our "ground truth" contains noise: hence we cannot expect any algorithmic evaluation to come that close to it a However, we could reduce a bit this noise: " =>
"This means that our "ground truth" contains noise: hence we cannot expect any algorithmic evaluation to come that close to it. However, we could reduce this noise a bit: "

Reply via ReviewNB

MKhalusova · 2024-03-19T14:12:37Z

notebooks/en/llm_judge.ipynb

@@ -0,0 +1,716 @@
+{


typo: independent

Reply via ReviewNB

MKhalusova

Found two typos. Otherwise looks good to merge :)

aymeric-roucher added 2 commits March 11, 2024 19:42

Create prompting cookbook

24d29a8

Build prompting notebook

0799597

aymeric-roucher added 2 commits March 12, 2024 16:08

Split notebook, remove constrained generation part

2fa9e82

Elaborate explanations

b71d1f1

aymeric-roucher changed the title ~~[Draft] Prompting~~ Prompting cookbook Mar 12, 2024

aymeric-roucher changed the title ~~Prompting cookbook~~ LLM-as-a-judge cookbook Mar 12, 2024

Add reference in toctree

b52ddd0

MKhalusova reviewed Mar 13, 2024

View reviewed changes

aymeric-roucher added 4 commits March 14, 2024 12:15

Answer PR comments

c6dd747

Change titles and update index

358bb34

Test without emoji

045d9e5

Reinsert emoji

a96ab32

Merge branch 'main' into main

b77a5ed

aymeric-roucher added 4 commits March 18, 2024 16:18

Try copying the notebook

9dc620a

Try renaming the notebook

60da160

Try another fix

1ee176c

Try removing half the notebook

de6b42b

aymeric-roucher added 3 commits March 18, 2024 17:57

Restore part 2

fe6d1bd

Restore part 3

160fd37

Restore part 4

9d2dbdd

Fix error

8c7e04e

MKhalusova reviewed Mar 19, 2024

View reviewed changes

MKhalusova approved these changes Mar 19, 2024

View reviewed changes

Resolve issues

72029f5

aymeric-roucher merged commit 5bcfd4b into huggingface:main Mar 19, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

LLM-as-a-judge cookbook #61

LLM-as-a-judge cookbook #61

aymeric-roucher commented Mar 11, 2024

review-notebook-app bot commented Mar 11, 2024

MKhalusova commented Mar 12, 2024

aymeric-roucher commented Mar 12, 2024

MKhalusova Mar 13, 2024

aymeric-roucher Mar 13, 2024

MKhalusova Mar 13, 2024 •

edited

Loading

aymeric-roucher Mar 14, 2024

MKhalusova Mar 13, 2024 •

edited

Loading

aymeric-roucher Mar 14, 2024

MKhalusova Mar 13, 2024 •

edited

Loading

aymeric-roucher Mar 14, 2024

MKhalusova Mar 13, 2024 •

edited

Loading

aymeric-roucher Mar 14, 2024

MKhalusova Mar 13, 2024 •

edited

Loading

MKhalusova Mar 13, 2024 •

edited

Loading

MKhalusova commented Mar 13, 2024

aymeric-roucher commented Mar 14, 2024

aymeric-roucher commented Mar 14, 2024

MKhalusova commented Mar 14, 2024

HuggingFaceDocBuilderDev commented Mar 18, 2024

MKhalusova commented Mar 18, 2024

aymeric-roucher commented Mar 18, 2024

MKhalusova Mar 19, 2024 •

edited

Loading

MKhalusova Mar 19, 2024 •

edited

Loading

MKhalusova left a comment

LLM-as-a-judge cookbook #61

LLM-as-a-judge cookbook #61

Conversation

aymeric-roucher commented Mar 11, 2024

What does this PR do?

review-notebook-app bot commented Mar 11, 2024

MKhalusova commented Mar 12, 2024

aymeric-roucher commented Mar 12, 2024

MKhalusova Mar 13, 2024

Choose a reason for hiding this comment

aymeric-roucher Mar 13, 2024

Choose a reason for hiding this comment

MKhalusova Mar 13, 2024 • edited Loading

Choose a reason for hiding this comment

aymeric-roucher Mar 14, 2024

Choose a reason for hiding this comment

MKhalusova Mar 13, 2024 • edited Loading

Choose a reason for hiding this comment

aymeric-roucher Mar 14, 2024

Choose a reason for hiding this comment

MKhalusova Mar 13, 2024 • edited Loading

Choose a reason for hiding this comment

aymeric-roucher Mar 14, 2024

Choose a reason for hiding this comment

MKhalusova Mar 13, 2024 • edited Loading

Choose a reason for hiding this comment

aymeric-roucher Mar 14, 2024

Choose a reason for hiding this comment

MKhalusova Mar 13, 2024 • edited Loading

Choose a reason for hiding this comment

MKhalusova Mar 13, 2024 • edited Loading

Choose a reason for hiding this comment

MKhalusova commented Mar 13, 2024

aymeric-roucher commented Mar 14, 2024

aymeric-roucher commented Mar 14, 2024

MKhalusova commented Mar 14, 2024

HuggingFaceDocBuilderDev commented Mar 18, 2024

MKhalusova commented Mar 18, 2024

aymeric-roucher commented Mar 18, 2024

MKhalusova Mar 19, 2024 • edited Loading

Choose a reason for hiding this comment

MKhalusova Mar 19, 2024 • edited Loading

Choose a reason for hiding this comment

MKhalusova left a comment

Choose a reason for hiding this comment

MKhalusova Mar 13, 2024 •

edited

Loading

MKhalusova Mar 13, 2024 •

edited

Loading

MKhalusova Mar 13, 2024 •

edited

Loading

MKhalusova Mar 13, 2024 •

edited

Loading

MKhalusova Mar 13, 2024 •

edited

Loading

MKhalusova Mar 13, 2024 •

edited

Loading

MKhalusova Mar 19, 2024 •

edited

Loading

MKhalusova Mar 19, 2024 •

edited

Loading