-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LLM-as-a-judge cookbook #61
Conversation
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
@aymeric-roucher I think it's a good idea to split this into two notebook. Both, LLM-as-a-judge and Constrained generation are great topics in themselves. I would also suggest having them in two different PRs so we can merge and communicate about them separately. |
@MKhalusova I did the split, the notebook now only contains the LLM-as-a-judge. |
notebooks/en/_toctree.yml
Outdated
@@ -28,3 +28,6 @@ | |||
title: Prompt tuning with PEFT | |||
- local: semantic_cache_chroma_vector_database | |||
title: Implementing semantic cache to improve a RAG system. | |||
- local: llm_judge | |||
title: LLM-as-a-judge 🧑⚖️ - Building an automated and versatile evaluation system |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not use emojis in the titles in yaml, it can break the build. Also, if possible, let's make this title a bit shorter.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks! Also I was wondering, does ':' in the title break the build as well ? (given that there is already a ':' after 'title')
@@ -0,0 +1,744 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing a word here: "But we'll see that it will not well out-of-the-box: you need to set it up carefully for good results." => ? "But we'll see that it will not work well out-of-the-box: you need to set it up carefully for good results."
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done!
@@ -0,0 +1,744 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
✅
@@ -0,0 +1,744 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before the setup it would be helpful to outline the problem or the use case. E.g. we want to set up an LLM as judge to be nearly as good as our human reviewers, because we want to scale the reviews. Let's say, we already have a collection of human reviews for question-answering (introduce the dataset here, add a link to it) as a guide. Let's start by checking the baseline performance we would expect from the LLM, here it can be....
Something along these lines.
It might also be good to rename the title to something more descriptive, e.g. "Problem setup", "Using Human feedback to guide the LLM-as-a-judge", etc.
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you, good idea! I changed the title, and developed explanations for this part.
@@ -0,0 +1,744 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did we initially have the scale between 0 and 10, and now it's 1 to 4? Shouldn't the range be the same as in the dataset? And could've the difference in scale affected the initial performance?
Reply via ReviewNB
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we use pearson correlation to compute the performance, the results should not be affected by a rescaling. But what we show here is that smaller integer scales to work better!
@@ -0,0 +1,744 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Line #25. Provide your feedback. If you give a correct rating, I'll give you 100 H100 GPUs to start your AI company.
Lol!
Reply via ReviewNB
@@ -0,0 +1,744 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps it would be useful to give some guidance on using LLM-as-a-judge in production - some basic dos and don'ts?. E.g. Should it be used in replacement of human evaluators, or as assistance for them? How should one monitor LLM-as-a-judge? (If the questions drift a lot, will that affect the system). And so on.
Reply via ReviewNB
Love your style! Very engaging, and informative, I left a few suggestions. Some important thing to fix - remove emoji in the yaml file. Hopefully it'll help to turn the build green. |
@MKhalusova I did changes for all your comments! The tests are still broken though, I don't know why since the error is not really explicit (not related to the emoji in the title, I tested without it and it still broke) |
The error is 'ReferenceError: DocNotebookDropdown is not defined' |
I think we need @mishig25 to help us understand why the |
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
I see that you've resolved the CI/CD issue. Congrats! Can I take another look or is it still WIP? |
You can take a look @MKhalusova ! I've just solved the CI/CD error using the dumbest method ever: deleting parts of the notebook which removed the issue, the restoring them one by one. |
@@ -0,0 +1,716 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A small typo: "This means that our "ground truth" contains noise: hence we cannot expect any algorithmic evaluation to come that close to it a However, we could reduce a bit this noise: " =>
"This means that our "ground truth" contains noise: hence we cannot expect any algorithmic evaluation to come that close to it. However, we could reduce this noise a bit: "
Reply via ReviewNB
@@ -0,0 +1,716 @@ | |||
{ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Found two typos. Otherwise looks good to merge :)
What does this PR do?
Add a prompting notebook detailing how to build LLM-as-a-judge, how to enforce constrained generation, and how asking too much at once in a prompt can reduce performance.
@MKhalusova I've done the two first parts already:
LLM-as-a-judge
andConstrained generation
.I'm wondering if
LLM-as-a-judge
should go in a separate notebook, given that this topic has a lot of interest and takes a lot of room to simple compare results between different prompts?