Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LLM-as-a-judge cookbook #61

Merged
merged 19 commits into from
Mar 19, 2024
Merged

Conversation

aymeric-roucher
Copy link
Collaborator

What does this PR do?

Add a prompting notebook detailing how to build LLM-as-a-judge, how to enforce constrained generation, and how asking too much at once in a prompt can reduce performance.

@MKhalusova I've done the two first parts already: LLM-as-a-judge and Constrained generation.
I'm wondering if LLM-as-a-judge should go in a separate notebook, given that this topic has a lot of interest and takes a lot of room to simple compare results between different prompts?

Copy link

Check out this pull request on  ReviewNB

See visual diffs & provide feedback on Jupyter Notebooks.


Powered by ReviewNB

@MKhalusova
Copy link
Contributor

@aymeric-roucher I think it's a good idea to split this into two notebook. Both, LLM-as-a-judge and Constrained generation are great topics in themselves. I would also suggest having them in two different PRs so we can merge and communicate about them separately.

@aymeric-roucher
Copy link
Collaborator Author

@MKhalusova I did the split, the notebook now only contains the LLM-as-a-judge.
And it is finished, so this is ready for review! 😃

@aymeric-roucher aymeric-roucher changed the title [Draft] Prompting Prompting cookbook Mar 12, 2024
@aymeric-roucher aymeric-roucher changed the title Prompting cookbook LLM-as-a-judge cookbook Mar 12, 2024
@@ -28,3 +28,6 @@
title: Prompt tuning with PEFT
- local: semantic_cache_chroma_vector_database
title: Implementing semantic cache to improve a RAG system.
- local: llm_judge
title: LLM-as-a-judge 🧑‍⚖️ - Building an automated and versatile evaluation system
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's not use emojis in the titles in yaml, it can break the build. Also, if possible, let's make this title a bit shorter.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Also I was wondering, does ':' in the title break the build as well ? (given that there is already a ':' after 'title')

@@ -0,0 +1,744 @@
{
Copy link
Contributor

@MKhalusova MKhalusova Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing a word here: "But we'll see that it will not well out-of-the-box: you need to set it up carefully for good results." => ? "But we'll see that it will not work well out-of-the-box: you need to set it up carefully for good results."


Reply via ReviewNB

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done!

@@ -0,0 +1,744 @@
{
Copy link
Contributor

@MKhalusova MKhalusova Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this output, it doesn't add much value.


Reply via ReviewNB

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -0,0 +1,744 @@
{
Copy link
Contributor

@MKhalusova MKhalusova Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before the setup it would be helpful to outline the problem or the use case. E.g. we want to set up an LLM as judge to be nearly as good as our human reviewers, because we want to scale the reviews. Let's say, we already have a collection of human reviews for question-answering (introduce the dataset here, add a link to it) as a guide. Let's start by checking the baseline performance we would expect from the LLM, here it can be....

Something along these lines.

It might also be good to rename the title to something more descriptive, e.g. "Problem setup", "Using Human feedback to guide the LLM-as-a-judge", etc.


Reply via ReviewNB

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, good idea! I changed the title, and developed explanations for this part.

@@ -0,0 +1,744 @@
{
Copy link
Contributor

@MKhalusova MKhalusova Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did we initially have the scale between 0 and 10, and now it's 1 to 4? Shouldn't the range be the same as in the dataset? And could've the difference in scale affected the initial performance?


Reply via ReviewNB

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we use pearson correlation to compute the performance, the results should not be affected by a rescaling. But what we show here is that smaller integer scales to work better!

@@ -0,0 +1,744 @@
{
Copy link
Contributor

@MKhalusova MKhalusova Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line #25.    Provide your feedback. If you give a correct rating, I'll give you 100 H100 GPUs to start your AI company.

Lol!


Reply via ReviewNB

@@ -0,0 +1,744 @@
{
Copy link
Contributor

@MKhalusova MKhalusova Mar 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps it would be useful to give some guidance on using LLM-as-a-judge in production - some basic dos and don'ts?. E.g. Should it be used in replacement of human evaluators, or as assistance for them? How should one monitor LLM-as-a-judge? (If the questions drift a lot, will that affect the system). And so on.


Reply via ReviewNB

@MKhalusova
Copy link
Contributor

Love your style! Very engaging, and informative, I left a few suggestions. Some important thing to fix - remove emoji in the yaml file. Hopefully it'll help to turn the build green.
I also recommend adding the notebook to the list of recent notebook in the index.md file (feel free to place it at the top of the list there as it's the most recent addition)

@aymeric-roucher
Copy link
Collaborator Author

@MKhalusova I did changes for all your comments!

The tests are still broken though, I don't know why since the error is not really explicit (not related to the emoji in the title, I tested without it and it still broke)

@aymeric-roucher
Copy link
Collaborator Author

The error is 'ReferenceError: DocNotebookDropdown is not defined'

@MKhalusova
Copy link
Contributor

The error is 'ReferenceError: DocNotebookDropdown is not defined'

I think we need @mishig25 to help us understand why the build_pr_documentation fails here.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@MKhalusova
Copy link
Contributor

I see that you've resolved the CI/CD issue. Congrats! Can I take another look or is it still WIP?

@aymeric-roucher
Copy link
Collaborator Author

You can take a look @MKhalusova ! I've just solved the CI/CD error using the dumbest method ever: deleting parts of the notebook which removed the issue, the restoring them one by one.

@@ -0,0 +1,716 @@
{
Copy link
Contributor

@MKhalusova MKhalusova Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A small typo: "This means that our "ground truth" contains noise: hence we cannot expect any algorithmic evaluation to come that close to it  a  However, we could reduce a bit this noise: " =>

"This means that our "ground truth" contains noise: hence we cannot expect any algorithmic evaluation to come that close to it. However, we could reduce this noise a bit: "


Reply via ReviewNB

@@ -0,0 +1,716 @@
{
Copy link
Contributor

@MKhalusova MKhalusova Mar 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: independent


Reply via ReviewNB

Copy link
Contributor

@MKhalusova MKhalusova left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found two typos. Otherwise looks good to merge :)

@aymeric-roucher aymeric-roucher merged commit 5bcfd4b into huggingface:main Mar 19, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants