Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dataset Issues for RAG Guide #111

Open
yildize opened this issue Nov 8, 2024 · 1 comment
Open

Dataset Issues for RAG Guide #111

yildize opened this issue Nov 8, 2024 · 1 comment

Comments

@yildize
Copy link

yildize commented Nov 8, 2024

Hello there, I was examining the RAG guide:
https://github.com/anthropics/anthropic-cookbook/blob/main/skills/retrieval_augmented_generation/guide.ipynb

Especially I was trying to understand the evaluation logic. The example guide uses the following data for the evaluation purposes:

But as I continue my examination I've realized some problems about the labeling. Here are some examples:

An Example of False Positive Labeling

Eval Item
{
"id": "efc09699",
"question": "How can you create multiple test cases for an evaluation in the Anthropic Evaluation tool?",
"correct_chunks": [
"https://docs.anthropic.com/en/docs/test-and-evaluate/eval-tool#creating-test-cases",
"https://docs.anthropic.com/en/docs/build-with-claude/develop-tests#building-evals-and-test-cases"
],
"correct_answer": "To create multiple test cases in the Anthropic Evaluation tool, click the 'Add Test Case' button, fill in values for each variable in your prompt, and repeat the process to create additional test case scenarios."
}

Corresponding Highlighted Positive Labeled Document
{
"chunk_link": "https://docs.anthropic.com/en/docs/build-with-claude/develop-tests#building-evals-and-test-cases",
"chunk_heading": "Building evals and test cases",
"text": "Building evals and test cases\n\n\n"
}

Problem
As you can see, although this positively labeled document have some similar keywords, it is definitely not an answer for the related question. I am afraid there are other similar examples. I believe this situation both affects the retrieval performance evaluation along with the end-to-end performance evaluation.

My Question
May I ask, how exactly is this eval dataset is obtained? I first thought it is human generated, but maybe it is not? I am also wondering can there be any "false negative labeled documents" meaning some documents are actually related with the question but not labeled positively.

Thank you for you response in advance. :)

@yildize
Copy link
Author

yildize commented Nov 27, 2024

FYI, further examination showed me that, in addition to false positive labels, there are false negative labels (passages that are actually related to the question but not labeled as related) as well. So I have serious doubts that improving scores on this eval set actually mean a better retrieval method or not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant