LLM-as-a-Judge Evaluation

This repository includes the code, dataset and results of our bachelor thesis about automatic evaluation of chatbots using LLM-as-a-Judge, at Linnaeus University. The authors are Vilgot Lundborg and Yuyao Duan. The program automatically evaluates the correctness of chatbot answers to a set of questions in history and biology, using another LLM to perform this evaluation. The chatbots whose answers are being evaluated are Llama 3 70B, ChatGPT 4, and Gemini Advanced. The LLM that evaluates the answers is the GPT-4o API. The GPT-4o API is instructed to grade the answers based on three parameters: relevance, completeness, and clarity, each being graded from 1 to 5, together with an explanation. An overall grade based on the average of the three grades is also calculated. The results are stored in JSON format in the results folder.

How to run

Install the required libraries using pip install openai pandas.
Add your OpenAI API key as the parameter to the evaluator on line 19 in main.py.
Run main.py.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
data		data
results		results
README.md		README.md
answer.py		answer.py
dataset.py		dataset.py
evaluator.py		evaluator.py
main.py		main.py
question_data.py		question_data.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM-as-a-Judge Evaluation

How to run

About

Uh oh!

Releases

Packages

Languages

VilgotLB/LLM-as-a-Judge-Evaluation

Folders and files

Latest commit

History

Repository files navigation

LLM-as-a-Judge Evaluation

How to run

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages