Paper title | Authors | Affiliation | Published | Code | Summary | Comments | Languages | Tool |
---|---|---|---|---|---|---|---|---|
OpenFactCheck: A Unified Framework for Factuality Evaluation of LLMs | Iqbal, H., Wang, Y., Wang, M., Georgiev, G., Geng, J., Gurevych, I., & Nakov, P. | MBZUAI (Mohamed bin Zayed University of Artificial Intelligence) | 2024-08 | https://github.com/mbzuai-nlp/openfactcheck | OpenFactCheck has 3 modules: \n\n- RESPONSEEVAL: customize fact-checking system and assess the factuality of all claims in an input document\n- LLMEVAL: assess overall factuality of an LLM\n- CHECKEREVAL: evaluate automatic fact-checking systems | They created two datasets: FactQA (6480 questions) and FactBench (4507 claims). | English, Urdu | OpenFactCheck |
Loki: An Open-Source Tool for Fact Verification | Li, H., Han, X., Wang, H., Wang, Y., Wang, M., Xing, R., ... & Baldwin | LibrAI, MBZUAI, Monash University, The University of Melbourne | 2024-10 | https://github.com/Libr-AI/OpenFactVerification | https://loki.librai.tech/ | Multilingual | Loki | |
FactScore | ||||||||
https://www.comet.com/site/blog/selfcheckgpt-for-llm-evaluation/ | A blackbox hallucination detection method that relies solely on stochastic sampling of model responses. The core intuition of their method is that factually accurate responses are typically consistent and frequent, whereas hallucinated outputs tend to vary and contradict each other. | SelfCheckGPT | ||||||
Long-form factuality in large language models | LongForm SAFE | |||||||
Not open-source | Perplexity fact checker | |||||||
Hallucination to Truth: A Review of Fact-Checking and Factuality\n\nEvaluation in Large Language Models | Rahman, S. S., Islam, M. A., Alam, M. M., Zeba, M., Rahman, M. A., Chowa, S. S., ... & Azam, S. | United International University (Bangladesh), Daffodil International University (Bangladesh), Charles Darwin University (Australia) | 2025-08 | |||||
FACTTEST: FACTUALITY TESTING IN LARGE LANGUAGE MODELS WITH FINITE-SAMPLE AND DISTRIBUTION-FREE GUARANTEES | Fan Nie1 Xiaotian Hou2 Shuhang Lin2 James Zou1 Huaxiu Yao3 Linjun Zhang | Stanford University, 2Rutgers University, 3UNC-Chapel Hill | 2024-11 | Used to "finetune" models to not answer if the answer is likely to be false. | ||||
Seq vs Seq: An Open Suite of Paired Encoders and Decoders | TinyLettuce is used to have a dataset consisting of hallunications and correct responses.\n\n*"The Problem: Training robust hallucination detection models requires large datasets of both correct and hallucinated responses. Manually creating such datasets is expensive and time-consuming.\n\nOur Solution**: LettuceDetect's synthetic data generation pipeline can generate realistic hallucinations from factual content."* | |||||||
Hallucination Risk Calculator & Prompt Re‑engineering Toolkit (OpenAI‑only) | https://hassana.io/readme.html | Calculate the risk of hallucination based on a prompt.\n\nBasically just entropy calculation?\n\nProblem is, which prompts should we supply? | ||||||
(Im)possibility of Automated Hallucination Detection in\n\nLarge Language Models | Not possible if trained only on correct samples (duh) | |||||||
HaluEval: A Large-Scale Hallucination Evaluation Benchmark for Large Language Models | https://github.com/RUCAIBox/HaluEval | Many citations |
Paper title | Authors | Affiliation | Date | Code | Summary/comments | Dataset |
---|---|---|---|---|---|---|
Snowball | ||||||
SelfAware | ||||||
FreshQA | ||||||
FacTool | ||||||
FELM | ||||||
Factcheck-Bench | ||||||
FactScore-Bio | ||||||
Human annotations | LLM-AGGREFACT | |||||
Binary error detection | ReaLMistake | |||||
Compute the ratio of factually supported sentences to the total response | LEAF Fact-check Score | |||||
Measures the overlap between human-used and model-used knowledge | Knowledge F1 | |||||
evaluates how much original content remains intact after hallucination correction | Presevation score | |||||
Human annotations | LLM-AGGREFACT | |||||
Binary error detection | ReaLMistake | |||||
Compute the ratio of factually supported sentences to the total response | LEAF Fact-check Score | |||||
Measures the overlap between human-used and model-used knowledge | Knowledge F1 | |||||
evaluates how much original content remains intact after hallucination correction | Presevation score | |||||
HotpotQA is a released question-answering dataset that involves multi-hop reasoning over\n\nmultiple paragraphs of information to produce an answer. A successful model must not only report\n\nanswers as yes/no or a span within the text but also identify supporting facts. | HotpotQA | |||||
SimpleQA | ||||||
Possibly not public/open. | PersonQA | |||||
TRUSTSCORE: REFERENCE-FREE EVALUATION OFLLM RESPONSE TRUSTWORTHINESS | Danna Zheng, Danyang Liu, Mirella Lapata, Jeff Z. Pan | University of Edinburgh,\n\nHuawei Edinburgh Research Centre | TrustScore | |||
Know What You Don't Know: Unanswerable Questions for SQuAD | 2018-11 | https://rajpurkar.github.io/SQuAD-explorer/ | Many | SQuAD | ||
is an automatic evaluation metric for factual precision in long-form text generation. It uses large language models and retrieval to break down generations into atomic facts and then measure the correctness with respect to a knowledge source (like Wikipedia). | FactScore |
Survey on Factuality in Large Language Models
Trustworthiness in Retrieval-Augmented Generation Systems: A Survey
SciTrust: Evaluating the Trustworthiness of Large Language Models for Science
WikiContradict: A Benchmark for Evaluating LLMs on Real-World Knowledge Conflicts from Wikipedia
Identifying Factual Inconsistencies in Summaries: Grounding LLM Inference via Task Taxonomy
Factual consistency evaluation of summarization in the Era of large language models
SIFiD: Reassess Summary Factual Inconsistency Detection with LLM
TofuEval: Evaluating Hallucinations of LLMs on Topic-Focused Dialogue Summarization
Factuality of Large Language Models: A Survey
FactPICO: Factuality Evaluation for Plain Language Summarization of Medical Evidence
TrustScore: Reference-Free Evaluation of LLM Response Trustworthiness
- LLMs do not know what they do not know, sometimes overestimate their capacities and confidently output unknown information, leading to false responses.
(Check paper from Friday on multiple questionnaires, where it sometimes says it doesn't know the answer.)
- Studies assessing language models’ factuality or evaluating whether the methods are effective to mitigate model hallucinations use different datasets and metrics.
- This makes it difficult to compare, in the same conditions, the factuality of different models as well as to compare the effectiveness of different factuality enhancement approaches.
-
Hvad vil vi måle
-
Hvilke metoder vil vi bruge
-
Hvilke dataset er relevante
We don't really care about how factual the LLM's are, but we want to know can we trust them. This means that we're not trying to test which LLM is most likeliy to win in Jeapordy, however we are more interested in testing which one would lie or cheat in Jeapordy. But at the same time, we do not want a metric that will be biased in a way that the LLM always answers I dont know, and gets a good score for that.
Metric bias (I. Augenstein, T. Baldwin, M. Cha, T. Chakraborty, G. L. Ciampaglia, D. Corney, R. DiResta, E. Ferrara, S. Hale, A. Halevy et al., “Factuality challenges in the era of large language models and opportunities for factchecking,” Nature Machine Intelligence, vol. 6, no. 8, pp. 852–863, 2024.)
Dependence on high quality annotated evidence datasets
Using LLMs themselves as evaluators (LLM-as-a-judge). Often closely just as good as humans.
RAG
Human evaluation
Detecting hallucinations in language models is challenging. There are three general approaches:
- Measuring token-level probability distributions for indications that a model is “confused.” Though sometimes effective, these methods rely on model internals being accessible—which is often not the case when working with hosted LLMs.
- Referencing external fact-verification systems, like a database or document store. These methods are great for RAG-style use-cases, but they are only effective if you have a useful dataset and the infrastructure to use it.
- Using LLM-as-a-judge techniques to assess whether or not a model hallucinated. These techniques are becoming standard in the LLM ecosystem, but as I’ll explain throughout this piece, using them effectively requires a deceptive amount of work.
The problem with many LLM-as-a-Judge techniques is that they tend towards two polarities: they are either too simple, using a basic zero-shot approach, or they are wildly complex, involving multiple LLMs interacting via multi-turn reasoning.
- HotpotQA
- SimpleQA
- PersonQA (possibly not public)
- SQuAD
Hallucinations are a feature, not a bug. When is a LLM hallucinating, and when is it creating?
- Entropy measurements (need output probability distribution)
- Er der teoretisk grundlag for at man kan teste factuality?
- Paper: Not possible if trained only on correct samples (duh)
- SelfCheckGPT: Voting system
Check for variance i output af ens model, er det meget stokastisk / random eller konvergerer modellen mod det samme svar?