Companion code and iPython notebooks for evaluating factuality as outlined in this blog post on "Can you trust Llama 2 and ChatGPT to factually summarize?"
Please contact [email protected] if you have any questions.
Parts of the code included here are extracted from an experimental library called Hermetic. For simplicity, the code in this repo is self-contained and does not have any dependencies on that library.
For convenience, this code is self-contained.
To get started:
% pip install -r requirements.txt
You also need to ensure that you have an OPENAI_API_KEY and AE_API_KEY. Include these in the environment variables.
Once that's complete, you can run the notebook.
The val_sentence_pairs.json
is from TU Delft:
@misc{https://tudatalib.ulb.tu-darmstadt.de/handle/tudatalib/2002,
url = { https://tudatalib.ulb.tu-darmstadt.de/handle/tudatalib/2002 },
author = { Falke, Tobias and Ribeiro, Leonardo and utama, caraka prasetya and Dagan, Ido and Gurevych, Iryna },
keywords = { 000 Informatik, Informationswissenschaft, allgemeine Werke },
publisher = { Technical University of Darmstadt },
year = { 2019-06-04 },
copyright = { Creative Commons Attribution Share-Alike 4.0 },
title = { Correctness of Generated Summaries }
}
In the resources
folder we included three versions of the prompts where we experimented with:
- Asking for the answer last (to give time for reasoning)
- Deliberately calling out the bias and asking for it to work.
These addiitonal experiments did not have a major impact on the results; it seem that each different version of these queries were in "tradeoff" territory where it changed the individual results but not the outcomes.