Table of Contents
[06-18-24] HPT is published ! Check out the paper here.
Hierarchical Prompting Taxonomy (HPT) is a universal evaluation framework for large language models. It is designed to evaluate the performance of large language models on a variety of tasks and datasets assigning HP-Score for each dataset relative to different models. The HPT employs Hierarchical Prompt Framework (HPF) which supports a wide range of tasks, including question-answering, reasoning, translation, and summarization. It provides a set of pre-defined prompting strategies tailored for each task based on its complexity. Refer to paper at : https://arxiv.org/abs/2406.12644
- Universal Evaluation Framework: HPT is a universal evaluation framework that can support a wide range of datasets and LLMs.
- Hierarchical Prompt Framework: HPF is a set of prompting strategies tailored for each task based on its complexity employed by the HPT. HPF is made available in two modes: manual and adaptive. Adaptive HPF selects the best prompting strategy for a given task adaptively by a LLM (prompt-selector).
- HP-Score: HPT assigns an HP-Score for each dataset relative to different agents(including LLMs and humans). HP-Score is a measure of the capability of an agent to perform a task related to a dataset. Lower HP-Score indicates better performance over the dataset.
Refer to examples directory for using the framework on different datasets and models.
To clone the repository, run the following command:
git clone https://github.com/devichand579/HPT.git
To get started on a Linux setup, follow these setup commands:
-
Activate your conda environment:
conda activate hpt
-
Navigate to the main codebase
cd HPT/hierarchical_prompt
-
Install the dependencies
pip install -r requirements.txt
-
Add your Hugging Face token
- Create a .env file in the conda environment
HF_TOKEN = "your HF Token"
-
To run both frameworks, use the following command structure
bash run.sh method model dataset [--thres num]
-
method
- man
- auto
-
model
- llama3
- phi3
- gemma
- mistral
-
dataset
- boolq
- csqa
- iwslt
- samsum
-
If the datasets are IWSLT or SamSum, add '--thres num'
-
num
- 0.15
- 0.20
- or higher thresholds apart from our experiments.
-
Example commands:
bash run.sh man llama3 iwslt --thres 0.15
bash run.sh auto phi3 boolq
-
HPT currently supports different datasets, models and prompt engineering methods employed by HPF. You are welcome to add more.
- Question-answering datasets:
- BoolQ
- Reasoning datasets:
- CommonsenseQA
- Translation datasets:
- IWSLT-2017 en-fr
- Summarization datasets:
- SamSum
- Language models:
- Llama 3 8B
- Mistral 7B
- Phi 3 3.8B
- Gemma 7B
- Role Prompting [1]
- Zero-shot Chain-of-Thought Prompting [2]
- Three-shot Chain-of-Thought Prompting [3]
- Least-to-Most Prompting [4]
- Generated Knowledge Prompting [5]
The benchmark results for different datasets and models are available in the leaderboad.
- Kong, A., Zhao, S., Chen, H., Li, Q., Qin, Y., Sun, R., & Zhou, X. (2023). Better Zero-Shot Reasoning with Role-Play Prompting. ArXiv, abs/2308.07702.
- Kojima, T., Gu, S.S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2022). Large Language Models are Zero-Shot Reasoners. ArXiv, abs/2205.11916.
- Wei, J., Wang, X., Schuurmans, D., Bosma, M., Chi, E.H., Xia, F., Le, Q., & Zhou, D. (2022). Chain of Thought Prompting Elicits Reasoning in Large Language Models. ArXiv, abs/2201.11903.
- Zhou, D., Scharli, N., Hou, L., Wei, J., Scales, N., Wang, X., Schuurmans, D., Bousquet, O., Le, Q., & Chi, E.H. (2022). Least-to-Most Prompting Enables Complex Reasoning in Large Language Models. ArXiv, abs/2205.10625.
- Liu, J., Liu, A., Lu, X., Welleck, S., West, P., Le Bras, R., Choi, Y., & Hajishirzi, H. (2021). Generated Knowledge Prompting for Commonsense Reasoning. Annual Meeting of the Association for Computational Linguistics.
This project aims to build open-source evaluation frameworks for assessing LLMs and other agents. This project welcomes contributions and suggestions. Please see the details on how to contribute.
If you are new to GitHub, here is a detailed guide on getting involved with development on GitHub.
If you find our work useful, please cite us !
@misc{budagam2024hierarchical,
title={Hierarchical Prompting Taxonomy: A Universal Evaluation Framework for Large Language Models},
author={Devichand Budagam and Sankalp KJ and Ashutosh Kumar and Vinija Jain and Aman Chadha},
year={2024},
eprint={2406.12644},
archivePrefix={arXiv},
primaryClass={id='cs.CL' full_name='Computation and Language' is_active=True alt_name='cmp-lg' in_archive='cs' is_general=False description='Covers natural language processing. Roughly includes material in ACM Subject Class I.2.7. Note that work on artificial languages (programming languages, logics, formal systems) that does not explicitly address natural-language issues broadly construed (natural-language processing, computational linguistics, speech, text retrieval, etc.) is not appropriate for this area.'}
}