GPT-4-ENEM

*** Most of the code in this repository has been adapted from Language Model Evaluation Harness. ***

Introduction

This repository contains code and data used in the following papers:

This most recent study presents a comprehensive framework to evaluate language models on entrance exams, which incorporates both textual and visual elements. We evaluate the three most recent editions of Exame Nacional do Ensino Médio (ENEM), the main standardized entrance examination adopted by Brazilian universities.

One of the highlights is that text captions transcribing visual content outperform the direct use of images, suggesting that the vision model has room for improvement. Yet, despite improvements afforded by images or captions, mathematical questions remain a challenge for these state-of-the-art models.

Significant improvements are noticeable when incorporating either textual or visual representations of images, with the difference nearing 10 points, particularly when utilizing captions.

Area	GPT-4o			Sabiá-3
Area	without images	with captions	with CoT + captions	without images	with captions	with CoT + captions
Languages and Codes	88.89	91.11	91.11	86.67	91.11	93.33
Human Sciences	100.00	100.00	100.00	100.00	100.00	100.00
Natural Sciences	68.18	84.09	93.18	72.73	81.82	86.36
Mathematics	60.00	66.67	91.11	60.00	75.56	82.22
Total	79.33	85.47	93.85	79.89	87.15	90.50

Results of GPT-4o and Sabiá-3 on ENEM 2024, using 3-shot prompts.

Data

We made available the ENEM 2022, ENEM 2023, and ENEM 2024 datasets. These datasets encompass all multiple-choice questions from the last three editions. The datasets have been created to allow the evaluation of both textual-only and textual-visual language models. To evaluate textual-only models, we incorporated into the datasets the textual descriptions of the images that appear in the questions' statements from the orange ENEM exam booklet, a particular booklet that offers accessibility to people with visual impairments.

The datasets can also be accessed via the 🤗 Datasets library: https://huggingface.co/datasets/maritaca-ai/enem

The deprecated ENEM 2022 dataset can be found here.

Warning

We do not recommend using the deprecated dataset, since it does not include the image placeholders, image paths, and textual descriptions. Also, the tables are not well-formatted.

Tasks

We have implemented a set of 22 tasks, described below:

Task	Enem edition	Experiment	CoT	Use all the questions
enem_2022_blind	ENEM 2022	without images	No	✔️
enem_cot_2022_blind	ENEM 2022	without images	Yes	✔️
enem_2022_images	ENEM 2022	with images	No	✔️
enem_cot_2022_images	ENEM 2022	with images	Yes	✔️
enem_2022_captions	ENEM 2022	with captions	No	✔️
enem_cot_2022_captions	ENEM 2022	with captions	Yes	✔️
enem_2023_blind	ENEM 2023	without images	No	✔️
enem_cot_2023_blind	ENEM 2023	without images	Yes	✔️
enem_2023_images	ENEM 2023	with images	No	✔️
enem_cot_2023_images	ENEM 2023	with images	Yes	✔️
enem_2023_captions	ENEM 2023	with captions	No	✔️
enem_cot_2023_captions	ENEM 2023	with captions	Yes	✔️
enem_2024_blind	ENEM 2024	without images	No	✔️
enem_cot_2024_blind	ENEM 2024	without images	Yes	✔️
enem_2024_images	ENEM 2024	with images	No	✔️
enem_cot_2024_images	ENEM 2024	with images	Yes	✔️
enem_2024_captions	ENEM 2024	with captions	No	✔️
enem_cot_2024_captions	ENEM 2024	with captions	Yes	✔️
enem	Enem Challenge (2009-2017)	-	No	❌
enem_cot	Enem Challenge (2009-2017)	-	Yes	❌
enem_2022_deprecated	ENEM 2022	-	No	❌
enem_cot_2022_deprecated	ENEM 2022	-	Yes	❌

Reproducing the results

To reproduce the experiments described in the paper, please follow the steps below:

1. Clone the repository:

git clone https://github.com/piresramon/gpt-4-enem.git

2. Install the required packages:

pip install -e .

3. Set the API keys:

Visit openai to retrieve OpenAI API keys and maritalk to retrieve MariTalk API keys. Insert them into your env variables.

OPENAI_API_SECRET_KEY=sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
MARITALK_API_SECRET_KEY=xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

4. Run the experiments:

To reproduce the results of the Table 1, run the following commands:

# running 3-shot with CoT for Sabiá-3 on ENEM 2024
python main.py \
    --model maritalk \
    --model_args engine=sabia-3 \
    --tasks enem_cot_2024_blind,enem_cot_2024_captions \
    --description_dict_path description.json \
    --num_fewshot 3 \
    --conversation_template chatgpt

# running 3-shot with CoT for GPT-4o on ENEM 2024
python main.py \
    --model chatgpt \
    --model_args engine=gpt-4o \
    --tasks enem_cot_2024_blind,enem_cot_2024_images,enem_cot_2024_captions \
    --description_dict_path description.json \
    --num_fewshot 3 \
    --conversation_template chatgpt

To experiment other Maritaca AI or OpenAI models, just change the engine. The tasks enem_cot_*_images are not supported by text-based models.

It is possible to use a different number of few-shot examples (maximum 3).

Tip

You can experiment any other model available in the 🤗 Transformers library. Just change the model and model_args parameters. It is necessary to disable the parameter conversation_template.

Citation

If you use this code or data in your research, please cite the following papers:

@misc{pires2023evaluating,
      title={Evaluating GPT-4's Vision Capabilities on Brazilian University Admission Exams}, 
      author={Ramon Pires and Thales Sales Almeida and Hugo Abonizio and Rodrigo Nogueira},
      year={2023},
      eprint={2311.14169},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

@misc{nunes2023evaluating,
      title={Evaluating GPT-3.5 and GPT-4 Models on Brazilian University Admission Exams}, 
      author={Desnes Nunes and Ricardo Primi and Ramon Pires and Roberto Lotufo and Rodrigo Nogueira},
      year={2023},
      eprint={2303.17003},
      archivePrefix={arXiv},
      primaryClass={cs.CL}
}

Name		Name	Last commit message	Last commit date
Latest commit History 33 Commits
data		data
docs		docs
lm_eval		lm_eval
reports		reports
scripts		scripts
templates		templates
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
description.json		description.json
ignore.txt		ignore.txt
main.py		main.py
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

GPT-4-ENEM

Introduction

Data

Tasks

Reproducing the results

1. Clone the repository:

2. Install the required packages:

3. Set the API keys:

4. Run the experiments:

Citation

About

Releases

Packages

Languages

License

piresramon/gpt-4-enem

Folders and files

Latest commit

History

Repository files navigation

GPT-4-ENEM

Introduction

Data

Tasks

Reproducing the results

1. Clone the repository:

2. Install the required packages:

3. Set the API keys:

4. Run the experiments:

Citation

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages