Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added two multimodal experiments: MathVista and ScienceQA #43

Merged
merged 4 commits into from
Jul 8, 2024
Merged
Show file tree
Hide file tree
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
53 changes: 53 additions & 0 deletions evaluation/results/solution_optimization_mm/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
# Results for Solution Optimization on Multimodal Tasks

We are pleased to release the results of our solution optimization experiments for multimodal tasks, including [MathVista](https://mathvista.github.io/) and [ScienceQA](https://scienceqa.github.io/) tasks. This release aims to ensure transparency and provide detailed insights into our optimization processes.

The files in this repository contain the solution trajectory, final solution, and the answers for each question in the datasets. Each `.json` file is structured as a dictionary where the key is the question index, and the value is another dictionary with the following possible keys:

- **`question`**: The question text.
- **`answer`**: The ground truth answer.
- **`predictions`**: The trajectory of solutions throughout the optimization process.
- **`loss_history`**: The history of loss function values over different iterations of optimization.
- **`performance_history`**: The prediction score, where 1 indicates a correct answer and 0 indicates an incorrect answer.
- **`result_data`**: The intermediate results of predictions.
- **`ques_data`**: The metadata of the question, which is useful for further fine-grained analysis.

## Notes
Solution optimization in general depends heavily on the test-time objective. Depending on the test-time objective, e.g. the model can be driven to explore more, which is the approach we took. For this reason, we used majority voting to get the final prediction. There are lots of interesting questions around identifying good test-time training strategies!

## Experiment on MathVista

[MathVista](https://mathvista.github.io/) is a comprehensive benchmark for mathematical reasoning within visual contexts. It emphasizes the diversity and complexity of visual perception and mathematical reasoning challenges.

To conduct an experiment on MathVista, use the following example command:

```sh
cd evaluation
python solution_optimization_mm.py --task mathvista \
--engine=gpt-4o \
--eval_engine=gpt-4o \
--max_iterations 4 \
--num_threads 10 \
--majority_voting
```

The result file is `mathvista_predictions.json`.

## Experiment on ScienceQA

[ScienceQA](https://scienceqa.github.io/) (Science Question Answering) is a multimodal benchmark consisting of multiple-choice questions covering a diverse set of science topics. It challenges participants to understand scientific images, retrieve relevant knowledge, and provide accurate reasoning for high-school-level scientific questions.

Running an experiment on ScienceQA, use the following example command:

```sh
cd evaluation

python solution_optimization_mm.py --task scienceqa \
--engine=gpt-4o \
--eval_engine=gpt-4o \
--max_iterations 8 \
--num_threads 20
```

The result file is `scienceqa_predictions.json`.

Loading
Loading