ProactiveBench is a benchmark for evaluating proactive agents. It includes a dataset, a reward model, and evaluation scripts.
Our test set contains events in three categories: coding, writing, and daily life.
Currently, the test set contains 227
events.
The reward model is trained on the dataset and reaches an F1 score of 0.918
on the test set.
We provide all scripts to evaluate the performance of the proactive agent and the reward model.
The reward model is used to evaluate the performance of the Proactive Agent. You can download the reward model from here (Coming soon) and host it with frameworks like VLLM to provide OpenAI style API.
After that, you should change the script reward_model_scoring.py
to set the address of your model, and run the script with
python eval/reward_model_scoring.py
After the process, you will get the final score for your reward model.
To check your model's performance, you will need to change the ./eval/script.py
and load in your model(or use the SDK), and run the script with:
python eval/script.py
The test data will be send to the model, and all the traces with agent response will be saved under ./eval/traces_new
folder.
After the process, you could run
# You should modify the address in judge_agent_prediction.py to your reward model address before run the script.
sh eval/judge_result.sh
which will let the Reward Model to evaluate whether the response from the agent is acceptable or not. The results will be saved under ./eval/judged
folder.
After judged by the reward model, you could run
sh calculate.sh
to finally get a score for your model.