What's Changed
- Bump version to 0.2.4 and remove unused dependencies by @pufanyi in #292
- Load package for NExT-QA evaluation by @zhijian-liu in #295
- Fix MMMU-Pro evaluation by @zhijian-liu in #296
- [Feat] LiveBench 2409 by @pufanyi in #304
- [Doc] add more detailed task guide to explain the variables in yaml configuration file by @Luodian in #303
- [fix] Invalid group in mmsearch.yaml by @skyil7 in #305
- [Fix] Fix cache_dir issue where MVBench cannot be found by @yinanhe in #306
- [Fix] LiveBench 2409 by @pufanyi in #308
- [Fix] A small fix for the
LiveBench
checker by @pufanyi in #310 - [Fix] Change "Basic Understanding" to "Concrete Recognition" by @pufanyi in #311
- [Feat] LLaMA-3.2-Vision by @kcz358 in #314
- [Fix] Fix extra calling in qwen_vl_api, use tempfile for tmp by @kcz358 in #312
- Fix
LMMS_EVAL_PLUGINS
by @zhijian-liu in #297 - [feat] changes on llava_vid model by @ZhangYuanhan-AI in #291
- Update video_decode_backend to "decord" by @ZhangYuanhan-AI in #318
- Update the prompt to be consistent with the current
LiveBench
design by @pufanyi in #319 - Add AI2D evaluation without masks by @zhijian-liu in #325
- add vinoground by @HanSolo9682 in #326
- Update evaluator.py to load datasets first before loading models by @LooperXX in #327
- Update llava_onevision.py to avoid erros on evaluation benchmarks with both single- and multi-image samples. by @LooperXX in #338
- Upload Tasks: CinePile by @JARVVVIS in #343
- [Update] Allow pass in max pixels and num frames in qwen2vl by @kcz358 in #346
- funqa update by @Nicous20 in #341
- Update Vinoground to make evaluation consistent with paper by @HanSolo9682 in #354
- Update mmmu_pro_standard.yaml by @zhijian-liu in #353
- Upload tasks: MovieChat-1K, VDC by @Espere-1119-Song in #342
- [Feat] Add AuroraCap, MovieChat, LLaVA-OneVision-MovieChat by @Espere-1119-Song in #358
- update docs for VDC and MovieChat by @rese1f in #359
- [WIP] feat: update to use azure api by @Luodian in #340
- Update MLVU answer parsing by @Xiuyu-Li in #364
- Add task docs for Vinoground by @HanSolo9682 in #372
- [Add Dataset] NaturalBench(NeurIPS24) by @Baiqi-Li in #371
- Update README.md by @kcz358 in #377
- fix model_specific_prompt_kwargs of VDC and MovieChat-1K by @Espere-1119-Song in #382
- Add os import to mathverse_evals.py by @spacecraft1013 in #381
- [Fix] Fix hallu bench by @kcz358 in #392
- Fix "percetion" typo (issue #396) by @Qu3tzal in #397
- Add TemporalBench by @mu-cai in #402
- [Tiny Fix] fix dataset_kwargs in lmms_eval/api/task.py by @Li-Qingyun in #404
- Add model aria & fix on LongVideoBench by @teowu in #391
- [update] NaturalBench to README by @Baiqi-Li in #406
- add model Slime and Benchmark mme_realworld_lite by @yfzhang114 in #409
- Update VDC with SGLang by @Espere-1119-Song in #411
- Add video processing logic for idefics2 by @kcz358 in #418
- update the introduction of mme-realworld by @yfzhang114 in #416
- [Task] add MIA-Bench by @Luodian in #419
- Modify typos in run_example.md by @Espere-1119-Song in #422
- [Release] lmms-eval v0.3.0 release by @kcz358 in #428
- PyPI 0.3.0 by @pufanyi in #432
New Contributors
- @ZhangYuanhan-AI made their first contribution in #291
- @HanSolo9682 made their first contribution in #326
- @LooperXX made their first contribution in #327
- @JARVVVIS made their first contribution in #343
- @Nicous20 made their first contribution in #341
- @Espere-1119-Song made their first contribution in #342
- @rese1f made their first contribution in #359
- @Xiuyu-Li made their first contribution in #364
- @Baiqi-Li made their first contribution in #371
- @spacecraft1013 made their first contribution in #381
- @Qu3tzal made their first contribution in #397
- @mu-cai made their first contribution in #402
- @Li-Qingyun made their first contribution in #404
Full Changelog: v0.2.4...v0.3.0