Releases: EvolvingLMMs-Lab/lmms-eval
Releases · EvolvingLMMs-Lab/lmms-eval
v0.3.0
What's Changed
- Bump version to 0.2.4 and remove unused dependencies by @pufanyi in #292
- Load package for NExT-QA evaluation by @zhijian-liu in #295
- Fix MMMU-Pro evaluation by @zhijian-liu in #296
- [Feat] LiveBench 2409 by @pufanyi in #304
- [Doc] add more detailed task guide to explain the variables in yaml configuration file by @Luodian in #303
- [fix] Invalid group in mmsearch.yaml by @skyil7 in #305
- [Fix] Fix cache_dir issue where MVBench cannot be found by @yinanhe in #306
- [Fix] LiveBench 2409 by @pufanyi in #308
- [Fix] A small fix for the
LiveBench
checker by @pufanyi in #310 - [Fix] Change "Basic Understanding" to "Concrete Recognition" by @pufanyi in #311
- [Feat] LLaMA-3.2-Vision by @kcz358 in #314
- [Fix] Fix extra calling in qwen_vl_api, use tempfile for tmp by @kcz358 in #312
- Fix
LMMS_EVAL_PLUGINS
by @zhijian-liu in #297 - [feat] changes on llava_vid model by @ZhangYuanhan-AI in #291
- Update video_decode_backend to "decord" by @ZhangYuanhan-AI in #318
- Update the prompt to be consistent with the current
LiveBench
design by @pufanyi in #319 - Add AI2D evaluation without masks by @zhijian-liu in #325
- add vinoground by @HanSolo9682 in #326
- Update evaluator.py to load datasets first before loading models by @LooperXX in #327
- Update llava_onevision.py to avoid erros on evaluation benchmarks with both single- and multi-image samples. by @LooperXX in #338
- Upload Tasks: CinePile by @JARVVVIS in #343
- [Update] Allow pass in max pixels and num frames in qwen2vl by @kcz358 in #346
- funqa update by @Nicous20 in #341
- Update Vinoground to make evaluation consistent with paper by @HanSolo9682 in #354
- Update mmmu_pro_standard.yaml by @zhijian-liu in #353
- Upload tasks: MovieChat-1K, VDC by @Espere-1119-Song in #342
- [Feat] Add AuroraCap, MovieChat, LLaVA-OneVision-MovieChat by @Espere-1119-Song in #358
- update docs for VDC and MovieChat by @rese1f in #359
- [WIP] feat: update to use azure api by @Luodian in #340
- Update MLVU answer parsing by @Xiuyu-Li in #364
- Add task docs for Vinoground by @HanSolo9682 in #372
- [Add Dataset] NaturalBench(NeurIPS24) by @Baiqi-Li in #371
- Update README.md by @kcz358 in #377
- fix model_specific_prompt_kwargs of VDC and MovieChat-1K by @Espere-1119-Song in #382
- Add os import to mathverse_evals.py by @spacecraft1013 in #381
- [Fix] Fix hallu bench by @kcz358 in #392
- Fix "percetion" typo (issue #396) by @Qu3tzal in #397
- Add TemporalBench by @mu-cai in #402
- [Tiny Fix] fix dataset_kwargs in lmms_eval/api/task.py by @Li-Qingyun in #404
- Add model aria & fix on LongVideoBench by @teowu in #391
- [update] NaturalBench to README by @Baiqi-Li in #406
- add model Slime and Benchmark mme_realworld_lite by @yfzhang114 in #409
- Update VDC with SGLang by @Espere-1119-Song in #411
- Add video processing logic for idefics2 by @kcz358 in #418
- update the introduction of mme-realworld by @yfzhang114 in #416
- [Task] add MIA-Bench by @Luodian in #419
- Modify typos in run_example.md by @Espere-1119-Song in #422
- [Release] lmms-eval v0.3.0 release by @kcz358 in #428
- PyPI 0.3.0 by @pufanyi in #432
New Contributors
- @ZhangYuanhan-AI made their first contribution in #291
- @HanSolo9682 made their first contribution in #326
- @LooperXX made their first contribution in #327
- @JARVVVIS made their first contribution in #343
- @Nicous20 made their first contribution in #341
- @Espere-1119-Song made their first contribution in #342
- @rese1f made their first contribution in #359
- @Xiuyu-Li made their first contribution in #364
- @Baiqi-Li made their first contribution in #371
- @spacecraft1013 made their first contribution in #381
- @Qu3tzal made their first contribution in #397
- @mu-cai made their first contribution in #402
- @Li-Qingyun made their first contribution in #404
Full Changelog: v0.2.4...v0.3.0
v0.2.4 add `generate_until_multi_round` to support interative and multi-round evaluations; add models and fix glitches
What's Changed
- [Fix] Fix bugs in returning result dict and bring back anls metric by @kcz358 in #221
- fix: fix wrong args in wandb logger by @Luodian in #226
- [feat] Add check for existence of accelerator before waiting by @Luodian in #227
- add more language tasks and fix fewshot evaluation bugs by @Luodian in #228
- Remove unnecessary LM object removal in evaluator by @Luodian in #229
- [fix] Shallow copy issue by @pufanyi in #231
- [Minor] Fix max_new_tokens in video llava by @kcz358 in #237
- Update LMMS evaluation tasks for various subjects by @Luodian in #240
- [Fix] Fix async append result in different order issue by @kcz358 in #244
- Update the version requirement for
transformers
by @zhijian-liu in #235 - Add new LMMS evaluation task for wild vision benchmark by @Luodian in #247
- Add raw score to wildvision bench by @Luodian in #250
- [Fix] Strict video to be single processing by @kcz358 in #246
- Refactor wild_vision_aggregation_raw_scores to calculate average score by @Luodian in #252
- [Fix] Bring back process result pbar by @kcz358 in #251
- [Minor] Update utils.py by @YangYangGirl in #249
- Refactor distributed gathering of logged samples and metrics by @Luodian in #253
- Refactor caching module and fix serialization issue by @Luodian in #255
- [Minor] Bring back fix for metadata by @kcz358 in #258
- [Model] support minimonkey model by @white2018 in #257
- [Feat] add regression test and change saving logic related to
output_path
by @Luodian in #259 - [Feat] Add support for llava_hf video, better loading logic for llava_hf ckpt by @kcz358 in #260
- [Model] support cogvlm2 model by @white2018 in #261
- [Docs] Update and sort current_tasks.md by @pbcong in #262
- fix error name with infovqa task by @ZhaoyangLi-nju in #265
- [Task] Add MMT and MMT_MI (Multiple Image) Task by @ngquangtrung57 in #270
- mme-realworld by @yfzhang114 in #266
- [Model] support Qwen2 VL by @abzb1 in #268
- Support new task mmworld by @jkooy in #269
- Update current tasks.md by @pbcong in #272
- [feat] support video evaluation for qwen2-vl and add mix-evals-video2text by @Luodian in #275
- [Feat][Task] Add multi-round evaluation in llava-onevision; Add MMSearch Benchmark by @CaraJ7 in #277
- [Fix] Model name None in Task manager, mix eval model specific kwargs, claude retrying fix by @kcz358 in #278
- [Feat] Add support for evaluation of Oryx models by @dongyh20 in #276
- [Fix] Fix the error when running models caused by
generate_until_multi_round
by @pufanyi in #281 - [fix] Refactor GeminiAPI class to add video pooling and freeing by @pufanyi in #287
- add jmmmu by @AtsuMiyai in #286
- [Feat] Add support for evaluation of InternVideo2-Chat && Fix evaluation for mvbench by @yinanhe in #280
New Contributors
- @YangYangGirl made their first contribution in #249
- @white2018 made their first contribution in #257
- @pbcong made their first contribution in #262
- @ZhaoyangLi-nju made their first contribution in #265
- @ngquangtrung57 made their first contribution in #270
- @yfzhang114 made their first contribution in #266
- @jkooy made their first contribution in #269
- @dongyh20 made their first contribution in #276
- @yinanhe made their first contribution in #280
Full Changelog: v0.2.3...v0.2.4
v0.2.3.post1
What's Changed
- [Fix] Fix bugs in returning result dict and bring back anls metric by @kcz358 in #221
- fix: fix wrong args in wandb logger by @Luodian in #226
Full Changelog: v0.2.3...v0.2.3.post1
v0.2.3 add language evaluations and remove registration to speedup loading tasks and models
What's Changed
- Update the blog link by @pufanyi in #196
- Bring back PR#52 by @kcz358 in #198
- fix: update from previous model_specific_prompt to current lmms_eval_kwargs to avoid warnings by @Luodian in #206
- [Feat] SGLang SRT commands in one go, async input for openai server by @kcz358 in #212
- [Minor] Add kill sglang process by @kcz358 in #213
- Support textonly inference for LLaVA-OneVision. by @CaraJ7 in #215
- Fix
videomme
evaluation by @zhijian-liu in #209 - [feat] remove registeration logic and adding language evaluation tasks. by @Luodian in #218
New Contributors
- @zhijian-liu made their first contribution in #209
Full Changelog: v0.2.2...v0.2.3
v0.2.2: add llava-onevision/mantis/llava-interleave/VILA and new tasks.
What's Changed
- Include VCR by @tianyu-z in #105
- [Small Update] Update the version of LMMs-Eval by @pufanyi in #109
- add II-Bench by @XinrunDu in #111
- Q-Bench, Q-Bench2, A-Bench by @teowu in #113
- LongVideoBench for LMMs-Eval by @teowu in #117
- Fix the potential risk by PR #117 by @teowu in #118
- add tinyllava by @zjysteven in #114
- Add docs for datasets upload to HF by @pufanyi in #120
- [Model] aligned llava-interleave model results on video tasks by @Luodian in #125
- External package integration using plugins by @lorenzomammana in #126
- Add task VITATECS by @lscpku in #130
- add task gqa-ru by @Dannoopsy in #128
- add task MMBench-ru by @Dannoopsy in #129
- Add wild vision bench by @kcz358 in #133
- Add detailcaps by @Dousia in #136
- add MLVU task by @shuyansy in #137
- add process sync in evaluation metric computation via a temp file in lmms_eval/evaluator.py by @Dousia in #143
- [Sync Features] add vila, add wildvision, add vibe-eval, add interleave bench by @Luodian in #138
- Add muirbench by @kcz358 in #147
- Add a new benchmark: MIRB by @ys-zong in #150
- Add LMMs-Lite by @kcz358 in #148
- [Docs] Fix broken hyperlink in README.md by @abzb1 in #149
- Changes in llava_hf.py. Corrected the response split by role and added the ability to specify an EOS token by @Dannoopsy in #153
- Add default values for mm_resampler_location and mm_newline_position to make sure Llavavid model can run successfully. by @choiszt in #156
- Update README.md by @kcz358 in #159
- revise llava_vid.py by @Luodian in #164
- Add MMStar by @skyil7 in #158
- Add model Mantis to the LMMs-Eval supported model list by @baichuanzhou in #162
- Fix utils.py by @abzb1 in #165
- Add default prompt for seedbench_2.yaml by @skyil7 in #167
- Fix a small typo for live_bench by @pufanyi in #169
- [New Model] Adding Cambrian Model by @Nyandwi in #171
- Revert "[New Model] Adding Cambrian Model" by @Luodian in #178
- Fixed some issues in InternVL family and ScienceQA task. by @skyil7 in #174
- [Add Dataset] SEEDBench 2 Plus by @abzb1 in #180
- [New Updates] LLaVA OneVision Release; MVBench, InternVL2, IXC2.5 Interleave-Bench integration. by @Luodian in #182
- New pypi by @pufanyi in #184
New Contributors
- @tianyu-z made their first contribution in #105
- @XinrunDu made their first contribution in #111
- @teowu made their first contribution in #113
- @zjysteven made their first contribution in #114
- @lorenzomammana made their first contribution in #126
- @lscpku made their first contribution in #130
- @Dannoopsy made their first contribution in #128
- @Dousia made their first contribution in #136
- @shuyansy made their first contribution in #137
- @ys-zong made their first contribution in #150
- @abzb1 made their first contribution in #149
- @choiszt made their first contribution in #156
- @skyil7 made their first contribution in #158
- @baichuanzhou made their first contribution in #162
- @Nyandwi made their first contribution in #171
Full Changelog: v0.2.0...v0.2.2
v0.2.0.post1
What's Changed
- Include VCR by @tianyu-z in #105
- [Small Update] Update the version of LMMs-Eval by @pufanyi in #109
- add II-Bench by @XinrunDu in #111
- Q-Bench, Q-Bench2, A-Bench by @teowu in #113
- LongVideoBench for LMMs-Eval by @teowu in #117
- Fix the potential risk by PR #117 by @teowu in #118
- add tinyllava by @zjysteven in #114
- Add docs for datasets upload to HF by @pufanyi in #120
- [Model] aligned llava-interleave model results on video tasks by @Luodian in #125
New Contributors
- @tianyu-z made their first contribution in #105
- @XinrunDu made their first contribution in #111
- @teowu made their first contribution in #113
- @zjysteven made their first contribution in #114
Full Changelog: v0.2.0...v0.2.0.post1
v0.2.0
What's Changed
- pip package by @pufanyi in #1
- Fix mmbench dataset submission format by @pufanyi in #7
- [Feat] add correct tensor parallelism for larger size model. by @Luodian in #4
- update version to 0.1.1 by @pufanyi in #9
- [Tasks] Fix MMBench by @pufanyi in #13
- [Fix] Fix llava reproduce error by @kcz358 in #24
- add_ocrbench by @echo840 in #28
- Joshua/olympiadbench by @JvThunder in #37
- [WIP] adding mmbench dev evaluation (#75) by @Luodian in #46
- Add
llava
model for 🤗 Transformers by @lewtun in #47 - Fix types to allow nullables in
llava_hf.py
by @lewtun in #55 - Add REC tasks for testing model ability to locally ground objects, given a description. This adds REC for all RefCOCO datasets. by @hunterheiden in #52
- [Benchmarks] RealWorldQA by @pufanyi in #57
- add Llava-SGlang by @jzhang38 in #54
- Add MathVerse by @CaraJ7 in #60
- Fix typo in Qwen-VL that was causing "reference before assignment" by @tupini07 in #61
- New Task: ScreenSpot - Grounding (REC) and instruction generation (REG) on screens by @hunterheiden in #63
- [New Task] WebSRC (multimodal Q&A on web screenshots) by @hunterheiden in #69
- Bugfix: WebSRC should be token-level F1 NOT character-level by @hunterheiden in #70
- Multilingual LLava bench by @gagan3012 in #56
- [Fix] repr llava doc by @cocoshe in #36
- add idefics2 by @jzhang38 in #59
- [Feat] Add qwen vl api by @kcz358 in #73
- Adding microsoft/Phi-3-vision-128k-instruct model. by @vfragoso in #87
- Add MathVerse in README.md by @CaraJ7 in #97
- add MM-UPD by @AtsuMiyai in #95
- add Conbench by @Gumpest in #100
- Update conbench in README by @Gumpest in #101
- update gpt-3.5-turbo version by @AtsuMiyai in #107
- [Upgrade to v0.2] Embracing Video Evaluations with LMMs-Eval by @Luodian in #108
New Contributors
- @pufanyi made their first contribution in #1
- @Luodian made their first contribution in #4
- @kcz358 made their first contribution in #24
- @echo840 made their first contribution in #28
- @JvThunder made their first contribution in #37
- @lewtun made their first contribution in #47
- @hunterheiden made their first contribution in #52
- @jzhang38 made their first contribution in #54
- @CaraJ7 made their first contribution in #60
- @tupini07 made their first contribution in #61
- @gagan3012 made their first contribution in #56
- @cocoshe made their first contribution in #36
- @vfragoso made their first contribution in #87
- @AtsuMiyai made their first contribution in #95
- @Gumpest made their first contribution in #100
Full Changelog: v0.1.0...v0.2.0
LMMs-Eval 0.1.0.dev
[Enhancement & Fix] Add Tensor Parallelism and Fix LLaVA-W/MMBench's issue.
LMMs-Eval 0.1.0 Release
Currently support 40+ evaluation datasets with 60+ subsets/variants, and 5 commonly used LMMs.