Releases · EvolvingLMMs-Lab/lmms-eval

22 Feb 09:15

Luodian

v0.3.1

eb2dadc

v0.3.1 Latest

Latest

What's Changed

BugFix: Fixed input to llama_vision processor by @Danielohayon in #431
MixEval-X Image / Video by @pufanyi in #434
MixEval-X Readme by @pufanyi in #444
Fix Llama vision mentioned in #434 by @pufanyi in #447
add task MMVet-v2 by @frankRenlf in #451
Fix mmt output format by @ngquangtrung57 in #454
[Fixed] metric names in NaturalBench dataset by @Baiqi-Li in #455
[Fix] fix mia-bench evaluation by @Luodian in #456
Fix MMVet V2 by @pufanyi in #457
Mmvetv2 by @frankRenlf in #458
[Fix] remove useless print statements by @pufanyi in #460
[Fix] remove unused text processing notebook by @pufanyi in #485
[Fix] Use no media iterator by @kcz358 in #486
Add VL-RewardBench dataset by @TobiasLee in #484
Delete model and cache before multigpu data gathering by @xumingze0308 in #489
Add MEGA-Bench by @woodfrog in #496
[WIP] style(megabench): improve code formatting and import ordering by @Luodian in #497
Fix llama_vision chat_template and decode by @coding-famer in #498
[Support] Support new model: Ross by @Haochen-Wang409 in #494
[FIX] Minor errors in gemini_api.py and internvl2.py. by @skyil7 in #502
Fix NoneType Error in flatten Function for Text-Only Tasks in LLAVA Models by @bibisbar in #501
Fix device_map by @coding-famer in #505
Fix custom model wrapper to enable usage of instanced model by @ErezSC42 in #508
fix output format of airbench and vocal sound by @pbcong in #510
Update README.md by @KairuiHu in #513
Add covost2 zh en by @pbcong in #515
[Fix] Fix dataset processing logic for common voice and gigaspeech by @kcz358 in #517
Fix language in common voice by @kcz358 in #518
[Dataset] Adding Fleurs en/cn split by @kcz358 in #516
Change fleurs path by @kcz358 in #519
add covost2_en_zh task by @ngquangtrung57 in #520
[Feat] Add VITA 1.5 into lmms-eval by @kcz358 in #521
[Fix] megabench evaluator metric type determination by @woodfrog in #523
fix aggregation function and remove redundancies by @pbcong in #522
Add VideoMMMU task and Support Qwen2.5-vl Model by @KairuiHu in #524
[Add Dataset] HR-Bench (AAAI 2025) by @DreamMr in #525
[Feat] add @maj and @pass to support sampling multiples times during evaluation by @Luodian in #526
[Feat] add mathvision datasets by @Luodian in #527
[Fix] of "Model llavavid not found in available models." by @zhshj0110 in #528
Replaced incorrect variable name self._word_size to self._world_size by @priancho in #535
Yhzhang/add charades sta by @ZhangYuanhan-AI in #536
[Fix] of "evaluation of llava_vid on mvbench" by @zhshj0110 in #541
[Model] add vllm compatible models by @Luodian in #544
[Model] add openai compatible API interface by @Luodian in #546

New Contributors

@Danielohayon made their first contribution in #431
@frankRenlf made their first contribution in #451
@TobiasLee made their first contribution in #484
@xumingze0308 made their first contribution in #489
@woodfrog made their first contribution in #496
@coding-famer made their first contribution in #498
@Haochen-Wang409 made their first contribution in #494
@bibisbar made their first contribution in #501
@ErezSC42 made their first contribution in #508
@KairuiHu made their first contribution in #513
@DreamMr made their first contribution in #525
@zhshj0110 made their first contribution in #528
@priancho made their first contribution in #535

Full Changelog: v0.3.0...v0.3.1

Contributors

maj, priancho, and 21 other contributors

Assets 2

29 Nov 09:46

pufanyi

v0.3.0

754640a

v0.3.0

What's Changed

Bump version to 0.2.4 and remove unused dependencies by @pufanyi in #292
Load package for NExT-QA evaluation by @zhijian-liu in #295
Fix MMMU-Pro evaluation by @zhijian-liu in #296
[Feat] LiveBench 2409 by @pufanyi in #304
[Doc] add more detailed task guide to explain the variables in yaml configuration file by @Luodian in #303
[fix] Invalid group in mmsearch.yaml by @skyil7 in #305
[Fix] Fix cache_dir issue where MVBench cannot be found by @yinanhe in #306
[Fix] LiveBench 2409 by @pufanyi in #308
[Fix] A small fix for the LiveBench checker by @pufanyi in #310
[Fix] Change "Basic Understanding" to "Concrete Recognition" by @pufanyi in #311
[Feat] LLaMA-3.2-Vision by @kcz358 in #314
[Fix] Fix extra calling in qwen_vl_api, use tempfile for tmp by @kcz358 in #312
Fix LMMS_EVAL_PLUGINS by @zhijian-liu in #297
[feat] changes on llava_vid model by @ZhangYuanhan-AI in #291
Update video_decode_backend to "decord" by @ZhangYuanhan-AI in #318
Update the prompt to be consistent with the current LiveBench design by @pufanyi in #319
Add AI2D evaluation without masks by @zhijian-liu in #325
add vinoground by @HanSolo9682 in #326
Update evaluator.py to load datasets first before loading models by @LooperXX in #327
Update llava_onevision.py to avoid erros on evaluation benchmarks with both single- and multi-image samples. by @LooperXX in #338
Upload Tasks: CinePile by @JARVVVIS in #343
[Update] Allow pass in max pixels and num frames in qwen2vl by @kcz358 in #346
funqa update by @Nicous20 in #341
Update Vinoground to make evaluation consistent with paper by @HanSolo9682 in #354
Update mmmu_pro_standard.yaml by @zhijian-liu in #353
Upload tasks: MovieChat-1K, VDC by @Espere-1119-Song in #342
[Feat] Add AuroraCap, MovieChat, LLaVA-OneVision-MovieChat by @Espere-1119-Song in #358
update docs for VDC and MovieChat by @rese1f in #359
[WIP] feat: update to use azure api by @Luodian in #340
Update MLVU answer parsing by @Xiuyu-Li in #364
Add task docs for Vinoground by @HanSolo9682 in #372
[Add Dataset] NaturalBench(NeurIPS24) by @Baiqi-Li in #371
Update README.md by @kcz358 in #377
fix model_specific_prompt_kwargs of VDC and MovieChat-1K by @Espere-1119-Song in #382
Add os import to mathverse_evals.py by @spacecraft1013 in #381
[Fix] Fix hallu bench by @kcz358 in #392
Fix "percetion" typo (issue #396) by @Qu3tzal in #397
Add TemporalBench by @mu-cai in #402
[Tiny Fix] fix dataset_kwargs in lmms_eval/api/task.py by @Li-Qingyun in #404
Add model aria & fix on LongVideoBench by @teowu in #391
[update] NaturalBench to README by @Baiqi-Li in #406
add model Slime and Benchmark mme_realworld_lite by @yfzhang114 in #409
Update VDC with SGLang by @Espere-1119-Song in #411
Add video processing logic for idefics2 by @kcz358 in #418
update the introduction of mme-realworld by @yfzhang114 in #416
[Task] add MIA-Bench by @Luodian in #419
Modify typos in run_example.md by @Espere-1119-Song in #422
[Release] lmms-eval v0.3.0 release by @kcz358 in #428
PyPI 0.3.0 by @pufanyi in #432

New Contributors

@ZhangYuanhan-AI made their first contribution in #291
@HanSolo9682 made their first contribution in #326
@LooperXX made their first contribution in #327
@JARVVVIS made their first contribution in #343
@Nicous20 made their first contribution in #341
@Espere-1119-Song made their first contribution in #342
@rese1f made their first contribution in #359
@Xiuyu-Li made their first contribution in #364
@Baiqi-Li made their first contribution in #371
@spacecraft1013 made their first contribution in #381
@Qu3tzal made their first contribution in #397
@mu-cai made their first contribution in #402
@Li-Qingyun made their first contribution in #404

Full Changelog: v0.2.4...v0.3.0

Contributors

Qu3tzal, zhijian-liu, and 19 other contributors

Assets 2

03 Oct 15:33

Luodian

v0.2.4

af395ae

v0.2.4 add `generate_until_multi_round` to support interative and multi-round evaluations; add models and fix glitches

What's Changed

[Fix] Fix bugs in returning result dict and bring back anls metric by @kcz358 in #221
fix: fix wrong args in wandb logger by @Luodian in #226
[feat] Add check for existence of accelerator before waiting by @Luodian in #227
add more language tasks and fix fewshot evaluation bugs by @Luodian in #228
Remove unnecessary LM object removal in evaluator by @Luodian in #229
[fix] Shallow copy issue by @pufanyi in #231
[Minor] Fix max_new_tokens in video llava by @kcz358 in #237
Update LMMS evaluation tasks for various subjects by @Luodian in #240
[Fix] Fix async append result in different order issue by @kcz358 in #244
Update the version requirement for transformers by @zhijian-liu in #235
Add new LMMS evaluation task for wild vision benchmark by @Luodian in #247
Add raw score to wildvision bench by @Luodian in #250
[Fix] Strict video to be single processing by @kcz358 in #246
Refactor wild_vision_aggregation_raw_scores to calculate average score by @Luodian in #252
[Fix] Bring back process result pbar by @kcz358 in #251
[Minor] Update utils.py by @YangYangGirl in #249
Refactor distributed gathering of logged samples and metrics by @Luodian in #253
Refactor caching module and fix serialization issue by @Luodian in #255
[Minor] Bring back fix for metadata by @kcz358 in #258
[Model] support minimonkey model by @white2018 in #257
[Feat] add regression test and change saving logic related to output_path by @Luodian in #259
[Feat] Add support for llava_hf video, better loading logic for llava_hf ckpt by @kcz358 in #260
[Model] support cogvlm2 model by @white2018 in #261
[Docs] Update and sort current_tasks.md by @pbcong in #262
fix error name with infovqa task by @ZhaoyangLi-nju in #265
[Task] Add MMT and MMT_MI (Multiple Image) Task by @ngquangtrung57 in #270
mme-realworld by @yfzhang114 in #266
[Model] support Qwen2 VL by @abzb1 in #268
Support new task mmworld by @jkooy in #269
Update current tasks.md by @pbcong in #272
[feat] support video evaluation for qwen2-vl and add mix-evals-video2text by @Luodian in #275
[Feat][Task] Add multi-round evaluation in llava-onevision; Add MMSearch Benchmark by @CaraJ7 in #277
[Fix] Model name None in Task manager, mix eval model specific kwargs, claude retrying fix by @kcz358 in #278
[Feat] Add support for evaluation of Oryx models by @dongyh20 in #276
[Fix] Fix the error when running models caused by generate_until_multi_round by @pufanyi in #281
[fix] Refactor GeminiAPI class to add video pooling and freeing by @pufanyi in #287
add jmmmu by @AtsuMiyai in #286
[Feat] Add support for evaluation of InternVideo2-Chat && Fix evaluation for mvbench by @yinanhe in #280

New Contributors

@YangYangGirl made their first contribution in #249
@white2018 made their first contribution in #257
@pbcong made their first contribution in #262
@ZhaoyangLi-nju made their first contribution in #265
@ngquangtrung57 made their first contribution in #270
@yfzhang114 made their first contribution in #266
@jkooy made their first contribution in #269
@dongyh20 made their first contribution in #276
@yinanhe made their first contribution in #280

Full Changelog: v0.2.3...v0.2.4

Contributors

zhijian-liu, Luodian, and 14 other contributors

Assets 2

04 Sep 15:16

Luodian

v0.2.3.post1

9d00bfa

v0.2.3.post1

What's Changed

[Fix] Fix bugs in returning result dict and bring back anls metric by @kcz358 in #221
fix: fix wrong args in wandb logger by @Luodian in #226

Full Changelog: v0.2.3...v0.2.3.post1

Contributors

Luodian and kcz358

Assets 2

01 Sep 11:21

Luodian

v0.2.3

30a0745

v0.2.3 add language evaluations and remove registration to speedup loading tasks and models

What's Changed

Update the blog link by @pufanyi in #196
Bring back PR#52 by @kcz358 in #198
fix: update from previous model_specific_prompt to current lmms_eval_kwargs to avoid warnings by @Luodian in #206
[Feat] SGLang SRT commands in one go, async input for openai server by @kcz358 in #212
[Minor] Add kill sglang process by @kcz358 in #213
Support textonly inference for LLaVA-OneVision. by @CaraJ7 in #215
Fix videomme evaluation by @zhijian-liu in #209
[feat] remove registeration logic and adding language evaluation tasks. by @Luodian in #218

New Contributors

@zhijian-liu made their first contribution in #209

Full Changelog: v0.2.2...v0.2.3

Contributors

zhijian-liu, Luodian, and 3 other contributors

Assets 2

09 Aug 14:34

Luodian

v0.2.2

3f89773

v0.2.2: add llava-onevision/mantis/llava-interleave/VILA and new tasks.

What's Changed

Include VCR by @tianyu-z in #105
[Small Update] Update the version of LMMs-Eval by @pufanyi in #109
add II-Bench by @XinrunDu in #111
Q-Bench, Q-Bench2, A-Bench by @teowu in #113
LongVideoBench for LMMs-Eval by @teowu in #117
Fix the potential risk by PR #117 by @teowu in #118
add tinyllava by @zjysteven in #114
Add docs for datasets upload to HF by @pufanyi in #120
[Model] aligned llava-interleave model results on video tasks by @Luodian in #125
External package integration using plugins by @lorenzomammana in #126
Add task VITATECS by @lscpku in #130
add task gqa-ru by @Dannoopsy in #128
add task MMBench-ru by @Dannoopsy in #129
Add wild vision bench by @kcz358 in #133
Add detailcaps by @Dousia in #136
add MLVU task by @shuyansy in #137
add process sync in evaluation metric computation via a temp file in lmms_eval/evaluator.py by @Dousia in #143
[Sync Features] add vila, add wildvision, add vibe-eval, add interleave bench by @Luodian in #138
Add muirbench by @kcz358 in #147
Add a new benchmark: MIRB by @ys-zong in #150
Add LMMs-Lite by @kcz358 in #148
[Docs] Fix broken hyperlink in README.md by @abzb1 in #149
Changes in llava_hf.py. Corrected the response split by role and added the ability to specify an EOS token by @Dannoopsy in #153
Add default values for mm_resampler_location and mm_newline_position to make sure Llavavid model can run successfully. by @choiszt in #156
Update README.md by @kcz358 in #159
revise llava_vid.py by @Luodian in #164
Add MMStar by @skyil7 in #158
Add model Mantis to the LMMs-Eval supported model list by @baichuanzhou in #162
Fix utils.py by @abzb1 in #165
Add default prompt for seedbench_2.yaml by @skyil7 in #167
Fix a small typo for live_bench by @pufanyi in #169
[New Model] Adding Cambrian Model by @Nyandwi in #171
Revert "[New Model] Adding Cambrian Model" by @Luodian in #178
Fixed some issues in InternVL family and ScienceQA task. by @skyil7 in #174
[Add Dataset] SEEDBench 2 Plus by @abzb1 in #180
[New Updates] LLaVA OneVision Release; MVBench, InternVL2, IXC2.5 Interleave-Bench integration. by @Luodian in #182
New pypi by @pufanyi in #184

New Contributors

@tianyu-z made their first contribution in #105
@XinrunDu made their first contribution in #111
@teowu made their first contribution in #113
@zjysteven made their first contribution in #114
@lorenzomammana made their first contribution in #126
@lscpku made their first contribution in #130
@Dannoopsy made their first contribution in #128
@Dousia made their first contribution in #136
@shuyansy made their first contribution in #137
@ys-zong made their first contribution in #150
@abzb1 made their first contribution in #149
@choiszt made their first contribution in #156
@skyil7 made their first contribution in #158
@baichuanzhou made their first contribution in #162
@Nyandwi made their first contribution in #171

Full Changelog: v0.2.0...v0.2.2

Contributors

lorenzomammana, Luodian, and 16 other contributors

Assets 2

23 Jun 06:02

Luodian

v0.2.0.post1

8f9d620

v0.2.0.post1

What's Changed

Include VCR by @tianyu-z in #105
[Small Update] Update the version of LMMs-Eval by @pufanyi in #109
add II-Bench by @XinrunDu in #111
Q-Bench, Q-Bench2, A-Bench by @teowu in #113
LongVideoBench for LMMs-Eval by @teowu in #117
Fix the potential risk by PR #117 by @teowu in #118
add tinyllava by @zjysteven in #114
Add docs for datasets upload to HF by @pufanyi in #120
[Model] aligned llava-interleave model results on video tasks by @Luodian in #125

New Contributors

@tianyu-z made their first contribution in #105
@XinrunDu made their first contribution in #111
@teowu made their first contribution in #113
@zjysteven made their first contribution in #114

Full Changelog: v0.2.0...v0.2.0.post1

Contributors

Luodian, zjysteven, and 4 other contributors

Assets 2

12 Jun 19:15

Luodian

v0.2.0

ed88068

v0.2.0

What's Changed

pip package by @pufanyi in #1
Fix mmbench dataset submission format by @pufanyi in #7
[Feat] add correct tensor parallelism for larger size model. by @Luodian in #4
update version to 0.1.1 by @pufanyi in #9
[Tasks] Fix MMBench by @pufanyi in #13
[Fix] Fix llava reproduce error by @kcz358 in #24
add_ocrbench by @echo840 in #28
Joshua/olympiadbench by @JvThunder in #37
[WIP] adding mmbench dev evaluation (#75) by @Luodian in #46
Add llava model for 🤗 Transformers by @lewtun in #47
Fix types to allow nullables in llava_hf.py by @lewtun in #55
Add REC tasks for testing model ability to locally ground objects, given a description. This adds REC for all RefCOCO datasets. by @hunterheiden in #52
[Benchmarks] RealWorldQA by @pufanyi in #57
add Llava-SGlang by @jzhang38 in #54
Add MathVerse by @CaraJ7 in #60
Fix typo in Qwen-VL that was causing "reference before assignment" by @tupini07 in #61
New Task: ScreenSpot - Grounding (REC) and instruction generation (REG) on screens by @hunterheiden in #63
[New Task] WebSRC (multimodal Q&A on web screenshots) by @hunterheiden in #69
Bugfix: WebSRC should be token-level F1 NOT character-level by @hunterheiden in #70
Multilingual LLava bench by @gagan3012 in #56
[Fix] repr llava doc by @cocoshe in #36
add idefics2 by @jzhang38 in #59
[Feat] Add qwen vl api by @kcz358 in #73
Adding microsoft/Phi-3-vision-128k-instruct model. by @vfragoso in #87
Add MathVerse in README.md by @CaraJ7 in #97
add MM-UPD by @AtsuMiyai in #95
add Conbench by @Gumpest in #100
Update conbench in README by @Gumpest in #101
update gpt-3.5-turbo version by @AtsuMiyai in #107
[Upgrade to v0.2] Embracing Video Evaluations with LMMs-Eval by @Luodian in #108