Releases: stanford-crfm/helm
v0.5.11
Breaking Changes
- Remove deprecated Claude 3.5 model from LLM juries (#3926)
- Switch Gemini 2.5 to use
GoogleGenAIClient(#3944)
Models
- Add Claude Haiku 4.5 (#3922)
- Add GPT-5.1 model (#3927, 3937)
- Add
GoogleGenAIClient(#3925) - Add Gemini 3 Pro (#3936, 3937)
- Add Gemini Robotics-ER 1.5 (#3939)
- Add Kimi K2 model on Together (#3943)
- Switch Gemini 2.5 to use
GoogleGenAIClient(#3944) - Add more Qwen3 and Qwen3-Next models on Together AI (#3947)
Scenarios
- Convert AraTrust to use multiple_choice_joint adapter (#3920)
- Remove deprecated Claude 3.5 model from LLM juries (#3926)
- Remove stop sequences in ALRAGE scenario (#3948)
Framework
Contributors
Thank you to the following contributors for your work on this HELM release!
v0.5.10
Breaking Changes
- Python 3.9 is no longer supported as it reached end-of-life in October 2025. (#3903)
Models
- Add Palmyra X5 on Bedrock (#3905)
- Add IBM Granite 4 models on Hugging Face (#3906)
- Fix incorrect Arabic model configs (#3907)
Scenarios
- Fix
shc_privacy_medscenario in MedHELM (#3911)
Framework
- Remove Python 3.9 support (#3903)
- Separate the definition of the ArgumentParser from its usage (#3908)
Contributors
Thank you to the following contributors for your work on this HELM release!
v0.5.9
Models
- Add Arabic language models
- Add Qwen3-Next 80B A3B Thinking model (#3875)
- Add DeepSeek-R1-Distill-Llama-70B and DeepSeek-R1-Distill-Qwen-14B (#3873)
- Add Qwen3-Next 80B A3B Thinking model (#3891)
- Add Falcon3 models (#3894)
- Add Claude 4.5 Sonnet (#3897)
Scenarios
- Fix breakages in
shc_privacy_medandshc_proxy_med(#3876) - Allow applying regular expression pattern before mapping output (#3882)
- Add output mapping patterns for Arabic MCQA scenarios (#3885)
- Update speech language pathology scenarios to use Hugging Face Datasets (#3835)
Framework
- Make priority optional for run entries (#3865)
- Add
herror()andhexception()to logger (#3789) - Log error stack traces from within various clients (#3880)
Contributors
Thank you to the following contributors for your work on this HELM release!
v0.5.8
Models
- Add GLM-4.5-AIR-FP8 model (#3785)
- Add Qwen3 235B A22B Instruct 2507 FP8 (#3788)
- Add Gemini 2.5 Flash-Lite GA (#3776)
- Add gpt-oss (#3789, #3794)
- Add GPT-5 (#3793, #3797)
- Handle safety and usage guidelines errors from Grok API (#3770)
- Handle Gemini responses with max tokens reached during thinking (#3804)
- Add OpenRouterClient (#3811)
Scenarios
- Fix instructions and prompt formatting for InfiniteBench En.MC (#3790)
- Add MedQA and MedMCQA to MedHELM (#3781)
- Add or modify Arabic language scenarios:
- Add run expander for Arabic language instructions for Arabic MCQA scenarios (#3833)
- Allow configuration of LLM-as-a-judge models in MedHELM scenarios (#3812)
- Add user-configurable MedHELM scenario (#3844)
Frontend
- Display Arabic text in RTL direction in frontend (#3807)
- Fix regular expression query handling in run predictions (#3826)
- Fix invalid sort column index error in leaderboard (#3845)
Framework
- Migrate to pyproject.toml (#3767)
- Various fixes for proxy server (#3801, #3802, #3803)
- Raise error if helm-summarize is given a non-existent suite (#3805)
- Allow setting reference prefix characters (#3809)
- Auto-generate schema in helm-summarize if
--auto-generate-schemais specified (#3813, #3814, #3828, #3839, #3842, #3848, #3850) - Omit empty tables for metric groups in helm-summarize (#3851)
- Add
get_metadata()method for many scenarios and metrics (#3815, #3829, #3832, #3834, #3841, #3843, #3849, #3840, #3830)
Contributors
Thank you to the following contributors for your work on this HELM release!
v0.5.7
Breaking Changes
- In v0.5.6,
HuggingFaceClientwould callapply_chat_templatewithadd_generation_prompt()unspecified, which caused it to default toFalse. In v0.5.7,HuggingFaceClientnow callsadd_generation_prompt()withapply_chat_template=True. (#3703) - Previously, in the
model_deployments.yamlconfiguration files, when determining the default model deployment for a model, the first model deployment in the file would be preferred, and model deployments in the package would be preferred over those inprod_env/. Now, the last model deployment in the file will be preferred, and model deployments inprod_envwill be preferred over those in the package. (#3694)
Models
- Add o3-pro (#3671)
- Allow passing initialization kwargs to VLLMClient and OpenAIClient (#3690)
- Add support for Granite thinking on vLLM (#3692)
- Add some Brazilian models (#3686)
- Always set
add_generation_prompttoTrueinapply_chat_templateinHuggingFaceClient(#3703) - Add support for chat models on vLLM (#3729)
- Add Kimi K2 (#3758)
- Add Grok 4 (#3759)
Scenarios
- Add LMKT (Language model cultural alignment transfer) (#3682)
- Modify SLPHelm (#3679)
- Add InfiniteBench En.MC (#3687)
- Add MMMLU (#3699)
- Add Arabic MMLU, AlGhafa (#3700)
- Add AlGhafa (#3706)
- Add EXAMS (#3714)
- Add AraTrust (#3715)
- Add HEALTHQA-BR (#3711)
- Add BLEUX (#3711)
- Remove numeracy scenario (#3733)
- Fixes to HEIM (#3765)
Framework
- Improvements to logging (#3683, #3696, #3744)
- Change logic for default model deployments (#3694)
- Loosen various dependency version specifiers (#3725, #3726)
- Add short_description field to RunGroup in schema (#3755)
- Add support for configuring the number of retries using environment variables (#3766)
Frontend
- Show run group name and description from schema on run page (#3753)
Contributors
Thank you to the following contributors for your work on this HELM release!
v0.5.6
Breaking Changes
- Previously,
HuggingFaceClientdid not apply the chat template for chat models.HuggingFaceClientwill now automatically apply the chat template if the tokenizer contains one. This may result in incorrect behavior for some non-chat models that have tokenizers with chat templates e.g. Qwen2, Qwen 2.5. For these models, the default behavior can be overridden by explicitly settingapply_chat_templateinargs. (#3678)
Models
- Fixed bug regarding
reasoning_effortnot being set correct for OpenAI models (#3501) - Add Llama 4 Scout and Maverick on Together AI (#3504)
- Add Qwen2.5 Omni 7B (#3505, #3530)
- Add Grok 3 models (#3518, #3529, #3534)
- Add GPT-4.1 models (#3527, #3528)
- Add Gemini 2.5 models (#3538, #3614)
- Add OpenAI o3 and o4 series models (#3539, #3617)
- Adjust max_tokens restriction for Anthropic models (#3555)
- Add Vietnamese LLMs (#3563)
- Add Qwen3 235B on Together (#3579)
- Add Palmyra X5 (#3580)
- Add COREBench v0 audio scenario (#3567, #3610)
- Add Anthropic Claude 4 models (#3607)
- Add extended thinking to Claude 3.7 and Claude 4 (#3622)
- Support streaming in Anthropic client (#3624)
- Added support for reasoning content (#3632, #3650)
- Add DeepSeek-R1-0528 on Together AI (#3636)
- Add Amazon Nova Premier model (#3656)
- Add Marin 8B Instruct (#3658, #3666)
- Add HuggingFacePipelineClient (#3665)
- Add SmolLM2 (#3665, #3668)
- Add OLMo 2 and OLMoE models (#3667, #3676)
- Add Granite 3.3 8B Instruct (#3675)
- Add Mistral Medium 3 (#3674)
- Add Gemini 2.5 Flash-Lite (#3677)
- Apply chat template in HuggingFaceClient and make chat template usage configurable (#3678)
- Add support for OpenAI Responses API (#3652)
Scenarios
- Improvements to VHELM (#3500)
- Improvements to audio scenarios (#3500, #3523,)
- Improvements to MedHELM scnearios (#3507, #3499, #3506, #3496, #3509, #3503, #3522, #3508, #3540, #3535, #3548, #3545, #3560, #3565, #3557, #3566, #3577, #3625, #3631)
- Add Vietnamese language scenarios (#3464, #3511, #3594, #3536, #3556, #3551, #3611, #3634)
- Add KPI Edgar scneario (#3495)
- Add run entries with additional instructions for MMLU-Winogrande-Afr (#3497)
- Add run entries for reasoning models (#3541, #3544)
- Fix bug with AIR-Bench handling of empty responses (#3543)
- Fix bug regarding the MMLU-Pro computer science subject (#3547)
- Add InfiniteBench En.QA (#3588)
- Add speech pathology scenarios (#3553, #3558)
- Bugfix for Image2Struct: EMS metric fails when the inputs are identical homogenous images files (#3576)
- Fix download URL for TruthfulQA CSV file (#3601)
- Use Python logging library in hlog (#3598, #3606)
- Add OpenAI-MRCR scenario (#3653)
- Rename InfiniteBench Sum to InfiniteBench En.Sum (#3657)
- Change metrics for long context scenarios (#3659)
Framework
- Avoid accessing SQLite accounts file in helm-run (#3573)
- Support video understanding (#3578)
- Use Python logging library (#3598)
Frontend
- Support displaying nested objects in annotations (#3513)
- Add displaying nested objects in annotations (#3512)
- Fix mobile menu z-index (#3514)
- Change favicon URL (#3597)
- Set MIME type for config.js in server (#3605)
- Fixes pagination of metrics displayed on the front-end (#3646)
Contributors
Thank you to the following contributors for your work on this HELM release!
v0.5.5
Breaking Changes
- Optimum Intel OpenVINO is no longer supported (#3153)
Scenarios
- HELM Capabilities scenarios
- Add GPQA scenario (#3068, #3078, #3079, #3100, #3096, #3420, #3445)
- Add MMLU-Pro scenario (#3108, #3125, #3200, #3272, #3458)
- Add IFEval scenario (#3122, #3275)
- Add WildBench (#3150, #3283, #3299, #3339, #3318, #3360)#3476
- Add Omni-MATH (#3271, #3299, #3271, #3299, #3291, #3322, #3348, #3372, #3373, #3407, #3407)
- IBM enterprise scenarios
- Add Gold Commodity News scenario (#3065)
- Add Legal Contract Summarization scenario (#3131)
- Add CASEHold scenario (#3164)
- Add SUMO Web Claims Summarization scenario (#3112)
- Add CTI-to-MITRE scenario (#3249)
- Add Legal Opinion Sentiment Classification scenario (#3286)
- Add Financial Phrasebank scenario (#3302)
- Add ConvFinQACalc (#3453)
- Add ECHR Judgment Classification scenario (#3311)
- Vision-language model scenarios
- Speech / audio model scenarios
- Add AudioMNIST scenario (#3093)
- Add CoVost-2: Speech Machine Translation (#3106)
- Add Vocal Sound scenario (#3130)
- Add Multilingual Librispeech (#3130, #3423)
- Add AudioCaps scenario (#3137)
- Add IEMOCAP Audio scenario (#3139)
- Add MELD Audio scenario (#3142)
- Add FLEURS scenario (#3130, #3151, #3287, #3299)
- Add Casual Conversation V2 audio scenario (#3158)
- Add Common_Voice_15 and RSB audio scenarios (#3147)
- Add Audio PAIRS audio scenario (#3149)
- Add VoxCeleb2Scenario for audio identification (#3179)
- Add AIR-Bench chat and foundation audio scenarios (#3189, #3362, #3486)
- Add MuTox Scenario (#3343)
- Add MUStARDScenario for sarcasm detection (#3345)
- Add AMI, LibriSpeech audio scenarios (#3355)
- Add VoiceJailbreakAttacksScenario (#3361)
- Add GPT4 evaluator for open-ended audio scenarios (#3417)
- Add LibriSpeech and FLEURS gender fairness audio scenarios (#3418)
- Add PARADE audio scenario (#3424)
- Misc
- Add AutoBencher Capabilities scenario (#3178)
- Add AutoBencher Safety Scenario (#3165)
- Add ENEM Challenge Scenario (#3185)
- Add BigCodeBench (#3186, #3310)
- Add experimental CzechBankQA scenario (#3222, #3240)
- Add TweetSentBR Scenario (#3219)
- Add HarmBench GCG-T (#3035)
- Add MMLU and Winogrande human-translated into 11 African languages (#3237, #3256)
- Pin revision in many invocations of Hugging Face load_datasets() (#3124)
- Add BIRD SQL scenario (#3292, #3312)
- Add Spider 1.0 scenario (#3300, #3371)
- Add helpdesk call summarization scenario (#3303, #3387, #3388)
- Adding IMDB_PTBR Scenario (#3284)
- Adding OAB Exams Scenario (#3329)
- Disable test_math_scenario due to DMCA takedown (#3315)
- Add InfiniteBenchSum scenario (#3409, #3476)
- Add HotPotQA and SQuAD scenarios from RULER (#3411, #3476)
- Allow using alternate annotator models for AIR-Bench 2024 (#3468, #3488)
- Set trust_remote_code for TyDiQA and BANKING77 (#3473, #3477)
- MedHELM scenarios
- Fix LiveQA scenario (#3244)
- Add MedHallu scenario (#3483)
- Add MIMIC-IV-BHC scenario (#3459)
- Add all conditions in CLEAR scenario (#3466)
- Add judges to many MedHELM scenarios (#3484)
- Add or update 31 scenarios for MedHELM v1.0.0 (#3038, #3403, #3492, #3493) -
ACI-Bench,
ADHD-Behavior,
ADHD-MedEffects,
BMT-Status,
CDI-QA,
CLEAR-AD,
ClinicReferral,
DischargeMe,
EHRSHOT,
EHRSQL,
ENT-Referral,
HeadQA,
HospiceReferral,
MedAlign,
Medbullets,
MedCalc-Bench,
MedConfInfo,
MedDialog,
Medec,
MedicationQA,
MEDIQA,
MentalHealth,
MIMIC-IV Billing Code,
MIMIC-RRS,
MTSamples,
MTSamples Procedures,
N2C2-CT Matching,
NoteExtract,
PatientInstruct,
PubMedQA,
RaceBias
Models
- Add Mistral and Llama models on AWS Bedrock (#3034, #3092, #3095)
- Add Gemini-1.5-Pro-002 and Gemini-1.5-Flash-002 models (#3032, #3085)
- Add Qwen2.5 Instruct Turbo models on Together AI (#3063)
- Add Anthropic Claude 3.5 Sonnet (20241022) models (#3082)
- Add Mistral Pixtral (2409) (#3073)
- Add GPT-4o Audio Preview (2024-10-01) model (#3091)
- Add Qwen-Audio-Chat and Qwen2-Audio-Instruct (#3104, #3298, #3474)
- Allow setting device for Hugging Face models (#3109)
- Add Mistral Small and Ministral models (#3110)
- Add Llama-Omni-8B (#3119)
- Treat missing AI21 message content as empty string (#3123)
- Add stop sequence support to MistralClient (#3120)
- Deprecate OpenAI legacy completions API (#3144)
- Add Grok Beta model (#3145)
- Add Diva Llama model (#3148)
- Remove OpenVino support (#3153)
- Add support for IBM Granite models on Hugging Face (#3166, #3261)
- Add Claude 3.5 Haiku model (#3171)
- Add Pixtral Large and Mistral Large (2411) models (#3177)
- Add Upstage Solar models (#3181, #3198)
- Add Llama 3.1 Nemotron Instruct (70B) model on Together AI (#3172)
- Add NECTEC model (#3197)
- Add Llama 3.3 model (#3202)
- Add Maritaca AI model (Sabiá 7B) (#3185)
- Add gemini-2.0-flash-exp model (#3210)
- Add Qwen 2 VLM (#3247)
- Add Amazon Nova models (#3251, #3252, #3263, #3264, #3408, #3442)
- Add DeepSeek v3 model (#3253)
- Simplify credential management for Bedrock client (#3255)
- Add Llama 3.1 Instruct on Vertex AI (#3278)
- Add Phi 3.5 models (#3306)
- Add Mistral Small 3 model (#3308)
- Add QwQ model on Together AI (#3307)
- Add Deepseek-R1 model (#3305)
- Add o3-mini model (#3304)
- Handle content filtering from Azure OpenAI (#3319, #3321, #3327)
- Add DeepSeek R1 Distill Llama 8B and DeepSeek Code Instruct 6.7B (#3332)
- Add a version of DeepSeek R1 that hides thinking tokens from output (#3335, #3485)
- Add OpenAI audio models (#3346)
- Add Claude 3.7 model (#3366)
- Add GPT-4.5 model (#3382)
- Added
gemini-2.0-flash-thinking-exp-01-21(#3410) - Add SEA-HELM leaderboard and SEA-LIONv3 models (#3347)
- Added
OpenAITranscriptionThenCompletionClient(#3416) - Add request response format JSON schema support (#3415)
- Make Azure OpenAI deployment name configurable (#3421)
- Use Anthropic tokenizer from Hugging Face (#3467)
- Add Palmyra Fin model (#3475)
- Add Mistral Small 3.1 model (#3478)
- Added support for phi-3.5 through Azure (#3489)
- Add IBM Granite models hosted on IBM WatsonX (#3441)
Frontend
- Add web player for audio objects (#3098)
- Add a badge indicating if the release is latest or stale (#3116, #3126)
- Change title for HELM leaderboards (#3117)
- Add functionality for linking directly to instances in Predictions page (#3121)
- Improve leaderboard frontend navigation (#3330)
- Display messages in instances and reuqests in frontend (#3336, #3341)
- Add latest to the frontend version dropdown (#3338)
- Allow overriding Vite base URL with VITE_HELM_FRONTEND_BASE environment variable (#3426, #3428)
- Load project_metadata.json from the website rather than GitHub (#3427)
Framework
- Add extra_data field to Instance (#3094)
- Fix mean aggregation for tables (#3127, #3309)
- Add encryption for stored data for GPQA (#3216, #3242)
- Allow running recipes from the Unitxt catalog (#3267)
- Add support to redact model outputs (#3301)
- Allow processing output before metrics for reasoning models (#3333)
- Display instances extra_data in frontend (#3340)
- Allow arguments to be passed into annotators (#3487)
- Automatically apply the model=all run expander (#3491)
- Add support for adaptive sampling based on Reliable and Efficient Amortized Model-based Evaluation (Truong et al., 2025) (#3397)
Metrics
- Flip LPIPS so that '1' is better (#3055)
- Fix incorrect handling of labels in ClassificationMetric (#3289)
- Fix ASR WER and MER metrics (#3296)
Misc
- Change GCS download docs to use gcloud storage instead of gsutil (#3083)
- Fix minor bug in punkt installation logic (#3111)
- Improvements to audio utilities (#3128)
- New arg for quasi-exact match (#3257)
- Use uv for Update Requirements GitHub Action (#3444, #3452)
- Resolve static files using importlib_resources for crfm-proxy-server (#3460, #3461)
Contributors
Thank you to the following contributors for your work on this HELM release!
- @Adam-Kasumovic
- @aptandre
- @arseniy-klimovskiy
- @asad-aali
- @asillycat
- @aunell
- @chiheem
- @farzaank
- @haoqiu1
- @HennyJie
- @ImKeTT
- @JackJessada
- @jessicalundin
- @jmbanda
- @LAOS-Y
- @liamjxu
- @lucas-s-p
- @MiguelAFH
- @Miking98
- @mtake
- @raileymontalan
- @raulista1997
- @rbitr
- @RonalddMatias
- @ryokawajp
- @saikiranjakka
- @sangttruong
- @shakatoday
- @siyagoel
- @subhaviv
- @suhana13
- @teetone
- @thallysonjsa
- @vz-ibm
- @yifanmai
- @YiZeng623
- @yuhengtu
v0.5.4
Breaking Changes
- Python 3.8 is no longer supported - please use Python 3.9 to 3.11 instead.(#2978)
Scenarios
- Fix prompt for BANKING77 (#3009)
- Split up LINDSEA scenario (#2938)
- Normalize lpips and ssim for image2struct (#3020)
Models
- Add o1 models (#2989)
- Add Palmyra-X-004 model (#2990)
- Add Palmyra-Med and Palmyra-Fin models (#3028)
- Add Llama 3.2 Turbo models on Together AI (#3029)
- Add Llama 3 Instruct Lite / Turbo on Together AI (#3031)
- Add Llama 3 CPT SEA-Lion v2 models (#3036)
- Add vision support to Together AI client (#3041)
Frontend
- Display null annotator values correctly in the frontend (#3003)
Framework
- Add support for Python 3.11 (#2922)
- Fix incorrect handling of ties in win rate computation (#3001, #2008)
- Add mean row aggregation to HELM summarize (#2997, #3030)
Developer Workflow
Contributors
Thank you to the following contributors for your work on this HELM release!
v0.5.3
Breaking Changes
- The
--models-to-runflag inhelm-runmust now be set if a models run expander such asmodels=textis used (#2852) - The
--jqueryflag has been removed fromhelm-serverbecause the legacy frontend is no longer supported (#2852)
Scenarios
- Improve DecodingTrust scenario (#2734, #2600)
- Add BHASA scenarios (#2648, #2914, #2913, #2937)
- Add BHASA LINDSEA scenarios (#2694)
- Change AIR-Bench main score to refusal rate (#2788, #2802, #2873)
- Add EWoK scenario (#2812, #2850, #2882, #2897, #2899)
- Add FinanceBench scenario (#2798)
- Add XSTest Scenario (#2831)
- Add AnthropicRedTeam scenario (#2830)
- Add SimpleSafetyTests Scenario(#2828)
- Add HarmBench Scenario (#2829, #2935)
- Add BANKING77 scenario (#2947)
- Change source dataset URL for Momentos scenario for VHELM (#2823)
- Add RealWorldQA, EXAMS-V, and FairFace scenarios for VHELM (#2825)
- Update Image2Struct scenarios (#2879, #2878, #2888, #2890, #2891, #2919, #2920)
Models
- Add SambaLingo Thai models (#2747, #2757)
- Add more Typhoon family models (#2745, #2768)
- Add SeaLLM models (#2744)
- Add OpenThaiGPT models (#2743)
- Add SambaLingo-Thai-Base-70B and SambaLingo-Thai-Chat-70B (#2758, #2757, #2782)
- Add Claude 3.5 Sonnet (20240620) (#2763)
- Add multi-GPU support to HuggingFaceClient (#2762)
- Add AI21 Jamba Instruct (#2766)
- Add Gemma 2 and Gemma 2 Instruct models (#2796, #2862)
- Deleted many deprecated models (#2668, #2814)
- Deleted many deprecated window services (#2669)
- Add Phi-3 models (#2815)
- Switched AI21 models to use local tokenizer (#2775)
- Add GPT-4o mini (#2827)
- Add Mistral NeMo (#2826)
- Add Llama 3.1 Instruct Turbo (#2835, #2840, #2844, #2880, #2898)
- Add Mistral Large 2 (#2839)
- Add Nemotron-4-Instruct (#2892, #2896, #2901)
- Add GPT-4o (2024-08-06) (#2900)
- Add Jamba 1.5 models (#2957)
- Add Llama Guard 3 (#2968)
Frontend
- Fix bug causing repeated renders and excessive CPU usage on some HELM landing pages (#2816)
- Fix bug causing Predictions page to repeatedly download schema.json (#2847)
- Fix spurious AbortError warnings in console logs (#2811)
- Fix incorrect handling perturbations in run predictions frontend (#2950)
Framework
- Support other reference prefixes in MultipleChoiceJointAdapter (#2809)
- Add validation for --models-to-run (#2852)
- Remove pyext from dependencies (#2921)
- Make Perspective API dependencies optional (#2924)
Misc
- Add additional instructions for more scenarios in
output_format_instructions(#2789, #2821, #2822, #2824, #2902, #2906, #2952, #2963) - Allow the
output_format_instructionsrun expander to add additional instructions as suffix (#2964) - Changelog messages are now in present tense rather than past tense, to align with Git commit message style
- Leaderboard releases are no longer included in this changelog, and will be included in
LEADERBOARD_CHANGELOG.mdinstead
Contributors
Thank you to the following contributors for your work on this HELM release!
v0.5.2
Scenarios
- Updated VHELM scenarios for VLMs (#2719, #2684, #2685, #2641, #2691)
- Updated Image2Struct scenarios (#2608, #2640, #2660, #2661)
- Added Automatic GPT4V Evaluation for VLM Originality Evaluation
- Added FinQA scenario (#2588)
- Added AIR-Bench 2024 (#2698, #2706, #2710, #2712, #2713)
- Fixed
entity_data_imputationscenario breakage by mirroring source data files (#2750)
Models
- Added google-cloud-aiplatform~=1.48 dependency requirement for Vertex AI client (#2628)
- Fixed bug with Vertex AI client error handling (#2614)
- Fixed bug with for Arctic tokenizer (#2615)
- Added Qwen1.5 110B Chat (#2621)
- Added TogetherCompletionClient (#2629)
- Fixed bugs with Yi Chat and Llama 3 Chat on Together (#2636)
- Added Optimum Intel (#2609, #2674)
- Added GPT-4o model (#2649, #2656)
- Added SEA-LION 7B and SEA-LION 7B Instruct (#2647)
- Added more Gemini 1.5 Flash and Pro versions (#2653, #2664, #2718, #2718)
- Added Gemini 1.0 Pro 002 (#2664)
- Added Command R and Command R+ models (#2548)
- Fixed GPT4V Evaluator Out of Option Range Issue (#2677)
- Added OLMo 1.5 (#2671)
- Added RekaClient (#2675)
- Added PaliGemma (#2683)
- Added Mistral 7B Instruct v0.1, v0.2 and v0.3 (#2665)
- Switched most Together chat models to use the chat client (#2703, #2701, #2705)
- Added MedLM model (#2696, #2709)
- Added Typhoon v1.5 models (#2659)
- Changed HuggingFaceClient to truncate end of text token (#2643)
- Added Qwen2 Instruct (72B) (#2722)
- Added Yi Large (#2723, #1731)
- Added Sailor models (#2658)
- Added BioMistral and Meditron (#2728)
Frontend
Framework
- Removed
adapter_specfromschema_*.yamlfiles (#2611) - Added support for annotators / LLM-as-judge (#2622, #2700)
- Updated documentation (#2626, #2529, #2521)
Evaluation Results
- MMLU v1.2.0
- Added results for DBRX Instruct, DeepSeek LLM Chat (67B), Gemini 1.5 Pro (0409 preview), Mistral Small (2402), Mistral Large (2402), Arctic Instruct
- MMLU v1.3.0
- Added results for Gemini 1.5 Flash (0514 preview), GPT-4o (2024-05-13), Palmyra X V3 (72B)
- MMLU v1.4.0
- Added results for Yi Large (Preview), OLMo 1.7 (7B), Command R, Command R Plus, Gemini 1.5 Flash (001), Gemini 1.5 Pro (001), Mistral Instruct v0.3 (7B), GPT-4 Turbo (2024-04-09), Qwen1.5 Chat (110B), Qwen2 Instruct (72B)
- Image2Struct v1.0.0
- Initial release with Claude 3 Sonnet (20240229), Claude 3 Opus (20240229), Gemini 1.0 Pro Vision, Gemini 1.5 Pro (0409 preview),IDEFICS 2 (8B), IDEFICS-instruct (9B), IDEFICS-instruct (80B), LLaVA 1.5 (13B), LLaVA 1.6 (13B), GPT-4o (2024-05-13), GPT-4V (1106 preview), Qwen-VL Chat
- AIR-Bench v1.0.0
- Initial release with Claude 3 Haiku (20240307), Claude 3 Sonnet (20240229), Claude 3 Opus (20240229), Cohere Command R, Cohere Command R Plus, DBRX Instruct, DeepSeek LLM Chat (67B), Gemini 1.5 Pro (001, default safety), Gemini 1.5 Flash (001, default safety), Llama 3 Instruct (8B), Llama 3 Instruct (70B), Yi Chat (34B), Mistral Instruct v0.3 (7B), Mixtral Instruct (8x7B), Mixtral Instruct (8x22B), GPT-3.5 Turbo (0613), GPT-3.5 Turbo (1106), GPT-3.5 Turbo (0125), GPT-4 Turbo (2024-04-09), GPT-4o (2024-05-13), Qwen1.5 Chat (72B)
Contributors
Thank you to the following contributors for your work on this HELM release!