GTLLMZoo 🦙

A comprehensive framework for aggregating, comparing, and evaluating Large Language Models (LLMs) through benchmark performance data from multiple sources.

📋 Overview

GTLLMZoo provides a unified platform for comparing LLMs across multiple dimensions including performance, efficiency, and safety. The framework aggregates data from various benchmark sources to enable researchers, developers, and decision-makers to make informed selections based on their specific requirements.

Key features:

Unified Benchmarks: Combines data from Open LLM Leaderboard, LLM Safety Leaderboard, LLM Performance Leaderboard, and Chatbot Arena
Interactive UI: Intuitive filtering and selection interface built with Gradio
Comprehensive Metrics: Compare models across performance, safety, efficiency, and user preference metrics
Customizable Views: Select specific metrics and model attributes for focused comparison

🚀 Getting Started

Prerequisites

Python >= 3.9
gradio==4.9.0
Pandas
Beautiful Soup (for data scraping)

Installation

git clone https://github.com/git-disl/GTLLMZoo.git
cd GTLLMZoo
pip install -r requirements.txt

Running the Application

To run the application locally:

python app.py

For development with hot reloading:

gradio app.py

🔍 Features

LLM Comparison Tab

Compare LLMs based on:

Basic Information: Model name, parameter count, hub popularity
Benchmark Performance: Scores on ARC, HellaSwag, MMLU, TruthfulQA, Winogrande, GSM8K
Efficiency Metrics: Prefill time, decode speed, memory usage, energy efficiency
Safety Metrics: Non-toxicity, non-stereotype, fairness, ethics
Arena Performance: Chatbot arena ranking, ELO scores, user votes

Control Panel

Filter models by:

Model type
Architecture
Precision
License
Weight type

Data Export

Export filtered data to CSV for further analysis.

📊 Data Sources

GTLLMZoo aggregates data from:

Open LLM Leaderboard
LLM Safety Leaderboard
LLM Performance Leaderboard
Chatbot Arena Leaderboard

🏗️ Project Structure

app.py: Main Gradio UI application
leaderboard.py: Functions to load and process leaderboard data
control.py: UI control callbacks and filtering functions
data_structures.py: Data structure definitions for LLMs and datasets
utils.py: Utility functions and enum classes
scrape_llm_lb.py: Scripts to scrape latest leaderboard data
merge.py: Functions to merge data from different sources
assets.py: Custom CSS and UI assets

💾 Data Files

llm.json: LLM metadata
dset.json: Dataset information
merged.csv: Merged data from all leaderboards

🤝 Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

📬 Contact

Project Link: https://github.com/git-disl/GTLLMZoo

🙏 Acknowledgements

HuggingFace for hosting the original leaderboards
All benchmark creators and maintainers
The open-source LLM community

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

GTLLMZoo 🦙

📋 Overview

🚀 Getting Started

Prerequisites

Installation

Running the Application

🔍 Features

LLM Comparison Tab

Control Panel

Data Export

📊 Data Sources

🏗️ Project Structure

💾 Data Files

🤝 Contributing

📄 License

📬 Contact

🙏 Acknowledgements

Files

README.md

Latest commit

History

README.md

File metadata and controls

GTLLMZoo 🦙

📋 Overview

🚀 Getting Started

Prerequisites

Installation

Running the Application

🔍 Features

LLM Comparison Tab

Control Panel

Data Export

📊 Data Sources

🏗️ Project Structure

💾 Data Files

🤝 Contributing

📄 License

📬 Contact

🙏 Acknowledgements