Skip to content

alvinreal/awesome-opensource-ai

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

235 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Awesome Open Source AI

Awesome Open Source AI

A curated list of battle-tested, production-proven open-source AI models, libraries, infrastructure, and developer tools. Only elite-tier projects make this list.

Awesome PRs Welcome License: CC0-1.0

by Boring Dystopia Development

boringdystopia.ai   X @alvinunreal   Telegram Join channel


📋 Contents


🧬 1. Core Frameworks & Libraries

Core libraries and frameworks used to build, train, and run AI and machine learning systems.

Deep Learning Frameworks

  • PyTorch GitHub stars - Dynamic computation graphs, Pythonic API, dominant in research and production. The current standard for most frontier AI work.
  • TensorFlow GitHub stars - End-to-end platform with excellent production deployment, TPU support, and large-scale serving tools.
  • JAX GitHub stars + Flax GitHub stars - High-performance numerical computing with composable transformations (JIT, vmap, grad). Rising favorite for research and scientific ML.
  • NumPyro GitHub stars - Probabilistic programming with NumPy powered by JAX for autograd and JIT compilation. Bayesian modeling and inference at scale.
  • Keras GitHub stars - High-level, beginner-friendly API that now runs on multiple backends (TensorFlow, JAX, PyTorch). Perfect for rapid experimentation.
  • tinygrad GitHub stars - Minimalist deep learning framework with tiny code footprint. The "you like pytorch? you like micrograd? you love tinygrad!" philosophy - simple yet powerful.
  • PaddlePaddle GitHub stars - Industrial deep learning platform from Baidu serving 23+ million developers and 760,000+ companies. China's first independent R&D framework with advanced distributed training and deployment capabilities.
  • PyTorch Geometric GitHub stars - Library for deep learning on irregular input data such as graphs, point clouds, and manifolds. Part of the PyTorch ecosystem.

Rust ML Frameworks

  • Burn GitHub stars - Next-generation deep learning framework in Rust. Backend-agnostic with CPU, GPU, WebAssembly support.
  • Candle (Hugging Face) GitHub stars - Minimalist ML framework for Rust. PyTorch-like API with focus on performance and simplicity.
  • linfa GitHub stars - Comprehensive Rust ML toolkit with classical algorithms. scikit-learn equivalent for Rust with clustering, regression, and preprocessing.

Julia ML Frameworks

  • Flux.jl GitHub stars - 100% pure-Julia ML stack with lightweight abstractions on top of native GPU and AD support. Elegant, hackable, and fully integrated with Julia's scientific computing ecosystem.
  • MLJ.jl GitHub stars - Comprehensive Julia machine learning framework providing a unified interface to 200+ models with meta-algorithms for selection, tuning, and evaluation. MIT licensed.

NLP & Transformers

  • spaCy (Explosion AI) GitHub stars - Industrial-strength natural language processing with 75+ languages, transformer pipelines, and production-grade NER, parsing, and text classification.
  • Transformers (Hugging Face) GitHub stars - The de facto standard library for pretrained NLP models. 1M+ models, 250,000+ downloads/day. BERT, GPT, Llama, Qwen, and hundreds more.
  • sentence-transformers GitHub stars - Classic library for sentence and image embeddings.
  • tokenizers (Hugging Face) GitHub stars - Fast state-of-the-art tokenizers for training and inference.

Data Processing & Manipulation

  • Pandas GitHub stars - The gold standard for data analysis and manipulation in Python.
  • Polars GitHub stars - Blazing-fast DataFrame library (Rust backend) - modern alternative to pandas for large-scale workloads.
  • cuDF GitHub stars - GPU DataFrame library from RAPIDS. Accelerates pandas workflows on NVIDIA GPUs with zero code changes using cuDF.pandas accelerator mode.
  • Modin GitHub stars - Parallel pandas DataFrames. Scale pandas workflows by changing a single line of code - distributes data and computation automatically.
  • Dask GitHub stars - Parallel computing for big data - scales pandas/NumPy/scikit-learn to clusters.
  • NumPy GitHub stars - Fundamental array computing library that powers almost every AI stack.
  • SciPy GitHub stars - Scientific computing algorithms (optimization, linear algebra, statistics, signal processing).
  • NetworkX GitHub stars - Creation, manipulation, and study of complex networks. The foundational graph analysis library for Python data science.
  • cuGraph GitHub stars - GPU graph analytics library with NetworkX-compatible API. 10-100x faster than CPU for large-scale graph algorithms. Apache 2.0 licensed.
  • Vaex GitHub stars - Out-of-Core hybrid Apache Arrow/NumPy DataFrame for Python. Visualize and explore billion-row datasets at millions of rows per second. MIT licensed.
  • Datashader GitHub stars - High-performance large data visualization. Renders billions of points interactively without aggregation artifacts. BSD-3-Clause licensed.
  • Zarr GitHub stars - Chunked, compressed, N-dimensional array storage. Scalable tensor data format optimized for cloud and parallel computing. MIT licensed.
  • NVIDIA DALI GitHub stars - GPU-accelerated data loading and augmentation library with highly optimized building blocks for deep learning applications. Apache 2.0 licensed.
  • Narwhals GitHub stars - Lightweight compatibility layer between DataFrame libraries. Write Polars-like code that works seamlessly across pandas, Polars, cuDF, Modin, and more. MIT licensed.
  • Ibis GitHub stars - Portable Python dataframe library with 20+ backends. Write pandas-like code that runs locally with DuckDB or scales to production databases (BigQuery, Snowflake, PostgreSQL) by changing one line. Apache 2.0 licensed.

Classical ML & Gradient Boosting

  • scikit-learn GitHub stars - Industry-standard library for traditional machine learning (classification, regression, clustering, pipelines).
  • XGBoost GitHub stars - Scalable, high-performance gradient boosting library. Still dominates Kaggle and tabular competitions.
  • LightGBM GitHub stars - Microsoft's ultra-fast gradient boosting framework, optimized for speed and memory.
  • CatBoost GitHub stars - Gradient boosting that handles categorical features natively with great out-of-the-box performance.
  • sktime GitHub stars - Unified framework for machine learning with time series. Scikit-learn compatible API for forecasting, classification, clustering, and anomaly detection.
  • StatsForecast GitHub stars - Lightning-fast statistical forecasting with ARIMA, ETS, CES, and Theta models. Optimized for high-performance time series workloads.
  • cuML GitHub stars - GPU-accelerated machine learning algorithms with scikit-learn compatible API. 10-50x faster than CPU implementations for large datasets. Apache 2.0 licensed.
  • SynapseML GitHub stars - Distributed machine learning on Apache Spark. Scalable, composable APIs for text analytics, vision, anomaly detection with seamless Python/Scala/R/.NET integration. MIT licensed.

AutoML & Hyperparameter Optimization

  • Optuna GitHub stars - Modern, define-by-run hyperparameter optimization with pruning and visualizations. Extremely popular in 2026.
  • AutoGluon GitHub stars - AWS AutoML toolkit for tabular, image, text, and multimodal data - state-of-the-art with almost zero code.
  • FLAML GitHub stars - Microsoft's fast & lightweight AutoML focused on efficiency and low compute.
  • AutoKeras GitHub stars - Neural architecture search on top of Keras.

Interactive ML Apps & Notebooks

  • Streamlit GitHub stars - The fastest way to build and share data apps. Transform Python scripts into beautiful web applications with minimal code. Widely used for ML model demos, data visualization, and internal tools.
  • Gradio GitHub stars - Build and share delightful machine learning apps, all in Python. The de facto standard for creating interactive ML demos with automatic UI generation from function signatures. Powers thousands of Hugging Face Spaces.
  • Marimo GitHub stars - A reactive notebook for Python — run reproducible experiments, query with SQL, execute as a script, deploy as an app, and version with git. Stored as pure Python. All in a modern, AI-native editor.

Model Training & Optimization Utilities

  • Hugging Face Accelerate GitHub stars - Simple API to make training scripts run on any hardware (multi-GPU, TPU, mixed precision) with minimal code changes.
  • DeepSpeed GitHub stars - Microsoft's deep learning optimization library for extreme-scale training (ZeRO, offloading, MoE).
  • Transformers GitHub stars - Library of pretrained transformer models and utilities for text, vision, audio, and multimodal training and inference.
  • FlashAttention GitHub stars - Fast exact attention kernels that reduce memory usage and accelerate transformer training and inference.
  • xFormers GitHub stars - Optimized transformer building blocks and attention operators for PyTorch.
  • PyTorch Lightning GitHub stars - High-level wrapper for PyTorch that removes boilerplate and adds best practices.
  • ONNX Runtime GitHub stars - High-performance inference and training for ONNX models across hardware.
  • einops GitHub stars - Flexible, powerful tensor operations for readable and reliable code. Supports PyTorch, JAX, TensorFlow, NumPy, MLX.
  • safetensors GitHub stars - Simple, safe way to store and distribute tensors. Fast, secure alternative to pickle for model serialization.
  • torchmetrics GitHub stars - Machine learning metrics for distributed, scalable PyTorch applications. 80+ metrics with built-in distributed synchronization.
  • torchao GitHub stars - PyTorch native quantization and sparsity for training and inference. Drop-in optimizations for production deployment.
  • SHAP GitHub stars - Game theoretic approach to explain the output of any machine learning model. Industry standard for model interpretability.
  • skorch GitHub stars - Scikit-learn compatible neural network library that wraps PyTorch. Seamlessly integrate PyTorch models with scikit-learn pipelines, grid search, and cross-validation.

🧠 2. Open Foundation Models

Pretrained language, multimodal, speech, and video models with publicly available weights.

Large Language Models (Base + Chat)

  • RWKV-7 "Goose" (BlinkDL) GitHub stars - Novel RNN architecture with transformer-level LLM performance. 100% attention-free, linear-time, constant-space (no kv-cache), infinite ctx_len. Linux Foundation AI project with runtime already deployed in Windows & Office.
  • Qwen3.6-Plus (Alibaba) GitHub stars - Latest flagship series released April 2026 with 1M context window, agentic coding performance competitive with Claude 4.5 Opus, and enhanced multimodal capabilities.
  • Gemma 4 (Google) GitHub stars - Released April 2026 in four sizes (E2B, E4B, 26B MoE, 31B Dense). First major update in a year with Apache 2.0 license, complex logic, and agentic workflows.
  • Kimi K2 (Moonshot AI) GitHub stars - State-of-the-art 1T parameter MoE model with 32B activated parameters and 128K context. Trained with Muon optimizer for exceptional reasoning and coding performance.
  • Kimi K2.5 (Moonshot AI) GitHub stars - Frontier open-weight MoE model with 256K context, strong coding and reasoning performance, and native multimodal + tool-use support for agentic workflows.
  • Phi-4 (Microsoft) GitHub stars - Small but highly capable models optimized for reasoning, edge devices, and on-device inference. Includes Phi-4-reasoning variants with thinking capabilities.
  • GLM-5 (Zhipu AI) GitHub stars - Strong open model line with solid coding, reasoning, and agentic-task performance.
  • OLMo 2 (Allen AI) GitHub stars - Fully open-source LLMs (1B–32B) with complete transparency: models, data, training code, and logs. Designed by scientists, for scientists.
  • Llama 4 (Meta) GitHub stars - First native multimodal MoE open-source models (Scout: 10M context, Maverick: 400B+ params). Released April 2025 with enterprise-grade capabilities.
  • GPT-OSS (OpenAI) GitHub stars - OpenAI's first open-weight models since GPT-2 (120B and 20B MoE). Apache 2.0 licensed with state-of-the-art performance for their size class. Released August 2025.

Coding & Reasoning Models

Multimodal Models (Vision + Language)

  • MMaDA (Gen-Verse) GitHub stars - Open-sourced multimodal large diffusion language model with unified architecture for text, image generation and multimodal reasoning. MIT licensed, NeurIPS 2025.
  • Qwen3-VL (Alibaba) GitHub stars - Latest flagship VLM with native 256K context (expandable to 1M), visual agent capabilities, 3D grounding, and superior multimodal reasoning. Major leap over Qwen2.5-VL.
  • GLM-4.5V / GLM-4.1V-Thinking (Zhipu AI) GitHub stars - Strong multimodal reasoning with scalable reinforcement learning. Compares favorably with Gemini-2.5-Flash on benchmarks.
  • MiniCPM-V 2.6 GitHub stars - Handles images up to 1.8M pixels with top-tier OCR performance. Excellent for on-device deployment.
  • Gemma 4 (Google) GitHub stars - Multimodal model supporting vision-language input, optimized for efficiency, complex logic, and on-device use.
  • Magma (Microsoft) GitHub stars - Foundation model for multimodal AI agents that perceives the world and takes goal-driven actions across digital and physical environments. CVPR 2025.

Speech & Audio Models (TTS, STT, Music)

  • FunASR GitHub stars - Fundamental end-to-end speech recognition toolkit with SOTA pretrained models. Supports ASR, VAD, speaker verification, diarization, and multi-talker ASR. Industrial-grade with 31-language support and real-time transcription services. MIT licensed.
  • Whisper (OpenAI → community forks) GitHub stars - The gold-standard open speech-to-text model. Massive community fine-tunes available.
  • faster-whisper (SYSTRAN) GitHub stars - Reimplementation of Whisper using CTranslate2 for up to 4x faster inference with same accuracy. Supports batched processing and 8-bit quantization.
  • OuteTTS / CosyVoice 2 GitHub stars - High-quality open TTS with natural prosody and multilingual support.
  • Fish Speech / StyleTTS 2 GitHub stars - Zero-shot TTS with excellent voice cloning. Extremely popular in 2026.
  • MusicGen / AudioCraft (Meta) GitHub stars - Open music and audio generation models.
  • VibeVoice (Microsoft) GitHub stars - Open-source frontier voice AI with expressive, longform conversational speech synthesis. 7B parameter TTS with streaming support.
  • Qwen3-TTS (Alibaba) GitHub stars - Open TTS series supporting stable, expressive, and streaming speech generation with free-form voice design and vivid voice cloning. Natural language instruction-driven control over timbre, emotion, and prosody. Apache 2.0 licensed.
  • Chatterbox (Resemble AI) GitHub stars - State-of-the-art open TTS family with 350M parameter Turbo variant. Single-step generation with native paralinguistic tags for realistic dialogue.
  • Dia (Nari Labs) GitHub stars - 1.6B parameter TTS generating ultra-realistic dialogue in one pass with nonverbal communications (laughter, coughing). Emotion and tone control via audio conditioning.
  • Step-Audio (StepFun) GitHub stars - 130B-parameter production-ready audio LLM for intelligent speech interaction. Supports multilingual conversations (Chinese, English, Japanese), emotional tones, regional dialects (Cantonese, Sichuanese), adjustable speech rates, and prosodic styles including rap. Apache 2.0 licensed.
  • Voxtral TTS (Mistral) GitHub stars - 4B parameter state-of-the-art TTS with zero-shot voice cloning, 9-language support, and ~90ms time-to-first-audio for voice agents.
  • WhisperSpeech GitHub stars - Open source text-to-speech system built by inverting Whisper. High-quality voice cloning with zero-shot capabilities. MIT licensed.

Video & Animation Models


⚡ 3. Inference Engines & Serving

Inference runtimes, serving systems, and optimization tools for running models locally or in production.

Local / On-device Inference

  • llama.cpp GitHub stars - Pure C/C++ inference engine with GGUF format support. The gold standard for CPU/GPU/Apple Silicon on-device running. Includes llama-server for OpenAI-compatible API.
  • Ollama GitHub stars - Dead-simple local LLM runner with a one-line install, model registry, and OpenAI-compatible API.
  • MLX GitHub stars (Apple) - High-performance array framework + LLM inference optimized for Apple Silicon.
  • MLC-LLM GitHub stars - Deployment engine that compiles and runs LLMs across browsers, mobile devices, and local hardware.
  • WebLLM GitHub stars - High-performance in-browser LLM inference engine. Runs models directly in the browser with WebGPU acceleration.
  • llama-cpp-python GitHub stars - Official Python bindings for llama.cpp.
  • KoboldCpp GitHub stars - User-friendly llama.cpp fork focused on role-playing and creative writing.
  • RamaLama GitHub stars - Container-centric tool for simplifying local AI model serving. Automatically detects GPUs, pulls optimized container images, and runs models securely in rootless containers with enterprise-grade isolation.

High-performance Serving & API Servers

  • llm-d GitHub stars - Kubernetes-native distributed LLM inference framework. Donated to CNCF by RedHat, Google, and IBM. Intelligent scheduling, KV-cache optimization, and state-of-the-art performance across accelerators.
  • LMDeploy GitHub stars - Toolkit for compressing, deploying, and serving LLMs from OpenMMLab. 4-bit inference with 2.4x higher performance than FP16, distributed multi-model serving across machines.
  • vLLM GitHub stars - State-of-the-art serving engine with PagedAttention and continuous batching. Currently the fastest production-grade LLM server.
  • nano-vLLM GitHub stars - Minimalist vLLM implementation in ~1,200 lines of Python. Educational yet performant with prefix caching, tensor parallelism, and CUDA graph acceleration. Comparable inference speeds to full vLLM. MIT licensed.
  • SGLang GitHub stars - Next-gen serving framework with RadixAttention. Powers xAI's production workloads at 100K+ GPUs scale.
  • TensorRT-LLM GitHub stars - NVIDIA's official high-performance inference backend.
  • Aphrodite Engine GitHub stars - vLLM fork optimized for role-play and creative writing.
  • AIBrix GitHub stars - Cost-efficient and pluggable infrastructure components for GenAI inference. Kubernetes-native control plane for vLLM with distributed KV cache, heterogeneous GPU serving, and intelligent routing. Apache 2.0 licensed.
  • Triton Inference Server GitHub stars - NVIDIA's production-grade open-source inference serving software. Supports multiple frameworks (TensorRT, PyTorch, ONNX) with optimized cloud and edge deployment.
  • mistral.rs GitHub stars - Fast, flexible Rust-native LLM inference engine built on Candle. Supports text, vision, audio, image generation, and embeddings with hardware-aware auto-tuning.
  • KTransformers GitHub stars - Flexible framework for heterogeneous CPU-GPU LLM inference and fine-tuning. Enables running large MoE models by offloading experts to CPU with BF16/FP8 precision support.
  • llamafile GitHub stars - Mozilla's single-file distributable LLM solution. Bundle model weights, inference engine, and runtime into one portable executable that runs on six OSes without installation.
  • Xinference GitHub stars - Unified, production-ready inference API for LLMs, speech, and multimodal models. Drop-in GPT replacement with single-line code changes. Supports thousands of models with auto-batching and distributed inference.
  • LightLLM GitHub stars - Pure Python-based LLM inference and serving framework with lightweight design, easy extensibility, and high-speed performance. Integrates optimizations from FasterTransformer, TGI, vLLM, and SGLang.
  • TabbyAPI GitHub stars - FastAPI-based API server for ExLlamaV2/V3 backends. OpenAI-compatible API with support for model loading/unloading, embeddings, speculative decoding, multi-LoRA, and streaming.
  • GPUStack GitHub stars - GPU cluster manager that orchestrates inference engines like vLLM and SGLang. Automated engine selection, parameter optimization, and distributed multi-GPU deployment for high-performance AI workloads.
  • One-API GitHub stars - LLM API management and key redistribution system. Unifies multiple providers (OpenAI, Anthropic, Azure, etc.) under a single OpenAI-compatible API with built-in rate limiting, quota management, and cost tracking. MIT licensed.
  • OpenLLM (BentoML) GitHub stars - Production-grade platform for running any open-source LLMs as OpenAI-compatible API endpoints. Supports 50+ models with built-in streaming, batching, and auto-acceleration. Apache 2.0 licensed.
  • Higress (Alibaba) GitHub stars - AI-native API gateway born from Alibaba's internal infrastructure with 2+ years of production validation. Provides unified LLM API and MCP (Model Context Protocol) management with enterprise-grade 99.99% availability. Apache 2.0 licensed.

Quantization, Distillation & Optimization

  • GGUF GitHub stars (part of llama.cpp) - Modern quantized format that powers most local inference.
  • bitsandbytes GitHub stars - 8-bit and 4-bit optimizers + quantization.
  • ExLlamaV2 GitHub stars - Highly optimized CUDA kernels for 4-bit/8-bit inference.
  • Optimum GitHub stars - Hardware-specific acceleration and quantization.

🤖 4. Agentic AI & Multi-Agent Systems

Frameworks and platforms for building agent-based systems and multi-agent workflows.

Single-Agent Frameworks

  • LangGraph GitHub stars - Stateful, controllable agent orchestration.
  • CrewAI GitHub stars - Role-based agent framework.
  • AutoGen (AG2) GitHub stars - Flexible multi-agent conversation framework.
  • DSPy GitHub stars - Framework for programming language model pipelines with modules, optimizers, and evaluation loops.
  • Semantic Kernel GitHub stars - SDK for building and orchestrating AI agents and workflows across multiple programming languages.
  • smolagents GitHub stars - Lightweight agent framework centered on tool use and code-executing workflows.
  • LangChain GitHub stars - Foundational library for agents, chains, and memory.
  • Hermes Agent (NousResearch) GitHub stars - The agent that grows with you. Autonomous server-side agent with persistent memory that learns and improves over time.
  • Agno GitHub stars - Build, run, and manage agentic software at scale. High-performance framework for multi-agent systems with memory, knowledge, and tools.
  • Upsonic GitHub stars - Agent framework for fintech and banking with built-in MCP support, guardrails, and tool server architecture.
  • VoltAgent GitHub stars - TypeScript-first AI agent engineering platform with memory, RAG, workflows, MCP integration, and voice support.
  • PocketFlow GitHub stars - 100-line minimalist LLM framework for building agent workflows. Lightweight, extensible architecture for tool use and autonomous task execution.
  • Agent Development Kit (Google) GitHub stars - Code-first Python toolkit for building sophisticated AI agents with multi-agent orchestration, built-in evaluation, and flexible deployment. Model-agnostic with tight Google ecosystem integration. Apache 2.0 licensed.

Multi-Agent Orchestration

  • MetaGPT GitHub stars - Simulates an entire "AI software company".
  • CAMEL GitHub stars - First and best multi-agent framework for building scalable agent systems. Apache 2.0 licensed with extensive tooling for agent communication and task automation.
  • Swarms GitHub stars - Bleeding-edge enterprise multi-agent orchestration.
  • Llama-Agents GitHub stars - Async-first multi-agent system.
  • Mastra GitHub stars - TypeScript-first agent framework with built-in RAG, workflows, tool integrations, observability and observational memory.
  • Deer-Flow (ByteDance) GitHub stars - Open-source long-horizon SuperAgent harness that researches, codes, and creates. Handles tasks from minutes to hours with sandboxes, memories, tools, skills, subagents, and message gateway.
  • OpenAI Agents SDK GitHub stars - Production-ready lightweight framework for multi-agent workflows. The evolution of Swarm with enhanced orchestration capabilities and enterprise-grade features.
  • AgentScope GitHub stars - Alibaba's production-ready multi-agent framework with 23K+ stars. Features built-in MCP and A2A support, message hub for flexible orchestration, and AgentScope Runtime for production deployment.
  • Microsoft Agent Framework GitHub stars - Microsoft's official framework combining AutoGen's agent abstractions with Semantic Kernel's enterprise features. Supports Python and .NET with graph-based workflows.
  • Agency Swarm GitHub stars - Reliable multi-agent orchestration framework built on top of the OpenAI Assistants API with organizational structure modeling.
  • elizaOS GitHub stars - Autonomous multi-agent framework for building and deploying AI-powered applications. Features Discord/Telegram/Farcaster connectors, RAG support, and a modern web dashboard.
  • Agent Squad (AWS Labs) GitHub stars - Flexible multi-agent orchestration framework with intelligent intent classification and context management. Supports Python and TypeScript with pre-built agents for Bedrock, Lex, and custom integrations. Apache 2.0 licensed.
  • DeepResearchAgent GitHub stars - Hierarchical multi-agent system for deep research tasks with automated task decomposition and execution across complex domains.
  • BeeAI Framework (IBM) GitHub stars - Production-ready multi-agent framework in Python and TypeScript. Features workflow orchestration, ACP/MCP protocol support, and deep watsonx integration. Part of Linux Foundation AI & Data program.

Autonomous Coding Agents

  • OpenHands (ex-OpenDevin) GitHub stars - Full-featured open-source AI software engineer.
  • Goose GitHub stars - Extensible on-machine AI agent for development tasks.
  • OpenCode GitHub stars - Terminal-native autonomous coding agent.
  • Aider GitHub stars - Command-line pair-programming agent.
  • Pi (badlogic) GitHub stars - Terminal coding agent with hash-anchored edits, LSP integration, subagents, MCP support, and package ecosystem.
  • Mistral-Vibe (Mistral) GitHub stars - Minimal CLI coding agent by Mistral. Lightweight, fast, and designed for local development workflows.
  • Nanocoder (Nano-Collective) GitHub stars - Beautiful local-first coding agent running in your terminal. Built for privacy and control with support for multiple AI providers via OpenRouter.
  • Gemini CLI (Google) GitHub stars - Open-source AI agent that brings Gemini's power directly into your terminal. Supports code generation, shell execution, and file editing with full Apache 2.0 licensing.

Domain-Specific Agents

  • Composio GitHub stars - Tool integration layer for AI agents with 1000+ toolkits, authentication management, and sandboxed workbench. Powers tool use across major frameworks.
  • Langflow GitHub stars - Visual low-code platform for agentic workflows.
  • Dify GitHub stars - Production-ready agentic workflow platform.
  • OWL (camel-ai/owl) GitHub stars - Advanced multi-agent collaboration system.
  • AI-Scientist-v2 (SakanaAI) GitHub stars - Workshop-level automated scientific discovery via agentic tree search. Generates novel research ideas, runs experiments, and writes papers.
  • PraisonAI GitHub stars - 24/7 AI employee team for automating complex challenges. Low-code multi-agent framework with handoffs, guardrails, memory, RAG, and 100+ LLM providers.
  • Agent-S (Simular AI) GitHub stars - Open agentic framework that uses computers like a human. SOTA on OSWorld benchmark (72.6%) for GUI automation and computer control.
  • Browser Use GitHub stars - Makes websites accessible for AI agents. Enables autonomous web automation, data extraction, and task completion with natural language instructions. MIT licensed.
  • TradingAgents GitHub stars - Multi-agent framework for financial trading. Simulates professional trading firm operations with 6+ specialized agent roles, backtesting, risk management, and portfolio optimization. Built with LangGraph, supports multiple LLM providers.
  • Parlant GitHub stars - Conversational control layer for customer-facing AI agents. Enterprise-grade context engineering framework optimized for consistent, compliant, and on-brand B2C and sensitive B2B interactions. Apache 2.0 licensed.

Agent Memory & State

  • Letta (ex-MemGPT) GitHub stars - Platform for building stateful agents with advanced memory that learn and self-improve over time.
  • Mem0 GitHub stars - Universal memory layer for AI agents. Persistent, multi-session memory across models and environments.
  • Hindsight GitHub stars - State-of-the-art long-term memory for AI agents by Vectorize. Fully self-hosted, MIT-licensed, with integrations for LangChain, CrewAI, LlamaIndex, Vercel AI SDK, and more.

🔍 5. Retrieval-Augmented Generation (RAG) & Knowledge

Retrieval systems, vector databases, embedding models, and related tooling for RAG pipelines.

Vector Databases & Search Engines

  • Chroma GitHub stars - Most popular open-source embedding database.
  • Qdrant GitHub stars - High-performance vector search engine in Rust.
  • Weaviate GitHub stars - GraphQL-native vector search engine.
  • Milvus GitHub stars - Scalable cloud-native vector database.
  • Faiss GitHub stars - Similarity search and clustering library for dense vectors with CPU and GPU implementations.
  • LanceDB GitHub stars - Serverless vector DB optimized for multimodal data.
  • Vespa GitHub stars - AI + Data platform with hybrid search (vector + keyword) and real-time indexing at scale. Battle-tested serving billions of queries daily.
  • pgvector GitHub stars - PostgreSQL extension for vector similarity search.
  • Quickwit GitHub stars - Cloud-native search engine for observability. Open-source alternative to Datadog, Elasticsearch, Loki, and Tempo with native vector search support.
  • Tantivy GitHub stars - Full-text search engine library inspired by Apache Lucene and written in Rust. Powers Quickwit and other production search systems.
  • Manticore Search GitHub stars - Easy to use open source fast database for search. Good alternative to Elasticsearch with SQL-like interface and vector search capabilities.
  • OpenSearch GitHub stars - Open-source distributed and RESTful search and analytics suite with native vector search. Enterprise-grade fork of Elasticsearch with k-NN plugin for semantic search at scale.
  • Marqo GitHub stars - Multimodal vector search for text, image, and structured data. End-to-end indexing and search with built-in embedding models. Apache 2.0 licensed.
  • Vald GitHub stars - Highly scalable distributed vector search engine. Cloud-native architecture with automatic indexing, horizontal scaling, and multiple ANN algorithm support. Apache 2.0 licensed.
  • Annoy GitHub stars - Approximate nearest neighbors library optimized for memory usage and fast loading. Powers Spotify's music recommendation with C++/Python bindings. Apache 2.0 licensed.

Embedding Models

  • BGE (FlagEmbedding) GitHub stars - BAAI's best-in-class embedding family.
  • E5 (Microsoft) GitHub stars - High-performance text embeddings for retrieval.
  • FastEmbed (Qdrant) GitHub stars - Lightweight, fast Python library for embedding generation with ONNX Runtime. Supports text, sparse (SPLADE), and late-interaction (ColBERT) embeddings without GPU dependencies. Apache 2.0 licensed.
  • EmbedAnything GitHub stars - Minimalist, highly performant multimodal embedding pipeline built in Rust. Memory-safe, modular, and production-ready for text, image, and audio embeddings with seamless vector DB integration. Apache 2.0 licensed.

Embedding Benchmarks

  • MTEB GitHub stars - Massive Text Embedding Benchmark covering 1000+ languages and diverse tasks. The industry standard for evaluating and comparing embedding models.

RAG Frameworks & Advanced Retrieval Tools

  • LlamaIndex GitHub stars - Full-featured RAG pipeline with advanced indexing.
  • Haystack GitHub stars - End-to-end NLP and RAG framework.
  • RAGFlow GitHub stars - Deep-document-understanding RAG engine.
  • GraphRAG (Microsoft) GitHub stars - Knowledge-graph-based RAG.
  • Docling GitHub stars - Document processing toolkit for turning PDFs and other files into structured data for GenAI workflows.
  • Unstructured GitHub stars - Best-in-class document preprocessing.
  • MinerU GitHub stars - High-accuracy document parsing for LLM and RAG workflows. Converts PDFs, Word, PPTs, and images into structured Markdown/JSON with VLM+OCR dual engine.
  • Marker GitHub stars - Fast, accurate PDF-to-markdown converter with table extraction, equation handling, and optional LLM enhancement for RAG pipelines.
  • ColPali / ColQwen GitHub stars - Vision-language models for document retrieval.
  • LightRAG GitHub stars - Graph-based RAG with dual-level retrieval system. Simple and fast with comprehensive knowledge discovery (EMNLP 2025).
  • RAG-Anything GitHub stars - All-in-One Multimodal RAG system for seamless processing of text, images, tables, and equations. Built on LightRAG.
  • txtai GitHub stars - All-in-one AI framework for semantic search, LLM orchestration and language model workflows. Embeddings database with customizable pipelines.
  • Infinity GitHub stars - High-throughput, low-latency serving engine for text-embeddings, reranking, CLIP, and ColPali. OpenAI-compatible API.
  • FlashRAG GitHub stars - Efficient toolkit for RAG research with 40+ retrieval and reranking models, 20+ benchmark datasets, and optimized evaluation pipelines (WWW 2025 Resource). MIT licensed.
  • DocsGPT GitHub stars - Private AI platform for building intelligent agents and assistants with enterprise search. Features Agent Builder, deep research tools, multi-format document analysis, and multi-model support. MIT licensed.
  • llmware GitHub stars - Unified framework for building enterprise RAG pipelines with small, specialized models. Optimized for AI PC and local deployment with 300+ models in catalog. Apache 2.0 licensed.
  • AutoFlow GitHub stars - Graph RAG-based conversational knowledge base tool built on TiDB Vector and LlamaIndex. Features Perplexity-style search with built-in website crawler. Apache 2.0 licensed.

Knowledge Graphs for RAG

  • Graphiti GitHub stars - Build real-time temporal knowledge graphs for AI agents. Tracks how facts change over time with provenance to source data. Supports prescribed and learned ontology for evolving real-world data. Apache 2.0 licensed.

Web Data Ingestion

  • Crawl4AI GitHub stars - LLM-friendly web crawler that turns websites into clean Markdown for RAG and agentic workflows.
  • Lightpanda GitHub stars - Machine-first headless browser in Zig; rendering-free and ultra-lightweight for AI agent browsing.
  • Paperless-AI GitHub stars - Automated document analyzer for Paperless-ngx with RAG-powered semantic search across your document archive.
  • Firecrawl GitHub stars - Web Data API for AI - search, scrape, and interact with the web at scale. Clean markdown/JSON output with proxy rotation and JS-blocking handled automatically.

🎨 6. Generative Media Tools

Open-source models and applications for image, video, audio, and 3D generation and editing.

Image Generation & Editing

  • ComfyUI GitHub stars - Node-based visual workflow editor for Stable Diffusion, FLUX, etc.
  • Stable Diffusion WebUI Forge - Neo GitHub stars - Actively maintained Forge-based Stable Diffusion web UI with the familiar extension-driven workflow.
  • Fooocus GitHub stars - Midjourney-style UI with beautiful out-of-the-box results.
  • Diffusers GitHub stars - PyTorch library for diffusion pipelines spanning image, video, and audio generation.
  • InvokeAI GitHub stars - Full-featured creative studio.
  • PowerPaint (OpenMMLab) GitHub stars - Versatile image inpainting model supporting text-guided inpainting, object removal, and outpainting (ECCV 2024).
  • SD.Next GitHub stars - All-in-one WebUI for AI generative image and video creation with multi-platform support, SDNQ quantization, and balanced CPU/GPU memory offload.

Video Generation

  • Wan2.2 (Alibaba) GitHub stars - Leading open Mixture-of-Experts text-to-video model.
  • HunyuanVideo (Tencent) GitHub stars - 13B-parameter systematic video generation framework. Leading quality among open models.
  • SkyReels V2/V3 (Skywork) GitHub stars - First open-source infinite-length film generative model using AutoRegressive Diffusion-Forcing.
  • Mochi 1 (Genmo) GitHub stars - 10B-parameter open video model.
  • LTX-Video (Lightricks) GitHub stars - Fast native 4K video generation.
  • Stable Video Diffusion (Stability AI) GitHub stars - Official image-to-video and text-to-video implementation within Stability AI's generative models repository.
  • Latte (Vchitect) GitHub stars - Latent Diffusion Transformer for video generation with state-of-the-art quality (TMLR 2025). Apache 2.0 licensed.
  • Open-Sora-Plan (PKU-YuanGroup) GitHub stars - Reproduction of Sora with full open-source pipeline for text-to-video generation. MIT licensed.
  • Open-Sora (HPC-AI Tech) GitHub stars - Fully open-source video generation with 11B model achieving on-par performance with HunyuanVideo. Complete training pipeline for $200K. Apache 2.0 licensed.
  • Helios (PKU-YuanGroup) GitHub stars - Efficient long-video generation framework with 24GB VRAM support for up to 10,000 frames (5+ minutes) and 1280×768 resolution. Apache 2.0 licensed.

Audio / Music / Voice Generation

  • AudioCraft / MusicGen (Meta) GitHub stars - Controllable text-to-music and audio models.
  • ACE-Step 1.5 GitHub stars - Local-first music generation model with broad hardware support across Mac, AMD, Intel, and CUDA devices.
  • Fish Speech GitHub stars - Zero-shot TTS and voice cloning.
  • CosyVoice 2 GitHub stars - Natural multilingual TTS with emotional control.
  • OuteTTS GitHub stars - High-quality open TTS.
  • Amphion GitHub stars - Comprehensive toolkit for Audio, Music, and Speech Generation (9.7K stars).

3D & Creative Tools

  • Hunyuan3D-2 (Tencent) GitHub stars - State-of-the-art open image-to-3D and text-to-3D.
  • Trellis (Microsoft) GitHub stars - Structured 3D latents for high-quality generation.
  • gsplat (3D Gaussian Splatting tools) GitHub stars - High-performance 3D Gaussian Splatting library.
  • LichtFeld-Studio GitHub stars - Native application for training, editing, and exporting 3D Gaussian Splatting scenes with MCMC optimization and timelapse generation. GPL-3.0 licensed.
  • OpenSplat GitHub stars - Production-grade, portable implementation of 3D Gaussian Splatting with CPU/GPU support for Windows, Mac, and Linux. Creates 3D scenes from camera poses and sparse points. AGPL-3.0 licensed.

🛠️ 7. Training & Fine-tuning Ecosystem

Tools for model training, fine-tuning, synthetic data generation, and distributed training.

Full Training Frameworks

  • LLaMA-Factory GitHub stars - One-stop unified framework for SFT, DPO, ORPO, KTO with web UI.
  • Axolotl GitHub stars - YAML-driven full pipeline for SFT, DPO, GRPO.
  • ms-swift GitHub stars - Unified training framework for 600+ LLMs and 300+ MLLMs with CPT/SFT/DPO/GRPO (AAAI 2025).
  • Unsloth GitHub stars - 2× faster, 70% less memory fine-tuning.
  • LitGPT GitHub stars - Clean from-scratch implementations of 20+ LLMs.
  • LLM Foundry GitHub stars - Databricks' training framework for composable LLM training with StreamingDataset and Composer.
  • torchtune GitHub stars - PyTorch-native library for post-training, fine-tuning, and experimentation with LLMs.
  • kohya_ss GitHub stars - Gradio-based GUI and CLI for training Stable Diffusion models (LoRA, Dreambooth, fine-tuning, SDXL). Provides accessible interface to Kohya's powerful training scripts.
  • TRL (Transformers Reinforcement Learning) GitHub stars - Official library for RLHF, SFT, DPO, ORPO.
  • verl GitHub stars - Volcano Engine Reinforcement Learning for LLMs with PPO, GRPO, REINFORCE++, DAPO (EuroSys 2025).
  • NeMo-RL GitHub stars - Scalable toolkit for efficient model reinforcement with DTensor and Megatron backends.
  • OpenRLHF GitHub stars - Easy-to-use, scalable RLHF framework based on Ray. Supports PPO, GRPO, REINFORCE++, DAPO with vLLM integration and async training. Apache 2.0 licensed.
  • LMFlow GitHub stars - Extensible toolkit for finetuning and inference of large foundation models. Features RAFT alignment algorithm and comprehensive model support. Apache 2.0 licensed.
  • XTuner GitHub stars - A next-generation training engine built for ultra-large MoE models with efficient QLoRA and full-parameter fine-tuning. Apache 2.0 licensed.
  • H2O LLM Studio GitHub stars - No-code GUI framework for fine-tuning LLMs. Streamlined interface for SFT, reward modeling, and model deployment. Apache 2.0 licensed.

LoRA / PEFT Tools

Synthetic Data Generation

  • distilabel GitHub stars - End-to-end pipeline for synthetic instruction data.
  • Data-Juicer GitHub stars - High-performance data processing for LLM training.
  • Argilla GitHub stars - Open-source data labeling + synthetic data platform.
  • SDV (Synthetic Data Vault) GitHub stars - High-fidelity tabular and relational synthetic data.
  • DataTrove (Hugging Face) GitHub stars - Platform-agnostic data processing pipelines for LLM training at scale. Handles filtering, deduplication, and tokenization on local machines or SLURM clusters.
  • Bespoke Curator GitHub stars - Synthetic data curation for post-training and structured data extraction. Makes it easy to build pipelines around LLMs with batching and progress tracking. Apache 2.0 licensed.
  • SDG (Harbin Institute) GitHub stars - Specialized framework for generating high-quality structured tabular synthetic data with CTGAN models supporting billion-level data processing. Apache 2.0 licensed.

Distributed Training

  • DeepSpeed GitHub stars - Extreme-scale training optimizations.
  • Colossal-AI GitHub stars - Unified system for 100B+ models.
  • Megatron-LM GitHub stars - Distributed training framework and reference codebase for large transformer models at scale.
  • Composer GitHub stars - MosaicML's PyTorch library for scalable, efficient neural network training with algorithmic speedups.
  • Ray Train GitHub stars - Scalable distributed training.
  • Nanotron (Hugging Face) GitHub stars - Minimalistic 3D-parallelism LLM pretraining with tensor, pipeline, and data parallelism. Designed for simplicity and speed.
  • veScale (ByteDance) GitHub stars - Hyperscale PyTorch distributed training with flexible FSDP implementation for LLMs and RL training at scale.
  • GPT-NeoX (EleutherAI) GitHub stars - Production-grade distributed training framework for large autoregressive transformers, powering models like GPT-J and GPT-NeoX-20B.
  • RLinf GitHub stars - Scalable open-source RL infrastructure for post-training foundation models via reinforcement learning. Features M2Flow paradigm for embodied AI and agentic workflows with real-world robotics integrations. Apache 2.0 licensed.
  • Streaming (MosaicML) GitHub stars - High-performance data streaming library for efficient neural network training. Streams training data from cloud storage (S3, GCS, Azure) with local caching and deterministic shuffling. Apache 2.0 licensed.

Model Quantization & Optimization

  • LLM Compressor (vLLM) GitHub stars - Transformers-compatible library for applying various compression algorithms to LLMs for optimized deployment with vLLM. Supports GPTQ, AWQ, SmoothQuant, AutoRound, and FP8/INT8 quantization with seamless Hugging Face integration.
  • NVIDIA Model Optimizer GitHub stars - Unified library of SOTA model optimization techniques including quantization, pruning, distillation, and speculative decoding. Compresses deep learning models for deployment with TensorRT-LLM, TensorRT, and vLLM to optimize inference speed across NVIDIA hardware.

📊 8. MLOps / LLMOps & Production

Tooling for tracking, deploying, monitoring, and operating AI systems in production.

Experiment Tracking & Versioning

  • MLflow GitHub stars - End-to-end open platform for the ML/LLM lifecycle.
  • DVC (Data Version Control) GitHub stars - Git-like versioning for data and models.
  • ClearML GitHub stars - Open-source platform for experiment tracking, orchestration, data management, and model serving.
  • Weights & Biases Weave GitHub stars - Open-source tracing and experiment tracking.
  • Aim GitHub stars - Self-hosted ML experiment tracker designed to handle 10,000s of training runs with performant UI and SDK for programmatic access. Apache 2.0 licensed.
  • Feast GitHub stars - Open source feature store for ML. Manages offline/online feature storage with point-in-time correctness to prevent data leakage. Apache 2.0 licensed.

Deployment & Orchestration

  • BentoML GitHub stars - Unified framework to build, ship, and scale AI apps.
  • Ray Serve GitHub stars - Scalable model serving library.
  • ZenML GitHub stars - Pipeline and orchestration framework for taking ML and LLM systems from development to production.
  • Kubeflow GitHub stars - Kubernetes-native ML/LLM platform.
  • KServe GitHub stars - Kubernetes-based model serving.
  • Seldon Core GitHub stars - MLOps and LLMOps framework for deploying, managing and scaling AI systems in Kubernetes. Standardized deployment across model types with autoscaling, multi-model serving, and A/B experiments.
  • Metaflow GitHub stars - Netflix's ML platform for building and managing real-world AI systems. Powers thousands of projects at Netflix, Amazon, and DoorDash. Apache 2.0 licensed.
  • Flyte GitHub stars - Kubernetes-native workflow orchestration platform for AI/ML pipelines. Dynamic, resilient orchestration with strong type safety and reproducibility. Used by Lyft, Spotify, and Gojek. Apache 2.0 licensed.
  • Prefect GitHub stars - Workflow orchestration framework for building resilient data and ML pipelines. Python-native with modern observability and 200+ integrations. Apache 2.0 licensed.
  • Dagster GitHub stars - Cloud-native orchestration platform for developing and maintaining data assets including ML models. Declarative programming model with integrated lineage and observability. Apache 2.0 licensed.
  • Kubeflow Pipelines GitHub stars - Machine Learning Pipelines for Kubeflow. Platform for building and deploying portable, scalable ML workflows using Kubernetes and Argo. Apache 2.0 licensed.
  • MLRun GitHub stars - Open-source AI orchestration platform for quickly building and managing continuous ML and generative AI applications across their lifecycle. Automates data preparation, model tuning, and deployment. Apache 2.0 licensed.

Monitoring, Evaluation & Observability

  • Langfuse GitHub stars - #1 open-source LLM observability platform.
  • Phoenix (Arize) GitHub stars - AI observability & evaluation platform.
  • Evidently GitHub stars - ML & LLM monitoring framework.
  • Deepchecks GitHub stars - Holistic validation and testing suite for ML models and data. Continuous validation from research to production with 50+ built-in checks for data integrity, distribution drift, and model performance.
  • Opik (Comet) GitHub stars - Production-ready LLM evaluation platform.
  • LiteLLM GitHub stars - AI Gateway to call 100+ LLM APIs in OpenAI format with unified cost tracking, guardrails, load balancing, and logging.
  • OpenLIT GitHub stars - OpenTelemetry-native LLM observability platform with GPU monitoring, evaluations, prompt management, and guardrails.
  • OpenLLMetry (Traceloop) GitHub stars - Open-source observability for GenAI/LLM applications based on OpenTelemetry with 25+ integration backends.
  • Agenta GitHub stars - Open-source LLMOps platform combining prompt playground, prompt management, LLM evaluation, and observability.
  • Helicone GitHub stars - Open-source LLM observability with request logging, caching, rate limiting, and cost analytics.
  • Giskard GitHub stars - Open-source evaluation and testing library for LLM agents. Red teaming, vulnerability scanning, RAG evaluation, and safety testing with modular architecture. Apache 2.0 licensed.
  • Portkey Gateway GitHub stars - Blazing fast AI Gateway to route 200+ LLMs with unified API. Integrated guardrails, load balancing, fallbacks, and cost tracking. MIT licensed.
  • TensorZero GitHub stars - Open-source LLMOps platform unifying LLM gateway, observability, evaluation, and experimentation. Production-grade with sub-1ms latency, used by Fortune 10 companies.

Guardrails & Safety Tools

  • NVIDIA NeMo Guardrails GitHub stars - Programmable guardrails toolkit for LLM-based conversational systems. Uses Colang to define dialog flows with input/output rails, jailbreak detection, fact-checking, and hallucination detection. Apache 2.0 licensed.
  • Guardrails AI GitHub stars - Python framework for adding input/output guardrails to LLM applications. Detects and mitigates risks like PII leakage, toxic language, competitor mentions, with 50+ validators in Guardrails Hub. Apache 2.0 licensed.
  • LLM Guard GitHub stars - Comprehensive security toolkit for LLM interactions with input/output scanners for prompt injection, PII anonymization, toxic content, secrets detection, and adversarial attack prevention. MIT licensed.
  • LlamaGuard (Meta) GitHub stars - Open safety classifier models.
  • Garak GitHub stars - LLM vulnerability scanner.
  • Promptfoo GitHub stars - LLM testing and red-teaming framework.

📈 9. Evaluation, Benchmarks & Datasets

Benchmarks, evaluation frameworks, datasets, and supporting tools for model assessment.

Benchmark Suites

  • LiveBench GitHub stars - Contamination-free LLM benchmark with objective ground-truth scoring. ICLR 2025 spotlight paper featuring frequently-updated questions from recent sources. Tests math, coding, reasoning, language, instruction following, and data analysis.
  • lm-evaluation-harness (EleutherAI) GitHub stars - De-facto standard for generative model evaluation.
  • HELM (Stanford) GitHub stars - Holistic Evaluation of Language Models.
  • SWE-bench GitHub stars - Evaluates LLMs on real-world GitHub issues from 15+ Python repositories.
  • GAIA - Real-world multi-step agentic benchmark.
  • OpenCompass GitHub stars - Evaluation platform for benchmarking language and multimodal models across large benchmark suites.
  • MLPerf Inference GitHub stars - Industry-standard ML inference benchmarks with reference implementations for AI accelerators.
  • SWE-rebench (Nebius) - Continuously updated benchmark with 21,000+ real-world SWE tasks for evaluating agentic LLMs. Decontaminated, mined from GitHub.
  • AgentBench (THUDM) GitHub stars - Comprehensive benchmark to evaluate LLMs as agents across 8 diverse environments including household, web shopping, OS interaction, and database tasks. ICLR 2024. Apache 2.0 licensed.

Evaluation Frameworks

  • DeepEval GitHub stars - The "Pytest for LLMs".
  • Inspect AI GitHub stars - Framework for large language model evaluations from the UK AI Security Institute.
  • RAGAs GitHub stars - End-to-end RAG evaluation framework.
  • Lighteval GitHub stars - Evaluation toolkit for LLMs across multiple backends with reusable tasks, metrics, and result tracking.
  • Hugging Face Evaluate GitHub stars - Standardized evaluation metrics.
  • OpenAI Evals GitHub stars - Framework for evaluating LLMs and LLM systems with an open-source registry of 100+ community-contributed benchmarks. MIT licensed.
  • LMMs-Eval GitHub stars - Unified multimodal evaluation toolkit for text, image, video, and audio tasks with 100+ supported benchmarks.

High-quality Open Datasets & Data Tools

  • Hugging Face Datasets GitHub stars - Largest open repository of datasets.
  • Cleanlab GitHub stars - Data-centric AI package for automatically finding and fixing issues in datasets. Detects label errors, outliers, and ambiguous examples in ML datasets. Apache 2.0 licensed.
  • FineWeb / FineWeb-2 (Hugging Face) - Curated 15T+ token web dataset for pre-training.
  • OSWorld GitHub stars - Multimodal agent benchmark dataset.
  • Great Expectations GitHub stars - Always know what to expect from your data. Data validation, profiling, and documentation for data pipelines. Apache 2.0 licensed.

🛡️ 10. AI Safety, Alignment & Interpretability

Tools for alignment, interpretability, safety evaluation, and adversarial testing.

Safety Evaluation Frameworks

  • Inspect AI GitHub stars - Framework for large language model evaluations from the UK AI Safety Institute. Systematic capability and safety assessments with built-in scaffolding for multi-turn dialog, tool use, and adversarial testing. MIT licensed.
  • DeepEval GitHub stars - LLM evaluation framework with built-in safety metrics including hallucination detection, bias detection, toxicity evaluation, and prompt alignment checking. Apache 2.0 licensed.

Alignment & RLHF Tools

  • Safe-RLHF GitHub stars - Safe reinforcement learning from human feedback.
  • Alignment Handbook GitHub stars - Complete recipes for full-stack alignment.
  • OpenRLHF GitHub stars - High-performance distributed RLHF framework.

Interpretability & Explainability

  • interpret (Microsoft) GitHub stars - Fit interpretable models and explain blackbox machine learning with state-of-the-art explainability techniques including Explainable Boosting Machines and SHAP-based explanations.
  • TransformerLens GitHub stars - Gold-standard for mechanistic interpretability.
  • SAELens GitHub stars - Sparse autoencoders for interpretable features.
  • Captum GitHub stars - PyTorch's official interpretability library.
  • SHAP GitHub stars - Game theoretic approach to explain the output of any machine learning model. Industry standard for model interpretability.
  • XAI GitHub stars - eXplainability toolbox for machine learning with bias evaluation and production monitoring tools.
  • EasyEdit GitHub stars - Easy-to-use knowledge editing framework for LLMs. Enables precise modification of model knowledge and behavior to correct hallucinations or outdated information. ACL 2024. MIT licensed.
  • AIX360 GitHub stars - Comprehensive AI explainability toolkit with interpretability algorithms for data and machine learning models. Includes TED, BRCG, and ProtoNN methods for diverse explanation needs. Apache 2.0 licensed.

Fairness & Bias Mitigation

  • AI Fairness 360 GitHub stars - Comprehensive toolkit for detecting, understanding, and mitigating unwanted algorithmic bias in datasets and ML models.

Adversarial & Red-teaming Tools

  • PyRIT (Microsoft) GitHub stars - Python Risk Identification Tool for generative AI. Open-source framework for security professionals to proactively identify risks in generative AI systems through automated red teaming.
  • Garak GitHub stars - Automated LLM vulnerability scanner.
  • Promptfoo GitHub stars - Systematic prompt testing and red-teaming.
  • LLM Guard GitHub stars - Input/output scanner for LLMs.
  • Adversarial Robustness Toolbox GitHub stars - Python library for machine learning security (evasion, poisoning, extraction, inference attacks).
  • DeepTeam GitHub stars - Framework to red team LLMs and LLM systems.
  • Agentic Security GitHub stars - Agentic LLM vulnerability scanner and AI red teaming kit with multi-step attack simulation and automated security probing. Apache 2.0 licensed.

🧩 11. Specialized Domains

Scientific AI & Drug Discovery

  • Boltz GitHub stars - Open-source biomolecular interaction prediction models. Boltz-1 was the first fully open source model to approach AlphaFold3 accuracy; Boltz-2 adds binding affinity prediction for drug discovery. MIT licensed.
  • Protenix GitHub stars - High-accuracy open-source biomolecular structure prediction model from ByteDance. First fully open-source model to outperform AlphaFold3 across diverse benchmarks with Apache 2.0 licensing for both academic and commercial use.
  • OpenFold GitHub stars - Trainable PyTorch reproduction of AlphaFold2. Complete open-source pipeline for protein structure prediction with competitive accuracy to the original. Apache 2.0 licensed.

Medical Imaging & Healthcare AI

  • MONAI GitHub stars - Medical Open Network for AI. End-to-end framework for healthcare imaging with state-of-the-art, production-ready training workflows. Apache 2.0 licensed.
  • nnU-Net GitHub stars - Self-configuring deep learning method for medical image segmentation. Automatically adapts to any dataset without manual parameter tuning. Widely adopted as the standard baseline for biomedical segmentation challenges. Apache 2.0 licensed.

Game AI & Simulations

  • Unity ML-Agents GitHub stars - Toolkit for training intelligent agents in games and simulations using deep reinforcement learning. Enables NPC behavior control, automated testing, and game design evaluation. Apache 2.0 licensed.
  • OpenSpiel GitHub stars - Collection of environments and algorithms for research in general reinforcement learning and search/planning in games from Google DeepMind. Apache 2.0 licensed.

Finance & Quantitative AI

  • OpenBB GitHub stars - Financial data platform for analysts, quants and AI agents. Open-source investment research infrastructure with extensive data integrations. AGPL-3.0 licensed.
  • FinGPT GitHub stars - Open-source financial large language models. Democratizing financial AI with data-centric training pipeline and multiple model releases for trading, analysis, and robo-advising. MIT licensed.
  • FinRL GitHub stars - Financial reinforcement learning framework for quantitative trading. Deep RL library for stock trading, portfolio allocation, and market execution with pre-built environments and benchmarks. MIT licensed.

Computer Vision

  • OpenCV GitHub stars - World's most widely used computer vision library.
  • Ultralytics YOLO GitHub stars - State-of-the-art real-time object detection.
  • Detectron2 GitHub stars - High-performance object detection library.
  • CVAT GitHub stars - Industry-leading data annotation platform for computer vision. Interactive video and image annotation tool used by tens of thousands of teams for machine learning at any scale.
  • SAM 2 GitHub stars - Promptable image and video segmentation model with released checkpoints and training code.
  • Kornia GitHub stars - Differentiable computer vision library.
  • MediaPipe GitHub stars - Cross-platform multimodal pipelines.

Reinforcement Learning & Robotics

  • LeRobot (Hugging Face) GitHub stars - State-of-the-art machine learning framework for real-world robotics. End-to-end learning with models, datasets, and training tools for robotic manipulation tasks.
  • Stable-Baselines3 GitHub stars - Production-ready RL algorithms.
  • Isaac Lab GitHub stars - GPU-accelerated robot learning framework.
  • MuJoCo GitHub stars - General-purpose physics simulator for robotics, biomechanics, and ML research. High-fidelity contact dynamics with native Python and C++ bindings. Apache 2.0 licensed.
  • Gymnasium (ex-OpenAI Gym) GitHub stars - Standard RL environment API.

Time Series & Scientific AI

  • Time Series Library (TSLib) GitHub stars - Comprehensive benchmark for time-series models.
  • Chronos (Amazon) GitHub stars - Pretrained foundation models for time-series forecasting.
  • GluonTS (AWS Labs) GitHub stars - Probabilistic time series modeling with deep learning. Powers Amazon SageMaker forecasting with PyTorch and MXNet backends. Apache 2.0 licensed.
  • Darts GitHub stars - Easy-to-use time-series forecasting library.
  • AutoTS GitHub stars - Automated time series forecasting with broad model selection, ensembling, anomaly detection, and holiday effects. Designed for production deployment with minimal setup.

Edge / On-device AI

Legal AI & Contract Analysis

  • OpenContracts GitHub stars - Self-hosted document annotation platform for legal AI. Semantic search, contract analysis, version control, and MCP integration for building legal knowledge bases. AGPL-3.0 licensed.

🖥️ 12. User Interfaces & Self-hosted Platforms

Local AI Chat UIs & Personal Assistants

  • OpenClaw GitHub stars - Local-first personal AI assistant with multi-channel integrations and full agentic task execution.
  • Open WebUI GitHub stars - Most popular self-hosted ChatGPT-style interface.
  • text-generation-webui GitHub stars - Web UI for running local LLMs with multiple backends, extensions, and model formats.
  • LobeChat GitHub stars - Sleek modern chat UI.
  • LibreChat GitHub stars - Feature-packed multi-LLM interface.
  • HuggingChat (self-hosted) GitHub stars - Official open-source codebase for HuggingChat.
  • Khoj GitHub stars - Self-hostable personal AI assistant for search, chat, automation, and workflows over local and web data.
  • Newelle GitHub stars - GNOME/Linux desktop virtual assistant with integrated file editor, global hotkeys, and profile manager.
  • NextChat GitHub stars - Light and fast AI assistant supporting Web, iOS, macOS, Android, Linux, and Windows. One-click deploy with multi-model support. MIT licensed.
  • big-AGI GitHub stars - AI suite for power users with multi-model "Beam" chats, AI personas, voice, text-to-image, code execution, and PDF import. MIT licensed.
  • Leon GitHub stars - Your open-source personal assistant. Built around tools, context, memory, and agentic execution. Self-hosted, privacy-focused, and extensible. MIT licensed.
  • Willow GitHub stars - Open source, local, and self-hosted Amazon Echo/Google Home competitive voice assistant alternative with hardware support. Apache-2.0 licensed.
  • CoPaw GitHub stars - Your Personal AI Assistant; easy to install, deploy on your own machine or on the cloud; supports multiple chat apps with easily extensible capabilities. Apache-2.0 licensed.

Full Self-hosted AI Platforms

  • AnythingLLM GitHub stars - All-in-one RAG + agents platform.
  • Dify GitHub stars - Complete AI application platform with visual builder.
  • Langflow GitHub stars - Visual low-code platform for LangChain flows.
  • Flowise GitHub stars - Drag-and-drop LLM app builder.
  • LocalAI GitHub stars - Open-source AI engine running LLMs, vision, voice, image, and video models on any hardware. Self-hosted OpenAI-compatible API. MIT licensed.
  • Onyx GitHub stars - Full-featured AI platform with Chat, RAG, Agents, and Actions. 40+ document connectors and every LLM support. MIT licensed (Community Edition).

Desktop & Mobile AI Apps

  • Jan GitHub stars - Local-first AI app framework.
  • Cherry Studio GitHub stars - AI productivity studio with smart chat, autonomous agents, and 300+ assistants. Unified access to frontier LLMs. AGPL-3.0 licensed.
  • DeepChat GitHub stars - A smart assistant that connects powerful AI to your personal world. Built-in MCP and ACP support, multiple search engines, privacy-focused with local data storage. Apache-2.0 licensed.
  • SillyTavern GitHub stars - Highly customizable role-playing frontend.
  • ChatALL GitHub stars - Concurrently chat with multiple AI bots to discover the best answers. Desktop app for comparing ChatGPT, Claude, Gemini, and 20+ LLMs side-by-side. Apache 2.0 licensed.
  • Chatbox GitHub stars - Powerful desktop AI client for ChatGPT, Claude, and other LLMs. Cross-platform with modern UI. GPLv3 licensed (Community Edition).
  • Maid GitHub stars - Free and open-source Android app for interfacing with llama.cpp models locally and remote APIs (Anthropic, DeepSeek, Mistral, Ollama, OpenAI). MIT licensed.
  • Dive GitHub stars - Open-source MCP Host Desktop Application with dual Tauri/Electron architecture. Seamlessly integrates with any LLMs supporting function calling. MIT licensed.

Agent & Voice Infrastructure

  • LiveKit Agents GitHub stars - Framework for building realtime voice AI agents with WebRTC transport, STT-LLM-TTS pipelines, and production-grade orchestration. Used by Salesforce Agentforce and Tesla. Apache-2.0 licensed.
  • Pipecat GitHub stars - Open-source framework for voice and multimodal conversational AI. Build real-time voice agents with support for speech-to-text, LLMs, text-to-speech, and live video. BSD-2-Clause licensed.
  • AVA AI Voice Agent GitHub stars - Open-source AI voice agent for Asterisk/FreePBX telephony systems. Modular pipeline architecture supporting multiple STT, LLM, and TTS providers with Audiosocket/RTP integration. MIT licensed.
  • Agent Chat UI GitHub stars - Web app for interacting with any LangGraph agent (Python & TypeScript) via a chat interface. Stream messages, handle interruptions, and view agent state. MIT licensed.

🧪 13. Developer Tools & Integrations

AI Coding Assistants (open-source)

  • Continue GitHub stars - Open-source AI coding autopilot for VS Code & JetBrains.
  • Tabby GitHub stars - Self-hosted AI coding assistant.
  • Cline GitHub stars - Open-source IDE coding agent that can edit files, run commands, and use tools with user approval.
  • Open Interpreter GitHub stars - Lets LLMs run code locally.
  • Roo Code GitHub stars - Open-source editor-based coding agent with multiple modes and tool integrations.
  • Aider GitHub stars - Terminal-based AI pair programmer.

IDE Plugins & Extensions

  • llama.vim GitHub stars - Local LLM-powered code completion plugin for Vim/Neovim using llama.cpp. Fast, privacy-first, no API key needed.
  • CodeCompanion.nvim GitHub stars - AI-powered coding assistant for Neovim. Inline code generation, chat, actions, and tool use with support for multiple LLM providers.
  • Continue VS Code / JetBrains GitHub stars - Most installed open-source AI extension.
  • Jupyter AI GitHub stars - Chat and code generation inside notebooks.

UI Components & Chat Libraries

  • Assistant UI GitHub stars - React/TypeScript library for building production-grade AI chat interfaces. Drop-in components for streaming messages, tool calls, and multi-modal inputs.
  • Deep Chat GitHub stars - Fully customizable AI chatbot component for your website. Supports OpenAI, direct API services, and custom endpoints. MIT licensed.
  • CopilotKit GitHub stars - Best-in-class SDK for building full-stack agentic applications, Generative UI, and chat applications. Creators of the AG-UI Protocol adopted by Google, LangChain, AWS, and Microsoft. MIT licensed.

CLI Tools & API Clients

  • PR-Agent (Qodo) GitHub stars - AI-powered code review agent for GitHub, GitLab, BitBucket, and Azure DevOps. Automated PR analysis, improvement suggestions, and multi-platform deployment via CLI, GitHub Actions, or webhooks. AGPL-3.0 licensed.

  • Gemini CLI GitHub stars - Google's open-source AI agent for the terminal. Access Gemini models with built-in tool use, MCP support, and 1M token context. Apache 2.0 licensed.

  • LLM (Simon Willison) GitHub stars - CLI tool and Python library for interacting with dozens of LLMs via remote APIs or locally. Extensible plugin ecosystem, SQLite logging. Apache 2.0 licensed.

  • AIChat GitHub stars - All-in-one LLM CLI in Rust featuring Shell Assistant, Chat-REPL, RAG, AI Tools & Agents. Supports 20+ providers. MIT/Apache 2.0 licensed.

  • aicommits GitHub stars - CLI that writes your git commit messages for you with AI. Never write a commit message again. Supports multiple providers including OpenAI, Groq, xAI, Ollama, and LM Studio. MIT licensed.

  • Codex CLI GitHub stars - OpenAI's lightweight coding agent that runs in your terminal. Code generation, file editing, and command execution with approval. Apache 2.0 licensed.

  • Repomix GitHub stars - Powerful tool that packs your entire repository into a single AI-friendly file. Perfect for feeding codebases to LLMs with smart filtering and token counting. MIT licensed.

Testing & Debugging Tools


📚 14. Resources & Learning

Papers with Open Implementations

Communities, Forums & Newsletters

Courses & Interactive Playgrounds

  • Hugging Face Course - Free hands-on courses using only open models.
  • ML For Beginners (Microsoft) GitHub stars - 12-week, 26-lesson, 52-quiz classic machine learning course for beginners. Comprehensive curriculum covering regression, classification, clustering, and NLP with practical projects.
  • LLM Course (Maxime Labonne) GitHub stars - End-to-end course for getting into Large Language Models with roadmaps and Colab notebooks. Covers pre-training, fine-tuning, RLHF, quantization, and prompt engineering.
  • AI For Beginners (Microsoft) GitHub stars - 12-week, 24-lesson curriculum on Artificial Intelligence. Covers symbolic AI, neural networks, computer vision, NLP, and reinforcement learning with hands-on labs.
  • Generative AI for Beginners (Microsoft) GitHub stars - 21 lessons covering generative AI fundamentals, prompt engineering, RAG applications, fine-tuning, and LLM app deployment with practical exercises.
  • Fast.ai GitHub stars - Legendary practical deep learning course.
  • LangChain Academy - Free courses on agents and RAG.
  • Data Science for Beginners (Microsoft) GitHub stars - 10-week, 20-lesson curriculum on data science fundamentals. Covers data preparation, visualization, modeling, and deployment with practical projects.
  • Learn PyTorch for Deep Learning (Zero to Mastery) GitHub stars - Comprehensive PyTorch deep learning course with hundreds of exercises and real-world projects.
  • The Incredible PyTorch GitHub stars - Curated list of PyTorch tutorials, papers, projects, and communities for deep learning researchers.
  • Deep RL Class (Hugging Face) GitHub stars - Free deep reinforcement learning course with hands-on exercises and trained agent publishing to the Hugging Face Hub.
  • Transformers Tutorials (Niels Rogge) GitHub stars - Comprehensive tutorials and demos using the Hugging Face Transformers library for NLP, vision, and multimodal tasks.
  • Made With ML (Goku Mohandas) GitHub stars - End-to-end course on building production-grade ML systems with MLOps fundamentals, from design to deployment and iteration.

Starter Projects & Examples

Curated Resource Lists

  • Awesome Machine Learning GitHub stars - The definitive curated list of machine learning frameworks, libraries and software organized by language. Covers Python, C++, Java, JavaScript, and more with comprehensive coverage of the ML ecosystem. CC0-1.0 licensed.

Contributing

Contributions are highly welcome! Please read the CONTRIBUTING.md for guidelines (quality standards, formatting, license requirements, etc.).

  • Only OSI-approved licenses
  • Projects must be actively maintained (commits in last 6 months)
  • High-quality, well-documented, real adoption

License

This list itself is licensed under CC0 1.0 Universal. Feel free to use it for any purpose.


Made with ❤️ for the open-source AI community. Star the repo if you find it useful - it helps more people discover the best open tools!


About

Curated list of the best truly open-source AI projects, models, tools, and infrastructure.

Topics

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages