A curated list of awesome projects in the Harbor ecosystem.
- terminal-bench-2 - Measures agent ability to complete tasks in a terminal
- terminal-bench-pro - Extension of terminal-bench by Alibaba
- skillsbench - Measures agent ability to use skills
- otel-bench - Measures agent ability to instrument code with OpenTelemetry across multiple languages
- CompileBench - Measures agent ability to build a working binary from source
- harbor-datasets - Popular benchmarks (e.g. SWE-bench verified) ported to run in Harbor.
- RuneBench - Measures agent ability to play RuneScape and complete tasks via TypeScript SDK
- legacy-bench - Evaluates agents on maintaining, debugging, and modernizing legacy code in COBOL, Java 7, Fortran, C, and Assembly
- SWE-Atlas - Evaluates agents on professional SWE tasks including codebase comprehension and test writing
- SWE-gen-Java - 1000 JVM tasks generated from 16 open-source GitHub repos using SWE-gen
- SWE-gen-JS - 1000 JS/TS tasks generated from 30 open-source GitHub repos using SWE-gen
- SWE-gen-Rust - 1000 Rust SWE tasks generated using SWE-gen
- SWE-gen-Go - 1000 Go SWE tasks generated using SWE-gen
- SWE-gen-Cpp - 1000 C++ SWE tasks generated using SWE-gen
- Nemotron-Terminal-Synthetic-Tasks - Synthetic terminal tasks by NVIDIA
- seta-env - Scaling Environments for Terminal Agents: fully automated Harbor task synthesis and verification
- OpenThoughts-Agent - Generating Harbor tasks, distilling trajectories with SFT, and training with SkyRL
- endless-terminals - Procedurally generates terminal-use tasks and trains terminal agents with SkyRL
- Ares - Framework for online RL training of LLM agents, built on Harbor and SkyRL
- SkyRL Harbor Integration - Guide for RL training of agents with SkyRL and Harbor
- harbor-bot - GitHub bot automating QA on Harbor tasks
- Benchmark Template - Template for building benchmarks on Harbor with automated QA in CI
- SWE-gen - Convert GitHub PRs into Harbor tasks
- Oddish - Eval scheduler for running Harbor tasks with provider-aware queuing and automatic retries
- TerminalBenchTaskGenerator - Desktop app for chat-driven authoring of Harbor benchmark tasks
- AutoAgent - Autonomous agent harness engineering driven by benchmark scores
- Meta-Harness - Autonomous improvement of harness code using previous iterations and Harbor evaluations
Contributions welcome! Open a PR to add a project you have created or love using.