SWE-Forge (Python)

High-performance SWE-bench dataset generator and evaluation harness that mines real GitHub pull requests, produces evaluation-ready task instances, and benchmarks coding agents.

Built on top of SweInfinite by @unconst, rewritten in Python with:

Agentic command discovery (NO hardcoded install commands)
Language detection (rule-based, OK to hardcode)
Difficulty filtering with LLM classification
Docker verification of generated tests
Full parallelism with semaphore-based concurrency
Structured LLM outputs via OpenAI function calling
200k context auto-compaction with smart summarization

What it does

swe-forge connects to GH Archive to discover recently merged pull requests, enriches them via the GitHub API, classifies their difficulty using an LLM, discovers install/test commands agenticly, generates test specifications via an agentic loop, and exports SWE-bench-compatible task instances.

Key Features

Feature	Description
🔍 Real GitHub Data	Mines GH Archive for merged PRs across all public repositories
🎯 Difficulty Filtering	Pre-classifies PRs as easy/medium/hard before expensive processing
🤖 Agentic Discovery	Discovers install/test commands from CI/CD (NO hardcoding)
📦 Docker Verification	Verifies tests in Docker before export
⚡ Full Parallelism	GH Archive 8x, enrichment 20x, Docker 8x concurrent
🧠 Smart Compaction	200k context limit with structured summary templates
📊 Complete Export	workspace.yaml + patch.diff + tests/ directory

Installation

From PyPI

pip install swe-forge

From Source

git clone https://github.com/CortexLM/swe-forge.git
cd swe-forge
pip install -e .

Docker

docker pull ghcr.io/cortexlm/swe-forge:latest

Quick Start

Prerequisites

# Required environment variables
export GITHUB_TOKEN="ghp_..."           # GitHub PAT for PR enrichment
export OPENROUTER_API_KEY="sk-or-v1-..." # OpenRouter API key for LLM

Mine Tasks from GH Archive

# Mine 10 tasks with workspace export
swe-forge mine mine \
  --limit 10 \
  --output ./tasks.jsonl \
  --output-folder ./tasks \
  --docker-username myuser \
  --parallel 8

# Mine with difficulty filter
swe-forge mine mine \
  --limit 5 \
  --difficulty hard \
  --min-stars 100

# Mine specific repository
swe-forge mine mine \
  --repo python/cpython \
  --limit 3

Complete Mining with Docker Verification

# Full A-Z pipeline with test verification
swe-forge mine complete \
  --repo owner/repo \
  --pr 12345 \
  --output ./tasks.jsonl \
  --model openai/gpt-5.4

Output Structure

Directory Format (when using `--output-folder`)

tasks/
├── owner-repo-1234/
│   ├── workspace.yaml      # Complete task configuration
│   ├── patch.diff          # PR patch to apply
│   ├── test_patch.diff     # Test file changes
│   └── tests/              # Extracted test files
│       ├── test_feature.py
│       └── test_another.py
└── owner-repo-5678/
    └── ...

workspace.yaml Format

task_id: owner-repo-1234
repo:
  url: https://github.com/owner/repo.git
  base_commit: abc123def456...
  merge_commit: fed456abc123...
language: python
difficulty_score: 5
prompt: "Fix the bug in..."
environment:
  image: myuser/swe-forge-tasks:owner-repo-1234
  language_version: "3.12"
install:
  commands:
    - pip install -e .
    - pip install pytest
tests:
  fail_to_pass:
    - pytest tests/test_feature.py -v
    - pytest tests/test_another.py::test_case -v
  pass_to_pass:
    - pytest tests/ -v --ignore=tests/test_feature.py
docker:
  image: myuser/swe-forge-tasks:owner-repo-1234
  build: true

CLI Reference

`swe-forge mine mine` - Mine from GH Archive

swe-forge mine mine [OPTIONS]

Option	Short	Default	Description
`--repo`	`-r`	All	Target repository (owner/repo format)
`--limit`	`-l`	10	Maximum tasks to mine
`--output`	`-o`	./tasks.jsonl	Output JSONL file
`--output-folder`	`-O`	None	Output folder for workspace format
`--docker-username`	`-D`	None	Docker Hub username for image names
`--parallel`	`-p`	8	Concurrent Docker containers
`--difficulty`	`-d`	All	Filter: easy, medium, hard
`--model`	`-m`	moonshotai/kimi-k2.5	LLM model for classification
`--min-stars`		100	Minimum repository stars
`--language`		python	Filter by language
`--filter`	`-f`	{"easy":10,"medium":10,"hard":10}	JSON max tasks per difficulty
`--verbose`	`-v`	False	Enable verbose logging

`swe-forge mine complete` - Full Pipeline with Verification

swe-forge mine complete [OPTIONS]

Option	Short	Default	Description
`--repo`	`-r`	Required	Target repository (owner/repo)
`--pr`	`-p`	Required	Pull request number
`--output`	`-o`	./tasks.jsonl	Output file
`--model`	`-m`	openai/gpt-5.4	LLM model
`--verbose`	`-v`	False	Verbose logging

Architecture

Pipeline Flow

sequenceDiagram
    participant GHA as GH Archive
    participant SF as swe-forge
    participant GH as GitHub API
    participant LLM as LLM
    participant D as Docker

    GHA->>SF: Merged PR events (8x concurrent)
    SF->>SF: Pre-filter (bots, org, stars)
    SF->>GH: Enrich candidates (20x concurrent)
    GH-->>SF: PR metadata + diff
    SF->>LLM: Classify difficulty
    LLM-->>SF: easy / medium / hard
    SF->>D: Agentic discovery (8x concurrent)
    D-->>SF: fail_to_pass + pass_to_pass
    SF->>LLM: Quality scoring
    LLM-->>SF: Accept / reject
    SF-->>SF: Export workspace.yaml

Parallelism Configuration

Stage	Semaphore	Default	Description
GH Archive Fetch	`gh_archive_sem`	8	Download hourly dumps
GitHub Enrichment	`enrichment_sem`	20	Fetch PR metadata (5000/h rate limit)
Pre-classification	`preclassify_sem`	25	LLM triage on title+body
Deep Processing	`deep_sem`	8	Full pipeline per candidate
Docker Containers	`docker_sem`	8	Concurrent test verification

Agentic Command Discovery

IMPORTANT: Commands are NEVER hardcoded.

sequenceDiagram
    participant AD as Agent Discovery
    participant CI as CI/CD Config
    participant LLM as LLM
    participant SH as Shell (Docker)

    AD->>CI: Parse .github/workflows/, .gitlab-ci.yml
    CI-->>AD: Install patterns, test commands
    
    AD->>SH: Clone repo in Docker
    AD->>LLM: "Discover how to install and test"
    
    loop Up to 200 turns
        LLM->>SH: shell("pip install -e .")
        SH-->>LLM: exit_code=0
        LLM->>SH: shell("pytest tests/")
        SH-->>LLM: exit_code=0, output
    end
    
    LLM->>AD: submit_tests(fail_to_pass, pass_to_pass)

What Happens in Docker

Clone repository at base commit
Detect language from files (package.json, pyproject.toml, Cargo.toml, etc.)
Discover commands by:
- Parsing CI/CD workflows
- Reading package manager configs
- Trying commands and checking exit codes
Generate tests via LLM agentic loop
Verify tests fail before patch (proves bug exists)
Apply patch
Verify tests pass after patch (proves fix works)

Difficulty Classification

Level	Score Range	Typical Changes	Examples
Easy	0.1 – 0.35	Typos, config, single-file	Fix import, update version
Medium	0.4 – 0.65	Bug fixes, features, APIs	Fix race condition, add endpoint
Hard	0.7 – 1.0	Cross-cutting, architectural	New subsystem, migration

Classification Models

Pre-classification: moonshotai/kimi-k2.5 (fast triage on title+body)
Full classification: Uses complete diff and test spec

Auto-Compaction (200k Context)

When context exceeds 200k tokens, the system uses structured summarization:

## Goal
[What goal(s) is the user trying to accomplish?]

## Instructions
- [What important instructions did the user give you]
- [If there is a plan or spec, include information about it]

## Discoveries
[What notable things were learned during this conversation]

## Accomplished
[What work has been completed, in progress, and left?]

## Relevant files / directories
[Structured list of relevant files]

This preserves critical context across long agentic sessions.

Configuration

Environment Variables

Variable	Required	Description
`GITHUB_TOKEN`	Yes	GitHub PAT for PR enrichment
`OPENROUTER_API_KEY`	Yes	OpenRouter API key for LLM calls
`HF_TOKEN`	No	HuggingFace token for dataset upload
`RUST_LOG`	No	Log level: debug, info, warn, error

Supported Languages

Language	Detection	Package Managers
Python	pyproject.toml, setup.py, requirements.txt	pip, poetry, uv
JavaScript/TypeScript	package.json	npm, yarn, pnpm
Rust	Cargo.toml	cargo
Go	go.mod	go mod
Java	pom.xml, build.gradle	maven, gradle

Development

Setup

# Clone and install dev dependencies
git clone https://github.com/CortexLM/swe-forge.git
cd swe-forge
pip install -e ".[dev]"

# Install pre-commit hooks
pre-commit install

Testing

# Run all tests
pytest tests/ -v

# Run specific test module
pytest tests/test_swe/test_pipeline.py -v

# Run with coverage
pytest tests/ --cov=src/swe_forge --cov-report=html

Code Quality

# Format
ruff format src/

# Lint
ruff check src/

# Type check
pyright src/

Benchmark Results

Benchmark run with 100 candidate PRs from GH Archive:

Pipeline Funnel

Stage	Count	Percentage
Raw GH Archive events (12h)	1,752,426	100%
Merged PR events	35,498	2.03%
After pre-filter	1,394	3.93%
Enriched successfully	21	1.51%
Tests generated	11	52.38%
Quality passed	8	72.73%

Throughput

Metric	Value
Tasks per hour	8
Avg time per task	450s
Docker parallelism	8 containers

API Reference

Python API

from swe_forge.swe.pipeline import SwePipeline, SwePipelineConfig
from swe_forge.export.workspace import export_tasks_to_workspace

# Configure pipeline
config = SwePipelineConfig(
    max_candidates=50,
    max_tasks=10,
    min_stars=100,
    languages=["python"],
)

# Run pipeline
async with SwePipeline(config) as pipeline:
    result = await pipeline.run()
    
    # Export to workspace format
    export_tasks_to_workspace(
        result.tasks,
        output_folder="./tasks",
        docker_username="myuser"
    )

SweTask Model

from swe_forge.swe.models import SweTask

@dataclass
class SweTask:
    id: str
    repo: str                    # owner/repo format
    base_commit: str             # Git SHA
    merge_commit: str            # Git SHA
    language: str                 # python, rust, etc.
    difficulty_score: int         # 1-10
    patch: str                    # Unified diff
    test_patch: str               # Test file changes
    fail_to_pass: list[str]       # Test commands
    pass_to_pass: list[str]       # Test commands
    install_config: dict          # Discovered install commands
    prompt: str                   # Task description
    quality_score: float          # 0.0-1.0
    status: SweTaskStatus         # candidate, validated, etc.

Testing Tasks

Test Tasks from HuggingFace Dataset

The published dataset CortexLM/swe-forge on HuggingFace contains task instances with pre-built Docker images.

Prerequisites

Docker installed and running
pip install datasets

CLI Usage

# Test a specific task by ID
python scripts/test_task.py --task-id pydantic-pydantic-12985

# Test 5 random tasks
python scripts/test_task.py --random 5

# Test all tasks and save results
python scripts/test_task.py --all --output results.json

# With verbose output
python scripts/test_task.py --task-id pydantic-pydantic-12985 -v

Or use the shell wrapper:

./scripts/test_task.sh --random 5

Docker Sandbox

Each task is tested in an isolated Docker container:

Pull Docker image - Contains repo at base_commit
Run fail_to_pass tests - Should all PASS
Run pass_to_pass tests - Should all PASS

Docker Image Contents

Pre-built Docker images (platformnetwork/swe-forge:*) contain:

/workspace/patch.diff - The patch
/workspace/run_tests.sh - Test script
Repository cloned at base_commit

Dataset Fields

Field	Description
`instance_id`	Task ID (format: `owner-repo-123`)
`docker_image`	Pre-built Docker image
`fail_to_pass`	Tests that must pass after patch
`pass_to_pass`	Tests that must stay passing
`patch`	Unified diff to apply

Benchmark Harness

SWE-Forge provides a Docker-based evaluation harness for benchmarking model-generated patches, similar to SWE-bench.

Installation

pip install datasets  # For HuggingFace dataset loading

Quick Start

# Evaluate gold patches (ground truth) on a specific task
python3 scripts/run_evaluation.py --predictions_path gold --instance_ids pydantic-pydantic-12985

# Evaluate on 5 random tasks
python3 scripts/run_evaluation.py --predictions_path gold --random 5

# Evaluate all tasks
python3 scripts/run_evaluation.py --predictions_path gold --max_workers 8

Prediction Format

Create a JSONL file with model predictions:

{"instance_id": "pydantic-pydantic-12985", "model_patch": "diff --git a/..."}
{"instance_id": "owner-repo-123", "model_patch": "..."}

Then evaluate:

python3 scripts/run_evaluation.py --predictions_path predictions.jsonl --max_workers 4

Evaluation Flow

For each task, the harness:

Pull Docker image - Contains repo at base commit
Run fail_to_pass tests BEFORE patch - Should FAIL (bug exists)
Apply model patch
Run fail_to_pass tests AFTER patch - Should PASS (bug fixed)
Run pass_to_pass tests - Should PASS (no regression)
Grade - Resolved if all tests pass as expected

Parameters

Parameter	Description
`--predictions_path`	Path to JSONL or "gold" for ground truth
`--max_workers`	Parallel workers (default: 4)
`--instance_ids`	Specific instances to evaluate
`--random N`	Evaluate N random instances
`--timeout`	Timeout per instance (default: 600s)
`--run_id`	Run identifier
`--output_dir`	Output directory
`--clean`	Cleanup Docker after evaluation

Output

Results are saved to evaluation_results/{run_id}/:

results.json - Overall metrics
instance_results.jsonl - Detailed per-instance results

Metrics

Resolution Rate: Percentage of patches that fixed the issue
Tests Passed/Failed: Test execution results
Duration: Evaluation time

Credits

Built on top of SweInfinite by @unconst.

Extended with:

Python rewrite with full async support
Agentic command discovery (NO hardcoding)
Docker verification of generated tests
Structured workspace export
200k context auto-compaction
Configurable parallelism

License

MIT — see LICENSE.

Quality Control Pipeline

SWE-Forge includes a comprehensive quality control pipeline to ensure tasks are valid and appropriately challenging.

Overview

Task Generation
      ↓
┌─────────────────────────────────┐
│ 1. Complexity Evaluation       │
│    LLM assesses task difficulty │
│    Score: 0.0 (trivial) to 1.0  │
│    Reject if < 0.25             │
└─────────────────────────────────┘
      ↓
┌─────────────────────────────────┐
│ 2. Docker Verification         │
│    Tests FAIL before patch     │
│    Apply patch                 │
│    Tests PASS after patch     │
│    Reject if tests don't work  │
└─────────────────────────────────┘
      ↓
   Accept Task

Complexity Scoring

The complexity evaluator uses an LLM agent to analyze:

Factor	Impact
Lines changed	More lines → higher score
Files modified	More files → higher score
Logic complexity	Complex logic → higher score
Context needed	More context → higher score
Change type	Config/docs → lower score

Scoring thresholds:

Score	Difficulty	Action
0.0-0.25	Trivial	❌ REJECTED
0.25-0.40	Easy	✅ Accepted
0.40-0.65	Medium	✅ Accepted
0.65-1.00	Hard	✅ Accepted

Docker Verification

Each task is verified in an isolated Docker container:

Before patch: Tests MUST FAIL (proves bug exists)
Apply patch: git apply /workspace/patch.diff
After patch: Tests MUST PASS (proves fix works)
Regression tests: pass_to_pass tests must stay passing

CLI Options

# Mining with quality control (default)
swe-forge mine mine --limit 100

# Adjust minimum complexity
swe-forge mine mine --min-complexity 0.30

# Skip Docker verification (faster, less reliable)
swe-forge mine mine --no-verify

# Skip complexity check (faster, accepts trivial tasks)
swe-forge mine mine --skip-complexity

# Use different model for evaluation
swe-forge mine mine --complexity-model openai/gpt-4

Revalidation Script

Revalidate existing tasks to filter out invalid ones:

# Revalidate all tasks
python scripts/revalidate_tasks.py --tasks-dir ./tasks

# Skip Docker verification (complexity only)
python scripts/revalidate_tasks.py --tasks-dir ./tasks --no-verification

# Limit to N tasks
python scripts/revalidate_tasks.py --tasks-dir ./tasks --limit 10

# Custom threshold
python scripts/revalidate_tasks.py --tasks-dir ./tasks --min-complexity 0.30

# Output report
python scripts/revalidate_tasks.py --tasks-dir ./tasks --report report.json

Expected Results

For a typical mining run:

Metric	Typical Value
Tasks generated	100%
Rejected (complexity)	~20%
Rejected (verification)	~20%
Accepted	~60%

The acceptance rate of 30-70% is normal and ensures quality benchmarks.

Dataset Fields

When tasks are exported to HuggingFace, quality fields are included:

Field	Description
`complexity_score`	0.0-1.0 complexity rating
`complexity_difficulty`	"easy", "medium", or "hard"
`verified`	True if Docker verification passed

Filter on HF:

from datasets import load_dataset

ds = load_dataset("CortexLM/swe-forge")
# Only medium+ difficulty, verified tasks
filtered = ds.filter(lambda x: x['complexity_score'] >= 0.4 and x['verified'])

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

SWE-Forge (Python)

What it does

Key Features

Installation

From PyPI

From Source

Docker

Quick Start

Prerequisites

Mine Tasks from GH Archive

Complete Mining with Docker Verification

Output Structure

Directory Format (when using --output-folder)

workspace.yaml Format

CLI Reference

swe-forge mine mine - Mine from GH Archive

swe-forge mine complete - Full Pipeline with Verification

Architecture

Pipeline Flow

Parallelism Configuration

Agentic Command Discovery

What Happens in Docker

Difficulty Classification

Classification Models

Auto-Compaction (200k Context)

Configuration

Environment Variables

Supported Languages

Development

Setup

Testing

Code Quality

Benchmark Results

Pipeline Funnel

Throughput

API Reference

Python API

SweTask Model

Testing Tasks

Test Tasks from HuggingFace Dataset

Prerequisites

CLI Usage

Docker Sandbox

Docker Image Contents

Dataset Fields

Benchmark Harness

Installation

Quick Start

Prediction Format

Evaluation Flow

Parameters

Output

Metrics

Credits

License

Quality Control Pipeline

Overview

Complexity Scoring

Docker Verification

CLI Options

Revalidation Script

Expected Results

Dataset Fields

Directory Format (when using `--output-folder`)

`swe-forge mine mine` - Mine from GH Archive

`swe-forge mine complete` - Full Pipeline with Verification