Skip to content

feat(swe): fix harness for Docker agent execution and add test results#12

Merged
alpha1122x merged 2 commits intomainfrom
feat/swe-harness-fixes-agent-tests
Feb 17, 2026
Merged

feat(swe): fix harness for Docker agent execution and add test results#12
alpha1122x merged 2 commits intomainfrom
feat/swe-harness-fixes-agent-tests

Conversation

@alpha1122x
Copy link
Copy Markdown
Contributor

Summary

Fix SWE harness Docker execution issues and add baseagent-echo test results across multiple difficulty levels.

Changes

Harness fixes (src/swe/harness.rs)

  • Add DOCKER_AGENT_DIR env var support for overriding agent directory path in containerized environments
  • Pass OPENROUTER_API_KEY environment variable into Docker containers for LLM-based agents
  • Install python3, python3-pip, and python3-venv alongside system tools in container setup, with a fallback symlink for python
  • Increase system tools install timeout from 120s to 180s
  • Add --break-system-packages flag to pip install for compatibility with externally-managed Python environments

Agent test results (agent-tests/)

  • Add test results for 9 tasks across easy, medium, and hard difficulties
  • Include per-task JSON results and execution logs
  • Add summary.json with aggregated test outcomes
  • Add README.md documenting test methodology and results

Other

  • Add baseagent-echo submodule reference
  • Minor .cargo/config.toml update

Test the baseagent-echo agent (echobt/baseagent-echo) against 9 SWE-bench
tasks from the validated-dataset (3 easy, 3 medium, 3 hard) using the
swe-forge Docker harness. Results: 2/2 resolved on tasks with valid
sanity checks (100% effective resolution rate).

Harness fixes in src/swe/harness.rs:
- Pass OPENROUTER_API_KEY env var into Docker containers so the agent
  can authenticate with OpenRouter for LLM calls
- Install python3/pip/venv in all containers (not just Python-based ones)
  since the agent itself requires Python regardless of task language
- Add --break-system-packages flag to pip install for requirements.txt
  to work on Debian-based images without virtualenv conflicts
- Support DOCKER_AGENT_DIR env var override for agent directory path
  resolution in nested Docker environments
- Increase system tools install timeout from 120s to 180s

Build config change in .cargo/config.toml:
- Switch linker from clang+mold to cc for broader build compatibility

Test results in agent-tests/:
- Per-task JSON results organized by difficulty (easy/medium/hard)
- Execution logs for resolved tasks (batocera.linux-15418, happier-35)
- summary.json with aggregate metrics
- README.md documenting methodology and findings
- 5 tasks failed sanity checks (dataset issues), 2 had setup errors

Added baseagent-echo as embedded git repo for local testing.
@alpha1122x alpha1122x merged commit ef19b21 into main Feb 17, 2026
9 checks passed
@alpha1122x alpha1122x deleted the feat/swe-harness-fixes-agent-tests branch February 17, 2026 15:36
echobt added a commit that referenced this pull request Apr 8, 2026
Test the baseagent-echo agent (echobt/baseagent-echo) against 9 SWE-bench
tasks from the validated-dataset (3 easy, 3 medium, 3 hard) using the
swe-forge Docker harness. Results: 2/2 resolved on tasks with valid
sanity checks (100% effective resolution rate).

Harness fixes in src/swe/harness.rs:
- Pass OPENROUTER_API_KEY env var into Docker containers so the agent
  can authenticate with OpenRouter for LLM calls
- Install python3/pip/venv in all containers (not just Python-based ones)
  since the agent itself requires Python regardless of task language
- Add --break-system-packages flag to pip install for requirements.txt
  to work on Debian-based images without virtualenv conflicts
- Support DOCKER_AGENT_DIR env var override for agent directory path
  resolution in nested Docker environments
- Increase system tools install timeout from 120s to 180s

Build config change in .cargo/config.toml:
- Switch linker from clang+mold to cc for broader build compatibility

Test results in agent-tests/:
- Per-task JSON results organized by difficulty (easy/medium/hard)
- Execution logs for resolved tasks (batocera.linux-15418, happier-35)
- summary.json with aggregate metrics
- README.md documenting methodology and findings
- 5 tasks failed sanity checks (dataset issues), 2 had setup errors

Added baseagent-echo as embedded git repo for local testing.

Co-authored-by: echobt <mathis.massimino+echo@cortex.foundation>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants