feat(swe): fix harness for Docker agent execution and add test results#12
Merged
alpha1122x merged 2 commits intomainfrom Feb 17, 2026
Merged
feat(swe): fix harness for Docker agent execution and add test results#12alpha1122x merged 2 commits intomainfrom
alpha1122x merged 2 commits intomainfrom
Conversation
Test the baseagent-echo agent (echobt/baseagent-echo) against 9 SWE-bench tasks from the validated-dataset (3 easy, 3 medium, 3 hard) using the swe-forge Docker harness. Results: 2/2 resolved on tasks with valid sanity checks (100% effective resolution rate). Harness fixes in src/swe/harness.rs: - Pass OPENROUTER_API_KEY env var into Docker containers so the agent can authenticate with OpenRouter for LLM calls - Install python3/pip/venv in all containers (not just Python-based ones) since the agent itself requires Python regardless of task language - Add --break-system-packages flag to pip install for requirements.txt to work on Debian-based images without virtualenv conflicts - Support DOCKER_AGENT_DIR env var override for agent directory path resolution in nested Docker environments - Increase system tools install timeout from 120s to 180s Build config change in .cargo/config.toml: - Switch linker from clang+mold to cc for broader build compatibility Test results in agent-tests/: - Per-task JSON results organized by difficulty (easy/medium/hard) - Execution logs for resolved tasks (batocera.linux-15418, happier-35) - summary.json with aggregate metrics - README.md documenting methodology and findings - 5 tasks failed sanity checks (dataset issues), 2 had setup errors Added baseagent-echo as embedded git repo for local testing.
echobt
added a commit
that referenced
this pull request
Apr 8, 2026
Test the baseagent-echo agent (echobt/baseagent-echo) against 9 SWE-bench tasks from the validated-dataset (3 easy, 3 medium, 3 hard) using the swe-forge Docker harness. Results: 2/2 resolved on tasks with valid sanity checks (100% effective resolution rate). Harness fixes in src/swe/harness.rs: - Pass OPENROUTER_API_KEY env var into Docker containers so the agent can authenticate with OpenRouter for LLM calls - Install python3/pip/venv in all containers (not just Python-based ones) since the agent itself requires Python regardless of task language - Add --break-system-packages flag to pip install for requirements.txt to work on Debian-based images without virtualenv conflicts - Support DOCKER_AGENT_DIR env var override for agent directory path resolution in nested Docker environments - Increase system tools install timeout from 120s to 180s Build config change in .cargo/config.toml: - Switch linker from clang+mold to cc for broader build compatibility Test results in agent-tests/: - Per-task JSON results organized by difficulty (easy/medium/hard) - Execution logs for resolved tasks (batocera.linux-15418, happier-35) - summary.json with aggregate metrics - README.md documenting methodology and findings - 5 tasks failed sanity checks (dataset issues), 2 had setup errors Added baseagent-echo as embedded git repo for local testing. Co-authored-by: echobt <mathis.massimino+echo@cortex.foundation>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fix SWE harness Docker execution issues and add baseagent-echo test results across multiple difficulty levels.
Changes
Harness fixes (
src/swe/harness.rs)DOCKER_AGENT_DIRenv var support for overriding agent directory path in containerized environmentsOPENROUTER_API_KEYenvironment variable into Docker containers for LLM-based agentspython3,python3-pip, andpython3-venvalongside system tools in container setup, with a fallback symlink forpython--break-system-packagesflag to pip install for compatibility with externally-managed Python environmentsAgent test results (
agent-tests/)summary.jsonwith aggregated test outcomesREADME.mddocumenting test methodology and resultsOther
baseagent-echosubmodule reference.cargo/config.tomlupdate