CortexLM · echobt · Feb 17, 2026 · Feb 17, 2026 · Feb 17, 2026 · Feb 17, 2026
diff --git a/README.md b/README.md
@@ -425,6 +425,114 @@ cargo clippy         # Lint
 RUST_LOG=debug cargo run -- swe mine --max-tasks 1 --once  # Debug run
 ```
 
+## Benchmark Results
+
+Benchmark run on **2026-02-17** processing 100 candidate PRs from GH Archive through the full pipeline (GH Archive → enrichment → filtering → LLM classification → patch extraction → Docker-based agentic test generation → quality scoring → export). Model: `moonshotai/kimi-k2.5:nitro` via OpenRouter.
+
+### Pipeline Funnel
+
+| Stage | Count | Ratio |
+|-------|------:|------:|
+| Raw GH Archive events (12 hours) | 1,752,426 | 100% |
+| Merged PR events | 35,498 | 2.03% |
+| Pre-filtered candidates (sampled) | 5,000 | — |
+| After bot/org filter | 1,394 | 27.88% of sampled |
+| Enriched & patch extracted | 21 | 1.51% of filtered |
+| Test generation started | 21 | 100% of extracted |
+| Dual-commit validation passed | 11 | 52.38% of test gen |
+| Quality scored | 11 | 100% of validated |
+| Quality passed (accepted) | 8 | 72.73% of scored |
+| Quality failed (rejected) | 3 | 27.27% of scored |
+
+Overall yield: **8 accepted tasks from 1.75M raw events** (0.00046%).
+
+### Difficulty Distribution
+
+| Difficulty | Count | Percentage | Score Range |
+|------------|------:|-----------:|-------------|
+| Easy | 2 | 18.2% | 0.15 – 0.20 |
+| Medium | 9 | 81.8% | 0.40 – 0.62 |
+| Hard | 0 | 0.0% | — |
+
+All 8 accepted tasks were classified as **medium** difficulty. The 2 easy tasks (scores 0.15 and 0.20) were rejected by the quality gate.
+
+### Quality Metrics
+
+| Metric | Value |
+|--------|------:|
+| Average quality score | 0.47 |
+| Median quality score | 0.55 |
+| Min quality score | 0.15 |
+| Max quality score | 0.62 |
+| Passing threshold | ≥ 0.30 |
+| Quality pass rate | 72.7% |
+
+### Throughput & Timing
+
+| Metric | Value |
+|--------|------:|
+| Total wall-clock time | 3,600 s (60 min) |
+| PRs extracted per hour | 21.0 |
+| PRs fully processed per hour | 11.0 |
+| PRs accepted per hour | 8.0 |
+| Avg processing time per PR | 171.4 s |
+| Avg time to acceptance | 450.0 s |
+
+The primary bottleneck is Docker-based agentic test generation, which clones each repository, runs multi-turn LLM exploration (up to 200 turns), and performs dual-commit validation with retries.
+
+### Language Distribution (Accepted Tasks)
+
+| Language | Count | Percentage |
+|----------|------:|-----------:|
+| Go | 3 | 37.5% |
+| Java | 2 | 25.0% |
+| Python | 2 | 25.0% |
+| TypeScript | 1 | 12.5% |
+
+### Accepted Tasks
+
+| Task ID | Language | Difficulty | Quality Score |
+|---------|----------|------------|-------------:|
+| Kong/deck-1841 | Go | medium | 0.55 |
+| NeuralTrust/TrustGate-297 | Go | medium | 0.62 |
+| jmix-framework/jmix-5079 | Java | medium | 0.60 |
+| Decomp-Robot/dtk-template-1 | Python | medium | 0.60 |
+| softeerbootcamp-7th/WEB-Team4-Refit-448 | TypeScript | medium | 0.40 |
+| fluxcd/helm-controller-1411 | Go | medium | 0.55 |
+| run-house/kubetorch-2243 | Python | medium | 0.50 |
+| 2026TUKCOMCD/Dalum-108 | Java | medium | 0.55 |
+
+### Test Generation Failure Analysis
+
+| Failure Reason | Count | Percentage |
+|----------------|------:|-----------:|
+| Dual-commit validation failed | 3 | 30% |
+| Patch apply failed | 1 | 10% |
+| String-matching tests rejected | 1 | 10% |
+| Still in progress at timeout | 5 | 50% |
+
+Out of 21 PRs that entered test generation, 11 passed dual-commit validation (52.4%). The most common failure mode was timeout — 5 PRs were still being processed when the 60-minute benchmark window ended. These include large repositories (elastic/kibana, LemmyNet/lemmy) where Docker cloning and test execution take significant time.
+
+### Running the Benchmark
+
+```bash
+export OPENROUTER_API_KEY="sk-or-v1-..."
+export GITHUB_TOKEN="ghp_..."
+
+# Run benchmark on 100 candidate PRs
+cargo run --release -- swe benchmark --count 100 --cache-db benchmark_cache.db -o ./benchmark-output
+
+# Run with custom settings
+cargo run --release -- swe benchmark \
+  --count 50 \
+  --min-stars 100 \
+  --languages python,rust \
+  --model anthropic/claude-sonnet-4 \
+  -o ./benchmark-output
+```
+
+The benchmark command outputs the full `SweRunResult` as JSON to stdout, including the `benchmark_metrics` object with all pipeline counters.
+
 ## Credits
 
 Built on top of [SweInfinite](https://github.com/unconst/SweInfinite) by [@unconst](https://github.com/unconst). The original architecture for mining GitHub PRs and generating SWE-bench-style datasets was designed by the SweInfinite team. swe-forge extends it with:

diff --git a/benchmark-output/2026TUKCOMCD/Dalum-108/checks.txt b/benchmark-output/2026TUKCOMCD/Dalum-108/checks.txt
@@ -0,0 +1,4 @@
+cd /repo/Dalum-BE && ./gradlew test --tests "dalum.dalum.global.s3.S3ServiceTest" --no-daemon
+cd /repo/Dalum-BE && ./gradlew test --tests "dalum.dalum.domain.dupe_product.controller.DupeProductControllerTest" --no-daemon
+cd /repo/Dalum-BE && ./gradlew test --tests "dalum.dalum.domain.like_product.service.LikeProductServiceTest" --no-daemon
+cd /repo/Dalum-BE && ./gradlew test --tests "dalum.dalum.domain.search_log.service.SearchLogServiceTest" --no-daemon
diff --git a/benchmark-output/2026TUKCOMCD/Dalum-108/original_pr.md b/benchmark-output/2026TUKCOMCD/Dalum-108/original_pr.md
@@ -0,0 +1,15 @@
+# 2026TUKCOMCD/Dalum-108 (original PR)
+
+2026TUKCOMCD/Dalum (#108): [BE] S3 서비스 추가
+
+## 📝작업 내용
+> 듀프제품 서칭 시 사용자가 업로드한 사진 S3 버킷에 저장되도록 구현
+
+### 스크린샷 (선택)
+<img width="1419" height="174" alt="image" src="https://github.com/user-attachments/assets/1f8b8649-0298-4f1f-b269-05c0912ca497" />
+
+
+## 💬리뷰 요구사항(선택)
+> 없음
+
+
diff --git a/benchmark-output/2026TUKCOMCD/Dalum-108/prompt.md b/benchmark-output/2026TUKCOMCD/Dalum-108/prompt.md
@@ -0,0 +1,3 @@
+# 2026TUKCOMCD/Dalum-108
+
+Implement S3 storage for user-uploaded photos. When users upload images during duplicate product searching, the photos should be stored in an S3 bucket with proper file handling and access configuration.
diff --git a/benchmark-output/2026TUKCOMCD/Dalum-108/tests/DupeProductControllerTest.java b/benchmark-output/2026TUKCOMCD/Dalum-108/tests/DupeProductControllerTest.java
@@ -0,0 +1,52 @@
+package dalum.dalum.domain.dupe_product.controller;
+
+import dalum.dalum.domain.dupe_product.dto.request.DupeSearchRequest;
+import dalum.dalum.domain.dupe_product.service.DupeSearchService;
+import org.junit.jupiter.api.DisplayName;
+import org.junit.jupiter.api.Test;
+import org.junit.jupiter.api.extension.ExtendWith;
+import org.mockito.InjectMocks;
+import org.mockito.Mock;
+import org.mockito.junit.jupiter.MockitoExtension;
+
+import java.io.IOException;
+import java.lang.reflect.Method;
+import java.util.Arrays;
+
+import static org.assertj.core.api.Assertions.*;
+
+@ExtendWith(MockitoExtension.class)
+@DisplayName("DupeProductController API 테스트")
+class DupeProductControllerTest {
+
+    @Mock
+    private DupeSearchService dupeSearchService;
+
+    @InjectMocks
+    private DupeProductController dupeProductController;
+
+    @Test
+    @DisplayName("searchDupe 메소드는 DupeSearchRequest를 파라미터로 받아야 한다")
+    void searchDupe_AcceptsDupeSearchRequest() {
+        // Verify the method exists with correct parameter type
+        assertThat(DupeProductController.class.getMethods())
+            .anyMatch(m -> m.getName().equals("searchDupe") && 
+                          m.getParameterCount() == 1 &&
+                          m.getParameterTypes()[0].equals(DupeSearchRequest.class));
+    }
+
+    @Test
+    @DisplayName("DupeProductController는 DupeSearchService를 의존성으로 가져야 한다")
+    void controller_HasDupeSearchServiceField() {
+        // Verify that DupeProductController has a field of type DupeSearchService
+        assertThat(DupeProductController.class.getDeclaredFields())
+            .anyMatch(field -> field.getType().equals(DupeSearchService.class));
+    }
+
+    @Test
+    @DisplayName("Controller 클래스는 @RestController 어노테이션을 가져야 한다")
+    void controller_IsRestController() {
+        assertThat(DupeProductController.class.isAnnotationPresent(org.springframework.web.bind.annotation.RestController.class))
+            .isTrue();
+    }
+}
diff --git a/benchmark-output/2026TUKCOMCD/Dalum-108/tests/DupeSearchServiceS3IntegrationTest.java b/benchmark-output/2026TUKCOMCD/Dalum-108/tests/DupeSearchServiceS3IntegrationTest.java
@@ -0,0 +1,40 @@
+package dalum.dalum.domain.dupe_product.service;
+
+import dalum.dalum.global.s3.S3Service;
+import org.junit.jupiter.api.DisplayName;
+import org.junit.jupiter.api.Test;
+
+import java.lang.reflect.Constructor;
+import java.lang.reflect.Field;
+
+import static org.assertj.core.api.Assertions.*;
+
+@DisplayName("DupeSearchService S3 통합 테스트")
+class DupeSearchServiceS3IntegrationTest {
+
+    @Test
+    @DisplayName("DupeSearchService는 S3Service를 의존성으로 가져야 한다")
+    void dupeSearchService_HasS3ServiceField() {
+        // Verify that DupeSearchService has a field of type S3Service
+        assertThat(DupeSearchService.class.getDeclaredFields())
+            .anyMatch(field -> field.getType().equals(S3Service.class));
+    }
+
+    @Test
+    @DisplayName("DupeSearchService는 S3Service를 주입받는 생성자를 가져야 한다")
+    void dupeSearchService_HasConstructorWithS3Service() {
+        // Verify constructor injection includes S3Service
+        Constructor<?>[] constructors = DupeSearchService.class.getConstructors();
+
+        assertThat(constructors)
+            .anyMatch(constructor -> {
+                Class<?>[] paramTypes = constructor.getParameterTypes();
+                for (Class<?> paramType : paramTypes) {
+                    if (paramType.equals(S3Service.class)) {
+                        return true;
+                    }
+                }
+                return false;
+            });
+    }
+}
diff --git a/benchmark-output/2026TUKCOMCD/Dalum-108/tests/S3ServiceTest.java b/benchmark-output/2026TUKCOMCD/Dalum-108/tests/S3ServiceTest.java
@@ -0,0 +1,39 @@
+package dalum.dalum.global.s3;
+
+import org.junit.jupiter.api.DisplayName;
+import org.junit.jupiter.api.Test;
+
+import java.io.IOException;
+import java.lang.reflect.Method;
+import java.lang.reflect.Modifier;
+
+import static org.assertj.core.api.Assertions.*;
+
+@DisplayName("S3Service API 테스트")
+class S3ServiceTest {
+
+    @Test
+    @DisplayName("S3Service 클래스가 존재해야 한다")
+    void s3Service_ClassExists() {
+        // Verify the S3Service class exists
+        assertThatCode(() -> Class.forName("dalum.dalum.global.s3.S3Service"))
+            .doesNotThrowAnyException();
+    }
+
+    @Test
+    @DisplayName("S3Service는 uploadFile 메소드를 가지고 있어야 한다")
+    void s3Service_HasUploadFileMethod() throws NoSuchMethodException {
+        Class<?> clazz = S3Service.class;
+        Method method = clazz.getMethod("uploadFile", org.springframework.web.multipart.MultipartFile.class);
+        assertThat(method).isNotNull();
+        assertThat(method.getExceptionTypes()).contains(IOException.class);
+    }
+
+    @Test
+    @DisplayName("S3Service는 deleteFile 메소드를 가지고 있어야 한다")
+    void s3Service_HasDeleteFileMethod() throws NoSuchMethodException {
+        Class<?> clazz = S3Service.class;
+        Method method = clazz.getMethod("deleteFile", String.class);
+        assertThat(method).isNotNull();
+    }
+}
diff --git a/benchmark-output/2026TUKCOMCD/Dalum-108/tests/fail_to_pass_1.sh b/benchmark-output/2026TUKCOMCD/Dalum-108/tests/fail_to_pass_1.sh
@@ -0,0 +1,3 @@
+#!/bin/bash
+# This test must FAIL on base commit, PASS after fix
+cd /repo/Dalum-BE && ./gradlew test --tests "dalum.dalum.global.s3.S3ServiceTest" --no-daemon
diff --git a/benchmark-output/2026TUKCOMCD/Dalum-108/tests/fail_to_pass_2.sh b/benchmark-output/2026TUKCOMCD/Dalum-108/tests/fail_to_pass_2.sh
@@ -0,0 +1,3 @@
+#!/bin/bash
+# This test must FAIL on base commit, PASS after fix
+cd /repo/Dalum-BE && ./gradlew test --tests "dalum.dalum.domain.dupe_product.controller.DupeProductControllerTest" --no-daemon
diff --git a/benchmark-output/2026TUKCOMCD/Dalum-108/tests/pass_to_pass_1.sh b/benchmark-output/2026TUKCOMCD/Dalum-108/tests/pass_to_pass_1.sh
@@ -0,0 +1,3 @@
+#!/bin/bash
+# This test must PASS on base commit AND after fix
+cd /repo/Dalum-BE && ./gradlew test --tests "dalum.dalum.domain.like_product.service.LikeProductServiceTest" --no-daemon
diff --git a/benchmark-output/2026TUKCOMCD/Dalum-108/tests/pass_to_pass_2.sh b/benchmark-output/2026TUKCOMCD/Dalum-108/tests/pass_to_pass_2.sh
@@ -0,0 +1,3 @@
+#!/bin/bash
+# This test must PASS on base commit AND after fix
+cd /repo/Dalum-BE && ./gradlew test --tests "dalum.dalum.domain.search_log.service.SearchLogServiceTest" --no-daemon
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1,3 @@
		# 2026TUKCOMCD/Dalum-108

		Implement S3 storage for user-uploaded photos. When users upload images during duplicate product searching, the photos should be stored in an S3 bucket with proper file handling and access configuration.