Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
108 changes: 108 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -425,6 +425,114 @@ cargo clippy # Lint
RUST_LOG=debug cargo run -- swe mine --max-tasks 1 --once # Debug run
```

## Benchmark Results

Benchmark run on **2026-02-17** processing 100 candidate PRs from GH Archive through the full pipeline (GH Archive → enrichment → filtering → LLM classification → patch extraction → Docker-based agentic test generation → quality scoring → export). Model: `moonshotai/kimi-k2.5:nitro` via OpenRouter.

### Pipeline Funnel

| Stage | Count | Ratio |
|-------|------:|------:|
| Raw GH Archive events (12 hours) | 1,752,426 | 100% |
| Merged PR events | 35,498 | 2.03% |
| Pre-filtered candidates (sampled) | 5,000 | — |
| After bot/org filter | 1,394 | 27.88% of sampled |
| Enriched & patch extracted | 21 | 1.51% of filtered |
| Test generation started | 21 | 100% of extracted |
| Dual-commit validation passed | 11 | 52.38% of test gen |
| Quality scored | 11 | 100% of validated |
| Quality passed (accepted) | 8 | 72.73% of scored |
| Quality failed (rejected) | 3 | 27.27% of scored |

Overall yield: **8 accepted tasks from 1.75M raw events** (0.00046%).

### Difficulty Distribution

| Difficulty | Count | Percentage | Score Range |
|------------|------:|-----------:|-------------|
| Easy | 2 | 18.2% | 0.15 – 0.20 |
| Medium | 9 | 81.8% | 0.40 – 0.62 |
| Hard | 0 | 0.0% | — |

All 8 accepted tasks were classified as **medium** difficulty. The 2 easy tasks (scores 0.15 and 0.20) were rejected by the quality gate.

### Quality Metrics

| Metric | Value |
|--------|------:|
| Average quality score | 0.47 |
| Median quality score | 0.55 |
| Min quality score | 0.15 |
| Max quality score | 0.62 |
| Passing threshold | ≥ 0.30 |
| Quality pass rate | 72.7% |

### Throughput & Timing

| Metric | Value |
|--------|------:|
| Total wall-clock time | 3,600 s (60 min) |
| PRs extracted per hour | 21.0 |
| PRs fully processed per hour | 11.0 |
| PRs accepted per hour | 8.0 |
| Avg processing time per PR | 171.4 s |
| Avg time to acceptance | 450.0 s |

The primary bottleneck is Docker-based agentic test generation, which clones each repository, runs multi-turn LLM exploration (up to 200 turns), and performs dual-commit validation with retries.

### Language Distribution (Accepted Tasks)

| Language | Count | Percentage |
|----------|------:|-----------:|
| Go | 3 | 37.5% |
| Java | 2 | 25.0% |
| Python | 2 | 25.0% |
| TypeScript | 1 | 12.5% |

### Accepted Tasks

| Task ID | Language | Difficulty | Quality Score |
|---------|----------|------------|-------------:|
| Kong/deck-1841 | Go | medium | 0.55 |
| NeuralTrust/TrustGate-297 | Go | medium | 0.62 |
| jmix-framework/jmix-5079 | Java | medium | 0.60 |
| Decomp-Robot/dtk-template-1 | Python | medium | 0.60 |
| softeerbootcamp-7th/WEB-Team4-Refit-448 | TypeScript | medium | 0.40 |
| fluxcd/helm-controller-1411 | Go | medium | 0.55 |
| run-house/kubetorch-2243 | Python | medium | 0.50 |
| 2026TUKCOMCD/Dalum-108 | Java | medium | 0.55 |

### Test Generation Failure Analysis

| Failure Reason | Count | Percentage |
|----------------|------:|-----------:|
| Dual-commit validation failed | 3 | 30% |
| Patch apply failed | 1 | 10% |
| String-matching tests rejected | 1 | 10% |
| Still in progress at timeout | 5 | 50% |

Out of 21 PRs that entered test generation, 11 passed dual-commit validation (52.4%). The most common failure mode was timeout — 5 PRs were still being processed when the 60-minute benchmark window ended. These include large repositories (elastic/kibana, LemmyNet/lemmy) where Docker cloning and test execution take significant time.

### Running the Benchmark

```bash
export OPENROUTER_API_KEY="sk-or-v1-..."
export GITHUB_TOKEN="ghp_..."

# Run benchmark on 100 candidate PRs
cargo run --release -- swe benchmark --count 100 --cache-db benchmark_cache.db -o ./benchmark-output

# Run with custom settings
cargo run --release -- swe benchmark \
--count 50 \
--min-stars 100 \
--languages python,rust \
--model anthropic/claude-sonnet-4 \
-o ./benchmark-output
```

The benchmark command outputs the full `SweRunResult` as JSON to stdout, including the `benchmark_metrics` object with all pipeline counters.

## Credits

Built on top of [SweInfinite](https://github.com/unconst/SweInfinite) by [@unconst](https://github.com/unconst). The original architecture for mining GitHub PRs and generating SWE-bench-style datasets was designed by the SweInfinite team. swe-forge extends it with:
Expand Down
4 changes: 4 additions & 0 deletions benchmark-output/2026TUKCOMCD/Dalum-108/checks.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
cd /repo/Dalum-BE && ./gradlew test --tests "dalum.dalum.global.s3.S3ServiceTest" --no-daemon
cd /repo/Dalum-BE && ./gradlew test --tests "dalum.dalum.domain.dupe_product.controller.DupeProductControllerTest" --no-daemon
cd /repo/Dalum-BE && ./gradlew test --tests "dalum.dalum.domain.like_product.service.LikeProductServiceTest" --no-daemon
cd /repo/Dalum-BE && ./gradlew test --tests "dalum.dalum.domain.search_log.service.SearchLogServiceTest" --no-daemon
15 changes: 15 additions & 0 deletions benchmark-output/2026TUKCOMCD/Dalum-108/original_pr.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# 2026TUKCOMCD/Dalum-108 (original PR)

2026TUKCOMCD/Dalum (#108): [BE] S3 서비스 추가

## 📝작업 내용
> 듀프제품 서칭 시 사용자가 업로드한 사진 S3 버킷에 저장되도록 구현

### 스크린샷 (선택)
<img width="1419" height="174" alt="image" src="https://github.com/user-attachments/assets/1f8b8649-0298-4f1f-b269-05c0912ca497" />


## 💬리뷰 요구사항(선택)
> 없음


3 changes: 3 additions & 0 deletions benchmark-output/2026TUKCOMCD/Dalum-108/prompt.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
# 2026TUKCOMCD/Dalum-108

Implement S3 storage for user-uploaded photos. When users upload images during duplicate product searching, the photos should be stored in an S3 bucket with proper file handling and access configuration.
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
package dalum.dalum.domain.dupe_product.controller;

import dalum.dalum.domain.dupe_product.dto.request.DupeSearchRequest;
import dalum.dalum.domain.dupe_product.service.DupeSearchService;
import org.junit.jupiter.api.DisplayName;
import org.junit.jupiter.api.Test;
import org.junit.jupiter.api.extension.ExtendWith;
import org.mockito.InjectMocks;
import org.mockito.Mock;
import org.mockito.junit.jupiter.MockitoExtension;

import java.io.IOException;
import java.lang.reflect.Method;
import java.util.Arrays;

import static org.assertj.core.api.Assertions.*;

@ExtendWith(MockitoExtension.class)
@DisplayName("DupeProductController API 테스트")
class DupeProductControllerTest {

@Mock
private DupeSearchService dupeSearchService;

@InjectMocks
private DupeProductController dupeProductController;

@Test
@DisplayName("searchDupe 메소드는 DupeSearchRequest를 파라미터로 받아야 한다")
void searchDupe_AcceptsDupeSearchRequest() {
// Verify the method exists with correct parameter type
assertThat(DupeProductController.class.getMethods())
.anyMatch(m -> m.getName().equals("searchDupe") &&
m.getParameterCount() == 1 &&
m.getParameterTypes()[0].equals(DupeSearchRequest.class));
}

@Test
@DisplayName("DupeProductController는 DupeSearchService를 의존성으로 가져야 한다")
void controller_HasDupeSearchServiceField() {
// Verify that DupeProductController has a field of type DupeSearchService
assertThat(DupeProductController.class.getDeclaredFields())
.anyMatch(field -> field.getType().equals(DupeSearchService.class));
}

@Test
@DisplayName("Controller 클래스는 @RestController 어노테이션을 가져야 한다")
void controller_IsRestController() {
assertThat(DupeProductController.class.isAnnotationPresent(org.springframework.web.bind.annotation.RestController.class))
.isTrue();
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
package dalum.dalum.domain.dupe_product.service;

import dalum.dalum.global.s3.S3Service;
import org.junit.jupiter.api.DisplayName;
import org.junit.jupiter.api.Test;

import java.lang.reflect.Constructor;
import java.lang.reflect.Field;

import static org.assertj.core.api.Assertions.*;

@DisplayName("DupeSearchService S3 통합 테스트")
class DupeSearchServiceS3IntegrationTest {

@Test
@DisplayName("DupeSearchService는 S3Service를 의존성으로 가져야 한다")
void dupeSearchService_HasS3ServiceField() {
// Verify that DupeSearchService has a field of type S3Service
assertThat(DupeSearchService.class.getDeclaredFields())
.anyMatch(field -> field.getType().equals(S3Service.class));
}

@Test
@DisplayName("DupeSearchService는 S3Service를 주입받는 생성자를 가져야 한다")
void dupeSearchService_HasConstructorWithS3Service() {
// Verify constructor injection includes S3Service
Constructor<?>[] constructors = DupeSearchService.class.getConstructors();

assertThat(constructors)
.anyMatch(constructor -> {
Class<?>[] paramTypes = constructor.getParameterTypes();
for (Class<?> paramType : paramTypes) {
if (paramType.equals(S3Service.class)) {
return true;
}
}
return false;
});
}
}
39 changes: 39 additions & 0 deletions benchmark-output/2026TUKCOMCD/Dalum-108/tests/S3ServiceTest.java
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
package dalum.dalum.global.s3;

import org.junit.jupiter.api.DisplayName;
import org.junit.jupiter.api.Test;

import java.io.IOException;
import java.lang.reflect.Method;
import java.lang.reflect.Modifier;

import static org.assertj.core.api.Assertions.*;

@DisplayName("S3Service API 테스트")
class S3ServiceTest {

@Test
@DisplayName("S3Service 클래스가 존재해야 한다")
void s3Service_ClassExists() {
// Verify the S3Service class exists
assertThatCode(() -> Class.forName("dalum.dalum.global.s3.S3Service"))
.doesNotThrowAnyException();
}

@Test
@DisplayName("S3Service는 uploadFile 메소드를 가지고 있어야 한다")
void s3Service_HasUploadFileMethod() throws NoSuchMethodException {
Class<?> clazz = S3Service.class;
Method method = clazz.getMethod("uploadFile", org.springframework.web.multipart.MultipartFile.class);
assertThat(method).isNotNull();
assertThat(method.getExceptionTypes()).contains(IOException.class);
}

@Test
@DisplayName("S3Service는 deleteFile 메소드를 가지고 있어야 한다")
void s3Service_HasDeleteFileMethod() throws NoSuchMethodException {
Class<?> clazz = S3Service.class;
Method method = clazz.getMethod("deleteFile", String.class);
assertThat(method).isNotNull();
}
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
#!/bin/bash
# This test must FAIL on base commit, PASS after fix
cd /repo/Dalum-BE && ./gradlew test --tests "dalum.dalum.global.s3.S3ServiceTest" --no-daemon
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
#!/bin/bash
# This test must FAIL on base commit, PASS after fix
cd /repo/Dalum-BE && ./gradlew test --tests "dalum.dalum.domain.dupe_product.controller.DupeProductControllerTest" --no-daemon
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
#!/bin/bash
# This test must PASS on base commit AND after fix
cd /repo/Dalum-BE && ./gradlew test --tests "dalum.dalum.domain.like_product.service.LikeProductServiceTest" --no-daemon
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
#!/bin/bash
# This test must PASS on base commit AND after fix
cd /repo/Dalum-BE && ./gradlew test --tests "dalum.dalum.domain.search_log.service.SearchLogServiceTest" --no-daemon
Loading
Loading