Korean Vision Document Retrieval (KoViDoRe) benchmark for evaluating text-to-image retrieval models on Korean visual documents.
KoViDoRe is a comprehensive benchmark for evaluating Korean visual document retrieval capabilities. Built upon the foundation of ViDoRe, it assesses how well models can retrieve relevant Korean visual documents—including screenshots, presentation slides, and office documents—when given Korean text queries.
The KoViDoRe v1 encompasses 5 distinct tasks, each targeting different types of visual documents commonly found in Korean business and academic environments. This diverse task structure allows for thorough evaluation of multimodal retrieval performance across various document formats and content types.
The KoViDoRe v2 addresses a key limitation of KoViDoRe v1—single-page matching—by generating queries that require aggregating information across multiple pages. This benchmark consists of 4 distinct tasks targeting practical enterprise domains: cybersecurity, economic reports, energy documents, and HR materials.
| Subset | Description | Documents | Queries | Link |
|---|---|---|---|---|
| HR | Workforce outlook and employment policy | 2,109 | 221 | 🤗 Dataset |
| Energy | Energy policy and power market trends | 1,911 | 190 | 🤗 Dataset |
| Economic | Quarterly economic trend reports | 1,477 | 163 | 🤗 Dataset |
| Cybersecurity | Cyber threat analysis and security guides | 1,150 | 149 | 🤗 Dataset |
The following table shows performance across all KoViDoRe v1 tasks (ndcg@5 scores as percentages):
| Model | Model Size | FinOCR | MIR | Office | Slide | VQA | Average | ViDoRe V2 (Eng) |
|---|---|---|---|---|---|---|---|---|
| nomic-ai/colnomic-embed-multimodal-3b | 3000 | 82.2 | 70.7 | 86.3 | 78.4 | 84.4 | 80.4 | 55.5 |
| nomic-ai/colnomic-embed-multimodal-7b | 7000 | 81.9 | 67.9 | 85.9 | 87.6 | 87.2 | 82.1 | 60.8 |
| vidore/colqwen2.5-v0.2 | 3000 | 67.3 | 62.5 | 75.3 | 78.0 | 81.0 | 72.8 | 59.3 |
| vidore/colqwen2-v1.0 | 2210 | 66.3 | 57.4 | 68.7 | 73.9 | 75.5 | 68.4 | 55.0 |
| jinaai/jina-embeddings-v4 | 3800 | 88.9 | 73.8 | 88.6 | 89.5 | 86.2 | 85.4 | 57.6 |
| vidore/colpali-v1.2 | 2920 | 43.8 | 20.2 | 28.4 | 51.2 | 36.8 | 36.1 | 50.7 |
| vidore/colpali-v1.3 | 2920 | 42.6 | 18.8 | 26.4 | 55.3 | 36.6 | 35.9 | 54.2 |
| vidore/colpali-v1.1 | 2920 | 38.3 | 19.0 | 25.3 | 48.6 | 30.0 | 32.2 | 47.2 |
| nvidia/llama-nemoretriever-colembed-3b-v1 | 3000 | TBA | TBA | TBA | TBA | TBA | TBA | 63.5 |
| nvidia/llama-nemoretriever-colembed-1b-v1 | 1000 | 76.6 | 28.1 | 34.2 | 53.3 | 39.4 | 46.3 | 62.1 |
| vidore/colSmol-500M | 500 | 50.9 | 4.7 | 9.7 | 16.1 | 7.4 | 17.8 | 43.5 |
| vidore/colSmol-256M | 256 | 46.6 | 4.0 | 8.4 | 13.9 | 7.6 | 16.1 | 32.9 |
| google/siglip-so400m-patch14-384 | 878 | 4.0 | 3.9 | 6.3 | 21.3 | 7.3 | 8.6 | 31.4 |
| TIGER-Lab/VLM2Vec-Full | 4150 | 1.4 | 1.6 | 7.2 | 14.9 | 6.8 | 6.4 | 30.1 |
| laion/CLIP-ViT-bigG-14-laion2B-39B-b160k | 2540 | 0.5 | 1.9 | 3.7 | 12.5 | 5.6 | 4.8 | 17.6 |
| openai/clip-vit-base-patch16 | 151 | 0.3 | 0.6 | 0.0 | 5.9 | 3.3 | 2.5 | 8.3 |
| ibm-granite/granite-vision-3.3-2b-embedding | 2980 | 0.0 | 0.4 | 0.6 | 0.3 | 0.0 | 0.26 | 58.1 |
The following table shows performance across all KoViDoRe v2 tasks (ndcg@10 scores as percentages):
| Model | Model Size | Cybersecurity | Economic | Energy | HR | Average | KoViDoRe V1 (Kor) |
|---|---|---|---|---|---|---|---|
| nomic-ai/colnomic-embed-multimodal-3b | 3000 | 73.7 | 17.8 | 61.0 | 37.0 | 47.4 | 80.4 |
| nomic-ai/colnomic-embed-multimodal-7b | 7000 | 72.3 | 19.9 | 56.7 | 35.8 | 46.2 | 82.1 |
| vidore/colqwen2.5-v0.2 | 3000 | 60.8 | 12.6 | 48.1 | 22.9 | 36.1 | 72.8 |
| vidore/colqwen2-v1.0 | 2210 | 59.9 | 10.4 | 37.7 | 23.8 | 33.0 | 68.4 |
| jinaai/jina-embeddings-v4 | 3800 | 77.3 | 25.5 | 61.7 | 50.4 | 53.7 | 85.4 |
| vidore/colpali-v1.2 | 2920 | 40.9 | 2.0 | 18.2 | 5.9 | 16.8 | 36.1 |
| vidore/colpali-v1.3 | 2920 | 37.8 | 1.7 | 17.8 | 7.0 | 16.1 | 35.9 |
| vidore/colpali-v1.1 | 2920 | 35.6 | 2.7 | 17.7 | 6.5 | 15.6 | 32.2 |
| nvidia/llama-nemoretriever-colembed-3b-v1 | 3000 | TBA | TBA | TBA | TBA | TBA | TBA |
| nvidia/llama-nemoretriever-colembed-1b-v1 | 2418 | 52.1 | TBA | TBA | TBA | TBA | 46.3 |
| vidore/colSmol-500M | 500 | 27.0 | 1.1 | 6.3 | 1.2 | 8.9 | 17.8 |
| vidore/colSmol-256M | 256 | 23.1 | 1.1 | 5.7 | 1.3 | 7.8 | 16.1 |
| google/siglip-so400m-patch14-384 | 878 | 15.3 | 1.3 | 3.3 | 1.1 | 5.3 | 8.6 |
| TIGER-Lab/VLM2Vec-Full | 4150 | 9.8 | 1.3 | 2.8 | 1.2 | 3.8 | 6.4 |
| laion/CLIP-ViT-bigG-14-laion2B-39B-b160k | 2540 | 13.7 | 0.3 | 2.4 | 0.4 | 4.2 | 4.8 |
| openai/clip-vit-base-patch16 | 151 | 4.1 | 0.0 | 0.7 | 0.6 | 1.3 | 2.5 |
| ibm-granite/granite-vision-3.3-2b-embedding | 2980 | 0.0 | 0.5 | 0.3 | 0.4 | 0.3 | 0.3 |
We provide interpretability maps to help understand how different models attend to document image patches when processing queries. Each row in the tables represents interpretability maps for different query words.
- Query: 인천 광역시의 CT 설치 비율은 몇 프로니?
| vidore/colpali-v1.3 | vidore/colqwen2.5-v0.2 | jinaai/jina-embeddings-v4 |
|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
- Query: 지방자치단체가 보건복지부에 제출하는 문서는 무엇인가요?
| vidore/colpali-v1.3 | vidore/colqwen2.5-v0.2 | jinaai/jina-embeddings-v4 |
|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
- Query: 나무가 주거 공간에서 제공하는 역할은 무엇인가?
| vidore/colpali-v1.3 | vidore/colqwen2.5-v0.2 | jinaai/jina-embeddings-v4 |
|---|---|---|
![]() |
![]() |
![]() |
![]() |
![]() |
![]() |
# Install dependencies
uv sync# Run with custom model
uv run kovidore --model "your-model-name"
# Run specific tasks
uv run kovidore --model "your-model-name" --tasks mir vqa
# Run with custom batch size (default: 16)
uv run kovidore --model "your-model-name" --batch-size 32
# List available tasks
uv run kovidore --list-tasksfrom src.evaluate import run_benchmark
# Run all tasks
evaluation = run_benchmark("your-model-name")
# Run specific tasks
evaluation = run_benchmark("your-model-name", tasks=["mir", "vqa"])
# Run with custom batch size
evaluation = run_benchmark("your-model-name", batch_size=32)Note
Unlike KoViDoRe v1, KoViDoRe v2 is freely available on Hugging Face. You can access the full dataset collection here.
We provide pre-processed queries and query-corpus mappings for each task. However, due to licensing restrictions, you'll need to download the image datasets manually from AI Hub (see Acknowledgements section for dataset links).
Setup Instructions:
- Download the required datasets from AI Hub
- Extract and place images in the following directory structure:
data/ ├── mir/images/ ├── vqa/images/ ├── slide/images/ ├── office/images/ └── finocr/images/
The benchmark will automatically locate and use the images from these directories during evaluation.
Results are automatically saved in the results/ directory after evaluation completion. The KoViDoRe v1 uses NDCG@5 and the KoViDoRe v2 uses NDCG@10 as the main evaluation metric for all tasks.
This benchmark is inspired by the ViDoRe benchmark. We thank the original authors for their foundational work that helped shape our approach to Korean visual document retrieval.
We also acknowledge the following Korean datasets from AI Hub that were used to construct each task in KoViDoRe v1:
- 멀티모달 정보검색 데이터 - Used for KoVidoreMIRRetrieval task
- 시각화 자료 질의응답 데이터 - Used for KoVidoreVQARetrieval task
- 오피스 문서 생성 데이터 - Used for KoVidoreSlideRetrieval and KoVidoreOfficeRetrieval tasks
- OCR 데이터(금융 및 물류) - Used for KoVidoreFinOCRRetrieval task
For questions or suggestions, please open an issue on the GitHub repository or contact the maintainers:
If you use KoViDoRe in your research, please cite as follows:
@misc{KoViDoRe2025,
author = {Yongbin Choi and Yongwoo Song},
title = {KoViDoRe: Korean Vision Document Retrieval Benchmark},
year = {2025},
url = {https://github.com/whybe-choi/kovidore-benchmark},
note = {A comprehensive benchmark for evaluating visual document retrieval models on Korean document images}
}
@misc{choi2026kovidorev2,
author = {Yongbin Choi},
title = {KoViDoRe v2: a comprehensive evaluation of vision document retrieval for enterprise use-cases},
year = {2026},
url = {https://github.com/whybe-choi/kovidore-data-generator},
note = {A benchmark for evaluating Korean vision document retrieval with multi-page reasoning queries in practical domains}
}






















