Skip to content

Latest commit

 

History

History
342 lines (313 loc) · 9.04 KB

File metadata and controls

342 lines (313 loc) · 9.04 KB

English | 简体中文

TRivia: Self-supervised Fine-tuning of Vision-Language Models for Table Recognition


📜 论文 | Github | 🤗 演示(Huggingface) 🤗 模型权重(Huggingface)

TRivia是一个新颖的自监督表格识别VLM的微调框架。我们在这个仓库中发布了TRivia-3B。TRivia-3B是一个基于Qwen2.5-VL-3B,使用TRivia框架进行微调的先进表格识别VLM,并在多个真实世界的表格识别基准上展现出强大的性能。

关键特性:

  • ⭐ 强大的表格识别能力,TRivia-3B不仅适用于电子、扫描和拍照等等表格,而且能自动分辨表格图片中的背景与主体,仅识别表格主体部分。
  • 📃 可复现的训练管线,仅使用无标签数据且无需蒸馏即可推动表格识别能力的提升。


基准性能

我们主要在下面三个真实世界基准上进行评测: OmnidocBench v1.5, CC-OCR and OCRBench v2

PubTabNet OmniDocBench CC-OCR OCRBench Overall
TEDS S-TEDS TEDS S-TEDS TEDS S-TEDS TEDS S-TEDS TEDS S-TEDS
Expert TR models
SLANNet-plus 86.57 96.43 81.90 89.08 50.93 65.84 65.55 77.73 68.19 79.21
UniTable 86.44 95.66 82.76 89.82 57.84 70.47 67.73 78.65 70.86 80.81
General-purpose VLMs
InternVL3.5-241B-A30B 83.75 88.76 86.03 90.53 62.87 69.52 79.50 85.81 78.41 84.18
Qwen2.5-VL-72B 84.39 87.91 87.85 91.80 81.22 86.48 81.33 86.58 83.52 88.33
Qwen3-VL-235B-A22B - - 91.02 94.97 80.98 86.19 84.12 88.15 85.83 90.07
Gemini 2.5 Pro - - 90.90 94.32 85.56 90.07 88.94 89.47 88.93 91.23
GPT-4o 76.53 86.16 78.27 84.56 66.98 79.04 70.51 79.55 72.44 81.15
GPT-5 - - 84.91 89.91 63.25 74.09 79.91 88.69 78.30 86.21
Document-parsing VLMs
dots.ocr 90.65 93.76 88.62 92.86 75.42 81.65 82.04 86.27 82.95 87.58
DeepSeek-OCR - - 83.79 87.86 68.95 75.22 82.64 87.33 80.31 85.11
PaddleOCR-VL - - 91.12 94.62 79.62 85.04 79.29 83.93 83.36 87.77
MinerU2.5 89.07 93.11 90.85 94.68 79.76 85.16 87.13 90.62 86.82 90.81
TRivia-3B(Ours) 91.79 93.81 91.60 95.01 84.90 90.17 90.76 94.03 89.88 93.60
Overall一栏是三个基准上的加权平均分数:OmniDocBench v1.5, CC-OCR, and OCRBench v2.

环境配置

因为TRivia-3B是基于Qwen2.5-VL-3B进行训练,因此你可以参考Qwen2.5-VL-3B installation guide 进行环境配置。

我们强烈推荐安装vLLM >= 0.7.2来提高推理速度.

使用方法

TRivia-3B以表格图像作为输入并输出OTSL标记作为输出。

注意:TRivia-3B 是一个实验性的模型,没有经过严格的工程优化且无法输出LaTex公式或者以及表中有图片的场景。

vLLM离线推理

确保已经安装 vllm >= 0.7.2. 将待识别的图片放到目录下并运行以下命令:

python run_vllm_offline_inf.py --ckpt_root opendatalab/TRivia-3B --image_root /path/to/images --output_path ./vllm_offline_output.json
# Examples
python run_vllm_offline_inf.py --ckpt_root opendatalab/TRivia-3B --image_root ./examples --output_path ./examples_output.json

输出是一个JSON文件(example),格式如下:

[
    {
        "path": "...", // Image path
        "otsl": "...", // Unprocessed OTSL tags output by the model
        "html": "...", // Converted HTML tags
    }
]

vLLM在线部署

你也可以使用vLLM或者SGLang部署TRivia-3B,并使用openai样式的api进行请求访问。

  • 启动服务
vllm serve opendatalab/TRivia-3B --port 10000 --gpu_memory_utilization 0.8 
  • Table Image Request
import base64
from openai import OpenAI
from otsl_utils import convert_otsl_to_html

client = OpenAI(
    api_key="EMPTY",
    base_url="http://127.0.0.1:10000/v1",
    timeout=3600
)

image_path = "./examples/docstructbench_llm-raw-scihub-o.O-ijc.22994.pdf_3_5.png"
with open(path, "rb") as image_file:
    base64_image = base64.b64encode(image_file.read()).decode('utf-8')

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "text",
                "text": "You are an AI specialized in recognizing and extracting table from images. Your mission is to analyze the table image and generate the result in OTSL format using specified tags. Output only the results without any other words and explanation." # Make sure to use this prompt for optimal performance.
            },
            {
                "type": "image_url",
                "image_url": {"url": f"data:image/jpeg;base64,{base64_image}"}
            }
        ]
    }
]

response = client.chat.completions.create(
    model="opendatalab/TRivia-3B",
    messages=messages,
    temperature=0.0,
    max_tokens=8192
)
otsl_content = response.choices[0].message.content
html_content = convert_otsl_to_html(otsl_content)
print(f"Generated otsl tags: {otsl_content}")
print(f"HTML table: {html_content}")

Citation

@misc{zhang2025triviaselfsupervisedfinetuningvisionlanguage,
      title={TRivia: Self-supervised Fine-tuning of Vision-Language Models for Table Recognition}, 
      author={Junyuan Zhang and Bin Wang and Qintong Zhang and Fan Wu and Zichen Wen and Jialin Lu and Junjie Shan and Ziqi Zhao and Shuya Yang and Ziling Wang and Ziyang Miao and Huaping Zhong and Yuhang Zang and Xiaoyi Dong and Ka-Ho Chow and Conghui He},
      year={2025},
      eprint={2512.01248},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2512.01248}, 
}

License

Apache License 2.0