OCR API Documentation (v0.1) - Placeholder

Status: Initial placeholder for OCR endpoints and integration. This page will expand with full request/response schemas and examples.

Summary

The OCR module integrates with media ingestion to extract text from scanned PDFs or images when native text is unavailable. The public API currently exposes lightweight endpoints for backend discovery and preloading, while full OCR execution is driven via the media ingestion APIs.

Auth + Rate Limits

Single-user: X-API-KEY: <key>
Multi-user: Authorization: Bearer <JWT>
Standard limits apply; OCR preloading is low-cost, while end-to-end OCR via media ingestion follows media service limits.

Endpoints

GET /api/v1/ocr/backends
- Lists available OCR backends with basic health info.
- Returns a map keyed by backend name (e.g., mineru, points, dots, llamacpp, chatllm) including lightweight backend-specific configuration details.
- Field shape varies by backend. Today llamacpp exposes mode, configured_mode, configured, supports_structured_output, supports_json, model, configured_flags, auto-eligibility flags, backend_concurrency_cap, and mode-specific flags such as url_configured, managed_configured, managed_running, allow_managed_start, and cli_configured; chatllm exposes a similar capability set with its own mode-specific details, but not every field is identical.
- Code: tldw_Server_API/app/api/v1/endpoints/ocr.py:router.get("/backends")
POST /api/v1/ocr/points/preload
- Attempts to preload the POINTS Transformers model to surface errors early.
- Returns { "status": "ok" | "error", ... }.
- Code: tldw_Server_API/app/api/v1/endpoints/ocr.py:router.post("/points/preload")

OCR in Media Ingestion

OCR is typically enabled via the media ingestion request options. Key fields (see code for authoritative definitions):

enable_ocr (bool) - enable OCR for scanned/low-text PDFs
ocr_backend (str | null) - backend name (e.g., tesseract, auto, or module-specific)
ocr_lang (str) - language (e.g., eng)
ocr_dpi (int) - DPI for page rendering prior to OCR
ocr_mode (enum) - always or fallback
ocr_min_page_text_chars (int) - threshold to treat a page as “no text” for fallback OCR
ocr_output_format (str | null) - text|markdown|json (controls structured OCR output)
ocr_prompt_preset (str | null) - general|doc|table|spotting|json (backend-specific presets)
The PDF pipeline stores structured OCR data under analysis_details.ocr.structured when the backend returns it.
Per-page OCR concurrency is capped by the smaller of OCR_PAGE_CONCURRENCY and the backend profile's max_page_concurrency.

Reference (code): tldw_Server_API/app/api/v1/schemas/media_request_models.py.

MinerU behavior

ocr_backend=mineru is supported only for PDF ingestion and OCR evaluation in v1.
MinerU is document-level, not per-page image OCR. The PDF pipeline runs it once for the whole PDF and stores the normalized result under analysis_details.ocr.structured.
MinerU appears in GET /api/v1/ocr/backends with capability flags such as pdf_only, document_level, and opt_in_only.
MinerU is excluded from auto, auto_high_quality, and OCR.backend_priority in v1.
ocr_lang and ocr_dpi are advisory for MinerU and are currently recorded in metadata but not used to drive the CLI invocation.

Llama.cpp and ChatLLM behavior

ocr_backend=llamacpp and ocr_backend=chatllm are server-owned OCR profiles that can run in remote, managed, cli, or auto mode.
Both backends use explicit auto-eligibility flags. They only participate in auto / auto_high_quality when the flag is enabled and the backend is locally available.
Managed mode is single-process only in v1. For multi-worker deployments, use remote or cli.
The PDF pipeline records the effective page concurrency it actually used, not just the global cap.

Quick Examples

List OCR backends

curl -s http://localhost:8000/api/v1/ocr/backends | jq

Preload POINTS Transformers

curl -s -X POST http://localhost:8000/api/v1/ocr/points/preload | jq

Enable OCR in media ingestion (illustrative JSON fragment)

{
  "enable_ocr": true,
  "ocr_backend": "auto",
  "ocr_lang": "eng",
  "ocr_mode": "fallback",
  "ocr_dpi": 300
}

Structured OCR example (process PDF + inspect analysis_details)

curl -s -X POST http://localhost:8000/api/v1/media/process-pdfs \
  -H "X-API-KEY: $TLDW_API_KEY" \
  -F "enable_ocr=true" \
  -F "ocr_backend=hunyuan" \
  -F "ocr_output_format=json" \
  -F "ocr_prompt_preset=json" \
  -F "files=@/path/to/sample.pdf"

Example response excerpt (truncated)

{
  "results": [
    {
      "analysis_details": {
        "ocr": {
          "backend": "hunyuan",
          "output_format": "json",
          "prompt_preset": "json",
          "structured": {
            "format": "json",
            "text": "...",
            "pages": [
              { "text": "...", "raw": { "blocks": [ { "text": "..." } ] } }
            ]
          }
        }
      }
    }
  ]
}

MinerU PDF OCR example

curl -s -X POST http://localhost:8000/api/v1/media/process-pdfs \
  -H "X-API-KEY: $TLDW_API_KEY" \
  -F "enable_ocr=true" \
  -F "ocr_backend=mineru" \
  -F "ocr_mode=fallback" \
  -F "ocr_output_format=markdown" \
  -F "files=@/path/to/scanned-table.pdf"

Example MinerU discovery response excerpt (truncated)

{
  "mineru": {
    "available": true,
    "pdf_only": true,
    "document_level": true,
    "opt_in_only": true,
    "mode": "cli",
    "timeout_sec": 120,
    "max_concurrency": 1
  }
}

Backend Notes

MinerU: document-level PDF OCR with bounded structured artifacts (pages, tables, artifact excerpts)
POINTS Reader: documentation coming soon
Llama.cpp OCR: see Docs/OCR/LlamaCpp-OCR.md
ChatLLM OCR: see Docs/OCR/ChatLLM-OCR.md
OCR Providers overview: see Docs/OCR/OCR_Providers.md

Roadmap (Placeholder)

Expand docs with full request/response schemas
Add examples for common ingestion flows with OCR
Add troubleshooting and performance tips

If you need additional OCR endpoints or deeper docs, please open an issue with your use case.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OCR API Documentation (v0.1) - Placeholder

Summary

Auth + Rate Limits

Endpoints

OCR in Media Ingestion

MinerU behavior

Llama.cpp and ChatLLM behavior

Quick Examples

Backend Notes

Roadmap (Placeholder)

FilesExpand file tree

OCR_API_Documentation.md

Latest commit

History

OCR_API_Documentation.md

File metadata and controls

OCR API Documentation (v0.1) - Placeholder

Summary

Auth + Rate Limits

Endpoints

OCR in Media Ingestion

MinerU behavior

Llama.cpp and ChatLLM behavior

Quick Examples

Backend Notes

Roadmap (Placeholder)