Skip to content

Preserve printed page numbers from PP-DocLayoutV3 number regions in SDK output #170

@VooDisss

Description

@VooDisss

Summary

GLM-OCR currently detects PP-DocLayoutV3 layout regions, but printed page number regions labeled as number are not reliably preserved as OCR-derived metadata in saved SDK JSON output. The SDK should route number regions through OCR, keep the general json_result schema lean, and expose explicit printed-page metadata only when real printed-page data exists.

Proposed solution

Implement printed page number support in the SDK with the following behavior:

  • Route PP-DocLayoutV3 number regions through OCR instead of dropping them
  • Keep extraction strict: only use native_label == "number"
  • Extract printed page number evidence in the result formatter
  • Support both numeric and Roman numeral page labels
  • Add three top-level JSON layers when printed-page data exists:
    • page_number_candidates
    • document_page_numbering
    • page_metadata
  • Keep final general json_result blocks lean
    • do not expose broad transient layout metadata like layout_index / layout_score on all blocks
    • retain native_label in blocks
  • Wrap saved paper.json only when real printed-page data exists
    • feature off -> legacy flat JSON
    • feature on, no hits -> legacy flat JSON
    • feature on, hits -> wrapped JSON with the printed-page layers
  • Expose feature enablement via constructor, environment variable, and YAML config
  • Keep MaaS and self-hosted output contracts aligned

Why this is needed

Printed page numbers are a distinct layout signal that should be preserved separately from file page position. They are useful technical metadata for downstream consumers that need explicit page labeling grounded in OCR-recognized page folios rather than only file-page ordering.

Scope constraints

This proposal is intentionally narrow:

  • no header/footer/text fallback in this change
  • no extrapolation of missing page labels in this change
  • no debug-only artifacts in final SDK output

A later follow-up can add inferred full-document page mapping if needed, but this issue is only about preserving and exposing observed printed page number data cleanly in the SDK output contract.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions