Summary
GLM-OCR currently detects PP-DocLayoutV3 layout regions, but printed page number regions labeled as number are not reliably preserved as OCR-derived metadata in saved SDK JSON output. The SDK should route number regions through OCR, keep the general json_result schema lean, and expose explicit printed-page metadata only when real printed-page data exists.
Proposed solution
Implement printed page number support in the SDK with the following behavior:
- Route PP-DocLayoutV3
number regions through OCR instead of dropping them
- Keep extraction strict: only use
native_label == "number"
- Extract printed page number evidence in the result formatter
- Support both numeric and Roman numeral page labels
- Add three top-level JSON layers when printed-page data exists:
page_number_candidates
document_page_numbering
page_metadata
- Keep final general
json_result blocks lean
- do not expose broad transient layout metadata like
layout_index / layout_score on all blocks
- retain
native_label in blocks
- Wrap saved
paper.json only when real printed-page data exists
- feature off -> legacy flat JSON
- feature on, no hits -> legacy flat JSON
- feature on, hits -> wrapped JSON with the printed-page layers
- Expose feature enablement via constructor, environment variable, and YAML config
- Keep MaaS and self-hosted output contracts aligned
Why this is needed
Printed page numbers are a distinct layout signal that should be preserved separately from file page position. They are useful technical metadata for downstream consumers that need explicit page labeling grounded in OCR-recognized page folios rather than only file-page ordering.
Scope constraints
This proposal is intentionally narrow:
- no header/footer/text fallback in this change
- no extrapolation of missing page labels in this change
- no debug-only artifacts in final SDK output
A later follow-up can add inferred full-document page mapping if needed, but this issue is only about preserving and exposing observed printed page number data cleanly in the SDK output contract.
Summary
GLM-OCR currently detects PP-DocLayoutV3 layout regions, but printed page number regions labeled as
numberare not reliably preserved as OCR-derived metadata in saved SDK JSON output. The SDK should routenumberregions through OCR, keep the generaljson_resultschema lean, and expose explicit printed-page metadata only when real printed-page data exists.Proposed solution
Implement printed page number support in the SDK with the following behavior:
numberregions through OCR instead of dropping themnative_label == "number"page_number_candidatesdocument_page_numberingpage_metadatajson_resultblocks leanlayout_index/layout_scoreon all blocksnative_labelin blockspaper.jsononly when real printed-page data existsWhy this is needed
Printed page numbers are a distinct layout signal that should be preserved separately from file page position. They are useful technical metadata for downstream consumers that need explicit page labeling grounded in OCR-recognized page folios rather than only file-page ordering.
Scope constraints
This proposal is intentionally narrow:
A later follow-up can add inferred full-document page mapping if needed, but this issue is only about preserving and exposing observed printed page number data cleanly in the SDK output contract.