Preserve printed page numbers from PP-DocLayoutV3 number regions in SDK output

## Summary

GLM-OCR currently detects PP-DocLayoutV3 layout regions, but printed page number regions labeled as `number` are not reliably preserved as OCR-derived metadata in saved SDK JSON output. The SDK should route `number` regions through OCR, keep the general `json_result` schema lean, and expose explicit printed-page metadata only when real printed-page data exists.

## Proposed solution

Implement printed page number support in the SDK with the following behavior:

- Route PP-DocLayoutV3 `number` regions through OCR instead of dropping them
- Keep extraction strict: only use `native_label == "number"`
- Extract printed page number evidence in the result formatter
- Support both numeric and Roman numeral page labels
- Add three top-level JSON layers when printed-page data exists:
  - `page_number_candidates`
  - `document_page_numbering`
  - `page_metadata`
- Keep final general `json_result` blocks lean
  - do not expose broad transient layout metadata like `layout_index` / `layout_score` on all blocks
  - retain `native_label` in blocks
- Wrap saved `paper.json` only when real printed-page data exists
  - feature off -> legacy flat JSON
  - feature on, no hits -> legacy flat JSON
  - feature on, hits -> wrapped JSON with the printed-page layers
- Expose feature enablement via constructor, environment variable, and YAML config
- Keep MaaS and self-hosted output contracts aligned

## Why this is needed

Printed page numbers are a distinct layout signal that should be preserved separately from file page position. They are useful technical metadata for downstream consumers that need explicit page labeling grounded in OCR-recognized page folios rather than only file-page ordering.

## Scope constraints

This proposal is intentionally narrow:

- no header/footer/text fallback in this change
- no extrapolation of missing page labels in this change
- no debug-only artifacts in final SDK output

A later follow-up can add inferred full-document page mapping if needed, but this issue is only about preserving and exposing observed printed page number data cleanly in the SDK output contract.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserve printed page numbers from PP-DocLayoutV3 number regions in SDK output #170

Summary

Proposed solution

Why this is needed

Scope constraints

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Preserve printed page numbers from PP-DocLayoutV3 number regions in SDK output #170

Description

Summary

Proposed solution

Why this is needed

Scope constraints

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions